Configuration frameworks a plenty! A standard pattern with orchestration and metadata driven ETL is stuff a bunch of configuration metadata into a small RDBMS, put some relatively light weight API around it and then run your pipelines off it. This requires your databricks workloads to connect to a DB and converse metadata with it.
A common approach to this in Databricks is just to read the metadata or call table functions through spark dataframe API and whilst that works… well there are some drawbacks:
It’s overkill, you’re getting a significant performance hit for all that distribution only to bring it all back to the driver for what’s essentially an OLTP workload
Development and unit testing becomes clunky and cumbersome since you’ve introduced a massive framework (spark) to implement a small state configuration use case for component that otherwise doesn’t need spark at all.
Problem is though we can’t just connect using standard pyodbc, SQL Server driver and simply connect from the spark driver, since none of that is installed. Or can we?
There is a feature in databricks that allows to boot our clusters from docker images! It’s been around for ages but is in GA now I believe… So we can actually install whatever the hell we want!
With this in mind here’s a docker databricks build I threw together that provides a databricks image with pyodbc and SQL Server drivers installed. Here’s the code also:
FROM databricksruntime/minimal:9.x
RUN apt-get update \
&& apt-get install -y curl gnupg2
RUN curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -
RUN curl https://packages.microsoft.com/config/ubuntu/18.04/prod.list > /etc/apt/sources.list.d/mssql-release.list
RUN apt-get update \
&& ACCEPT_EULA=Y apt-get install -y \
msodbcsql17 \
mssql-tools \
gcc \
python3.8-dev \
g++ \
unixodbc-dev \
python3.8 \
virtualenv \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
# Initialize the default environment that Spark and notebooks will use
RUN virtualenv -p python3.8 --system-site-packages /databricks/python3
# These python libraries are used by Databricks notebooks and the Python REPL
# You do not need to install pyspark - it is injected when the cluster is launched
# Versions are intended to reflect DBR 9.0
RUN /databricks/python3/bin/pip install \
pyodbc==4.0.32 \
six==1.15.0 \
# ensure minimum ipython version for Python autocomplete with jedi 0.17.x
ipython==7.19.0 \
numpy==1.19.2 \
pandas==1.2.4 \
pyarrow==4.0.0 \
matplotlib==3.4.2 \
jinja2==2.11.3
# blah blah blah whatever you want
# Specifies where Spark will look for the python process
ENV PYSPARK_PYTHON=/databricks/python3/bin/python3
ENV PATH="$PATH:/opt/mssql-tools/bin"
Throw this into your cluster configuration:
Then in the resulting databricks cluster we can have a simple connection to SQL Server database:
import pyodbc
connection_str = 'DRIVER={ODBC Driver 17 for SQL Server};SERVER=tcp:myserver.database.windows.net,1433'
user = 'test'
password = 'THISAINTREAL'
database='test'
conn = pyodbc.connect(
connection_str,
user = user,
password = password,
database = database
)
cursor = conn.cursor()
cursor.execute('SELECT * FROM test.result')
for row in cursor:
print(row)