Design Pattern Context

Consists of:

  • Participants
    • Class involved to form a design pattern.
    • The play different roles to accomplish the goals of the design pattern
  • Quality Attributes
    • Non-functional requirements effect the entire software & architectural solutions address them
    • Usability
    • Modifiability
    • Reliability
    • Performance
    • More…
  • Forces
    • Various factors or trade-offs to consider
    • Quality attributes manifest forces
  • Consequences
    • If we don’t reason about these forces well then we will face unintended consequences
    • Worst performance, etc
    • Decision makers should consider consequence
    • Knowing when to use a design patterns and when not to use it is crucial

Pattern Language:

  • Name
    • Capture the gist of patterns
    • Vocabulary – meaningful and memorable
  • Context
    • Provides a scenario in which we use these patterns
    • Offers more insight on when to use the pattern
  • Problem
    • Describes a design challenge a pattern is addressing
  • Solution
    • Specifies the patterns itself
    • Structure -> relationships between the elements in a pattern
    • Behavior -> all the interactions between that pattern elements
  • Related Patterns
    • List other patterns being used together with the pattern being described
    • Or; Similar patterns
    • Crucial to precisely describe the subtle differences between the patterns

Databricks Cluster with Pyodbc SQL Server Drivers

Configuration frameworks a plenty! A standard pattern with orchestration and metadata driven ETL is stuff a bunch of configuration metadata into a small RDBMS, put some relatively light weight API around it and then run your pipelines off it. This requires your databricks workloads to connect to a DB and converse metadata with it.

A common approach to this in Databricks is just to read the metadata or call table functions through spark dataframe API and whilst that works… well there are some drawbacks:

  • It’s overkill, you’re getting a significant performance hit for all that distribution only to bring it all back to the driver for what’s essentially an OLTP workload
  • Development and unit testing becomes clunky and cumbersome since you’ve introduced a massive framework (spark) to implement a small state configuration use case for component that otherwise doesn’t need spark at all.

Problem is though we can’t just connect using standard pyodbc, SQL Server driver and simply connect from the spark driver, since none of that is installed. Or can we?

There is a feature in databricks that allows to boot our clusters from docker images! It’s been around for ages but is in GA now I believe… So we can actually install whatever the hell we want!

With this in mind here’s a docker databricks build I threw together that provides a databricks image with pyodbc and SQL Server drivers installed. Here’s the code also:

FROM databricksruntime/minimal:9.x

RUN apt-get update \
  && apt-get install -y curl gnupg2

RUN curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -
RUN curl https://packages.microsoft.com/config/ubuntu/18.04/prod.list > /etc/apt/sources.list.d/mssql-release.list

RUN apt-get update \
  && ACCEPT_EULA=Y apt-get install -y \
    msodbcsql17 \
    mssql-tools \
    gcc \
    python3.8-dev \
    g++ \
    unixodbc-dev \
    python3.8 \
    virtualenv \
  && apt-get clean \
  && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*


# Initialize the default environment that Spark and notebooks will use
RUN virtualenv -p python3.8 --system-site-packages /databricks/python3

# These python libraries are used by Databricks notebooks and the Python REPL
# You do not need to install pyspark - it is injected when the cluster is launched
# Versions are intended to reflect DBR 9.0

RUN /databricks/python3/bin/pip install \
  pyodbc==4.0.32 \
  six==1.15.0 \
  # ensure minimum ipython version for Python autocomplete with jedi 0.17.x
  ipython==7.19.0 \
  numpy==1.19.2 \
  pandas==1.2.4 \
  pyarrow==4.0.0 \
  matplotlib==3.4.2 \
  jinja2==2.11.3
# blah blah blah whatever you want

# Specifies where Spark will look for the python process
ENV PYSPARK_PYTHON=/databricks/python3/bin/python3
ENV PATH="$PATH:/opt/mssql-tools/bin"

Throw this into your cluster configuration:

Then in the resulting databricks cluster we can have a simple connection to SQL Server database:

import pyodbc
connection_str = 'DRIVER={ODBC Driver 17 for SQL Server};SERVER=tcp:myserver.database.windows.net,1433'
user = 'test'
password = 'THISAINTREAL'
database='test'

conn = pyodbc.connect(
  connection_str, 
  user = user, 
  password = password, 
  database = database
)

cursor = conn.cursor()
cursor.execute('SELECT * FROM test.result')

for row in cursor:
    print(row)

Here’s some proof that it does actually work!