Study Notes – Databricks Cram Sheet

What’s the difference between databricks and spark?

  • Databricks is PaaS platform built on spark that offers all the additional features required to easily productionise spark into an enterprise grade integrated platform with 10-40x performance gains. Comparison is here

Is Databricks database software?

  • No – It’s a distributed calculation engine that provides an analytics, streaming, data lake and data warehouse platform across distributed nosql storage

What distributed storage can it run on?

  • AWS S3
  • Azure Data Lake Storage I think possibly even blob not sure yet
  • Hadoop

What cluster managers does it support for distributing the calculation engine?

  • YARN
  • Mesos
  • Spark – built in standalone for dev & learning

What is it implemented in?

  • Scala

What programming languages does it support?

  • Python
  • Java
  • R
  • Scala
  • SQL

What class of use could I use it for?

  • Streaming
  • SQL Analytics
  • Data Transformation (Batch or Realtime)
  • Data Provisioning into Data Warehouse or Data Lake solution
  • Deep Learning
  • Machine Learning (Batch or Realtime)
  • Graph Analysis

What core API’s does it have?

  • MLib – machine learning
  • Streaming
  • SQL
  • GraphX

Can I use 3rd party non-core API’s?

  • Yes

It’s api’s are unified but what does that mean?

  • It means code can be ported from streaming to batch with little modification; lots of work has been put in to minimise time to production, ease of development and migrate solution from a streaming to batch analytics solution for example with ease

Is it free?

  • Spark is Free Databricks is not

How can I use it?

  • Databricks has a cloud portal – there is a free trial
  • Databricks can be provisioned on AWS
  • We’ll soon be able to provision databricks in Azure – it’s on preview

What features differentiates it as a leading data platform?

  • Unified coding model gives shorter dev cycles and time to production
  • It’s PaaS – no hardware cluster to manage, create or look after and I can easily scale it
  • Has a rich collaborative development experience allowing data engineers and data scientists to work together
  • I can run data processing and querying over S3, Azure Data Lake Storage and Hadoop HDFS with:
      • Much greater performance than other distributed storage query engines
      • Automatic Index Creation
      • Automatic Caching
      • Automatic Data Compacting
      • Transactional Support


  • There is no buy into a proprietary storage format – i.e. it just sits S3 for example and I can access and manage it with other processes and tools
  • Delta (2018) transactionally incorporates new batch and/or streaming data immediately for queries – no other data platform has this



The Basics – SSAS Tabular Many-to-Many

I won’t go into this loads since it’s a fairly well established Kimball modelling approach for cubes…

Essentially what do you do if the grain of a dimension is lower than the fact grain when denormalised directly to the fact table? Since if you design it that way the resulting model will double count the measures without slow, unsuable and complex adjustments to query patterns. That’s the generic definition of the modelling problem or pattern.

This requires a many-to-many dimension model using a factless fact or a bridge table. Essentially we identify the other dimension entity that bridges the lower grain dimension onto the fact, create a bridge table and join it to the fact through the bridge table. We can then hide the bridge table and the users are completely unaware of this complexity and the cube engine takes care of the measure aggregation automatically.

All well and good if you have a good grasp of this design pattern. A classic use case that is more intuitive to think of is Customers having many Accounts and Accounts having many customers. Transactions are at an Account level but customers relate to accounts at lower level. However we don’t want to double count transactions when we aggregate a total.

So in cubes:

  • SSAS Multi-Dimensional – has built in dimension relationship type for this feature
  • SSAS Tabular – you had to get around it using DAX
  • SSAS Tabular 2016, 2017 & Azure – Now also has a built-in relationship feature that handles this automatically – You don’t need to use DAX anymore or create several role playing customer dimensions.

Finally note this is also in Power BI; but not in Excel which is on a slower release cycle! You still have to use DAX last I checked in Excel. Hopefully Excel Power Pivot Models will catch-up soon.

Here’s a simple example of how it’s modelled:

SSAS Tabular

And here it is in Excel; see that accounts total correctly.

by account

When we break it out by customer we see that it aggregates correctly at the lower grain for individual customers but the total does not double count. E.g. Both Shaun and Sarah have 15 in account 001 but the total is 15 for account 001 because they share the same account and it doesn’t double count at the account level. Finally the grand total is 75 again because the engine doesn’t double count the measures for the total.

by customer

Again just by customer we see that it aggregates properly at the lower grain but the total is 75 not 90 which would be wrong. i.e. it doesn’t not double count the facts at customer level and shows the total correctly at account level.

by customer1

Quick Tip – TSQL: Parameter & Variable Naming

Saw this a while ago and used it ever since… can’t remember who from otherwise would credit…

I always use @@ for parameters and @ for variables. Makes for easier intellisense and readability.

CREATE PROCEDURE [dbo].[my_proc]
 @@name varchar(100), 
 @@is_enabled bit, 
 @@catageory char(5)
) as


FROM dbo.process
WHERE name = @@name
  AND is_enabled = @@is_enabled
  AND category = @@category
  AND rundate < @dte


Quick Tip – TSQL: Validate Procedure Metadata

So lets say you’re building some sort of DB integration and you want list only those procs that have a specific input and output definition. Here’s one way to do it… I used a table variable to hold the definitions but obviously there might be a better way to do that!

First create a proc we can target:

CREATE PROCEDURE [dbo].[my_proc]
  @@name varchar(100), 
  @@is_enabled bit, 
  @@catageory char(5)
) as

  FROM sys.tables 


Some T-SQL that returns the list procs that meets the specific inputs and outputs:

DECLARE @valid_definition TABLE
  name VARCHAR(150) NOT NULL, 
  type VARCHAR(50) NOT NULL, 
  ordinal INT NOT NULL, 
  direction BIT NOT NULL --1=in and 2=out 

  INSERT INTO @valid_definition (name, type, ordinal, direction) 
    ('@@name',       'varchar'       ,1 ,1), 
    ('@@is_enabled', 'bit'           ,2 ,1), 
    ('@@catageory',  'char'          ,3 ,1), 
    ('name',         'nvarchar(128)' ,1 ,0), 
    ('object_id',    'int'           ,2 ,0), 
    ('schema_id',    'int'           ,3 ,0), 
    ('type_desc',    'nvarchar(60)'  ,4 ,0), 
    ('create_date',  'datetime'      ,5 ,0)

  ;WITH cte_params AS 
      procedure_name   = '[' + OBJECT_SCHEMA_NAME(pa.object_id) + '].[' + OBJECT_NAME(pa.object_id) + ']', 
      parameter_name   =,
      parameter_type   = TYPE_NAME(pa.user_type_id),
      parameter_length = pa.max_length,
      parameter_prec   = CASE WHEN TYPE_NAME(pa.system_type_id) = 'uniqueidentifier' THEN precision ELSE OdbcPrec(pa.system_type_id, pa.max_length, precision) END,   
      parameter_scale  = OdbcScale(pa.system_type_id, pa.scale),
      parameter_order  = pa.parameter_id 
    FROM sys.parameters pa  
    FROM sys.procedures pr 
    JOIN cte_params pa ON OBJECT_ID(SCHEMA_NAME(pr.schema_id) + '.' + = pa.object_id 
    JOIN @valid_definition vd ON = pa.Parameter_name AND vd.type = pa.parameter_type AND vd.ordinal = pa.parameter_order AND vd.direction = 1 
    GROUP BY pa.procedure_name 
    HAVING COUNT(1) = 
      SELECT COUNT(1) 
      FROM @valid_definition WHERE direction = 1
  ) proc_signiture 
  -- get the input procedure signiture 
  CROSS APPLY sys.dm_exec_describe_first_result_set (proc_signiture.procedure_name, NULL, 0) proc_metadata 
  --get the output metadata 
  JOIN @valid_definition vd ON = AND vd.type = proc_metadata.system_type_name AND vd.ordinal = proc_metadata.column_ordinal 
  GROUP BY proc_signiture.procedure_name 
    FROM @valid_definition WHERE direction = 0

The Basics – T-SQL: Recursive Inheritance

This is a post regarding a SQL design pattern for applying configuration attributes to a parent child table. There’s a bunch of obvious use cases particularly around metadata and ETL control & configuration repositories…

Imagine if your ETL has hundreds if not thousands of processes. Often it makes sense to model process meta data in parent child table since you have 1 root process that when executed executes a hierarchy of child process. If then you have configuration options for those processes you won’t want to maintain a configuration row for every process right? that would be a huge maintenance headache. Ideally we would configure the root process with options that the child processes just inherit unless they’re specifically overridden. This gives us the best of both worlds in terms of being able to have the flexibility of specific configurations for child processes but where they don’t have configurations then inherit the configurations from the parent resulting in less configuration maintenance. This model doesn’t persist the inherited configuration but remains sparse and derives the inheritance logically when queried removing the need for updates or inserts.

We’ll have 2 tables:

  1. The parent child entity: process_executable
  2. The configuration attributes: process_configuration

Here they are:

CREATE TABLE [etl].[process_executable] (
 [process_executable_id] INT IDENTITY (1, 1) NOT NULL,
 [parent_process_executable_id] INT DEFAULT ((-1)) NOT NULL,
 [connection_id] INT NOT NULL,
 [process_executable_name] VARCHAR (200) NOT NULL,
 [created_date] DATETIME DEFAULT (getdate()) NOT NULL,
 [is_enabled] BIT DEFAULT ((1)) NOT NULL
 CONSTRAINT [PK_process_executable] PRIMARY KEY CLUSTERED ([process_executable_id] ASC)

CREATE TABLE [etl].[process_configuration] (
 [process_configuration_id] INT IDENTITY (1, 1) NOT NULL,
 [process_executable_id] INT NULL,
 [number_of_retries] INT NULL,
 [is_incremental] BIT NULL,
 CONSTRAINT [PK_process_configuration] PRIMARY KEY CLUSTERED ([process_configuration_id] ASC)

Here is the solution

First left join the configurations in so we get all process with and without configurations. Then recurse down the hierarchy using the a recursive CTE defaulting the configurations to the parent values where they are null.

WITH [cte_process_configuration]  
AS --join in the configuration 
  FROM etl.process_executable p 
  LEFT JOIN [etl].[process_configuration] pc ON p.process_executable_id = pc.process_configuration_id
AS --recurse  down tacking the parent settings down to children where they don't have settings. 
  FROM [cte_process_configuration] m 
  WHERE m.parent_process_executable_id = -1 


    --carry down the settings where there aren't any 
    ISNULL(m.number_of_retries, pm.number_of_retries) AS number_of_retries, 
    ISNULL(m.is_incremental, pm.is_incremental) AS is_incremental 
  FROM [cte_process_configuration] m 
  JOIN cte_collapse pm ON m.parent_process_executable_id = pm.process_executable_id 
--show me the money! 
FROM cte_collapse c


Quick Tip – C# Batch Parallel Processing

There’s a fair few creative solutions around for executing data processing tasks in parallel in SQL and SSIS. As good as they are they’re not really necessary if you use Parallel.ForEach in the C# library System.Threading.Tasks. This is a parallel loop that can be used to iterate items in a collection and run some code for each item. All the thread allocation and loop tracking control is handled internally without having to do anything.

The only other thing you could do is pass it a parameter to specify how many concurrent executions you want making it adjustable and easy to configure at runtime:

int maxP = 5;


new ParallelOptions { MaxDegreeOfParallelism = maxP},
(currentProcess) =>

//Excecute SSIS Package
//Execute SQL Proc
//Azure Data Lake Analytics Job


The Basics – BIML : Preview Pain & Multiple File Dependencies

BIML Express 2017 has been formally release and a great new feature is the preview pain that lets you see how you’re BIML will be rendered during development. For metaprogramming this is awesome and saves a lot of pain. You can read about it here and here.


So what if you have multiple dependent BIML files? e.g. I have a file that retrieves my table meta data to build table def BIML and a file that consumes the table def BIML to create my packages. This way I can use the same table definition BIML file for multiple package patterns. I just thought this would be too much for it to cope with for the 1st release – but it does in fact work which is super awesome.

The trick is the preview pain operates on all the open files. So all the dependent files have to be open to fully rendered the desired result in the preview pain.

So with just my package definition BIML file open no package BIML code is rendered in the preview… this is because there are no table defs to iterate to create the packages.


If I open both the table definition BIML Script and the packages definition script then we’re all good.

Table definition BIML


Packages definition BIML Preview now has packages!