Study Notes – Databricks Cram Sheet

What’s the difference between databricks and spark?

  • Databricks is PaaS platform built on spark that offers all the additional features required to easily productionise spark into an enterprise grade integrated platform with 10-40x performance gains. Comparison is here

Is Databricks database software?

  • No – It’s a distributed calculation engine that provides an analytics, streaming, data lake and data warehouse platform across distributed nosql storage

What distributed storage can it run on?

  • AWS S3
  • Azure Data Lake Storage I think possibly even blob not sure yet
  • Hadoop

What cluster managers does it support for distributing the calculation engine?

  • YARN
  • Mesos
  • Spark – built in standalone for dev & learning

What is it implemented in?

  • Scala

What programming languages does it support?

  • Python
  • Java
  • R
  • Scala
  • SQL

What class of use could I use it for?

  • Streaming
  • SQL Analytics
  • Data Transformation (Batch or Realtime)
  • Data Provisioning into Data Warehouse or Data Lake solution
  • Deep Learning
  • Machine Learning (Batch or Realtime)
  • Graph Analysis

What core API’s does it have?

  • MLib – machine learning
  • Streaming
  • SQL
  • GraphX

Can I use 3rd party non-core API’s?

  • Yes

It’s api’s are unified but what does that mean?

  • It means code can be ported from streaming to batch with little modification; lots of work has been put in to minimise time to production, ease of development and migrate solution from a streaming to batch analytics solution for example with ease

Is it free?

  • Spark is Free Databricks is not

How can I use it?

  • Databricks has a cloud portal – there is a free trial
  • Databricks can be provisioned on AWS
  • We’ll soon be able to provision databricks in Azure – it’s on preview

What features differentiates it as a leading data platform?

  • Unified coding model gives shorter dev cycles and time to production
  • It’s PaaS – no hardware cluster to manage, create or look after and I can easily scale it
  • Has a rich collaborative development experience allowing data engineers and data scientists to work together
  • I can run data processing and querying over S3, Azure Data Lake Storage and Hadoop HDFS with:
      • Much greater performance than other distributed storage query engines
      • Automatic Index Creation
      • Automatic Caching
      • Automatic Data Compacting
      • Transactional Support

     

  • There is no buy into a proprietary storage format – i.e. it just sits S3 for example and I can access and manage it with other processes and tools
  • Delta (2018) transactionally incorporates new batch and/or streaming data immediately for queries – no other data platform has this

 

Advertisements

The Basics – SSAS Tabular Many-to-Many

I won’t go into this loads since it’s a fairly well established Kimball modelling approach for cubes…

Essentially what do you do if the grain of a dimension is lower than the fact grain when denormalised directly to the fact table? Since if you design it that way the resulting model will double count the measures without slow, unsuable and complex adjustments to query patterns. That’s the generic definition of the modelling problem or pattern.

This requires a many-to-many dimension model using a factless fact or a bridge table. Essentially we identify the other dimension entity that bridges the lower grain dimension onto the fact, create a bridge table and join it to the fact through the bridge table. We can then hide the bridge table and the users are completely unaware of this complexity and the cube engine takes care of the measure aggregation automatically.

All well and good if you have a good grasp of this design pattern. A classic use case that is more intuitive to think of is Customers having many Accounts and Accounts having many customers. Transactions are at an Account level but customers relate to accounts at lower level. However we don’t want to double count transactions when we aggregate a total.

So in cubes:

  • SSAS Multi-Dimensional – has built in dimension relationship type for this feature
  • SSAS Tabular – you had to get around it using DAX
  • SSAS Tabular 2016, 2017 & Azure – Now also has a built-in relationship feature that handles this automatically – You don’t need to use DAX anymore or create several role playing customer dimensions.

Finally note this is also in Power BI; but not in Excel which is on a slower release cycle! You still have to use DAX last I checked in Excel. Hopefully Excel Power Pivot Models will catch-up soon.

Here’s a simple example of how it’s modelled:

SSAS Tabular

And here it is in Excel; see that accounts total correctly.

by account

When we break it out by customer we see that it aggregates correctly at the lower grain for individual customers but the total does not double count. E.g. Both Shaun and Sarah have 15 in account 001 but the total is 15 for account 001 because they share the same account and it doesn’t double count at the account level. Finally the grand total is 75 again because the engine doesn’t double count the measures for the total.

by customer

Again just by customer we see that it aggregates properly at the lower grain but the total is 75 not 90 which would be wrong. i.e. it doesn’t not double count the facts at customer level and shows the total correctly at account level.

by customer1

Quick Tip – SSIS: Excel Drivers

So last week I was just finishing of a generic metadata import solution for MDS. Basically you map your Excel semi-structured data source file to the entities and attributes in MDS run the solution and it automatically imports, transforms the file and loads it through to the mapped entities. It was all working great; until Windows installed a patch! Basically it looks like they have finally nailed the coffin shut on the Jet 4 driver – to be fair I probably shouldn’t have been using it.

So for older xls files (2007 – 2010) you can use the Microsoft.ACE.OLEDB.12.0 driver. If you have office 64 bit installed like me your SSIS IDE may not have this driver available in SSIS; but why?

Well because the IDE requires the 32 bit driver. You can flip the debugging run time in SSIS to be 64 bit however the IDE requires the 32 bit in order to design time validate the component and the solution can’t run in 64 bit until it’s validated; which is a tad annoying.

You can install the 32 (& 64) bit drivers from here. The really annoying thing is that if you have office 64 bit installed it won’t let you install the 32 drivers. This is the workaround cmd:

c:\>AccessDatabaseEngine.exe /quiet

 

 

 

Quick Tip – TSQL: Parameter & Variable Naming

Saw this a while ago and used it ever since… can’t remember who from otherwise would credit…

I always use @@ for parameters and @ for variables. Makes for easier intellisense and readability.

CREATE PROCEDURE [dbo].[my_proc]
( 
 @@name varchar(100), 
 @@is_enabled bit, 
 @@catageory char(5)
) as
BEGIN

DECLARE @dte DATETIME = GETDATE();

SELECT 
 name,
 status
FROM dbo.process
WHERE name = @@name
  AND is_enabled = @@is_enabled
  AND category = @@category
  AND rundate < @dte

END

Quick Tip – TSQL: Validate Procedure Metadata

So lets say you’re building some sort of DB integration and you want list only those procs that have a specific input and output definition. Here’s one way to do it… I used a table variable to hold the definitions but obviously there might be a better way to do that!

First create a proc we can target:

CREATE PROCEDURE [dbo].[my_proc]
( 
  @@name varchar(100), 
  @@is_enabled bit, 
  @@catageory char(5)
) as
BEGIN

  SELECT 
    name, 
    object_id, 
    schema_id, 
    type_desc, 
    create_date 
  FROM sys.tables 

END

Some T-SQL that returns the list procs that meets the specific inputs and outputs:

DECLARE @valid_definition TABLE
( 
  name VARCHAR(150) NOT NULL, 
  type VARCHAR(50) NOT NULL, 
  ordinal INT NOT NULL, 
  direction BIT NOT NULL --1=in and 2=out 
) 

  INSERT INTO @valid_definition (name, type, ordinal, direction) 
  VALUES 
    ('@@name',       'varchar'       ,1 ,1), 
    ('@@is_enabled', 'bit'           ,2 ,1), 
    ('@@catageory',  'char'          ,3 ,1), 
    ('name',         'nvarchar(128)' ,1 ,0), 
    ('object_id',    'int'           ,2 ,0), 
    ('schema_id',    'int'           ,3 ,0), 
    ('type_desc',    'nvarchar(60)'  ,4 ,0), 
    ('create_date',  'datetime'      ,5 ,0)

  ;WITH cte_params AS 
  ( 
    SELECT  
      pa.object_id, 
      procedure_name   = '[' + OBJECT_SCHEMA_NAME(pa.object_id) + '].[' + OBJECT_NAME(pa.object_id) + ']', 
      parameter_name   = pa.name,
      parameter_type   = TYPE_NAME(pa.user_type_id),
      parameter_length = pa.max_length,
      parameter_prec   = CASE WHEN TYPE_NAME(pa.system_type_id) = 'uniqueidentifier' THEN precision ELSE OdbcPrec(pa.system_type_id, pa.max_length, precision) END,   
      parameter_scale  = OdbcScale(pa.system_type_id, pa.scale),
      parameter_order  = pa.parameter_id 
    FROM sys.parameters pa  
  )
  SELECT DISTINCT 
    proc_signiture.procedure_name   
  FROM 
  ( 
    SELECT 
      pa.procedure_name 
    FROM sys.procedures pr 
    JOIN cte_params pa ON OBJECT_ID(SCHEMA_NAME(pr.schema_id) + '.' + pr.name) = pa.object_id 
    JOIN @valid_definition vd ON vd.name = pa.Parameter_name AND vd.type = pa.parameter_type AND vd.ordinal = pa.parameter_order AND vd.direction = 1 
    GROUP BY pa.procedure_name 
    HAVING COUNT(1) = 
    (
      SELECT COUNT(1) 
      FROM @valid_definition WHERE direction = 1
    ) 
  ) proc_signiture 
  -- get the input procedure signiture 
  CROSS APPLY sys.dm_exec_describe_first_result_set (proc_signiture.procedure_name, NULL, 0) proc_metadata 
  --get the output metadata 
  JOIN @valid_definition vd ON vd.name = proc_metadata.name AND vd.type = proc_metadata.system_type_name AND vd.ordinal = proc_metadata.column_ordinal 
  GROUP BY proc_signiture.procedure_name 
  HAVING COUNT(1) = 
  (
    SELECT COUNT(1) 
    FROM @valid_definition WHERE direction = 0
  )

The Basics – T-SQL: Recursive Inheritance

This is a post regarding a SQL design pattern for applying configuration attributes to a parent child table. There’s a bunch of obvious use cases particularly around metadata and ETL control & configuration repositories…

Imagine if your ETL has hundreds if not thousands of processes. Often it makes sense to model process meta data in parent child table since you have 1 root process that when executed executes a hierarchy of child process. If then you have configuration options for those processes you won’t want to maintain a configuration row for every process right? that would be a huge maintenance headache. Ideally we would configure the root process with options that the child processes just inherit unless they’re specifically overridden. This gives us the best of both worlds in terms of being able to have the flexibility of specific configurations for child processes but where they don’t have configurations then inherit the configurations from the parent resulting in less configuration maintenance. This model doesn’t persist the inherited configuration but remains sparse and derives the inheritance logically when queried removing the need for updates or inserts.

We’ll have 2 tables:

  1. The parent child entity: process_executable
  2. The configuration attributes: process_configuration

Here they are:

CREATE TABLE [etl].[process_executable] (
 [process_executable_id] INT IDENTITY (1, 1) NOT NULL,
 [parent_process_executable_id] INT DEFAULT ((-1)) NOT NULL,
 [connection_id] INT NOT NULL,
 [process_executable_name] VARCHAR (200) NOT NULL,
 [created_date] DATETIME DEFAULT (getdate()) NOT NULL,
 [is_enabled] BIT DEFAULT ((1)) NOT NULL
 CONSTRAINT [PK_process_executable] PRIMARY KEY CLUSTERED ([process_executable_id] ASC)
);


CREATE TABLE [etl].[process_configuration] (
 [process_configuration_id] INT IDENTITY (1, 1) NOT NULL,
 [process_executable_id] INT NULL,
 [number_of_retries] INT NULL,
 [is_incremental] BIT NULL,
 CONSTRAINT [PK_process_configuration] PRIMARY KEY CLUSTERED ([process_configuration_id] ASC)
);

Here is the solution

First left join the configurations in so we get all process with and without configurations. Then recurse down the hierarchy using the a recursive CTE defaulting the configurations to the parent values where they are null.

WITH [cte_process_configuration]  
AS --join in the configuration 
(
  SELECT  
    p.[process_executable_id],  
    p.[parent_process_executable_id],  
    pc.number_of_retries, 
    pc.is_incremental 
  FROM etl.process_executable p 
  LEFT JOIN [etl].[process_configuration] pc ON p.process_executable_id = pc.process_configuration_id
 ), 
cte_collapse 
AS --recurse  down tacking the parent settings down to children where they don't have settings. 
(
  SELECT  
    m.process_executable_id, 
    m.parent_process_executable_id,
    m.max_threads,
    m.number_of_retries,
    m.is_incremental 
  FROM [cte_process_configuration] m 
  WHERE m.parent_process_executable_id = -1 

  UNION ALL

  SELECT 
    m.process_executable_id, 
    m.parent_process_executable_id, 
    --carry down the settings where there aren't any 
    ISNULL(m.number_of_retries, pm.number_of_retries) AS number_of_retries, 
    ISNULL(m.is_incremental, pm.is_incremental) AS is_incremental 
  FROM [cte_process_configuration] m 
  JOIN cte_collapse pm ON m.parent_process_executable_id = pm.process_executable_id 
)
--show me the money! 
SELECT 
  process_executable_id,  
  number_of_retries, 
  is_incremental 
FROM cte_collapse c

 

Quick Tip – C# Batch Parallel Processing

There’s a fair few creative solutions around for executing data processing tasks in parallel in SQL and SSIS. As good as they are they’re not really necessary if you use Parallel.ForEach in the C# library System.Threading.Tasks. This is a parallel loop that can be used to iterate items in a collection and run some code for each item. All the thread allocation and loop tracking control is handled internally without having to do anything.

The only other thing you could do is pass it a parameter to specify how many concurrent executions you want making it adjustable and easy to configure at runtime:

int maxP = 5;

Parallel.ForEach(
collectionOfProcesses,

new ParallelOptions { MaxDegreeOfParallelism = maxP},
(currentProcess) =>
{

//Excecute SSIS Package
//Execute SQL Proc
//Azure Data Lake Analytics Job

  Execute(currentProcess);
});