Azure Databricks Walkabout

Finally got my Azure Databricks preview enabled. This is just a quick overview of how it all hooks together.

Technical Architecture Overview

Basically Databricks is the PaaS and Azure is the IaaS. Which is a smart play by Databricks. Why try and compete with the scale and resilience that Microsoft and AWS hosting IaaS. Leveraging partner IaaS to host their PaaS service allows them to do what they do well which is to focus on pushing the Databricks spark platform above and beyond what other data platforms are capable of.

Creating the Service

When you create the Databricks service it creates what appears to be a PaaS service that sits in a blade in the resource group you select when creating the service. So in my case I created a service called sibytes-databricks in my DataLakePatterns resource group.

Screen Shot 2018-01-14 at 08.45.08

At the top:

  • Managed Resource Group – this is a resource group that houses the managed IaaS – see below
  • URL – the URL to the Databricks PaaS service. This is also where the “Launch Workspace” button goes to
  • Resource Group – Just where this pass through blade to the Databricks service sit in your subscription

When you click the Launch Workspace button it takes you through a single sign-on to Databricks which is where the PaaS services operates. The first time you go through you’re asked to delegate authorisation to your azure account from the Databricks service so it can manage the IaaS as part of the service. So you’ll see this

Screen Shot 2018-01-14 at 08.19.34

NOTE: this has to be an azure domain account not a Microsoft account so you’ll need to set one up before hand. Also it’s probably wise to create a domain account specifically for this purpose rather than using a specific user account so you can lock down it’s permissions entirely to what it should have – particularly with regards data access.

Once in you’ll see the usual and in my opinion brilliantly usable PaaS service interface.

Screen Shot 2018-01-14 at 08.47.00

Remember that it’s a single sign-on delegated permission to Azure so when you manage your account in Databricks it will take straight back into Azure.

Managed IaaS

When you create your Databricks service you’ll also notice that it will also create an additional Resource Group to house the infrastructure services that Databricks creates to run the enhanced spark data platform.

Screen Shot 2018-01-14 at 08.31.10

Initially it will look like the following i.e. with no clusters created and running.

Screen Shot 2018-01-14 at 09.06.14It has:

  • A storage account – so it’s not entirely IaaS. The storage is read-only and you cannot access it. Not 100% what this is for yet but if it’s for data content storage then we might be limited to 1 account and that also means data going in and out can only be through Databricks. Or maybe it’s just for a specific types of Databricks meta data or processing storage type… Or maybe it’s for internal config, logs and security… Need to look into this further when I get into data processing and explore to see what if any other Azure storage accounts can be attached. Will blog on it…
  • Virtual Network – This is basically a secure vnet for the IaaS clusters.
  • Network Security Group – This is all the network security inbound and outbound settings that allow Azure and Databaricks to speak to each other to spin up IaaS VM’s and access data.

Creating A Cluster

Things get a little more interesting when we create a cluster. I created a 2 node cluster with a 1 driver using Standard DS3’s. In the drop downs for choosing machines you’re limited to a specific collection and currently they are all listed as beta.

Screen Shot 2018-01-14 at 08.32.10

Looking back into the Azure IaaS resource group we can see it’s spun up 3 linux machines on the VM’s we have chosen in Databricks:

Screen Shot 2018-01-14 at 08.33.20

When we stop the cluster in Databricks I was expecting the VM’s to be stopped. However that doesn’t happen. Currently what happens is that Databricks retains the cluster as a terminated cluster and Azure delete’s all of the VM’s so all you’re left with is the VNet, SGN and Storage Account with our data on it – which is what we started with plus our data.

What this means is that you’ll pay less in storage when you cluster is stopped however it can take quite a while for clusters to come and down particularly if they’re pretty large. It would be good to have an option to pause the service that keeps the VM’s persisted.

Thoughts & More to Follow

This is just a quick look and intro – more to follow particularly on data processing, storage and hooking into the PaaS service through their data provider.

How the storage accounts hook in is a big part of how good this will be because if interaction and flexibility around storage is limited then it may significantly reduce it’s appeal as an architectural option. If it is significantly locked down then we’ll be forced to use and pay for services that Microsoft and Databricks have agreed to integrate that you would otherwise may not need in order to build a viable and compelling data platform  e.g. Azure Data Warehouse…

Advertisements

Study Notes – Databricks Cram Sheet

What’s the difference between databricks and spark?

  • Databricks is PaaS platform built on spark that offers all the additional features required to easily productionise spark into an enterprise grade integrated platform with 10-40x performance gains. Comparison is here

Is Databricks database software?

  • No – It’s a distributed calculation engine that provides an analytics, streaming, data lake and data warehouse platform across distributed nosql storage

What distributed storage can it run on?

  • AWS S3
  • Azure Data Lake Storage I think possibly even blob not sure yet
  • Hadoop

What cluster managers does it support for distributing the calculation engine?

  • YARN
  • Mesos
  • Spark – built in standalone for dev & learning

What is it implemented in?

  • Scala

What programming languages does it support?

  • Python
  • Java
  • R
  • Scala
  • SQL

What class of use could I use it for?

  • Streaming
  • SQL Analytics
  • Data Transformation (Batch or Realtime)
  • Data Provisioning into Data Warehouse or Data Lake solution
  • Deep Learning
  • Machine Learning (Batch or Realtime)
  • Graph Analysis

What core API’s does it have?

  • MLib – machine learning
  • Streaming
  • SQL
  • GraphX

Can I use 3rd party non-core API’s?

  • Yes

It’s api’s are unified but what does that mean?

  • It means code can be ported from streaming to batch with little modification; lots of work has been put in to minimise time to production, ease of development and migrate solution from a streaming to batch analytics solution for example with ease

Is it free?

  • Spark is Free Databricks is not

How can I use it?

  • Databricks has a cloud portal – there is a free trial
  • Databricks can be provisioned on AWS
  • We’ll soon be able to provision databricks in Azure – it’s on preview

What features differentiates it as a leading data platform?

  • Unified coding model gives shorter dev cycles and time to production
  • It’s PaaS – no hardware cluster to manage, create or look after and I can easily scale it
  • Has a rich collaborative development experience allowing data engineers and data scientists to work together
  • I can run data processing and querying over S3, Azure Data Lake Storage and Hadoop HDFS with:
      • Much greater performance than other distributed storage query engines
      • Automatic Index Creation
      • Automatic Caching
      • Automatic Data Compacting
      • Transactional Support

     

  • There is no buy into a proprietary storage format – i.e. it just sits S3 for example and I can access and manage it with other processes and tools
  • Delta (2018) transactionally incorporates new batch and/or streaming data immediately for queries – no other data platform has this

 

The Basics – SSAS Tabular Many-to-Many

I won’t go into this loads since it’s a fairly well established Kimball modelling approach for cubes…

Essentially what do you do if the grain of a dimension is lower than the fact grain when denormalised directly to the fact table? Since if you design it that way the resulting model will double count the measures without slow, unsuable and complex adjustments to query patterns. That’s the generic definition of the modelling problem or pattern.

This requires a many-to-many dimension model using a factless fact or a bridge table. Essentially we identify the other dimension entity that bridges the lower grain dimension onto the fact, create a bridge table and join it to the fact through the bridge table. We can then hide the bridge table and the users are completely unaware of this complexity and the cube engine takes care of the measure aggregation automatically.

All well and good if you have a good grasp of this design pattern. A classic use case that is more intuitive to think of is Customers having many Accounts and Accounts having many customers. Transactions are at an Account level but customers relate to accounts at lower level. However we don’t want to double count transactions when we aggregate a total.

So in cubes:

  • SSAS Multi-Dimensional – has built in dimension relationship type for this feature
  • SSAS Tabular – you had to get around it using DAX
  • SSAS Tabular 2016, 2017 & Azure – Now also has a built-in relationship feature that handles this automatically – You don’t need to use DAX anymore or create several role playing customer dimensions.

Finally note this is also in Power BI; but not in Excel which is on a slower release cycle! You still have to use DAX last I checked in Excel. Hopefully Excel Power Pivot Models will catch-up soon.

Here’s a simple example of how it’s modelled:

SSAS Tabular

And here it is in Excel; see that accounts total correctly.

by account

When we break it out by customer we see that it aggregates correctly at the lower grain for individual customers but the total does not double count. E.g. Both Shaun and Sarah have 15 in account 001 but the total is 15 for account 001 because they share the same account and it doesn’t double count at the account level. Finally the grand total is 75 again because the engine doesn’t double count the measures for the total.

by customer

Again just by customer we see that it aggregates properly at the lower grain but the total is 75 not 90 which would be wrong. i.e. it doesn’t not double count the facts at customer level and shows the total correctly at account level.

by customer1

Quick Tip – SSIS: Excel Drivers

So last week I was just finishing of a generic metadata import solution for MDS. Basically you map your Excel semi-structured data source file to the entities and attributes in MDS run the solution and it automatically imports, transforms the file and loads it through to the mapped entities. It was all working great; until Windows installed a patch! Basically it looks like they have finally nailed the coffin shut on the Jet 4 driver – to be fair I probably shouldn’t have been using it.

So for older xls files (2007 – 2010) you can use the Microsoft.ACE.OLEDB.12.0 driver. If you have office 64 bit installed like me your SSIS IDE may not have this driver available in SSIS; but why?

Well because the IDE requires the 32 bit driver. You can flip the debugging run time in SSIS to be 64 bit however the IDE requires the 32 bit in order to design time validate the component and the solution can’t run in 64 bit until it’s validated; which is a tad annoying.

You can install the 32 (& 64) bit drivers from here. The really annoying thing is that if you have office 64 bit installed it won’t let you install the 32 drivers. This is the workaround cmd:

c:\>AccessDatabaseEngine.exe /quiet

 

 

 

Quick Tip – TSQL: Parameter & Variable Naming

Saw this a while ago and used it ever since… can’t remember who from otherwise would credit…

I always use @@ for parameters and @ for variables. Makes for easier intellisense and readability.

CREATE PROCEDURE [dbo].[my_proc]
( 
 @@name varchar(100), 
 @@is_enabled bit, 
 @@catageory char(5)
) as
BEGIN

DECLARE @dte DATETIME = GETDATE();

SELECT 
 name,
 status
FROM dbo.process
WHERE name = @@name
  AND is_enabled = @@is_enabled
  AND category = @@category
  AND rundate < @dte

END

Quick Tip – TSQL: Validate Procedure Metadata

So lets say you’re building some sort of DB integration and you want list only those procs that have a specific input and output definition. Here’s one way to do it… I used a table variable to hold the definitions but obviously there might be a better way to do that!

First create a proc we can target:

CREATE PROCEDURE [dbo].[my_proc]
( 
  @@name varchar(100), 
  @@is_enabled bit, 
  @@catageory char(5)
) as
BEGIN

  SELECT 
    name, 
    object_id, 
    schema_id, 
    type_desc, 
    create_date 
  FROM sys.tables 

END

Some T-SQL that returns the list procs that meets the specific inputs and outputs:

DECLARE @valid_definition TABLE
( 
  name VARCHAR(150) NOT NULL, 
  type VARCHAR(50) NOT NULL, 
  ordinal INT NOT NULL, 
  direction BIT NOT NULL --1=in and 2=out 
) 

  INSERT INTO @valid_definition (name, type, ordinal, direction) 
  VALUES 
    ('@@name',       'varchar'       ,1 ,1), 
    ('@@is_enabled', 'bit'           ,2 ,1), 
    ('@@catageory',  'char'          ,3 ,1), 
    ('name',         'nvarchar(128)' ,1 ,0), 
    ('object_id',    'int'           ,2 ,0), 
    ('schema_id',    'int'           ,3 ,0), 
    ('type_desc',    'nvarchar(60)'  ,4 ,0), 
    ('create_date',  'datetime'      ,5 ,0)

  ;WITH cte_params AS 
  ( 
    SELECT  
      pa.object_id, 
      procedure_name   = '[' + OBJECT_SCHEMA_NAME(pa.object_id) + '].[' + OBJECT_NAME(pa.object_id) + ']', 
      parameter_name   = pa.name,
      parameter_type   = TYPE_NAME(pa.user_type_id),
      parameter_length = pa.max_length,
      parameter_prec   = CASE WHEN TYPE_NAME(pa.system_type_id) = 'uniqueidentifier' THEN precision ELSE OdbcPrec(pa.system_type_id, pa.max_length, precision) END,   
      parameter_scale  = OdbcScale(pa.system_type_id, pa.scale),
      parameter_order  = pa.parameter_id 
    FROM sys.parameters pa  
  )
  SELECT DISTINCT 
    proc_signiture.procedure_name   
  FROM 
  ( 
    SELECT 
      pa.procedure_name 
    FROM sys.procedures pr 
    JOIN cte_params pa ON OBJECT_ID(SCHEMA_NAME(pr.schema_id) + '.' + pr.name) = pa.object_id 
    JOIN @valid_definition vd ON vd.name = pa.Parameter_name AND vd.type = pa.parameter_type AND vd.ordinal = pa.parameter_order AND vd.direction = 1 
    GROUP BY pa.procedure_name 
    HAVING COUNT(1) = 
    (
      SELECT COUNT(1) 
      FROM @valid_definition WHERE direction = 1
    ) 
  ) proc_signiture 
  -- get the input procedure signiture 
  CROSS APPLY sys.dm_exec_describe_first_result_set (proc_signiture.procedure_name, NULL, 0) proc_metadata 
  --get the output metadata 
  JOIN @valid_definition vd ON vd.name = proc_metadata.name AND vd.type = proc_metadata.system_type_name AND vd.ordinal = proc_metadata.column_ordinal 
  GROUP BY proc_signiture.procedure_name 
  HAVING COUNT(1) = 
  (
    SELECT COUNT(1) 
    FROM @valid_definition WHERE direction = 0
  )

The Basics – T-SQL: Recursive Inheritance

This is a post regarding a SQL design pattern for applying configuration attributes to a parent child table. There’s a bunch of obvious use cases particularly around metadata and ETL control & configuration repositories…

Imagine if your ETL has hundreds if not thousands of processes. Often it makes sense to model process meta data in parent child table since you have 1 root process that when executed executes a hierarchy of child process. If then you have configuration options for those processes you won’t want to maintain a configuration row for every process right? that would be a huge maintenance headache. Ideally we would configure the root process with options that the child processes just inherit unless they’re specifically overridden. This gives us the best of both worlds in terms of being able to have the flexibility of specific configurations for child processes but where they don’t have configurations then inherit the configurations from the parent resulting in less configuration maintenance. This model doesn’t persist the inherited configuration but remains sparse and derives the inheritance logically when queried removing the need for updates or inserts.

We’ll have 2 tables:

  1. The parent child entity: process_executable
  2. The configuration attributes: process_configuration

Here they are:

CREATE TABLE [etl].[process_executable] (
 [process_executable_id] INT IDENTITY (1, 1) NOT NULL,
 [parent_process_executable_id] INT DEFAULT ((-1)) NOT NULL,
 [connection_id] INT NOT NULL,
 [process_executable_name] VARCHAR (200) NOT NULL,
 [created_date] DATETIME DEFAULT (getdate()) NOT NULL,
 [is_enabled] BIT DEFAULT ((1)) NOT NULL
 CONSTRAINT [PK_process_executable] PRIMARY KEY CLUSTERED ([process_executable_id] ASC)
);


CREATE TABLE [etl].[process_configuration] (
 [process_configuration_id] INT IDENTITY (1, 1) NOT NULL,
 [process_executable_id] INT NULL,
 [number_of_retries] INT NULL,
 [is_incremental] BIT NULL,
 CONSTRAINT [PK_process_configuration] PRIMARY KEY CLUSTERED ([process_configuration_id] ASC)
);

Here is the solution

First left join the configurations in so we get all process with and without configurations. Then recurse down the hierarchy using the a recursive CTE defaulting the configurations to the parent values where they are null.

WITH [cte_process_configuration]  
AS --join in the configuration 
(
  SELECT  
    p.[process_executable_id],  
    p.[parent_process_executable_id],  
    pc.number_of_retries, 
    pc.is_incremental 
  FROM etl.process_executable p 
  LEFT JOIN [etl].[process_configuration] pc ON p.process_executable_id = pc.process_configuration_id
 ), 
cte_collapse 
AS --recurse  down tacking the parent settings down to children where they don't have settings. 
(
  SELECT  
    m.process_executable_id, 
    m.parent_process_executable_id,
    m.max_threads,
    m.number_of_retries,
    m.is_incremental 
  FROM [cte_process_configuration] m 
  WHERE m.parent_process_executable_id = -1 

  UNION ALL

  SELECT 
    m.process_executable_id, 
    m.parent_process_executable_id, 
    --carry down the settings where there aren't any 
    ISNULL(m.number_of_retries, pm.number_of_retries) AS number_of_retries, 
    ISNULL(m.is_incremental, pm.is_incremental) AS is_incremental 
  FROM [cte_process_configuration] m 
  JOIN cte_collapse pm ON m.parent_process_executable_id = pm.process_executable_id 
)
--show me the money! 
SELECT 
  process_executable_id,  
  number_of_retries, 
  is_incremental 
FROM cte_collapse c