Azure Databricks Walkabout

Finally got my Azure Databricks preview enabled. This is just a quick overview of how it all hooks together.

Technical Architecture Overview

Basically Databricks is the PaaS and Azure is the IaaS. Which is a smart play by Databricks. Why try and compete with the scale and resilience that Microsoft and AWS hosting IaaS. Leveraging partner IaaS to host their PaaS service allows them to do what they do well which is to focus on pushing the Databricks spark platform above and beyond what other data platforms are capable of.

Creating the Service

When you create the Databricks service it creates what appears to be a PaaS service that sits in a blade in the resource group you select when creating the service. So in my case I created a service called sibytes-databricks in my DataLakePatterns resource group.

Screen Shot 2018-01-14 at 08.45.08

At the top:

  • Managed Resource Group – this is a resource group that houses the managed IaaS – see below
  • URL – the URL to the Databricks PaaS service. This is also where the “Launch Workspace” button goes to
  • Resource Group – Just where this pass through blade to the Databricks service sit in your subscription

When you click the Launch Workspace button it takes you through a single sign-on to Databricks which is where the PaaS services operates. The first time you go through you’re asked to delegate authorisation to your azure account from the Databricks service so it can manage the IaaS as part of the service. So you’ll see this

Screen Shot 2018-01-14 at 08.19.34

NOTE: this has to be an azure domain account not a Microsoft account so you’ll need to set one up before hand. Also it’s probably wise to create a domain account specifically for this purpose rather than using a specific user account so you can lock down it’s permissions entirely to what it should have – particularly with regards data access.

Once in you’ll see the usual and in my opinion brilliantly usable PaaS service interface.

Screen Shot 2018-01-14 at 08.47.00

Remember that it’s a single sign-on delegated permission to Azure so when you manage your account in Databricks it will take straight back into Azure.

Managed IaaS

When you create your Databricks service you’ll also notice that it will also create an additional Resource Group to house the infrastructure services that Databricks creates to run the enhanced spark data platform.

Screen Shot 2018-01-14 at 08.31.10

Initially it will look like the following i.e. with no clusters created and running.

Screen Shot 2018-01-14 at 09.06.14It has:

  • A storage account – so it’s not entirely IaaS. The storage is read-only and you cannot access it. Not 100% what this is for yet but if it’s for data content storage then we might be limited to 1 account and that also means data going in and out can only be through Databricks. Or maybe it’s just for a specific types of Databricks meta data or processing storage type… Or maybe it’s for internal config, logs and security… Need to look into this further when I get into data processing and explore to see what if any other Azure storage accounts can be attached. Will blog on it…
  • Virtual Network – This is basically a secure vnet for the IaaS clusters.
  • Network Security Group – This is all the network security inbound and outbound settings that allow Azure and Databaricks to speak to each other to spin up IaaS VM’s and access data.

Creating A Cluster

Things get a little more interesting when we create a cluster. I created a 2 node cluster with a 1 driver using Standard DS3’s. In the drop downs for choosing machines you’re limited to a specific collection and currently they are all listed as beta.

Screen Shot 2018-01-14 at 08.32.10

Looking back into the Azure IaaS resource group we can see it’s spun up 3 linux machines on the VM’s we have chosen in Databricks:

Screen Shot 2018-01-14 at 08.33.20

When we stop the cluster in Databricks I was expecting the VM’s to be stopped. However that doesn’t happen. Currently what happens is that Databricks retains the cluster as a terminated cluster and Azure delete’s all of the VM’s so all you’re left with is the VNet, SGN and Storage Account with our data on it – which is what we started with plus our data.

What this means is that you’ll pay less in storage when you cluster is stopped however it can take quite a while for clusters to come and down particularly if they’re pretty large. It would be good to have an option to pause the service that keeps the VM’s persisted.

Thoughts & More to Follow

This is just a quick look and intro – more to follow particularly on data processing, storage and hooking into the PaaS service through their data provider.

How the storage accounts hook in is a big part of how good this will be because if interaction and flexibility around storage is limited then it may significantly reduce it’s appeal as an architectural option. If it is significantly locked down then we’ll be forced to use and pay for services that Microsoft and Databricks have agreed to integrate that you would otherwise may not need in order to build a viable and compelling data platform  e.g. Azure Data Warehouse…