Finer Points – USQL: Merging Datasets Part 2

In part 1 we looked at 2 ways of performing a data merge using USQL which are:

  1. Union of Left Outer, Inner, Right Outer Join – referred to as LIR
  2. Conditional Select on Full Outer Join – referred to as CFO

In this blog we’ll observe the job graphs and performance measures.

Setup

To give it a bit more of a grind to do I’ve generated 16 million customers with the same detail attributes as before:

  • ID – Unique business identifier
  • Title
  • Email
  • Firstname
  • Surname

All this data is artificially fabricated. Any similarities of names and title is purely co-incidentally. Email addresses are junk and just don’t exist and the unique ID is sequentially generated.

The load will deterministically merge 50% of the total volume of the customers using ID as a merge key. I’ll run in both cases with a max compute of 10 AU’s.

As a reminder here is the U-SQL…

LIR Method – Union of Left Outer, Inner and Right Outer Join

USE Patterns;
 
DECLARE @tabcode int = 44;
DECLARE @delimiter char = (char) @tabcode;
DECLARE @escape char = Char.Parse("œ");
DECLARE @extension string = "csv";

//  Path
DECLARE @filenumber string = "1";
DECLARE @filename string = "person";
DECLARE @inpath string = "/Person/" + @filename + @filenumber + ".csv";
DECLARE @outpath string = "/Person/output/" + @filename + @filenumber + ".csv";
 
// read data
@data =
    EXTRACT Id int,
            Title string,
            Email string,
            Firstname string,
            Surname string
    FROM @inpath
    USING Extractors.Text(delimiter : @delimiter, escapeCharacter : @escape, quoting : false, skipFirstNRows : 1);

@merge =
    //update current
    SELECT [src].[Id] AS [personid],
           [src].[Title] AS [title],
           [src].[Email] AS [email],
           [src].[Firstname] AS [firstname],
           [src].[Surname] AS [surname],
           (DateTime?) null AS [deleteddate],
           @filename + @filenumber AS [filename]
    FROM @data AS [src]
         INNER JOIN
             [dw].[person] AS [tgt]
         ON [src].[Id] == [tgt].[personid]
 
    UNION ALL
 
    //insert new
    SELECT [src].[Id] AS [personid],
           [src].[Title] AS [title],
           [src].[Email] AS [email],
           [src].[Firstname] AS [firstname],
           [src].[Surname] AS [surname],
           (DateTime?) null AS [deleteddate],
           @filename + @filenumber AS [filename]
    FROM @data AS [src]
         LEFT OUTER JOIN
             [dw].[person] AS [tgt]
         ON [src].[Id] == [tgt].[personid]
    WHERE [tgt].[personid] IS NULL
 
    UNION ALL
 
    //keep existing and logically delete
    SELECT [tgt].[personid],
           [tgt].[title],
           [tgt].[email],
           [tgt].[firstname],
           [tgt].[surname],
           (DateTime?) DateTime.Now AS [deleteddate],
           [tgt].[filename]
    FROM @data AS [src]
         RIGHT OUTER JOIN
             [dw].[person] AS [tgt]
         ON [src].[Id] == [tgt].[personid]
    WHERE [src].[Id] IS NULL;
 
//truncate table - we can't do row level operations
//it's all file based processing so we have to reload the whole table (partition)
TRUNCATE TABLE [dw].[person];
 
//insert the merged data
INSERT INTO [dw].[person]
(
    [personid],
    [title],
    [email],
    [firstname],
    [surname],
    [deleteddate],
    [filename]
)
SELECT [personid],
       [title],
       [email],
       [firstname],
       [surname],
       [deleteddate],
       [filename]
FROM @merge;

CFO Method – Conditional Select on Full Outer Join

USE Patterns;
 
DECLARE @tabcode int = 44;
DECLARE @delimiter char = (char) @tabcode;
DECLARE @escape char = Char.Parse("œ");
DECLARE @extension string = "csv";
 
//  Path - increment @filenumber to load files 2 and 3 on successive runs
DECLARE @filenumber string = "1";
DECLARE @filename string = "person";
DECLARE @inpath string = "/Person/" + @filename + @filenumber + ".csv";
DECLARE @outpath string = "/Person/output/" + @filename + @filenumber + ".csv";
 
// read data
@data =
    EXTRACT Id int,
            Title string,
            Email string,
            Firstname string,
            Surname string
    FROM @inpath
    USING Extractors.Text(delimiter : @delimiter, escapeCharacter : @escape, quoting : false, skipFirstNRows : 1);
 
@merge =
        SELECT
            //select the source data if insert or update
            //select the target data if deleted 
            //we have to cast the datatypes since the if else construct returns nullable types

            (int)(issource ? [srcPersonid] : [tgtPersonid] ) AS personid,
            (string)(issource ? [srcTitle] : [tgtTitle] ) AS title,
            (string)(issource ? [srcEmail] : [tgtEmail] ) AS email,
            (string)(issource ? [srcFirstname] : [tgtFirstname] ) AS firstname,
            (string)(issource ? [srcSurname] : [tgtSurname] ) AS surname,
            (issource ? null : (DateTime?)DateTime.Now ) AS deleteddate,
            (string)(issource ? [srcFilename] : [tgtFilename] ) AS filename
        FROM
        (
    SELECT (
               // create a boolean that can be re-used in the outer 
               // query to keep the code clean
               // if update
               ([src].[Id] == [tgt].[personid] & [src].[Id] != null)
               // or if insert
               || ([tgt].[personid] == null)
               //then write source data
               ? true
               //else keep the target data
               : false
           ) AS issource,
           //source data
           [src].[Id] AS [srcPersonid],
           [src].[Title] AS [srcTitle],
           [src].[Email] AS [srcEmail],
           [src].[Firstname] AS [srcFirstname],
           [src].[Surname] AS [srcSurname],
           @filename + @filenumber AS [srcFilename],
           //target data
           [tgt].[personid] AS [tgtPersonid],
           [tgt].[title] AS [tgtTitle],
           [tgt].[email] AS [tgtEmail],
           [tgt].[firstname] AS [tgtFirstname],
           [tgt].[surname] AS [tgtSurname],
           [tgt].[filename] AS [tgtFilename]
    FROM @data AS [src]
         FULL OUTER JOIN
             [dw].[person] AS [tgt]
         ON [src].[Id] == [tgt].[personid]
    ) AS cpr;

//truncate table - we can't do row level operations
//it's all file based processing so we have to reload the whole table (partition)
TRUNCATE TABLE [dw].[person];
 
//insert the merged data
INSERT INTO [dw].[person]
(
    [personid],
    [title],
    [email],
    [firstname],
    [surname],
    [deleteddate],
    [filename]
)
SELECT [personid],
       [title],
       [email],
       [firstname],
       [surname],
       [deleteddate],
       [filename]
FROM @merge;

Job Graphs

The job graphs are as follows showing significantly more map-reduce steps using the Left, Right and Inner join.

LIR Method

leftrightinner job graph

CFO Method

mergefullouter Job Graph

Comparison

So looking at the job graphs we can see that there is a very clear difference on the physical execution. As an aside Azure Data Lake Analytics has some really good job graph analysis tools. The first one we’ll look at is a side by side comparison:

 

job comparison

I’m not going to give commentary on the numbers. It should be fairly clear that the CFO (MergeFullOuter) physical job plan is significantly more efficient on all comparative measures

Vertexes are not all independent, they will have some dependency based on the logic. So vertexes are organised into dependency stages. Max Degree of parallelism that I allocated is 10 AU’s.

I picked 10 deliberately because I already knew from previous runs that it’s the maximum that it can achieve for this particular job. Increasing any further would wasting AU’s. Remember though we pay for the reservation not for the use so just because it’s using 10 it may not be the most efficient… It depends on the business value of your job and the cost! Because elastic MPP compute is available at a fixed cost and it scales relatively linearly it just becomes a question of:

  • how much data?
  • when do we want it to finish?
  • how much will it cost?

Using the AU analysis we can see the following for the LIR and CFO respectively:

leftrightiner AU

 

mergefullouter AU

This is a great tool for analyzing compute. Again we can see the CFO is more efficient needing less AU’s for the majority of the job for shorter compute time. Note that for both jobs we only make use of the full 10 AU’s for a relatively small proportion of the compute. This is why the Azure AU analyzer is recommending 3 AU’s for LIR and 2 for CFO. Interestingly if I drag the AU bar down using the visual to achieve approximately the same compute I can see that LIR will need 8 and CFO will need 6. Essentially CFO requires less reserved AU’s for a shorter compute hence a lower cost:

  • LIR  $ 0.67
  • CFO $ 0.33

Note cost doesn’t change much within 10 AU’s because as the reservation comes down the compute time goes up. It’s relatively linear.

Conclusion

In short write good U-SQL!

Lazy transformation is absolutely what we want from a platform that processes big data. What do I mean by this?

If you work with SQL Server for a career then you’ll know that SQL Server compiles it’s SQL into query plans. The optimizer that creates the plan decides how best to execute the query based on what it knows about the data, indexes, etc. It shouldn’t matter how we write the code, if the optimizer is good then it should arrive at the same query plan… In the case of SQL reality is more complicated than it appears and the code and indexes you create matter.

We need lazy transformation more so in big data processing platform just because of the scale and costs involved. The spark platform is really good at this. There are a variety of options for coding transformations but the end result… the query plan that eventually gets executed and reads the data will be the same.

With data lake it really does seem to matter how we write the queries. Returning back to table variable multiple times shouldn’t really matter. I would expect the platform to take the code and create a physically optimized map reduce job is if coded in a single operation. It really does seem to matter however if we go about the job using different join and set operators. We cannot just be lazy about the code we write and trust the engine optimizer to create the best physical plan. Fortunately the optimization tools are pretty good and very easy to use.

 

Advertisements

Azure Big Data: Attach Azure Blob Storage to CentOS 7 VM

In a this blog I’ve covered how I set up a standalone Spark 2.3 on an Azure provisioned CentOS 7.4 VM. This is the build I’m using to experiment with and learn Spark data applications and architectures. A benefit of using an Azure VM is that I can rip it down, rebuild it or clone it. When I do this I don’t want to lose my data every time, recover it and then put it back in place. Having my data in a datalake in an Azure blog storage container is ideal since I can kill and recycle my compute VMs and my data just stays persisted in the cloud. This blog covers how I can mount my blob storage container to my CentOS 7.4 VM.

Note this is for standalone only and for the convenience of learning and experimentation. A multi-node Spark cluster would need further consideration and configuration to achieve distributed compute over Azure blog storage.

A final note; I’m learning linux and spark myself and a lot of this stuff is already on the webz albeit in several different places sometimes poorly explained. Hopefully this provides a relatively layman’s end to end write-up with the missing bits filled in that I found myself asking.

Install Blobfuse

What is blobfuse? Well repeating the github opener…

blobfuse is an open source project developed to provide a virtual filesystem backed by the Azure Blob storage.

We need to download and install this; note the URL (….rhel/7…) is correct because we’re CentOS 7.4! Not (….rhel/6…) like I tried to do!

sudo rpm -Uvh https://packages.microsoft.com/config/rhel/7/packages-microsoft-prod.rpm
sudo yum install blobfuse

Temporary Path

Blobfuse requires a temporary path. This is where it caches files locally aiming to provide the performance of local native storage. This place obviously has to be big enough to accommodate the data that we want to use on our standalone spark build. What better drive to use for this than the local temporary SSD storage that you get with a Azure Linux VM. Running the following we can see a summary of our attached physical storage:

df -lh

centos vm storage

Here we can see that /dev/sbd1 has 63GB available which is plenty for me right now. It’s mounted on /mnt/resource so we’ll create a temp directory here. Obviously substitute your own username when assigning permissions.

sudo mkdir /mnt/resource/blobfusetmp 
sudo chown shaunryan /mnt/resource/blobfusetmp

When the machine is rebooted everything here can (assume it will) be lost. But that’s fine because it’s all held on our cloud storage container. It’s just the cache.

In fact if you navigate to mounted folder and list the files:

cd /mnt/resource
ls -L

we can can see a file called DATALOSS_WARNING_README.txt. If we nano that we can see the following:

datalosswarning

Create an Azure Blob Storage Account

I’m not going to cover creating an Azure storage account since it’s pretty straight forward – see here.

After creating the storage account we need to create a container; Click on blobs and create a container.

storage1

storage2

Once the container is created, click on it and upload some data. I’m using the companion data files for the book called Definitive Guide to Spark, they can be found here.

storage3

Now the storage, container and data is up we need to note down the following details so that we can configure the connection details for blobfuse:

  • Storage Account Name
  • Access Key 1 or 2 (doesn’t matter)
  • Container Name – we already created I called it datalake

These can be obtained by clicking on the storage account Access Keys.

storage1

storage5

Configure Blob Storage Access Credentials

Blobfuse takes a parameter which is a path to a file that holds the Azure storage credentials. To that end we need to create this file. I created it in my home user directory (i.e.  home/shaunryan or ~) for convenience. Because of it’s content it should be adequately secured on a shared machine so store it where you want to but note the path.

cd ~
sudo touch ~/fuse_connection.cfg
chmod 700 fuse_connection.cfg

We need the following Azure storage details for the storage container that we want to mount using blobfuse:

  • account name
  • access key
  • container name

Create an Azure Blob Storage Account above will show where these details can be found.

Edit the new file using nano:

sudo nano ~/fuse_connection.cfg

Enter the account details as follows

accountName myaccountname
accountKey mykeyaccount
containerName mycontainername

Should look something like this. When done hit ctrl+x and y to save.

fuse_connection

Mount the Drive

So now all that’s left to do is mount the drive. We need somewhere to mount it to so create a directory of your liking. I’m using a sub-dir in a folder called data at my home directory since I might mount more than 1 storage container and it’s just for me (~/data/datalake).

sudo mkdir ~/data/datalake

We also need the path to our temp location (/mnt/resource/blobfusetmp) and the path to our fuse_connection.cfg file that holds the connection details (just fuse_connection.cfg because I created this at ~).

cd ~
blobfuse ~/data/datalake --tmp-path=/mnt/resource/blobfusetmp --config-file=fuse_connection.cfg -o attr_timeout=240 -o entry_timeout=240 -o negative_timeout=120

So now when we list files in this directory I should see all the files that are in my storage account and I can load them into my spark console. See below where I have all the data files available to work through the definitive guide to spark book. I copied them from GitHub into my Azure storage account which is now attached to my VM.

datalake mounted

Automate in Bash Profile

So it’s all up and working until we reboot the machine, the drive is unmounted and our temp location is potentially (we should assume it will be) deleted.

To remedy this we can automate the temporary file creation and blobfuse storage mount in the bash profile.

That way I can totally forget all this stuff and just be happy that it works; and when it doesn’t I’ll be back here reading what I wrote.

Nano the bash profile to edit it.

sudo nano ~/.bash_profile

Add the following to the end of the profile and ctrl+x to exit and y to save.

# Mount Azure Storage account

if [ ! -d /mnt/resource/blobfusetmp ]
then
 echo "creating Azure Storage temporary folder /mnt/resource/blobfusetmp"
 sudo mkdir /mnt/resource/blobfusetmp
 sudo chown shaunryan /mnt/resource/blobfusetmp
 echo "created Azure Storage temporary folder /mnt/resource/blobfusetmp"
else
 echo "Azure Storage temprorary folder already exists /mnt/resource/blobfusetmp"
fi

echo "Mounting Azure storage at ~/data/datalake/ using ~/fuse_connection.cfg and temporary drive /mnt/resource/blobfusetmp"
blobfuse ~/data/datalake --tmp-path=/mnt/resource/blobfusetmp --config-file=/home/shaunryan/fuse_connection.cfg -o attr_timeout=240 -o entry_timeout=240 -o negative_timeout=120 -o nonempty

Now when we ssh in it should mount automatically. Below shows a login after a reboot and login after an exit. The mount after the exit will fail because it’s already mounted which is fine. Note the temporary storage already existed but it may not do. I issued a reboot so likely hood it wasn’t down long enough to be recycled, however it was destroyed when I shut down the VM last night and power it up this morning.

login-mount1

 

 

Azure Big Data: Spark 2.3 Centos VM Standalone

As of late I’ve been investigating all the options of running a Spark Big Data platform on Azure using blob and datalake for data storage. So far I’ve poked around with the following – which I may blog about if I get time:

  • IaaS Linux VM Build (standalone and clustered)
  • HDInsight
  • Databricks
  • Spark on Azure Container Cluster (AKS preview) i.e. Kubernetes

This is basic how to install Spark 2.3 on a standalone Centos VM in Azure. Basically the latest and greatest build of Spark 2.3, Centos 7.4 (Linux), Scala 2.11.12 and Java 8+. There are later versions of Scala but Spark 2.3 requires Scala 2.11 max as covered here:

Preparing Your Client Machine

 

  1. Install bash client
  2. Create ssh rsa key – we need this before creating the Azure VM

We’re setting up a linux server to run Spark on a Centos VM in Azure. I’m not going to bother with a Linux Desktop or remote desktop but we’ll need a client bash terminal to connect to the machine in order to:

  • Administrate the Centos & install software
  • Use the Spark terminal

I run mac OS and windows 10; mostly Windows 10. If you’re running with a mac you don’t need a bash client terminal since you have one already. Windows however does need a bash client.

There is a new Microsoft Linux Subsystem available in the Windows 10 Fall Creators update but I hit some issues with it so wouldn’t advise it yet. It’s not just a Bash client; it emulates a local Linux subsystem which provides some irritating complications. The best experience by far I’ve had is with Git Bash so go ahead and install this if you’re using Windows.

Once we have Bash we need to create public private key so that we can use the Secure Shell (SSH) command to securely connect to our Linux VM.

Open Git Bash and execute the following:

ssh-keygen -t rsa -b 2048

This creates a an 2048 bit rsa private public key pair that we can use to connect to our Azure VM. You’ll be prompted for a filename and passphrase. See here:

gitbashkey

As we can see it says that it created the key in:

C:\Users\shaun\.ssh\id_rsa

In my case it didn’t however and demokey.pub and demokey can be found here, which is my bash home directory:

C:\users\shaun\demopub.key (this is the public key)
C:\users\shaun\demopub  (no extension this is the private key)

Review these files using notepad. Copy the private key to the respective .ssh folder and rename it to id_rsa:

C:\Users\shaun\.ssh\id_rsa

Keep a note of the public key which looks something like below because this lives on the Linux server and we need it when creating the Linux VM in Azure. Also don’t forget the passphrase you entered because we need that to login using the ssh command.

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDBkb5GTWTIAtGhZeHNKXwbVF6WoQqb0u23D3opQc0TId9NdlWj8WnYmFu1/l4CuqdD/uzE7/JJTP2pW9mrb3/sZyygF560XGQzTmEUAGRlAexTr509Q0wB/Spekp9qGLVqkys3wQdbxjWsWI2lEhwJIvPlyzgzIAJrmeUU/NGS6rQN+tzoqntg4V2fI714W7f0YRerUICb9rveVwbDU0ieihs1B+n+ljNoJ+J3yFAKqYVcYyQIL4WYmpYgi/M1EMOyrRZK0hVySIbhGh4eI1FBOfplxEOhI8SgedK1KaemhBWs4f+zs1bntqkSCgFHJzV/eLUHDsYxTrgEK3Tn9s5X shaun@DESKTOP-CKJA9OR

Right time to create our Linux VM.

Create Centos VM in Azure

 

Login into the Azure Portal, click Create a Resource and search for Centos. Choose the CentOS-based 7.4 and hit create.

centOS VM.PNG

Fill in the necessaries in order to create your VM choosing the most affordable and appropriate machine. For a demo learning standalone I tend to go for about 4 cpu’s and 32GB (remember spark is an in-memory optimised big data platform). The important bit is to copy and paste our public rsa key into the SSH Public Key input box so it can be placed on the VM when provisioned. When Azure has provisioned your VM it leaves it up and running.

centOS VM2.PNG

Connect to CentOS VM

 

So hopefully that all went well and we’re now ready to connect. You can give your VM a DNS name (see docs) however I tend to just connect using the IP. Navigate to the VM in the portal and click the connect button. This will show you the SSH command with the server address that we can enter into a bash client in order to connect.

connect

Enter the SSH command, enter the passphrase and we’re good to go:

connect

Patch the OS

 

Ensure the OS is patched, the reboot will kick you out of your ssh session. So you’ll need to sign back in.

sudo yum update -y
sudo reboot

Install Java 8

 

Install open JDK 1.8 and validate the install

sudo yum install java-1.8.0-openjdk.x86_64
java -version

Set the following home paths in your .bash_profile so that everytime we login our paths are set accordingly. To do this we’ll use the nano text editor.

sudo nano ~/.bash_profile

Add the following path statements, since they’re required by the scala config:

export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk
export JRE_HOME=/usr/lib/jvm/jre

should look something like this.

javapaths

To exit press ctrl+x, you’ll be prompted to save. Now reload the bash profile.

source ~/.bash_profile

Check the java version.

java -version

javaversion

check the java paths:

echo $JAVA_HOME
echo $JAVA_JRE

echojavapaths

Java is all done. Onto scala.

Install Scala 2.11.12

 

Spark 2.3.0 requires Scala 2.11.x version. Note that the current scala is version 2.12 so we’ll go for the last 2.11 scala version which is 2.11.12; we want the rpm package:

wget http://downloads.lightbend.com/scala/2.11.12/scala-2.11.12.rpm
sudo yum install scala-2.11.12.rpm

We should now validate the install

scala -version

Note the output as follows:

cat: /usr/lib/jvm/jre-1.8.0-openjdk/release: No such file or directory
 Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL

This is because there is no release directory in the $JAVA_HOME path directory which the scala script looks for; see a more thorough explanation here. It’s not vitally necessary but I got around this by just creating a release directory at $JAVA_HOME.

cd $JAVA_HOME
sudo mkdir release
cd ~
scala -version

output:

cat: /usr/lib/jvm/jre-1.8.0-openjdk/release: Is a directory
Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL

Install Spark 2.3.0

 

Final step! Install spark. Download the 2.3.0 rmp package. We’ll use wget again and download the package from a mirror url listed from on this page. I’m using the 1st listed mirror url but adjust as you see fit.

wget http://apache.mirror.anlx.net/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz

unzip and move the files into a more appropriate directory:

tar xf spark-2.3.0-bin-hadoop2.7.tgz
mkdir /usr/local/spark
cp -r spark-2.3.0-bin-hadoop2.7/* /usr/local/spark

We need need add some paths to out bash profile for convenience so that we don’t have to map them every time we create a session. Again we’ll use nano for this.

cd ~
sudo nano ~/.bash_profile

Add the following lines along with the java paths that added earlier.

export SPARK_EXAMPLES_JAR=/usr/local/spark/examples/jars/spark-examples_2.11-2.0.0.jar

Also put the spark binary folder into the $PATH variable:

PATH=$PATH:$HOME/.local/bin:$HOME/bin:/usr/local/spark/bin

The file should look something like this.

spark paths

Now we can exit and save the file using crtl+x, reload the bash profile and check the paths.

source ~/.bash_profile
echo $PATH
echo $SPARK_EXAMPLES_JAR

Now we should be good to run the spark shell!!

spark-shell

There you have it folks enjoy!

sparkshell

To exit spark-shell use

:quit

to close your ssh session use

exit

Finally, don’t forget to stop your VM to reduce your Azure spend. Sensitive machine details, user details and rsa keys used for this blog have since been deleted.

 

 

Azure Databricks Walkabout

Finally got my Azure Databricks preview enabled. This is just a quick overview of how it all hooks together.

Technical Architecture Overview

Basically Databricks is the PaaS and Azure is the IaaS. Which is a smart play by Databricks. Why try and compete with the scale and resilience that Microsoft and AWS hosting IaaS. Leveraging partner IaaS to host their PaaS service allows them to do what they do well which is to focus on pushing the Databricks spark platform above and beyond what other data platforms are capable of.

Creating the Service

When you create the Databricks service it creates what appears to be a PaaS service that sits in a blade in the resource group you select when creating the service. So in my case I created a service called sibytes-databricks in my DataLakePatterns resource group.

Screen Shot 2018-01-14 at 08.45.08

At the top:

  • Managed Resource Group – this is a resource group that houses the managed IaaS – see below
  • URL – the URL to the Databricks PaaS service. This is also where the “Launch Workspace” button goes to
  • Resource Group – Just where this pass through blade to the Databricks service sit in your subscription

When you click the Launch Workspace button it takes you through a single sign-on to Databricks which is where the PaaS services operates. The first time you go through you’re asked to delegate authorisation to your azure account from the Databricks service so it can manage the IaaS as part of the service. So you’ll see this

Screen Shot 2018-01-14 at 08.19.34

NOTE: this has to be an azure domain account not a Microsoft account so you’ll need to set one up before hand. Also it’s probably wise to create a domain account specifically for this purpose rather than using a specific user account so you can lock down it’s permissions entirely to what it should have – particularly with regards data access.

Once in you’ll see the usual and in my opinion brilliantly usable PaaS service interface.

Screen Shot 2018-01-14 at 08.47.00

Remember that it’s a single sign-on delegated permission to Azure so when you manage your account in Databricks it will take straight back into Azure.

Managed IaaS

When you create your Databricks service you’ll also notice that it will also create an additional Resource Group to house the infrastructure services that Databricks creates to run the enhanced spark data platform.

Screen Shot 2018-01-14 at 08.31.10

Initially it will look like the following i.e. with no clusters created and running.

Screen Shot 2018-01-14 at 09.06.14It has:

  • A storage account – so it’s not entirely IaaS. The storage is read-only and you cannot access it. Not 100% what this is for yet but if it’s for data content storage then we might be limited to 1 account and that also means data going in and out can only be through Databricks. Or maybe it’s just for a specific types of Databricks meta data or processing storage type… Or maybe it’s for internal config, logs and security… Need to look into this further when I get into data processing and explore to see what if any other Azure storage accounts can be attached. Will blog on it…
  • Virtual Network – This is basically a secure vnet for the IaaS clusters.
  • Network Security Group – This is all the network security inbound and outbound settings that allow Azure and Databaricks to speak to each other to spin up IaaS VM’s and access data.

Creating A Cluster

Things get a little more interesting when we create a cluster. I created a 2 node cluster with a 1 driver using Standard DS3’s. In the drop downs for choosing machines you’re limited to a specific collection and currently they are all listed as beta.

Screen Shot 2018-01-14 at 08.32.10

Looking back into the Azure IaaS resource group we can see it’s spun up 3 linux machines on the VM’s we have chosen in Databricks:

Screen Shot 2018-01-14 at 08.33.20

When we stop the cluster in Databricks I was expecting the VM’s to be stopped. However that doesn’t happen. Currently what happens is that Databricks retains the cluster as a terminated cluster and Azure delete’s all of the VM’s so all you’re left with is the VNet, SGN and Storage Account with our data on it – which is what we started with plus our data.

What this means is that you’ll pay less in storage when you cluster is stopped however it can take quite a while for clusters to come and down particularly if they’re pretty large. It would be good to have an option to pause the service that keeps the VM’s persisted.

Thoughts & More to Follow

This is just a quick look and intro – more to follow particularly on data processing, storage and hooking into the PaaS service through their data provider.

How the storage accounts hook in is a big part of how good this will be because if interaction and flexibility around storage is limited then it may significantly reduce it’s appeal as an architectural option. If it is significantly locked down then we’ll be forced to use and pay for services that Microsoft and Databricks have agreed to integrate that you would otherwise may not need in order to build a viable and compelling data platform  e.g. Azure Data Warehouse…

Study Notes – Databricks Cram Sheet

What’s the difference between databricks and spark?

  • Databricks is PaaS platform built on spark that offers all the additional features required to easily productionise spark into an enterprise grade integrated platform with 10-40x performance gains. Comparison is here

Is Databricks database software?

  • No – It’s a distributed calculation engine that provides an analytics, streaming, data lake and data warehouse platform across distributed nosql storage

What distributed storage can it run on?

  • AWS S3
  • Azure Data Lake Storage I think possibly even blob not sure yet
  • Hadoop

What cluster managers does it support for distributing the calculation engine?

  • YARN
  • Mesos
  • Spark – built in standalone for dev & learning

What is it implemented in?

  • Scala

What programming languages does it support?

  • Python
  • Java
  • R
  • Scala
  • SQL

What class of use could I use it for?

  • Streaming
  • SQL Analytics
  • Data Transformation (Batch or Realtime)
  • Data Provisioning into Data Warehouse or Data Lake solution
  • Deep Learning
  • Machine Learning (Batch or Realtime)
  • Graph Analysis

What core API’s does it have?

  • MLib – machine learning
  • Streaming
  • SQL
  • GraphX

Can I use 3rd party non-core API’s?

  • Yes

It’s api’s are unified but what does that mean?

  • It means code can be ported from streaming to batch with little modification; lots of work has been put in to minimise time to production, ease of development and migrate solution from a streaming to batch analytics solution for example with ease

Is it free?

  • Spark is Free Databricks is not

How can I use it?

  • Databricks has a cloud portal – there is a free trial
  • Databricks can be provisioned on AWS
  • We’ll soon be able to provision databricks in Azure – it’s on preview

What features differentiates it as a leading data platform?

  • Unified coding model gives shorter dev cycles and time to production
  • It’s PaaS – no hardware cluster to manage, create or look after and I can easily scale it
  • Has a rich collaborative development experience allowing data engineers and data scientists to work together
  • I can run data processing and querying over S3, Azure Data Lake Storage and Hadoop HDFS with:
      • Much greater performance than other distributed storage query engines
      • Automatic Index Creation
      • Automatic Caching
      • Automatic Data Compacting
      • Transactional Support

     

  • There is no buy into a proprietary storage format – i.e. it just sits S3 for example and I can access and manage it with other processes and tools
  • Delta (2018) transactionally incorporates new batch and/or streaming data immediately for queries – no other data platform has this

 

The Basics – SSAS Tabular Many-to-Many

I won’t go into this loads since it’s a fairly well established Kimball modelling approach for cubes…

Essentially what do you do if the grain of a dimension is lower than the fact grain when denormalised directly to the fact table? Since if you design it that way the resulting model will double count the measures without slow, unsuable and complex adjustments to query patterns. That’s the generic definition of the modelling problem or pattern.

This requires a many-to-many dimension model using a factless fact or a bridge table. Essentially we identify the other dimension entity that bridges the lower grain dimension onto the fact, create a bridge table and join it to the fact through the bridge table. We can then hide the bridge table and the users are completely unaware of this complexity and the cube engine takes care of the measure aggregation automatically.

All well and good if you have a good grasp of this design pattern. A classic use case that is more intuitive to think of is Customers having many Accounts and Accounts having many customers. Transactions are at an Account level but customers relate to accounts at lower level. However we don’t want to double count transactions when we aggregate a total.

So in cubes:

  • SSAS Multi-Dimensional – has built in dimension relationship type for this feature
  • SSAS Tabular – you had to get around it using DAX
  • SSAS Tabular 2016, 2017 & Azure – Now also has a built-in relationship feature that handles this automatically – You don’t need to use DAX anymore or create several role playing customer dimensions.

Finally note this is also in Power BI; but not in Excel which is on a slower release cycle! You still have to use DAX last I checked in Excel. Hopefully Excel Power Pivot Models will catch-up soon.

Here’s a simple example of how it’s modelled:

SSAS Tabular

And here it is in Excel; see that accounts total correctly.

by account

When we break it out by customer we see that it aggregates correctly at the lower grain for individual customers but the total does not double count. E.g. Both Shaun and Sarah have 15 in account 001 but the total is 15 for account 001 because they share the same account and it doesn’t double count at the account level. Finally the grand total is 75 again because the engine doesn’t double count the measures for the total.

by customer

Again just by customer we see that it aggregates properly at the lower grain but the total is 75 not 90 which would be wrong. i.e. it doesn’t not double count the facts at customer level and shows the total correctly at account level.

by customer1