Lightning Post – SQL 2019 Big Data Clusters

SQL 2019 is now in CTP2.3 and there’s this thing called SQL Server 2019 Big Data clusters… What? No it’s not April 1st. If that’s boggled your mind here’s a quick post that might help…

Is this further bloat options on the install making yet even more complicated?

No you can still install SQL on prem the same way as before. It’s just an alternative way of building out a SQL deployment for specifically Big Data processing accommodating traditional and existing SQL processing with newer Big Data processing methods.

So for Big Data processing we need scale out right… multiple machines running processing nodes

Yes; and that’s what this is. It’s a cluster of compute machines running worker nodes of SQL Server to provide a highly distributed compute service… need more grunt then just add nodes…

But wait… that sounds like Azure DW previously parallel data warehouse, a head node with lots of worker machines, how is this different and how is this orchestrated…

It’s not parallel DW it’s just regular SQL; but you heard right that it can run on linux?! Well that means it can run sql on machines in docker containers orchestrated by Kubernetes (K8) into a cluster. K8 handles all the clustering over persistent detachable storage on VM disks… but wait it get’s even more crazy… not only does it have SQL nodes it also has spark nodes on HDFS! This means it can accommodate all the things that spark is good at such as streaming, easily process big silos of all data structures along with traditional sql processing…

But SQL can’t read HDFS

Now it can. SQL now has Parquet reader and it can read directly off detachable HDFS!!!

SSMS just does sql, but I can’t write T-SQL against spark so how does that work?

If you’re into big data and spark or read about it you may have heard that these engines require deeper OO and functional programming skills and they use something called notebooks. For spark we can write python, R and scala… SQL can also now run python and r. This all hangs together using Azure Data Studio which is a richer IDE surface than SSMS that provides a way to use all this technology together and gives us notebooks that can execute python, R and SQL!!. Data Studio is based on the VS shell which can be extended with plugins.

Spark is based on Java so does that cause inter-operation issues with SQL?

SQL Server can now run java… once again… SQL Server can now run java! No; I haven’t been smoking something.

This has to be cloud right? since MS is all about cloud…

Can be but doesn’t have to be. You just need Kubernetes so that could be on premises or using Azure Kubernetes service AKS.

So how do I get data into it?

Well you can use Azure Data Factory and other copy tools… but wait! You’re kind of not meant to! It’s fully integrated with Polybase that does predicate push down to any data source supported by polybase. That means this is a Data Virtualization platform. The source data can stay where it is… I can wrangle data in 1 SQL query across SQL, HDFS and Oracle in a single t-sql query and store the result in the cluster in SQL Server (eventually using an insert statement) called a SQL pool.

But there’s other big data architectures and service in Azure; is this a replacement?

No it’s just a different way of doing the same thing. Databricks runs spark in Azure but because it has the databricks runtime this has many more spark features that open source spark doesn’t have… so you should seek to understand these features and use the best one for yourselves.

So in a nutshell:

We’ve got big data engineers, data scientists and SQL BI folks and they all want to do their thing with tools they know… well then this is your thing…

Summary

  • It’s a highly distributed SQL cluster running in kubernetes
  • It has spark nodes
  • I can do data science, big data and traditional SQL (It’s just SQL) on it using Azure Data Studio
  • It’s a data virtualization platform and can query other stores outside the cluster using SQL and predicate push down

To prove this isn’t an April fools… Get started here…

Advertisements

The Basics – T-SQL: Recursive Inheritance

This is a post regarding a SQL design pattern for applying configuration attributes to a parent child table. There’s a bunch of obvious use cases particularly around metadata and ETL control & configuration repositories…

Imagine if your ETL has hundreds if not thousands of processes. Often it makes sense to model process meta data in parent child table since you have 1 root process that when executed executes a hierarchy of child process. If then you have configuration options for those processes you won’t want to maintain a configuration row for every process right? that would be a huge maintenance headache. Ideally we would configure the root process with options that the child processes just inherit unless they’re specifically overridden. This gives us the best of both worlds in terms of being able to have the flexibility of specific configurations for child processes but where they don’t have configurations then inherit the configurations from the parent resulting in less configuration maintenance. This model doesn’t persist the inherited configuration but remains sparse and derives the inheritance logically when queried removing the need for updates or inserts.

We’ll have 2 tables:

  1. The parent child entity: process_executable
  2. The configuration attributes: process_configuration

Here they are:

CREATE TABLE [etl].[process_executable] (
 [process_executable_id] INT IDENTITY (1, 1) NOT NULL,
 [parent_process_executable_id] INT DEFAULT ((-1)) NOT NULL,
 [connection_id] INT NOT NULL,
 [process_executable_name] VARCHAR (200) NOT NULL,
 [created_date] DATETIME DEFAULT (getdate()) NOT NULL,
 [is_enabled] BIT DEFAULT ((1)) NOT NULL
 CONSTRAINT [PK_process_executable] PRIMARY KEY CLUSTERED ([process_executable_id] ASC)
);


CREATE TABLE [etl].[process_configuration] (
 [process_configuration_id] INT IDENTITY (1, 1) NOT NULL,
 [process_executable_id] INT NULL,
 [number_of_retries] INT NULL,
 [is_incremental] BIT NULL,
 CONSTRAINT [PK_process_configuration] PRIMARY KEY CLUSTERED ([process_configuration_id] ASC)
);

Here is the solution

First left join the configurations in so we get all process with and without configurations. Then recurse down the hierarchy using the a recursive CTE defaulting the configurations to the parent values where they are null.

WITH [cte_process_configuration]  
AS --join in the configuration 
(
  SELECT  
    p.[process_executable_id],  
    p.[parent_process_executable_id],  
    pc.number_of_retries, 
    pc.is_incremental 
  FROM etl.process_executable p 
  LEFT JOIN [etl].[process_configuration] pc ON p.process_executable_id = pc.process_configuration_id
 ), 
cte_collapse 
AS --recurse  down tacking the parent settings down to children where they don't have settings. 
(
  SELECT  
    m.process_executable_id, 
    m.parent_process_executable_id,
    m.max_threads,
    m.number_of_retries,
    m.is_incremental 
  FROM [cte_process_configuration] m 
  WHERE m.parent_process_executable_id = -1 

  UNION ALL

  SELECT 
    m.process_executable_id, 
    m.parent_process_executable_id, 
    --carry down the settings where there aren't any 
    ISNULL(m.number_of_retries, pm.number_of_retries) AS number_of_retries, 
    ISNULL(m.is_incremental, pm.is_incremental) AS is_incremental 
  FROM [cte_process_configuration] m 
  JOIN cte_collapse pm ON m.parent_process_executable_id = pm.process_executable_id 
)
--show me the money! 
SELECT 
  process_executable_id,  
  number_of_retries, 
  is_incremental 
FROM cte_collapse c