Is this further bloat options on the install making yet even more complicated?
No you can still install SQL on prem the same way as before. It’s just an alternative way of building out a SQL deployment for specifically Big Data processing accommodating traditional and existing SQL processing with newer Big Data processing methods.
So for Big Data processing we need scale out right… multiple machines running processing nodes
Yes; and that’s what this is. It’s a cluster of compute machines running worker nodes of SQL Server to provide a highly distributed compute service… need more grunt then just add nodes…
But wait… that sounds like Azure DW previously parallel data warehouse, a head node with lots of worker machines, how is this different and how is this orchestrated…
It’s not parallel DW it’s just regular SQL; but you heard right that it can run on linux?! Well that means it can run sql on machines in docker containers orchestrated by Kubernetes (K8) into a cluster. K8 handles all the clustering over persistent detachable storage on VM disks… but wait it get’s even more crazy… not only does it have SQL nodes it also has spark nodes on HDFS! This means it can accommodate all the things that spark is good at such as streaming, easily process big silos of all data structures along with traditional sql processing…
But SQL can’t read HDFS
Now it can. SQL now has Parquet reader and it can read directly off detachable HDFS!!!
SSMS just does sql, but I can’t write T-SQL against spark so how does that work?
If you’re into big data and spark or read about it you may have heard that these engines require deeper OO and functional programming skills and they use something called notebooks. For spark we can write python, R and scala… SQL can also now run python and r. This all hangs together using Azure Data Studio which is a richer IDE surface than SSMS that provides a way to use all this technology together and gives us notebooks that can execute python, R and SQL!!. Data Studio is based on the VS shell which can be extended with plugins.
Spark is based on Java so does that cause inter-operation issues with SQL?
SQL Server can now run java… once again… SQL Server can now run java! No; I haven’t been smoking something.
This has to be cloud right? since MS is all about cloud…
Can be but doesn’t have to be. You just need Kubernetes so that could be on premises or using Azure Kubernetes service AKS.
So how do I get data into it?
Well you can use Azure Data Factory and other copy tools… but wait! You’re kind of not meant to! It’s fully integrated with Polybase that does predicate push down to any data source supported by polybase. That means this is a Data Virtualization platform. The source data can stay where it is… I can wrangle data in 1 SQL query across SQL, HDFS and Oracle in a single t-sql query and store the result in the cluster in SQL Server (eventually using an insert statement) called a SQL pool.
But there’s other big data architectures and service in Azure; is this a replacement?
No it’s just a different way of doing the same thing. Databricks runs spark in Azure but because it has the databricks runtime this has many more spark features that open source spark doesn’t have… so you should seek to understand these features and use the best one for yourselves.
So in a nutshell:
We’ve got big data engineers, data scientists and SQL BI folks and they all want to do their thing with tools they know… well then this is your thing…
- It’s a highly distributed SQL cluster running in kubernetes
- It has spark nodes
- I can do data science, big data and traditional SQL (It’s just SQL) on it using Azure Data Studio
- It’s a data virtualization platform and can query other stores outside the cluster using SQL and predicate push down
To prove this isn’t an April fools… Get started here…