What’s the difference between databricks and spark?
- Databricks is PaaS platform built on spark that offers all the additional features required to easily productionise spark into an enterprise grade integrated platform with 10-40x performance gains. Comparison is here
Is Databricks database software?
- No – It’s a distributed calculation engine that provides an analytics, streaming, data lake and data warehouse platform across distributed nosql storage
What distributed storage can it run on?
- AWS S3
- Azure Data Lake Storage I think possibly even blob not sure yet
What cluster managers does it support for distributing the calculation engine?
- Spark – built in standalone for dev & learning
What is it implemented in?
What programming languages does it support?
What class of use could I use it for?
- SQL Analytics
- Data Transformation (Batch or Realtime)
- Data Provisioning into Data Warehouse or Data Lake solution
- Deep Learning
- Machine Learning (Batch or Realtime)
- Graph Analysis
What core API’s does it have?
- MLib – machine learning
Can I use 3rd party non-core API’s?
It’s api’s are unified but what does that mean?
- It means code can be ported from streaming to batch with little modification; lots of work has been put in to minimise time to production, ease of development and migrate solution from a streaming to batch analytics solution for example with ease
Is it free?
- Spark is Free Databricks is not
How can I use it?
- Databricks has a cloud portal – there is a free trial
- Databricks can be provisioned on AWS
- We’ll soon be able to provision databricks in Azure – it’s on preview
What features differentiates it as a leading data platform?
- Unified coding model gives shorter dev cycles and time to production
- It’s PaaS – no hardware cluster to manage, create or look after and I can easily scale it
- Has a rich collaborative development experience allowing data engineers and data scientists to work together
- I can run data processing and querying over S3, Azure Data Lake Storage and Hadoop HDFS with:
- Much greater performance than other distributed storage query engines
- Automatic Index Creation
- Automatic Caching
- Automatic Data Compacting
- Transactional Support
- There is no buy into a proprietary storage format – i.e. it just sits S3 for example and I can access and manage it with other processes and tools
- Delta (2018) transactionally incorporates new batch and/or streaming data immediately for queries – no other data platform has this