Profiling Data in Databricks

Saturday morning study session and had a little focus on profiling data in databricks notebooks. Here’s a quick review…

So what I’ve covered so far:

  • Spark Profiler
  • Databricks Profiler
  • Great Expectations
  • Pandas Profiler

tldr; pandas profiler and great expectations are my go to’s.

Spark Profiler

Spark has a profiler method on a data frame:

my_df.describe()
my_df.summarize()

This is nothing new and we can read about it here. All in all it’s pretty limited but you do get the results back into data set that you can use for other automations. For more advance profiles though you’ll be neck deep in dataframe code.

Databricks Profiler

Databricks have introduced a richer profiler. You can either click the profile tab on the results or code it. The problem with this is the usability of it. There doesn’t appear to be a way to get the results back into dataframe e.g. for generating expectations. Also the GUI isn’t great expanding out the show raw data is really limited and I can’t get that into a dataframe also.

from pyspark.sql.functions import expr

airportsnaFilePath = "/databricks-datasets/learning-spark-v2/flights/airport-codes-na.txt"
  
# Obtain airports data set
airportsna = (spark.read
  .format("csv")
  .options(header="true", inferSchema="true", sep="\t")
  .load(airportsnaFilePath))

dbutils.data.summarize(airportsna)

Great Expectations

Great expectations has an auto profiler available in the CLI tooling and does support native spark dataframes which is great. The auto-profiler does an ok job and it also creates some Great Expectations (see what I did). The downside is there is a learning curve getting this working on databricks.

Pandas Profiler

Hands down this is the best quick profile I tried. The output is awesome… Hugely detailed and readable output. The downside being it’s pandas so blowing up your cluster is a possibility! However if your datasets are manageable then it’s great. You can either render the report in a notebook…

%pip install pandas-profiling
import pandas as pd
from pandas_profiling import ProfileReport

profile = ProfileReport(airportsna.toPandas(), title="Pandas Profiling Report", explorative=True)
displayHTML(profile.to_html())

Or here’s a little trick to provide a download…

profile.to_file('/tmp/profile_report.html')
dbutils.fs.ls("file:/tmp/")
dbutils.fs.cp("file:/tmp/profile_report.html", "/FileStore/data_profiles/profile_report.html")
displayHTML("""
<a href='https://adb-0000000000000000.0.azuredatabricks.net/files/data_profiles/profile_report.html'>report</a>
""")

Hopefully I can get the data as a dataframe too. Here’s a section of the report… much bigger than I could fit in the screen grab.

The downside here of course is that we’re not using spark dataframes and it creates no expectations. I did try using pyspark.pandas AKA koalas by simply changing the import, but alas no joy; no full parity yet!