This could be considered “The Basics” but it depends where you are on your learning journey. First lets review some high level guidance whether to use a calculated column or measure:
- Row based calculations – if you’re a SQL whizz think.
f() OVER(PARTITION BY ... ORDER BY...)
We get the same number of rows but we can aggregate within a row over the whole table or part of the table depending on the values of the current row.
- Can be used on any Axis, including the slicer – We’ve effectively created another attribute that can be used on rows, columns or slicer (which is just another axis).
- Aggregated Calculations – Generally we use it to aggregate data across the whole model and perform calculations using aggregates. Think ratios and again if you’re a SQL whizz you’d probably use a bunch of sub queries, aggregate functions and join it together to calculate your ratio
- User Context Sensitive – These calculations will be aggregated using the calculation you’ve defined across the selections on the axis a user makes i.e. they are context sensitive based on user selections. SQL cannot do this using a single expression – this is why cubes are very analytically powerful compared to SQL but conversely DAX is much harder to fully grasp. Generally speaking DAX will offer better performance for aggregates most of the time (when done correctly).
The journey to DAX hasn’t been to bad for myself since my MDX was pretty solid. Once you get over the differences of dealing with tables it’s a home run since the fundamental of context based calculations over user selected combinations of the whole data model is a way of thinking required for both DAX and MDX. It’s that way of thinking that is the learning curve.
Comparative Column Calculations
So we’re going with column just for the learning points. I’m blogging about this particular topic since if you’re coming from SQL or MDX then it might be helpful. Typically we would be familiar with the following:
- MDX: CurrentMember – In DAX there is no such thing!
- SQL: OVER() – In DAX there is no such thing!
So imagine the scenario:
We have a table of categorised granular measures and we want to create a calculated column for each row using:
- Each row value n
- & N = Aggregate(n) for category of n
Note that aggregate may be sum, min, max, etc.
We have a data set of run times for sequential processes split across 2 parallel streams. The data is at the grain of the process execution. I have:
- Stream : the stream it executed in
- Process : name of the process
- StartDate : when it started
- EndDate : when it ended
I want to plot a Gantt Chart using a Stacked Bar Chart of each stream to compare race conditions to visualise if it could be more efficient. To do this I need to calculate:
- ExecutionDuration : Easy EndDate – StartDate (in seconds)
- StreamStartDate : This is the MIN(StartDate) of a given stream; we need it to calculate the WaitDuration. Perhaps don’t need this step but helps with understanding.
- WaitDuration : StartDate – StreamStartDate (in seconds)
So some example data might be as follows:
In order to calculate the WaitDuration my brain immediately jumps to currentMember since I can refer to the current member context for the startdate and an aggregated min() startdate over the context over the whole data set with the same stream. But there are no members and there is no current member so how to we do this in DAX. Again in multiple steps for clarity.
StreamStartDate = MINX ( FILTER ( 'ProcessExec', 'ProcessExec'[Stream] = EARLIEST ( 'ProcessExec'[Stream] ) ), 'ProcessExec'[StartTime] )
WaitDuration = DATEDIFF('ProcessExec'[StreamStartDate], 'ProcessExec'[StartDate], SECOND)
RunDuration = DATEDIFF('ProcessExec'[StartDate], 'ProcessExec'[EndDate], SECOND)
WaitDuration and RunDuration require no explanation they should be easily understood.
StreamStartDate however makes use of the Earliest function which does require explanation. I’ve seen various explanations over the years and the docs themselves have changed. Essentially it’s hard explain and this is the crux of why DAX is hard to understand. Start simple and build – this is a pretty simple example.
In English the calc is:
Get the MIN StartDate for a set of data where the Stream is equal to the Stream in the current row.
MINX takes a table and an expression to aggregate in this case min.
We’ve passed in a filtered table to <table> to contain only those rows that have a stream equal to the stream in our current row.
FILTER ( 'ProcessExec', <-- this has a table context! 'ProcessExec'[Stream] = EARLIEST ( 'ProcessExec'[Stream] ) <-- acts as where condition )
Because Column calculations start with a ROW context we switch context from the table context (passed to the filter function) to row context using the Earliest function to compare Stream values on the filter.
Column or Measure
I thought this would be a worth while example because it starts to condition your thinking away from current member. However should this be a calculated column or measure? I’ll leave that one to stew with the following points:
- Is duration a useful additive continuous measure? Yes
- Would it be cool if we could efficiently drill up and down on our Gantt chart for large data sets? Yes (ignoring the limitations in Power BI data vis!)
- Will the continuous duration measure be useful as categorical data on rows, column, slicer or any other axis? If it’s banded it might be otherwise no.
- Do we want the measure to display aggregated correctly no matter what selections the user makes with other measures? Yes
- Calculated columns persist the data into model memory, do we need to? No