Finer Points – DAX : Calculated Columns and Context

This could be considered “The Basics” but it depends where you are on your learning journey. First lets review some high level guidance whether to use a calculated column or measure:

Column:

  • Row based calculations – if you’re a SQL whizz think.
    f() OVER(PARTITION BY ... ORDER BY...)

    We get the same number of rows but we can aggregate within a row over the whole table or part of the table depending on the values of the current row.

  • Can be used on any Axis, including the slicer – We’ve effectively created another attribute that can be used on rows, columns or slicer (which is just another axis).

Measure:

  • Aggregated Calculations  – Generally we use it to aggregate data across the whole model and perform calculations using aggregates. Think ratios and again if you’re a SQL whizz you’d probably use a bunch of sub queries, aggregate functions and join it together to calculate your ratio
  • User Context Sensitive – These calculations will be aggregated using the calculation you’ve defined across the selections on the axis a user makes i.e. they are context sensitive based on user selections. SQL cannot do this using a single expression – this is why cubes are very analytically powerful compared to SQL but conversely DAX is much harder to fully grasp. Generally speaking DAX will offer better performance for aggregates most of the time (when done correctly).

The journey to DAX hasn’t been to bad for myself since my MDX was pretty solid. Once you get over the differences of dealing with tables it’s a home run since the fundamental of context based calculations over user selected combinations of the whole data model is a way of thinking required for both DAX and MDX. It’s that way of thinking that is the learning curve.

Comparative Column Calculations

So we’re going with column just for the learning points. I’m blogging about this particular topic since if you’re coming from SQL or MDX then it might be helpful. Typically we would be familiar with the following:

  • MDX: CurrentMember – In DAX there is no such thing!
  • SQL: OVER() – In DAX there is no such thing!

So imagine the scenario:

We have a table of categorised granular measures and we want to create a calculated column for each row using:

  • Each row value n
  • & N = Aggregate(n) for category of n

Note that aggregate may be sum, min, max, etc.

Example

We have a data set of run times for sequential processes split across 2 parallel streams. The data is at the grain of the process execution. I have:

  • Stream : the stream it executed in
  • Process : name of the process
  • StartDate : when it started
  • EndDate : when it ended

I want to plot a Gantt Chart using a Stacked Bar Chart of each stream to compare race conditions to visualise if it could be more efficient. To do this I need to calculate:

  • ExecutionDuration : Easy EndDate – StartDate (in seconds)
  • StreamStartDate : This is the MIN(StartDate) of a given stream; we need it to calculate the WaitDuration. Perhaps don’t need this step but helps with understanding.
  • WaitDuration : StartDate – StreamStartDate (in seconds)

So some example data might be as follows:

DAX Earlier

In SQL

DAX Earlier 1

In DAX

In order to calculate the WaitDuration my brain immediately jumps to currentMember since I can refer to the current member context for the startdate and an aggregated min() startdate over the context over the whole data set with the same stream. But there are no members and there is no current member so how to we do this in DAX. Again in multiple steps for clarity.

Column: StreamStartDate

StreamStartDate =
MINX (
    FILTER (
        'ProcessExec',
        'ProcessExec'[Stream] = EARLIEST ( 'ProcessExec'[Stream] )
    ),
    'ProcessExec'[StartTime]
)

Column: WaitDuration

WaitDuration = 
DATEDIFF('ProcessExec'[StreamStartDate], 'ProcessExec'[StartDate], SECOND)

Column: RunDuration

RunDuration = 
DATEDIFF('ProcessExec'[StartDate], 'ProcessExec'[EndDate], SECOND)

WaitDuration and RunDuration require no explanation they should be easily understood.

StreamStartDate however makes use of the Earliest function which does require explanation. I’ve seen various explanations over the years and the docs themselves have changed. Essentially it’s hard explain and this is the crux of why DAX is hard to understand. Start simple and build – this is a pretty simple example.

Earliest and Earlier functions refer to previous execution contexts! The Earliest being the very 1st.

In English the calc is:

Get the MIN StartDate for a set of data where the Stream is equal to the Stream in the current row.

MINX takes a table and an expression to aggregate in this case min.

MINX(<table>, <expression>)

We’ve passed in a filtered table to <table> to contain only those rows that have a stream equal to the stream in our current row.

    FILTER (
        'ProcessExec',  <-- this has a table context!
        'ProcessExec'[Stream] = EARLIEST ( 'ProcessExec'[Stream] ) <-- acts as where condition
    )

Because Column calculations start with a ROW context we switch context from the table context (passed to the filter function) to row context using the Earliest function to compare Stream values on the filter.

Column or Measure

I thought this would be a worth while example because it starts to condition your thinking away from current member. However should this be a calculated column or measure? I’ll leave that one to stew with the following points:

  • Is duration a useful additive continuous measure? Yes
  • Would it be cool if we could efficiently drill up and down on our Gantt chart for large data sets? Yes (ignoring the limitations in Power BI data vis!)
  • Will the continuous duration measure be useful as categorical data on rows, column, slicer or any other axis? If it’s banded it might be otherwise no.
  • Do we want the measure to display aggregated correctly no matter what selections the user makes with other measures? Yes
  • Calculated columns persist the data into model memory, do we need to? No

Finer Details – Stream Analytics & PowerBI: Feb 1st Update

So folks might have noticed the Feb 1st update to stream analytics. There’s a fair bit of stuff in this release.. What I am going to focus on though is how stream analytics integrates into Power BI now.

Power BI has had the addition of Stream Data sets sometime after Azure Stream Analytics (ASA) integration. Folks who have worked with it might be aware that when you hook up ASA to Power BI it just creates a dataset and doesn’t create a stream dataset. When I first did this it jumped out as something that probably needed to be consolidated… and now it has.

capture

So what does the implementation experience look like now… Well from the stream analytics nothing much appears to have changed. I have a test query that’s not doing much it just fires through some data. I’m actually firing json in through the event hub but will focus on the ASA PowerBI bit. The query is as follows:

WITH customers AS (
 SELECT
 [customer].customerCode,
 [customer].firstname,
 [customer].lastname,
 [customer].addressline1,
 [customer].addressline2,
 [customer].addressline3,
 [customer].postcode,
 [customer].town,
 [customer].county,
 [customer].country,
 [customer].customerEmail,
 [transaction].transactionCode,
 [transaction].status,
 [transaction].[transactionTime],
 [transaction].[transactionAmount]
 FROM
 [evehub]
)

SELECT * INTO [TestReport] FROM customers

When creating the output everything seems as it was before. My assumption was that something would’ve changed here and that I would have to create my streaming dataset end point definition first in Power BI and choose my data set here. But that’s not the case…

capture2

The subtle impact here is that it’s not like an API where I can just fire json at the secure end point. I still have to authorize with an account! This means you need to have a PowerBI account for your stream analytics service which in a production environment you probably won’t want that to be a named user account. Also, if the workspace needs to be shared then it will need to be a Pro account.

So far we’ve hooked up ASA to Power BI but we haven’t done anything with Power BI. Well it turns out we don’t have to manually define the end point like we do with API integration. When data starts streaming through the data set appears automatically in our Streaming Datasets within Power BI. So when I turn on my data stream using a C# app I get the following in PowerBI.

capture3

When I click the pencil to edit the data set I can see that it’s created all the attributes, typed them and defaulted history to on.

capture4

Now it seems we have some control of how these data sets are shaped in Power BI and we can flip the history on or off. We also have an end point that we can use with an api, cURL or PowerShell…

capture5

This means we might be able to stream in data to the data set using another api source not just stream analytics. Also it might seem I can do the following:

  • add an attribute that’s not in my ASA stream
  • remove an attribute that is in my ASA stream

Lets see…

Add an Attribute

I created a streaming tile just totting up transactions realtime.

capture6

When I add a new column the streaming tile still displays without issue. However when I go to create a new report or edit the existing report I used to pin to the dashboard from my dataset I start hitting issues. To see the new column in the dataset on the report editor I have to hit refresh. After doing this my data set and report is broken.

capture7

capture8

Editing the data set again, removing the column and hitting refresh doesn’t fix the issue. It seems I have to trash and recreate the data set.

How about ASA? Well ASA also had a problem. It didn’t stop the job but failed to send the transactions meaning the seemingly unaffected tile won’t update with any new data. Note the lack of column name in both the Power BI and ASA exception.

capture10

 

Remove an Attribute

What happens when we remove an attribute?

Well… the same thing as we might expect from trying to add a column.

Conclusion

This is a good change and consolidates the features of streaming data sets into Power BI for API’s and Stream Analytics. We gain some further control over datasets but it seems ASA is very tightly bound to the data set definition and in reality we can’t do much without breaking it.

It might mean that we can stream data in from another source not just ASA since we now have an endpoint we can use… I haven’t tried this yet.

All in all I’m still left a bit wanting in the mashing features for realtime data in Power BI. I’m sure it will get there… Microsoft are doing a great job… We are a tough bunch to please these days 🙂

 

 

Finer Points – Azure Stream Analytics & PowerBI: Duplicates

Power BI is evolving rapidly and market competition is really hotting up especially with Tableau. Arguably Power BI’s real-time capability puts it slightly ahead of the game in one aspect. I managed to get some detailed hands on to see what the experience and capability is really like. There are 2 ways to stream data into power BI:

  1. API’s
  2. Azure Stream Analytics

You can get up and running with stream analytics pretty quickly however sometimes it’s good to break stuff just to see what you’re getting yourself into.

I’m not going to go into details of how the architecture is set up since there’s loads of blogs on that already. Basically:

  1. Set-up a stream analytics job pointing at Storage blob to read CSV’s
  2. Added a query to pull apart the date and add in some system columns
  3. Fed the data into my Power BI subscription

arch

You can stream data from IoT, event hub or storage events. I’ve used storage events here just because it’s easy to set up. Using storage for real-time is a way to batch events up into an event pay load which you can stream through and then archive into a historic database or data lake.

The query is as follows:

SELECT
BlobName,
EventProcessedUtcTime,
BlobLastModifiedUtcTime,
PartitionId,
CAST(DATEPART(year,Date) AS nvarchar(max)) AS Year,
CAST(DATEPART(month,Date) AS nvarchar(max)) AS Month,
CAST(DATEPART(day, Date) AS nvarchar(max)) AS Day,
Date,
Region,
CAST(Sales as BIGINT) AS Sales
INTO
 [TestReport]
FROM
 [Test]

In this scenario I’m just going to use the storage account write events to write a few simple rows of data into power BI. I will load 2 csv files:

  1. Sales for UK & Wales for Jan 2017 09:00:01 – 09:30:00
  2. Sales for UK & Wales for Jan 2017 09:30:01 – 10:00:00

Here is an example showing the 2nd file start and end, note that I’ve deliberately included data that will cause an error in stream analytics – “wrong” cannot be cast as a bigint:

Date,Region,Sales
2017-01-01 09:30:01,UK,5
2017-01-01 09:31:00,UK,1
2017-01-01 09:32:00,UK,2
....
2017-01-01 09:58:00,WALES,1
2017-01-01 09:59:00,WALES,4
2017-01-01 10:00:00,WALES,wrong

Step 1 – I copy file 1 up into the storage account, within a few seconds my test data set appears and I can see data from the 1st file. Note that I’ve included bloblastmodifiedutctime.

time-data-1

Step 2 – I copy file 2 up into the storage account.At this point the job stops and is in a failed state as expcted. Looking in the logs we can clearly see the error:

exception-message

However interestingly when we look at Power BI some data (1 row) made it through before the failure ended the stream:

2-with-error

Step 3 – so I’ve picked up my failure and I correct the file changing the value of wrong to 50. Note I can only do this because I have a file to change.

Date,Region,Sales
2017-01-01 09:30:01,UK,5
2017-01-01 09:31:00,UK,1
2017-01-01 09:32:00,UK,2
....
2017-01-01 09:58:00,WALES,1
2017-01-01 09:59:00,WALES,4
2017-01-01 10:00:00,WALES,50

Copy file 2 back up into my storage account overwriting the previous copy that caused the data to fail. Then I restart the job from when it failed.Note I haven’t used TIMESTAMP BY to set the event date using a source column.

when-last-finished

The job starts successfully and picks up the corrected file. However on checking Power BI I now have a situation where duplicates have made it into my Power BI data set. Looking at the bloblastmodifiedutctime I can see that they originate from different blob events; 1 that caused failure and 1 that did not. The numbers are different since there are 2 rows for uk in the data set with the same time which Power BI has aggregated. This shows that only 1 row made it through on the failure:

dupes-time

Duplicates

On perusing the documentation I came across the following:

“….When a Stream Analytics job is running, duplicate records may occasionally be noticed in the output data. These duplicate records expected because Azure Stream Analytics output adapters don’t write the output events transactionally. This ‘duplicate record’ scenario can result if one of the following conditions occur;

  • The Azure instance is upgraded while the job is running
  • The Stream Analytics job is upgraded or an issue occurs with connectivity or reliability to the job output
  • The Azure instance running the job has an outage

The downstream consumer of the output events need to dedupe the events using logical identity of the events. For example, if you are aggregating events by groups in a tumbling window, the logical identity of the event is the groups and the tumbling window’s end time. If you are running a pass through query, you may need to carry a unique id on the event in order to dedupe…

Fixing the Dupes

A simple observation is making sure the queries and source data don’t invoke errors. Although the docs imply that azure instance interruptions can cause the same issue… and outright states that there is no transaction protection of outputs.

In our case we are partitioning data into files before dumping the data into an event window for potential aggregation e.g. every 30 minutes. Because it’s a file on azure storage I can get in to correct the data and restart the stream.

Windowing

So does windowing help us? Well yes and no since it depends on the requirements. It fundamentally comes back to are we using the right thing for the job. With stream analytics we’d want to be doing some sort of windowed aggregate. If we partitioned our file at the same grain as the aggregate window then no files would commit as output events unless the operation succeeded for all the data in the file.

Dedupe

This is pretty much out of the window since we’ve plugged it straight into Power BI. There is no way to insert custom dedupe logic. If you know which files have failed you can filter them out which is an ugly work around.

Reload

All my data is in archived files so I can delete the data set, correct my data set and reload everything.

API

I can use the Power BI api to delete rows of data from my data set using the bloblastmodifiedutctime.

https://msdn.microsoft.com/en-us/library/dn877544.aspx

https://github.com/Microsoft/PowerBI-CSharp/tree/master/samples/webforms/get-started-web-app-asp.net

Do we care?

Considering the use case do we really care? If we’re running an event stream we should only really care about the now. Particularly if we using the stream analytics to stream data through prediction functions or statistical aggregations since they are approximate anyway based on a sample populations of data.

If on the other hand if it’s for a BI function that just want to see their data in a BI model as soon as possible then we are going to care that the numbers are correct especially in Finance. As well as real-time we’d be archiving this data off to load a historical data model. This is where the real-time feature can get a bit misconstrued. Real-time BI is a different beast to real time event processing. Currently using this method there is no way to blend your real-time feed into a richer BI model with related data sets and calculations. Even with event stream processing it would be good to visualize the event stream against the backdrop of historical aggregates.

In order to build real-time BI models we need to be able to partition data sets and point them at different sources – e.g. a real time source and retrospective source. It’s a tried and tested method with SSAS multi-dim. Partitioning is on the way for PaaS SSAS tabular and perhaps the API gives me more options for achieving real-time BI. Or perhaps a richer architecture with stream analytics putting something in between it’s output and Power BI. The only other way to push data real time into data sets currently is using API’s.

Right… onto the API.

MS Tabular – Simple Custom Rollup

Introduction

 

I have been having a bit of hands on with SSAS Tabular just lately and one the aspects that I think is extremely powerful is that it provides users with the ability to mash up their own data in various ad-hoc data sources with data from a structured SQL Server database. More specifically allowing users to create a control how granular attributes of a dimension might rollup up into a hierarchy is a valuable thing, for example an organisational structure or an accounts hierarchy. Users do need to be quite good with data modelling and DAX to leverage this and time will tell on that one. However, with that in mind I’m going to create a user controlled custom rollup on an accounts dimension using only very basic data modelling and DAX concepts.

 

Essentially within an accounting solution you might want to avoid storing numbers in the context of positive (+ve) and negative (-ve) since in accounting terms either the business owns it or it doesn’t i.e. it’s either an asset or a liability. Also things come into the business and things go out i.e. they purchase supplies and sell goods and services. None of the accounts themselves are either +ve or –ve since in accounting we deal with either credits or debits. Depending on what aspect of the business you’re evaluating and how you’re presenting the figures an account may considered as in credit for one context but in debit for another. For example the Closing Balance of Stock is considered a Credit in the P&L but a Debit in the Balance Sheet.

 

In BI data modelling terms we deal with this by associating the aggregation operator with the reporting dimension hierarchy rather than the number itself thus when the measure is combined with the hierarchy it is aggregated within the context of the hierarchy. More specifically if you’re familiar with SQL Server Analysis Services Multidim we do this using a rollup operator that is stored on the accounts dimensions.

 

This blog is a simple DAX design pattern to allow business users to create their own custom rollups on an accounts dimension and trial balance fact table from SQL Server.

Solution

 

I always like to present the final solution before going into the specifics of the design since knowing the final output tends to make design details a bit easier to digest. So here’s what we’re aiming for:
Custom Rollup in PowerPivot

 

So here we can see I’ve got a P&L hierarchy that rolls up the accounts through Gross Profit and Expenses and calculates the Profit. The accounts and figures have all been taken from a SQL Server database however which accounts and how they have been rolled up into Cost of Goods Sold, Sales, Gross Profit, Expenses and Net Profit has all been created and controlled by the Excel user. The P&L roll up is created from the following table in the Excel file that is linked into the Tabular model:

customrollup2

 

So in this table we can see the accounts mapping to relevant parent P&L members and a rollup operator that indicates how the account should be evaluated. The Rollup operators should look familiar if you’re from a SSAS Multidim background like me since the symbols are the same. + is for sum, – is subtract and ~ is to exclude. Notice there are many accounts that are not used here since they are not included in the P&L and as such have no parents defined either. A different rollup application might have the numbers excluded from the rollup but still have them included in the hierarchy.

Implementation

 

The underlying SQL Database contains an Accounts dimension, Date dimension and a fact table containing account Balances. The simple star schema is as follows:

customrolloup3

From the diagram we can see that the Account dimension is simply a list of accounts and there is no hierarchy. FactBalance contains the account trial balances which are all physically positive numbers which can conceptually be either summed or subtracted. I’ve included a very simple Date dimension as usual since it’s always good to see how a solution works with more than 1 dimension in play to test and understand what is going on inside the software without making it too complex.

The next step is to pull these 3 tables into SSAS tabular. I won’t go into how this is done in detail since there’s there are many resources that describe this process and I’m assuming if you’re looking to do custom rollup then you’ve already made a start with the basics.

Below is the Accounts dimension and Balance fact table that I’ve pulled into Tabular and renamed for usability.

customrollup4
The next step is to go to the Balance table tab and relate for Account_Key to the Account_Key in the Account table and Date_Key to the Date_Key in the Date table. If your database had these relationships when you imported the tables then Tabular will have created them automatically.

So, now we have our Tabular model we’re going to create a table in Excel to rollup the Accounts dimension. Enter the following data into spread sheet in the workbook.

customrollup5

 

With Cell A1 selected click the Link Table icon on the Excel PowerPivot tab.

customrollup7

This will import the data into the PowerPivot model linked directly to the data in the Excel sheet. Rename the new table we’ve just imported to something useful and relate the Account dimension to it using the Account Code and it should look like as follows:

customrollup7

So now we have all the data nicely imported into PowerPivot and related together we’ll apply the custom rollup. To do this I’m going to use the way that SSAS actually applies the operators under the covers (I believe but am sure someone will correct me if I’m wrong). Essentially + operator multiples the fact data by +1, – multiplies the fact data by –1 and ~ multiplies the fact by 0 or null to effectively yield null. After the operator has been multiplied which adjusts the sign of the data or nulls it out we leave it to PowerPivot to simply sum the result to give the answer. At this point you’re probably right in thinking I could’ve just used +1, –1 and 0 instead of +, – and ~ however I just wanted it to be consistent with what we’re (or I’m) used to in SSAS.

In the Budget table we’re going to create a calculated column called PL Operator that uses the PowerPivot relationships we’ve defined to pull the operator down into the Budget table. The simple DAX formula I’ve used can be seen in the next screen capture:

customrollup8

Once we have the operator we can create another calculated column to create the PL Amount column that applies the operator by multiplying PL Operator with the Balance as shown below.

customrollup9

So we can see the PL Amount column now has the rollup operator applied. The only thing left to do is review what columns you want to make visible in the model and drop a pivot table onto a spread sheet and have a look. We could’ve just used 1 calculated column to achieve what we needed to do, I broke it down into 2 just for presentation purposes. Have a play at modifying the Excel based data rollups and refreshing the data model to see how quick and easy it is to manipulate the rollup.

Conclusion

This is a really simple way for users to consume data from a database and apply a very simple custom rollup. I really like the simplicity of it and the fact as a user I can just edit my rollups in my sheet and refresh the data and adjust how the accounts roll up. Also you might have noticed there are trial balance figures for the accounts in my data that would be needed to create a balance sheet and so we could progress this further by having another roll up table to create a Balance Sheet.

I’m going to have a play around a bit more and see if it’s possible to create more complex rollups and perhaps apply different rollup operators for different columns instead of just at the grain of the account. This is where I suspect things will become somewhat more complicated very quickly! Coming from a SSAS background I find it quite useful to take SSAS modelling concepts and challenge my self and PowerPivot to achieve the same end. PowerPivot modelling is a different way of thinking and whilst it may not do what SSAS does out of the box I’m convinced where there is a will there is a way, or maybe not.