Automating DW Delivery – T4 Templates & BIML

The awareness of BIML has improved somewhat in the last few years but still very little is known about T4 Templates which is the topic of this post.

Just a bit of fundamentals. Effectively data ETL, ELT and data provisioning platforms are predominantly schema typed and this isn’t going away whilst we’re on silicon chips. The primary reason for this is performance! Even in MPP architectures you’re going to hit a strongly typed schema at some point; you can’t escape the entropy of information.

The Good

 

SSIS has to size a memory pipeline for it’s given data feed. It does this so it can shovel the data through fast distributed or not; and just to mention the next version of SSIS can be distributed on multiple servers (perhaps we’ll see in Azure 1 day). SSIS provides a hugely flexible and re-usable GUI to pretty much build any workflow and data feed your abilities will allow; so what’s the rub?

The Bad

 

Each SSIS executable you create has specific use and definition of metadata, with re-usable but very low level features that are click-tastic enough to give you RSI after one large project. It’s labour intensive to the max and subsequently error prone.

Designers cannot separate the metadata definitions from re-usable implementation patterns

 

This is a real pain… Why can’t I just build a template SSIS package dimension load and fire in my metadata definition for all my dimensions and implement them all with a single click of a mouse getting all the re-use of fully tested code that any sane software engineer would expect? Well you can but not with VS alone; enter BIML

The BIML!

 

In short BIML (Business Intelligence Markup Language) returns SSIS back to a respectable coding experience where I can define my template SSIS patterns using readable XML, hook up a central metadata repository using C# and generate all my DW feeds. Now we’re talking… proper code re-use delivering specific and fast SSIS executables.

There’s a ton of info on BIML already, if this is new to you stop here and go learn BIML! I have a couple of starter vids and there’s loads of other tech writers on the topic:

 

T4 Templates

 

So what’s this about T4 Templates then?

Well BIML is only free to a point (BIML Express). It does have a fully blown IDE (Mist) if you have the bucks. Also a lot of good SSIS patterns for some of the more upstream transformations require relatively complicated stored procedures that can’t be created using BIML templates. You can of course write your own metadata SQL wrapper using C# which is a fair amount of leg work to say the least – it’s a long road I have been on it.

Another option is dynamic SQL – which in my humble opinion are 2 words that should never go together! Why? Because SQL is terrible for text parsing / code generation and all the other reasons that are just too upsetting to list.

Or… you can use T4 Templates!

T4 Templates have been in Visual Studio since 2005. They’re basically a similar concept to BIML except the language can just be any text – not just BIML. I can basically template out any other language using C# to dynamically push in metadata in very much the same way that BIML works. I’m not just limited to a specific language either it could be used to create SQL procedures, Azure Data Factory json pipelines, Azure Data Lake jobs or C#.

It was included in Visual Studio specifically for templating Code Generation for design time or run time. It’s used by visual studio itself to generate code from UI designers such as Entity Framework for example. T4 comes from the abbreviation of:

  • Text – dynamically creates text output using template files
  • Template – a combination of text blocks & control logic written in C# or VB.net
  • Transformation – transforms the text in executable code and executes to produce the final output
  • Toolkit – a set of assemblies packaged into Visual Studio

There are 2 types of Templates that can be used:

  • Design Time – Templates are executed at design to generate code
  • Run Time – Executed when the application executes are compiled into classes in the application. They can receive parameters and sit within control logic.

Using T4 Instead of BIML

 

It’s a non-starter, forget it! Whilst you could script out SSIS packages using the native XML it’s just not (that) human readable. BIML brings the mark-up language to party which is more human readable by a country mile. You’d be mad to try to script out native SSIS XML whilst BIML express is available for free.

Design or Run Time

 

On the topic of code generating code. Should we be generating and executing code during design time or run time? I’ve seen various flavours of frameworks that do both or somewhere in between.

I’m a firm believer that code generation should only occur during design time or more specifically during a development process. The development process has all the bells and whistles required to help manage risk during a code generation exercise e.g. accidentally loading live data to an unsecured area of access or loading DEV data into live not to mention all the other bugs and failures that could occur.

Do we really want dynamic code executing in live that has never run before? Also debugging and issue resolution is an enormous pain in the neck if the code that ran isn’t available, easy to isolate and debug – dynamic run time frameworks written in anger tend to be low on support features and over complicated!

Also the arguments for dynamic run time seem to be around circumventing bureaucratic change control that was put in place because of a poor development process. The solution to a robust agile BI development isn’t slipping in a cheeky back door to make clandestine changes it is in fact continuous integration which is a whole other detailed topic.

To use T4 templates with BIML frameworks we use the run time execution type but during the development process since we can call the classes using C# that BIML executes during package creation. So in that respect they execute at the run time of BIML execution not the run time of the warehouse load.

Automation

 

I’ve still yet to come across a team fully automating the delivery of a data warehouse using meta-data. The tools are available (and free) so I’m not sure what the barriers are.

I’ll start dumping my BIML and T4 Template framework assets into Git when I get chance and hopefully show T4 working in unison with BIML to do more than just load a basic staging database layer.

 

 

 

Quick Tip – SSAS Degenerate Attributes

Not a huge amount of technical detail in this one… just a simple tip for SSAS because I’ve seen it more times now than I can remember and ultimately it boils down to a fundamental misunderstanding of technical architecture and what the role of SSAS plays in that architecture. A common response is that SSAS is “doing something weird”, or “SSAS is rubbish and not fit for purpose” leading to awkward overly complex work arounds.  SSAS is awesome if you use it for what it’s meant for.

Don’t put degenerate attributes in a SSAS cube

 

A degenerate attribute has a unique attribute value for every row in a fact table. A cube is for browsing aggregates and drilling down into a narrow set of detail.

  • How is a degenerate useful for a person to look at holistically or to analytically slice and dice measures? It isn’t. What will most likely happen is that a user will drag it onto ad-hoc report and then get very frustrated waiting for it to return
  • Secondly the cube will take a very very long time to process. Particularly with column storage and especially if it’s a string data type. Do you really want to process millions of unique values into a distinct column store? Is that going to achieve anything?

If you really must put degenerate information into a BI model:

  • Bucket them up into bandings so they can be aggregated
  • Don’t include the leaf level detail as an attribute or column… Use drill through or a report action to navigate to low level transactions on a narrow selection of data… Cubes are for aggregates and drilling down into a narrow set of detail.
  • If they’re needed for a relationship something has gone wrong in your Kimball model. Rethink the model so it provides detail for drill through but binds on something that can be better aggregated and grouped.

 

Quick Tip – SSAS Tabular Referential Integrity

SSAS tabular is a tad more loosely coupled than multi-dim. A quick and easy way to determine referential integrity violations on columns is to use the $system.DISCOVER_STORAGE_TABLES. This DMV query will return a rowset with the table details and a column called RIVIOLATION_COUNT that indicates the number of rows with a referential integrity violation.

select * from $system.DISCOVER_STORAGE_TABLES

SSAS RI Violation

 

Finer Points – DAX : Calculated Columns and Context

This could be considered “The Basics” but it depends where you are on your learning journey. First lets review some high level guidance whether to use a calculated column or measure:

Column:

  • Row based calculations – if you’re a SQL whizz think.
    f() OVER(PARTITION BY ... ORDER BY...)

    We get the same number of rows but we can aggregate within a row over the whole table or part of the table depending on the values of the current row.

  • Can be used on any Axis, including the slicer – We’ve effectively created another attribute that can be used on rows, columns or slicer (which is just another axis).

Measure:

  • Aggregated Calculations  – Generally we use it to aggregate data across the whole model and perform calculations using aggregates. Think ratios and again if you’re a SQL whizz you’d probably use a bunch of sub queries, aggregate functions and join it together to calculate your ratio
  • User Context Sensitive – These calculations will be aggregated using the calculation you’ve defined across the selections on the axis a user makes i.e. they are context sensitive based on user selections. SQL cannot do this using a single expression – this is why cubes are very analytically powerful compared to SQL but conversely DAX is much harder to fully grasp. Generally speaking DAX will offer better performance for aggregates most of the time (when done correctly).

The journey to DAX hasn’t been to bad for myself since my MDX was pretty solid. Once you get over the differences of dealing with tables it’s a home run since the fundamental of context based calculations over user selected combinations of the whole data model is a way of thinking required for both DAX and MDX. It’s that way of thinking that is the learning curve.

Comparative Column Calculations

So we’re going with column just for the learning points. I’m blogging about this particular topic since if you’re coming from SQL or MDX then it might be helpful. Typically we would be familiar with the following:

  • MDX: CurrentMember – In DAX there is no such thing!
  • SQL: OVER() – In DAX there is no such thing!

So imagine the scenario:

We have a table of categorised granular measures and we want to create a calculated column for each row using:

  • Each row value n
  • & N = Aggregate(n) for category of n

Note that aggregate may be sum, min, max, etc.

Example

We have a data set of run times for sequential processes split across 2 parallel streams. The data is at the grain of the process execution. I have:

  • Stream : the stream it executed in
  • Process : name of the process
  • StartDate : when it started
  • EndDate : when it ended

I want to plot a Gantt Chart using a Stacked Bar Chart of each stream to compare race conditions to visualise if it could be more efficient. To do this I need to calculate:

  • ExecutionDuration : Easy EndDate – StartDate (in seconds)
  • StreamStartDate : This is the MIN(StartDate) of a given stream; we need it to calculate the WaitDuration. Perhaps don’t need this step but helps with understanding.
  • WaitDuration : StartDate – StreamStartDate (in seconds)

So some example data might be as follows:

DAX Earlier

In SQL

DAX Earlier 1

In DAX

In order to calculate the WaitDuration my brain immediately jumps to currentMember since I can refer to the current member context for the startdate and an aggregated min() startdate over the context over the whole data set with the same stream. But there are no members and there is no current member so how to we do this in DAX. Again in multiple steps for clarity.

Column: StreamStartDate

StreamStartDate =
MINX (
    FILTER (
        'ProcessExec',
        'ProcessExec'[Stream] = EARLIEST ( 'ProcessExec'[Stream] )
    ),
    'ProcessExec'[StartTime]
)

Column: WaitDuration

WaitDuration = 
DATEDIFF('ProcessExec'[StreamStartDate], 'ProcessExec'[StartDate], SECOND)

Column: RunDuration

RunDuration = 
DATEDIFF('ProcessExec'[StartDate], 'ProcessExec'[EndDate], SECOND)

WaitDuration and RunDuration require no explanation they should be easily understood.

StreamStartDate however makes use of the Earliest function which does require explanation. I’ve seen various explanations over the years and the docs themselves have changed. Essentially it’s hard explain and this is the crux of why DAX is hard to understand. Start simple and build – this is a pretty simple example.

Earliest and Earlier functions refer to previous execution contexts! The Earliest being the very 1st.

In English the calc is:

Get the MIN StartDate for a set of data where the Stream is equal to the Stream in the current row.

MINX takes a table and an expression to aggregate in this case min.

MINX(<table>, <expression>)

We’ve passed in a filtered table to <table> to contain only those rows that have a stream equal to the stream in our current row.

    FILTER (
        'ProcessExec',  <-- this has a table context!
        'ProcessExec'[Stream] = EARLIEST ( 'ProcessExec'[Stream] ) <-- acts as where condition
    )

Because Column calculations start with a ROW context we switch context from the table context (passed to the filter function) to row context using the Earliest function to compare Stream values on the filter.

Column or Measure

I thought this would be a worth while example because it starts to condition your thinking away from current member. However should this be a calculated column or measure? I’ll leave that one to stew with the following points:

  • Is duration a useful additive continuous measure? Yes
  • Would it be cool if we could efficiently drill up and down on our Gantt chart for large data sets? Yes (ignoring the limitations in Power BI data vis!)
  • Will the continuous duration measure be useful as categorical data on rows, column, slicer or any other axis? If it’s banded it might be otherwise no.
  • Do we want the measure to display aggregated correctly no matter what selections the user makes with other measures? Yes
  • Calculated columns persist the data into model memory, do we need to? No

Quick Tip – Power BI Dev Tools Install Error

Installing the Power BI dev tools looks pretty straight forward. However as most of us are probably behind a proxy server you might come across this annoying failure.

npm install -g powerbi-visuals-tools
npm ERR! Cannot read property 'path' of null

You have to look in the log file for an error message that actually shows the 407 http error.

Basically configure npm to use the http registry address instead of https and pass in the proxy address and windows login in the call.

npm config set strict-ssl false
npm config set registry 'http://registry.npmjs.org'
npm --proxy http://username:password@myproxyaddress install -g powerbi-visuals-tools

 

 

 

Finer Points – USQL: Merging Datasets Part 1

Like a lot of Bid Data platforms Data Lake Analytics is a file based data platform. It’s a totally different approach to RDBMS data warehouse batch processing. The architecture brute forces the processing of complete file partitions on a distributed architecture. It just accepts that it will crunch a lot of data but achieves performance by doing job lots on loads of computers. I’ll write more on technical architecture considerations at a later date.

This means in a structured data world we lose row level resolution of inserts, updates and deletes. So how do we do common data integrations like merge?

Most good tech platforms have multiple ways of achieving the same thing. I’m going to compare 2 different ways. I didn’t set out to rubbish one approach over the other. It’s just an exercise of scientific comparison to build my understanding how stuff works, build up my coding patterns tool box and to share code and spark ideas  – feedback of anything I’ve overlooked is most welcome.

I’ll split this into 2 blogs… since there’s a fair bit to get through. Part 1 will present different approaches and part 2 will put them both under load and compare the job graphs and performance differences (if any).

So the approaches are:

  1. Multiple steps using basic SQL like pattern – blogged about by Paul from Purplefrog
  2. Single hit using a Full Outer Join and conditional C# – from my own patterns

A note about Combiners – I considered and played around with a custom combiner. However the join clause seems to be limited to an inner join, you can work around this but in the end it just feels very hackey and more hassle than it’s worth.

My aim is to understand the following facets:

  • Performance – how do the jobs compare when compiled and stress tested
  • Limitations – are there situations where 1 pattern works better over the other
  • Style – coding complexity, skill level and maintenance

The Demo Setup

 

Create the Target Managed Table

 

We need to create a managed table to merge the data into. We could of course do this into a file however these types of operations are going to be quite far up the information architecture stack since merge sets are structured, clean and conformed so using a managed table seems appropriate.

You could entertain the idea of holding a complete Kimbal star schema in Data Lake managed tables instead of using an RDBMS. There are consequences to this of course which is another topic for another day.

CREATE DATABASE IF NOT EXISTS Patterns;
 
USE Patterns;
 
CREATE SCHEMA IF NOT EXISTS dw;
 
DROP TABLE IF EXISTS dw.person;
 
CREATE TABLE IF NOT EXISTS dw.person
(
    personid int,
    title string,
    email string,
    firstname string,
    surname string,
    deleteddate DateTime?,
    filename string,
    INDEX clx_personKey CLUSTERED(personid ASC) DISTRIBUTED BY HASH(personid)
);

The table structure implies that we’re using details about persons as an example since it just offers simple data change examples that are intuitive.

Source Data

 

I’m using 3 files as example to incrementally load. For any solution we need to understand how we’re are receiving data. To keep things simple we’re assuming a full incremental extract with hard deletes that will require a full compare i.e. there is no Change Data Capture (CDC) lower down the information architecture. The files are as follows:

person1.csv – 2 inserts

ID,Title,Email,Firstname,Surname
1,Mr,sr@qwertymail.com,Shaun,Ryan
2,Mr,dj@qwertymail.com,David,Jones

person2.csv – 1 insert, 1 update

ID,Title,Email,Firstname,Surname
1,Mr,sr@qwertymail.com,Shaun,Ryan
2,Dr,dj@qwertymail.com,David,Jones
3,Ms,sr@qwertymail.com,Stacy,Rogers

person3.csv – 2 inserts, 1 update, 1 physical delete (ID=1)

ID,Title,Email,Firstname,Surname
2,Dr,dj@qwertymail.com,David,Jones
3,Mrs,sr@qwertymail.com,Stacy,Stowman
4,Mrs,jm@qwertymail.com,Janet,Moores
5,Mr,kf@qwertymail.com,Kevin,Fullers

To do a deterministic merge of data we of course need a true unique business key which in this case is ID.

Executing the following Scripts

 

To load the files dump them all on data lake or local storage at “/Person/” and run the scripts incrementing @filenumber and inspecting the output in the dw.person table. I’m also writing the merged data out to “/Person/output/” if you prefer to look at files rather than tables in Data Lake and for debugging; for me the table viewer in visual studio is fine for a small amount of data.

I’m not going to review the data output of every approach, other than to say the data merges as you would expect. Below is the final output of all 3 incremental loads. All the code and files is up on GitHub so to see working give it whirl.

output

1  – Multiple USQL Merge Steps

 

This is Paul’s approach, tweaked for my meta data. Basically we’re going to hit it in 3 steps using an inner, left and right join, and union all the results together.

USE Patterns;
 
DECLARE @tabcode int = 44;
DECLARE @delimiter char = (char) @tabcode;
DECLARE @escape char = Char.Parse("œ");
DECLARE @extension string = "csv";

//  Path
DECLARE @filenumber string = "1";
DECLARE @filename string = "person";
DECLARE @inpath string = "/Person/" + @filename + @filenumber + ".csv";
DECLARE @outpath string = "/Person/output/" + @filename + @filenumber + ".csv";
 
// read data
@data =
    EXTRACT Id int,
            Title string,
            Email string,
            Firstname string,
            Surname string
    FROM @inpath
    USING Extractors.Text(delimiter : @delimiter, escapeCharacter : @escape, quoting : false, skipFirstNRows : 1);

@merge =
    //update current
    SELECT [src].[Id] AS [personid],
           [src].[Title] AS [title],
           [src].[Email] AS [email],
           [src].[Firstname] AS [firstname],
           [src].[Surname] AS [surname],
           (DateTime?) null AS [deleteddate],
           @filename + @filenumber AS [filename]
    FROM @data AS [src]
         INNER JOIN
             [dw].[person] AS [tgt]
         ON [src].[Id] == [tgt].[personid]
 
    UNION ALL
 
    //insert new
    SELECT [src].[Id] AS [personid],
           [src].[Title] AS [title],
           [src].[Email] AS [email],
           [src].[Firstname] AS [firstname],
           [src].[Surname] AS [surname],
           (DateTime?) null AS [deleteddate],
           @filename + @filenumber AS [filename]
    FROM @data AS [src]
         LEFT OUTER JOIN
             [dw].[person] AS [tgt]
         ON [src].[Id] == [tgt].[personid]
    WHERE [tgt].[personid] IS NULL
 
    UNION ALL
 
    //keep existing and logically delete
    SELECT [tgt].[personid],
           [tgt].[title],
           [tgt].[email],
           [tgt].[firstname],
           [tgt].[surname],
           (DateTime?) DateTime.Now AS [deleteddate],
           [tgt].[filename]
    FROM @data AS [src]
         RIGHT OUTER JOIN
             [dw].[person] AS [tgt]
         ON [src].[Id] == [tgt].[personid]
    WHERE [src].[Id] IS NULL;
 
//optionally - output to file
//so we can save a review each iteration output
OUTPUT @merge
TO @outpath
USING Outputters.Csv();
 
//truncate table - we can't do row level operations
//it's all file based processing so we have to reload the whole table (partition)
TRUNCATE TABLE [dw].[person];
 
//insert the merged data
INSERT INTO [dw].[person]
(
    [personid],
    [title],
    [email],
    [firstname],
    [surname],
    [deleteddate],
    [filename]
)
SELECT [personid],
       [title],
       [email],
       [firstname],
       [surname],
       [deleteddate],
       [filename]
FROM @merge;

2  – Single USQL Merge Step

 

This is an approach I put together after seeing the 1st approach wondering how much more complicated and what the execution differences there would be to do it 1 hit using a full outer join with conditional selects.

USE Patterns;
 
DECLARE @tabcode int = 44;
DECLARE @delimiter char = (char) @tabcode;
DECLARE @escape char = Char.Parse("œ");
DECLARE @extension string = "csv";
 
//  Path - increment @filenumber to load files 2 and 3 on successive runs
DECLARE @filenumber string = "1";
DECLARE @filename string = "person";
DECLARE @inpath string = "/Person/" + @filename + @filenumber + ".csv";
DECLARE @outpath string = "/Person/output/" + @filename + @filenumber + ".csv";
 
// read data
@data =
    EXTRACT Id int,
            Title string,
            Email string,
            Firstname string,
            Surname string
    FROM @inpath
    USING Extractors.Text(delimiter : @delimiter, escapeCharacter : @escape, quoting : false, skipFirstNRows : 1);
 
@merge =
        SELECT
            //select the source data if insert or update
            //select the target data if deleted 
            //we have to cast the datatypes since the if else construct returns nullable types

            (int)(issource ? [srcPersonid] : [tgtPersonid] ) AS personid,
            (string)(issource ? [srcTitle] : [tgtTitle] ) AS title,
            (string)(issource ? [srcEmail] : [tgtEmail] ) AS email,
            (string)(issource ? [srcFirstname] : [tgtFirstname] ) AS firstname,
            (string)(issource ? [srcSurname] : [tgtSurname] ) AS surname,
            (issource ? null : (DateTime?)DateTime.Now ) AS deleteddate,
            (string)(issource ? [srcFilename] : [tgtFilename] ) AS filename
        FROM
        (
    SELECT (
               // create a boolean that can be re-used in the outer 
               // query to keep the code clean
               // if update
               ([src].[Id] == [tgt].[personid] & [src].[Id] != null)
               // or if insert
               || ([tgt].[personid] == null)
               //then write source data
               ? true
               //else keep the target data
               : false
           ) AS issource,
           //source data
           [src].[Id] AS [srcPersonid],
           [src].[Title] AS [srcTitle],
           [src].[Email] AS [srcEmail],
           [src].[Firstname] AS [srcFirstname],
           [src].[Surname] AS [srcSurname],
           @filename + @filenumber AS [srcFilename],
           //target data
           [tgt].[personid] AS [tgtPersonid],
           [tgt].[title] AS [tgtTitle],
           [tgt].[email] AS [tgtEmail],
           [tgt].[firstname] AS [tgtFirstname],
           [tgt].[surname] AS [tgtSurname],
           [tgt].[filename] AS [tgtFilename]
    FROM @data AS [src]
         FULL OUTER JOIN
             [dw].[person] AS [tgt]
         ON [src].[Id] == [tgt].[personid]
    ) AS cpr;
 
//optionally- output to file
//so that we can save and review each iteration output
OUTPUT @merge
TO @outpath
USING Outputters.Csv();
 
//truncate table - we can't do row level operations
//it's all file based processing so we have to reload the whole table (partition)
TRUNCATE TABLE [dw].[person];
 
//insert the merged data
INSERT INTO [dw].[person]
(
    [personid],
    [title],
    [email],
    [firstname],
    [surname],
    [deleteddate],
    [filename]
)
SELECT [personid],
       [title],
       [email],
       [firstname],
       [surname],
       [deleteddate],
       [filename]
FROM @merge;

Summary

 

I’ve presented 2 alternative ways to merge data using native USQL constructs. In the second part (not done yet) of this posting I’ll compare the compiled jobs graphs and any performance differences.

The Basics – Azure Stream Analytics : Use GetArrayElements to Flatten Json

In this blog I’m detailing out how flatten complex json in Azure Stream Analytics. It’s not massively hard but took a bit of fiddling since the online docs stop a little short. This kind of stuff is mostly a reference for myself for when I need to do it again sometime later and will have probably forgotten.

There is this blog which gives some insight but stops a little short on dealing with arrays. Basically I’m just going to cover how GetArrayElements works in a bit more detail.

Consider the following json event payload we might receive through a stream… No guesses for how I waste most of my time… and… it’s close but that’s not my real gamer tag 🙂

{  
   "id":"123e4567-e89b-12d3-a456-426655440000",
   "device":"34543FERRE55",
   "player":{  
      "firstname":"Shaun",
      "surname":"Ryan",
      "gamertag":"worlord"
   },
   "gamestats":[  
      {  
         "statid":1,
         "gametime":"2017-02-17T11:59:12.000Z",
         "event":"Kill",
         "item":"BAR M1918",
         "gamertag":"millipead"
      },
      {  
         "statid":2,
         "gametime":"2017-02-17T11:59:15.000Z",
         "event":"Suppression",
         "item":"BAR M1918",
         "gamertag":"duckwarrior"
      },
      {  
         "statid":3,
         "gametime":"2017-02-17T11:59:09.000Z",
         "event":"Kill Assist",
         "item":"Rifle Grenade",
         "gamertag":"captaincardboard"
      },
      {  
         "statid":4,
         "gametime":"2017-02-17T11:59:34.000Z",
         "event":"Heal",
         "item":"medipack",
         "gamertag":"rhubarb"
      }
   ],
   "EventProcessedUtcTime":"2017-02-17T12:00:00.384Z",
   "PartitionId":0,
   "EventEnqueuedUtcTime":"2017-02-17T12:00:00.146Z"
}

Let us imagine our game sends a payload every minute that contains all the game stats that occurred in that minute. This is completely fabricated. Am sure that’s not how such a thing would be done but its a fun example and provides an array.

So our json above is relatively semi-structured, we have:

  • Attributes in the root
  • Child entities with attributes
  • A child entity with an array

Our end game is that we want to flatten this into a denormalized data sets to insert into a SQL table for example or aggregate our stats which is a more likely use case of stream analytics.

We want a data set that looks like this (click image to see larger pic):

capture

If we were aggregating we might thin out a few columns not needed in the grouping but just for an example it’s fine.

Solution

So here’s the solution straight off the bat. There’s some discussion that follows if you’re interested.

SELECT
 [Stat].[id] AS [ID],
 [Stat].[device] AS [Device],
 
 [Stat].[player].[firstname] AS [Firstname],
 [Stat].[player].[surname] AS [Surname],
 [Stat].[player].[gamertag] AS [Playergamertag],
 
 [Stat].[GameStat].[statid] AS [Statid],
 [Stat].[GameStat].[gametime] AS [Gametime],
 [Stat].[GameStat].[event] AS [Event],
 [Stat].[GameStat].[item] AS [Item],
 [Stat].[GameStat].[gamertag] AS [Gamertag],
 
 [Stat].[EventProcessedUtcTime],
 [Stat].[PartitionId],
 [Stat].[EventEnqueuedUtcTime]
FROM
(
 SELECT
 [EventAlias].*,
 [GameStatsAlias].[ArrayValue] AS [GameStat]
 FROM [InEventHub] AS [EventAlias]
 CROSS APPLY GetArrayElements(Event.gamestats) AS [GameStatsAlias]
) AS Stat

 

Finding the Solution

The best way I found to fiddle around with this stuff is to write out to json file on blob storage. Since you get to see the json of how close you’re getting and go from there. There is a Stream Analytics addin for visual studio now that offers local debugging but I’ve had some issues with it, namely the addin breaking my data lake tools addin.

GetArrayElements

The GetArrayElements function pivots out the array items but has to be applied using a cross apply so that it runs across each item in the array. Any where clause should follow the cross apply. I’ve used a sub query since it allows me to:

  1. Have intuitively readable code
  2. Break out the columns from the ArrayValue into their own alias’s above; it might be possible to do that in 1 hit but I like the subquery (see 1)

If you take the sub query alone it creates the following json so it’s not hard to see how to get to the final result by looking at the intermediate result.

Query:

SELECT * FROM
( 
 SELECT
 [EventAlias].*,
 [GameStatsAlias].[ArrayValue] AS [GameStat]
 FROM [InEventHub] AS [EventAlias]
 CROSS APPLY GetArrayElements(Event.gamestats) AS [GameStatsAlias]
) AS Stat

Creates:

[  
   {  
      "stat":{  
         "id":"123e4567-e89b-12d3-a456-426655440000",
         "device":"34543FERRE55",
         "player":{  
            "firstname":"Shaun",
            "surname":"Ryan",
            "gamertag":"worlord"
         },
         "GameStat":{  
            "statid":1,
            "gametime":"2017-02-17T11:59:12.000Z",
            "event":"Kill",
            "item":"BAR M1918",
            "gamertag":"millipead"
         },
         "EventProcessedUtcTime":"2017-02-17T12:00:00.384Z",
         "PartitionId":0,
         "EventEnqueuedUtcTime":"2017-02-17T12:00:00.146Z"
      }
   },
   {  
      "stat":{  
         "id":"123e4567-e89b-12d3-a456-426655440000",
         "device":"34543FERRE55",
         "player":{  
            "firstname":"Shaun",
            "surname":"Ryan",
            "gamertag":"worlord"
         },
         "GameStat":{  
            "statid":2,
            "gametime":"2017-02-17T11:59:15.000Z",
            "event":"Suppression",
            "item":"BAR M1918",
            "gamertag":"duckwarrior"
         },
         "EventProcessedUtcTime":"2017-02-17T12:00:00.384Z",
         "PartitionId":0,
         "EventEnqueuedUtcTime":"2017-02-17T12:00:00.146Z"
      }
   },
   {  
      "stat":{  
         "id":"123e4567-e89b-12d3-a456-426655440000",
         "device":"34543FERRE55",
         "player":{  
            "firstname":"Shaun",
            "surname":"Ryan",
            "gamertag":"worlord"
         },
         "GameStat":{  
            "statid":3,
            "gametime":"2017-02-17T11:59:09.000Z",
            "event":"Kill Assist",
            "item":"Rifle Grenade",
            "gamertag":"captaincardboard"
         },
         "EventProcessedUtcTime":"2017-02-17T12:00:00.384Z",
         "PartitionId":0,
         "EventEnqueuedUtcTime":"2017-02-17T12:00:00.146Z"
      }
   },
   {  
      "stat":{  
         "id":"123e4567-e89b-12d3-a456-426655440000",
         "device":"34543FERRE55",
         "player":{  
            "firstname":"Shaun",
            "surname":"Ryan",
            "gamertag":"worlord"
         },
         "GameStat":{  
            "statid":4,
            "gametime":"2017-02-17T11:59:34.000Z",
            "event":"Heal",
            "item":"medipack",
            "gamertag":"rhubarb"
         },
         "EventProcessedUtcTime":"2017-02-17T12:00:00.384Z",
         "PartitionId":0,
         "EventEnqueuedUtcTime":"2017-02-17T12:00:00.146Z"
      }
   }
]

This has blown out the array across the higher level json data, from here we can just pick out what we need in our rows using the outer query. When you dump this out to a file it won’t create an array as a single json doc it basically writes a json document per file line, so 4 json documents in this case separated by a line feed.

Incidentally if you are using the Stream Analytics Visual Studio addin with local debug and a sample json file you have to encase your json document events into an array or it errors.

Hope this makes sense / helps. No doubt I’ll have forgotten how to do this in a couple of days.