Automating DW Delivery – T4 Templates & BIML

The awareness of BIML has improved somewhat in the last few years but still very little is known about T4 Templates which is the topic of this post.

Just a bit of fundamentals. Effectively data ETL, ELT and data provisioning platforms are predominantly schema typed and this isn’t going away whilst we’re on silicon chips. The primary reason for this is performance! Even in MPP architectures you’re going to hit a strongly typed schema at some point; you can’t escape the entropy of information.

The Good

 

SSIS has to size a memory pipeline for it’s given data feed. It does this so it can shovel the data through fast distributed or not; and just to mention the next version of SSIS can be distributed on multiple servers (perhaps we’ll see in Azure 1 day). SSIS provides a hugely flexible and re-usable GUI to pretty much build any workflow and data feed your abilities will allow; so what’s the rub?

The Bad

 

Each SSIS executable you create has specific use and definition of metadata, with re-usable but very low level features that are click-tastic enough to give you RSI after one large project. It’s labour intensive to the max and subsequently error prone.

Designers cannot separate the metadata definitions from re-usable implementation patterns

 

This is a real pain… Why can’t I just build a template SSIS package dimension load and fire in my metadata definition for all my dimensions and implement them all with a single click of a mouse getting all the re-use of fully tested code that any sane software engineer would expect? Well you can but not with VS alone; enter BIML

The BIML!

 

In short BIML (Business Intelligence Markup Language) returns SSIS back to a respectable coding experience where I can define my template SSIS patterns using readable XML, hook up a central metadata repository using C# and generate all my DW feeds. Now we’re talking… proper code re-use delivering specific and fast SSIS executables.

There’s a ton of info on BIML already, if this is new to you stop here and go learn BIML! I have a couple of starter vids and there’s loads of other tech writers on the topic:

 

T4 Templates

 

So what’s this about T4 Templates then?

Well BIML is only free to a point (BIML Express). It does have a fully blown IDE (Mist) if you have the bucks. Also a lot of good SSIS patterns for some of the more upstream transformations require relatively complicated stored procedures that can’t be created using BIML templates. You can of course write your own metadata SQL wrapper using C# which is a fair amount of leg work to say the least – it’s a long road I have been on it.

Another option is dynamic SQL – which in my humble opinion are 2 words that should never go together! Why? Because SQL is terrible for text parsing / code generation and all the other reasons that are just too upsetting to list.

Or… you can use T4 Templates!

T4 Templates have been in Visual Studio since 2005. They’re basically a similar concept to BIML except the language can just be any text – not just BIML. I can basically template out any other language using C# to dynamically push in metadata in very much the same way that BIML works. I’m not just limited to a specific language either it could be used to create SQL procedures, Azure Data Factory json pipelines, Azure Data Lake jobs or C#.

It was included in Visual Studio specifically for templating Code Generation for design time or run time. It’s used by visual studio itself to generate code from UI designers such as Entity Framework for example. T4 comes from the abbreviation of:

  • Text – dynamically creates text output using template files
  • Template – a combination of text blocks & control logic written in C# or VB.net
  • Transformation – transforms the text in executable code and executes to produce the final output
  • Toolkit – a set of assemblies packaged into Visual Studio

There are 2 types of Templates that can be used:

  • Design Time – Templates are executed at design to generate code
  • Run Time – Executed when the application executes are compiled into classes in the application. They can receive parameters and sit within control logic.

Using T4 Instead of BIML

 

It’s a non-starter, forget it! Whilst you could script out SSIS packages using the native XML it’s just not (that) human readable. BIML brings the mark-up language to party which is more human readable by a country mile. You’d be mad to try to script out native SSIS XML whilst BIML express is available for free.

Design or Run Time

 

On the topic of code generating code. Should we be generating and executing code during design time or run time? I’ve seen various flavours of frameworks that do both or somewhere in between.

I’m a firm believer that code generation should only occur during design time or more specifically during a development process. The development process has all the bells and whistles required to help manage risk during a code generation exercise e.g. accidentally loading live data to an unsecured area of access or loading DEV data into live not to mention all the other bugs and failures that could occur.

Do we really want dynamic code executing in live that has never run before? Also debugging and issue resolution is an enormous pain in the neck if the code that ran isn’t available, easy to isolate and debug – dynamic run time frameworks written in anger tend to be low on support features and over complicated!

Also the arguments for dynamic run time seem to be around circumventing bureaucratic change control that was put in place because of a poor development process. The solution to a robust agile BI development isn’t slipping in a cheeky back door to make clandestine changes it is in fact continuous integration which is a whole other detailed topic.

To use T4 templates with BIML frameworks we use the run time execution type but during the development process since we can call the classes using C# that BIML executes during package creation. So in that respect they execute at the run time of BIML execution not the run time of the warehouse load.

Automation

 

I’ve still yet to come across a team fully automating the delivery of a data warehouse using meta-data. The tools are available (and free) so I’m not sure what the barriers are.

I’ll start dumping my BIML and T4 Template framework assets into Git when I get chance and hopefully show T4 working in unison with BIML to do more than just load a basic staging database layer.

 

 

 

Quick Tip – SSAS Degenerate Attributes

Not a huge amount of technical detail in this one… just a simple tip for SSAS because I’ve seen it more times now than I can remember and ultimately it boils down to a fundamental misunderstanding of technical architecture and what the role of SSAS plays in that architecture. A common response is that SSAS is “doing something weird”, or “SSAS is rubbish and not fit for purpose” leading to awkward overly complex work arounds.  SSAS is awesome if you use it for what it’s meant for.

Don’t put degenerate attributes in a SSAS cube

 

A degenerate attribute has a unique attribute value for every row in a fact table. A cube is for browsing aggregates and drilling down into a narrow set of detail.

  • How is a degenerate useful for a person to look at holistically or to analytically slice and dice measures? It isn’t. What will most likely happen is that a user will drag it onto ad-hoc report and then get very frustrated waiting for it to return
  • Secondly the cube will take a very very long time to process. Particularly with column storage and especially if it’s a string data type. Do you really want to process millions of unique values into a distinct column store? Is that going to achieve anything?

If you really must put degenerate information into a BI model:

  • Bucket them up into bandings so they can be aggregated
  • Don’t include the leaf level detail as an attribute or column… Use drill through or a report action to navigate to low level transactions on a narrow selection of data… Cubes are for aggregates and drilling down into a narrow set of detail.
  • If they’re needed for a relationship something has gone wrong in your Kimball model. Rethink the model so it provides detail for drill through but binds on something that can be better aggregated and grouped.

 

Quick Tip – Power BI Dev Tools Install Error

Installing the Power BI dev tools looks pretty straight forward. However as most of us are probably behind a proxy server you might come across this annoying failure.

npm install -g powerbi-visuals-tools
npm ERR! Cannot read property 'path' of null

You have to look in the log file for an error message that actually shows the 407 http error.

Basically configure npm to use the http registry address instead of https and pass in the proxy address and windows login in the call.

npm config set strict-ssl false
npm config set registry 'http://registry.npmjs.org'
npm --proxy http://username:password@myproxyaddress install -g powerbi-visuals-tools

 

 

 

Finer Points – USQL: Merging Datasets Part 1

Like a lot of Bid Data platforms Data Lake Analytics is a file based data platform. It’s a totally different approach to RDBMS data warehouse batch processing. The architecture brute forces the processing of complete file partitions on a distributed architecture. It just accepts that it will crunch a lot of data but achieves performance by doing job lots on loads of computers. I’ll write more on technical architecture considerations at a later date.

This means in a structured data world we lose row level resolution of inserts, updates and deletes. So how do we do common data integrations like merge?

Most good tech platforms have multiple ways of achieving the same thing. I’m going to compare 2 different ways. I didn’t set out to rubbish one approach over the other. It’s just an exercise of scientific comparison to build my understanding how stuff works, build up my coding patterns tool box and to share code and spark ideas  – feedback of anything I’ve overlooked is most welcome.

I’ll split this into 2 blogs… since there’s a fair bit to get through. Part 1 will present different approaches and part 2 will put them both under load and compare the job graphs and performance differences (if any).

So the approaches are:

  1. Multiple steps using basic SQL like pattern – blogged about by Paul from Purplefrog
  2. Single hit using a Full Outer Join and conditional C# – from my own patterns

A note about Combiners – I considered and played around with a custom combiner. However the join clause seems to be limited to an inner join, you can work around this but in the end it just feels very hackey and more hassle than it’s worth.

My aim is to understand the following facets:

  • Performance – how do the jobs compare when compiled and stress tested
  • Limitations – are there situations where 1 pattern works better over the other
  • Style – coding complexity, skill level and maintenance

The Demo Setup

 

Create the Target Managed Table

 

We need to create a managed table to merge the data into. We could of course do this into a file however these types of operations are going to be quite far up the information architecture stack since merge sets are structured, clean and conformed so using a managed table seems appropriate.

You could entertain the idea of holding a complete Kimbal star schema in Data Lake managed tables instead of using an RDBMS. There are consequences to this of course which is another topic for another day.

CREATE DATABASE IF NOT EXISTS Patterns;
 
USE Patterns;
 
CREATE SCHEMA IF NOT EXISTS dw;
 
DROP TABLE IF EXISTS dw.person;
 
CREATE TABLE IF NOT EXISTS dw.person
(
    personid int,
    title string,
    email string,
    firstname string,
    surname string,
    deleteddate DateTime?,
    filename string,
    INDEX clx_personKey CLUSTERED(personid ASC) DISTRIBUTED BY HASH(personid)
);

The table structure implies that we’re using details about persons as an example since it just offers simple data change examples that are intuitive.

Source Data

 

I’m using 3 files as example to incrementally load. For any solution we need to understand how we’re are receiving data. To keep things simple we’re assuming a full incremental extract with hard deletes that will require a full compare i.e. there is no Change Data Capture (CDC) lower down the information architecture. The files are as follows:

person1.csv – 2 inserts

ID,Title,Email,Firstname,Surname
1,Mr,sr@qwertymail.com,Shaun,Ryan
2,Mr,dj@qwertymail.com,David,Jones

person2.csv – 1 insert, 1 update

ID,Title,Email,Firstname,Surname
1,Mr,sr@qwertymail.com,Shaun,Ryan
2,Dr,dj@qwertymail.com,David,Jones
3,Ms,sr@qwertymail.com,Stacy,Rogers

person3.csv – 2 inserts, 1 update, 1 physical delete (ID=1)

ID,Title,Email,Firstname,Surname
2,Dr,dj@qwertymail.com,David,Jones
3,Mrs,sr@qwertymail.com,Stacy,Stowman
4,Mrs,jm@qwertymail.com,Janet,Moores
5,Mr,kf@qwertymail.com,Kevin,Fullers

To do a deterministic merge of data we of course need a true unique business key which in this case is ID.

Executing the following Scripts

 

To load the files dump them all on data lake or local storage at “/Person/” and run the scripts incrementing @filenumber and inspecting the output in the dw.person table. I’m also writing the merged data out to “/Person/output/” if you prefer to look at files rather than tables in Data Lake and for debugging; for me the table viewer in visual studio is fine for a small amount of data.

I’m not going to review the data output of every approach, other than to say the data merges as you would expect. Below is the final output of all 3 incremental loads. All the code and files is up on GitHub so to see working give it whirl.

output

1  – Multiple USQL Merge Steps

 

This is Paul’s approach, tweaked for my meta data. Basically we’re going to hit it in 3 steps using an inner, left and right join, and union all the results together.

USE Patterns;
 
DECLARE @tabcode int = 44;
DECLARE @delimiter char = (char) @tabcode;
DECLARE @escape char = Char.Parse("œ");
DECLARE @extension string = "csv";

//  Path
DECLARE @filenumber string = "1";
DECLARE @filename string = "person";
DECLARE @inpath string = "/Person/" + @filename + @filenumber + ".csv";
DECLARE @outpath string = "/Person/output/" + @filename + @filenumber + ".csv";
 
// read data
@data =
    EXTRACT Id int,
            Title string,
            Email string,
            Firstname string,
            Surname string
    FROM @inpath
    USING Extractors.Text(delimiter : @delimiter, escapeCharacter : @escape, quoting : false, skipFirstNRows : 1);

@merge =
    //update current
    SELECT [src].[Id] AS [personid],
           [src].[Title] AS [title],
           [src].[Email] AS [email],
           [src].[Firstname] AS [firstname],
           [src].[Surname] AS [surname],
           (DateTime?) null AS [deleteddate],
           @filename + @filenumber AS [filename]
    FROM @data AS [src]
         INNER JOIN
             [dw].[person] AS [tgt]
         ON [src].[Id] == [tgt].[personid]
 
    UNION ALL
 
    //insert new
    SELECT [src].[Id] AS [personid],
           [src].[Title] AS [title],
           [src].[Email] AS [email],
           [src].[Firstname] AS [firstname],
           [src].[Surname] AS [surname],
           (DateTime?) null AS [deleteddate],
           @filename + @filenumber AS [filename]
    FROM @data AS [src]
         LEFT OUTER JOIN
             [dw].[person] AS [tgt]
         ON [src].[Id] == [tgt].[personid]
    WHERE [tgt].[personid] IS NULL
 
    UNION ALL
 
    //keep existing and logically delete
    SELECT [tgt].[personid],
           [tgt].[title],
           [tgt].[email],
           [tgt].[firstname],
           [tgt].[surname],
           (DateTime?) DateTime.Now AS [deleteddate],
           [tgt].[filename]
    FROM @data AS [src]
         RIGHT OUTER JOIN
             [dw].[person] AS [tgt]
         ON [src].[Id] == [tgt].[personid]
    WHERE [src].[Id] IS NULL;
 
//optionally - output to file
//so we can save a review each iteration output
OUTPUT @merge
TO @outpath
USING Outputters.Csv();
 
//truncate table - we can't do row level operations
//it's all file based processing so we have to reload the whole table (partition)
TRUNCATE TABLE [dw].[person];
 
//insert the merged data
INSERT INTO [dw].[person]
(
    [personid],
    [title],
    [email],
    [firstname],
    [surname],
    [deleteddate],
    [filename]
)
SELECT [personid],
       [title],
       [email],
       [firstname],
       [surname],
       [deleteddate],
       [filename]
FROM @merge;

2  – Single USQL Merge Step

 

This is an approach I put together after seeing the 1st approach wondering how much more complicated and what the execution differences there would be to do it 1 hit using a full outer join with conditional selects.

USE Patterns;
 
DECLARE @tabcode int = 44;
DECLARE @delimiter char = (char) @tabcode;
DECLARE @escape char = Char.Parse("œ");
DECLARE @extension string = "csv";
 
//  Path - increment @filenumber to load files 2 and 3 on successive runs
DECLARE @filenumber string = "1";
DECLARE @filename string = "person";
DECLARE @inpath string = "/Person/" + @filename + @filenumber + ".csv";
DECLARE @outpath string = "/Person/output/" + @filename + @filenumber + ".csv";
 
// read data
@data =
    EXTRACT Id int,
            Title string,
            Email string,
            Firstname string,
            Surname string
    FROM @inpath
    USING Extractors.Text(delimiter : @delimiter, escapeCharacter : @escape, quoting : false, skipFirstNRows : 1);
 
@merge =
        SELECT
            //select the source data if insert or update
            //select the target data if deleted 
            //we have to cast the datatypes since the if else construct returns nullable types

            (int)(issource ? [srcPersonid] : [tgtPersonid] ) AS personid,
            (string)(issource ? [srcTitle] : [tgtTitle] ) AS title,
            (string)(issource ? [srcEmail] : [tgtEmail] ) AS email,
            (string)(issource ? [srcFirstname] : [tgtFirstname] ) AS firstname,
            (string)(issource ? [srcSurname] : [tgtSurname] ) AS surname,
            (issource ? null : (DateTime?)DateTime.Now ) AS deleteddate,
            (string)(issource ? [srcFilename] : [tgtFilename] ) AS filename
        FROM
        (
    SELECT (
               // create a boolean that can be re-used in the outer 
               // query to keep the code clean
               // if update
               ([src].[Id] == [tgt].[personid] & [src].[Id] != null)
               // or if insert
               || ([tgt].[personid] == null)
               //then write source data
               ? true
               //else keep the target data
               : false
           ) AS issource,
           //source data
           [src].[Id] AS [srcPersonid],
           [src].[Title] AS [srcTitle],
           [src].[Email] AS [srcEmail],
           [src].[Firstname] AS [srcFirstname],
           [src].[Surname] AS [srcSurname],
           @filename + @filenumber AS [srcFilename],
           //target data
           [tgt].[personid] AS [tgtPersonid],
           [tgt].[title] AS [tgtTitle],
           [tgt].[email] AS [tgtEmail],
           [tgt].[firstname] AS [tgtFirstname],
           [tgt].[surname] AS [tgtSurname],
           [tgt].[filename] AS [tgtFilename]
    FROM @data AS [src]
         FULL OUTER JOIN
             [dw].[person] AS [tgt]
         ON [src].[Id] == [tgt].[personid]
    ) AS cpr;
 
//optionally- output to file
//so that we can save and review each iteration output
OUTPUT @merge
TO @outpath
USING Outputters.Csv();
 
//truncate table - we can't do row level operations
//it's all file based processing so we have to reload the whole table (partition)
TRUNCATE TABLE [dw].[person];
 
//insert the merged data
INSERT INTO [dw].[person]
(
    [personid],
    [title],
    [email],
    [firstname],
    [surname],
    [deleteddate],
    [filename]
)
SELECT [personid],
       [title],
       [email],
       [firstname],
       [surname],
       [deleteddate],
       [filename]
FROM @merge;

Summary

 

I’ve presented 2 alternative ways to merge data using native USQL constructs. In the second part (not done yet) of this posting I’ll compare the compiled jobs graphs and any performance differences.

Quick Tip – SQL Azure VM’s

When creating VM’s for IO work e.g. SQL Server.

Max out the allowed number of disks even if you’re not going to use the storage and stripe the data files across the disks:

  • You only pay for the storage you use – even though the VHD is sized
  • You get the combined IO for free

capture

Finer Points – USQL: Extraction Paths

This blog is about the finer points of ingesting temporally stamped files from directory structures.

Consider the following directory structure. This kind of structure is typically created by Data Factory when ingesting data into blob storage in azure. The temporal schedule of extract is reflected in the document structure in the form yyyy->mm->dd. In this example we stop at day but it is possible to go down to minute:

1

Each leaf directory contains the corresponding data for that day. This data reflects a daily change data capture (CDC) extraction from some source. This means that the characteristics of the data are as follows:

  • a file in a leaf directory only contains data that has changed since the last extract
  • this approach stores historical changes across time therefore data at month or year level may contain duplicate members but some part of the data will be different
  • data at leaf level will not contain duplicates, duplicates can only occur across different days
  • the current position of the data will be distributed throughout all the leaf directories since the directory that holds the latest version for a given member will depend on when it was last changed

Each leaf directory contains the following files and data about actors. Note the file names also contain the temporal stamp with the format yyyy-MM-dd, this is deliberate:

3

With this in mind it would be possible to create a procedure that loads a consolidated table that could either contain history or just the current version. This procedure could work in the following ways:

  1. Load all the data from the directories to provide a historical and current view of the information up to the current day – has to process all the data
  2. Incrementally load the current day merging the delta into an existing table of data – just has to process days not already processed
  3. Load a defined proportion of all the data up-to a provided point in time to create a consolidated view at a specific point in time – has to process a variable subset of data

I have patterns to do all 3 that I will blog about at some point. It’s just worth noting for now that 3 has the least value. It could be of some value in some specific use cases around creating snapshots for comparison; normally a full reload, incremental of a current or full historical consolidated view of data is adequate since the full historical view can be used to view any point in time. Also when thinking about this problem consider that this is file based nosql distributed batch processing system… row level updates and inserts aren’t available to you! Now think about how that will affect the incremental load pattern.

I’ve used the local debug execution environment to build and test my patterns. For this to work with the structure above I’ve changed my local data lake directory path to be my sourcecode directory. To do this in visual studio choose the menu [Data Lake] -> [Options and Setting…] and change the path:

2

So getting to the topic of discussion. I want to be able to:

  1. Optionally load all of my data from the entire directory structure
  2. Optionally load just 1 daily incremental leaf directory
  3. Pass the incremental snapshot date in the folder and/or file name into the data so I can determine which is the latest version of a row or where in the historical view that version should be inserted

Specific Directory

Review the following code.

//file format
DECLARE @tabcode int = 44;
DECLARE @delimiter char = (char) @tabcode;
DECLARE @escape char = Char.Parse("œ");
DECLARE @extension string = "csv";

// Path
DECLARE @path string = "/Patterns/2016/01/01/{filename:*}.csv";

@data =
 EXTRACT Id int,
 Title string,
 Email string,
 Firstname string,
 Surname string,
 filename string
 FROM @path
 USING Extractors.Text(
 delimiter : @delimiter, 
 escapeCharacter : @escape, 
 quoting : false, 
 skipFirstNRows : 1
);

OUTPUT @data
TO "/Patterns/Test/ActorsOut.csv"
USING Outputters.Csv();

This is obviously the most simple thing you can do. In this snippet we specifically named the year, month and day in the path. This will specifically load the files in the targeted directory. Obviously you would want to parameterise year, month and day so the proc can be re-used for any leaf period.

The following specifies a wildcard to load all the files at the directory path. The filename is then passed into the dataset into a field called filename

{filename:*}

This gives us an output file that looks like the following. It meets our objective of just loading 1 incremental partition. Although we pushed the filename into the output data but we haven’t extracted the date.

1,"Mr","hf@qwertymail.com","Harrison","Ford","Actors_2016-01-01"
8,"Ms","hf@qwertymail.com","Angelina","Jolie","Actors_2016-01-01"
9,"Ms","gc@qwertymail.com","Jessica","Jones","Actors_2016-01-01"
10,"Ms","ks@qwertymail.com","Nicole","Kidman","Actors_2016-01-01"

Wildcards

Review the following code. We can pass in wildcards to any part of the path using {*}. This meets our objective of loading all the data.

//file format
DECLARE @tabcode int = 44;
DECLARE @delimiter char = (char) @tabcode;
DECLARE @escape char = Char.Parse("œ");
DECLARE @extension string = "csv";

// Path
DECLARE @path string = "/Patterns/{*}/{*}/{*}/{filename:*}.csv";

@data =
 EXTRACT Id int,
 Title string,
 Email string,
 Firstname string,
 Surname string,
 filename string
 FROM @path
 USING Extractors.Text(
    delimiter : @delimiter, 
    escapeCharacter : @escape, 
    quoting : false, 
    skipFirstNRows : 1);

OUTPUT @data
TO "/Patterns/Test/ActorsOut.csv"
USING Outputters.Csv();

The output data set includes all the data and again we see that we’ve pushed the filename into the output data set but haven’t extracted the date. Note the duplicates appearing because this is a complete historical concatenation. Also note how easy it is to concatenate the load of multiple files with a single extract statement, which on Data Lake in the cloud will utilize multiple nodes to load the data in parallel resulting in random interlaced ordering data in the output.

1,"Mr","hf@qwertymail.com","Harrison","Ford","Actors_2016-01-01"
2,"Mr","gc@qwertymail.com","George","Clooney","Actors_2016-01-02"
4,"Mr","rr@qwertymail.com","Roger","Rabbit","Actors_2016-01-03"
3,"Mr","ks@qwertymail.com","Gene","Wilder","Actors_2016-01-04"
7,"Mr","bp@qwertymail.com","Brand","Pitt","Actors_2016-01-05"
6,"Mr","tc@qwertymail.com","Tom","Cruise","Actors_2016-01-06"
5,"Mr","jd@qwertymail.com","Johnny","Depp","Actors_2016-01-07"
8,"Ms","hf@qwertymail.com","Angelina","Jolie","Actors_2016-01-01"
9,"Ms","gc@qwertymail.com","Jessica","Rabbit","Actors_2016-01-04"
8,"Ms","hf@qwertymail.com","Angelina","Pitt","Actors_2016-01-06"
10,"Ms","ks@qwertymail.com","Nicole","Cruise","Actors_2016-01-07"
9,"Ms","gc@qwertymail.com","Jessica","Jones","Actors_2016-01-01"
10,"Ms","ks@qwertymail.com","Nicole","Kidman","Actors_2016-01-01"

Date Parts

Review the following code. We can pass in date part variables to any part of the path. We can then pull the date into our result set in the same way as the filename. The date gets passed through as a strongly typed DateTime. This meets our objective of loading all the data and passing in the date from the directory path.

//file format
DECLARE @tabcode int = 44;
DECLARE @delimiter char = (char) @tabcode;
DECLARE @escape char = Char.Parse("œ");
DECLARE @extension string = "csv";

// Path
DECLARE @path string = "/Patterns/{date:yyyy}/{date:MM}/{date:dd}/{filename:*}.csv";

@data =
 EXTRACT Id int,
 Title string,
 Email string,
 Firstname string,
 Surname string,
 date DateTime,
 filename string
 FROM @path
 USING Extractors.Text(
    delimiter : @delimiter, 
    escapeCharacter : @escape, 
    quoting : false, 
    skipFirstNRows : 1);

OUTPUT @data
TO "/Patterns/Test/ActorsOut.csv"
USING Outputters.Csv();

The output data set includes all the data. Note that using the date parts has achieved the goal of passing the date into the result data set from the directory path. However we can’t use this method for an incremental load because date parts effectively wildcard the directory path and will always load all of the data.

1,"Mr","hf@qwertymail.com","Harrison","Ford",2016-01-01T00:00:00.0000000,"Actors_2016-01-01"
2,"Mr","gc@qwertymail.com","George","Clooney",2016-01-02T00:00:00.0000000,"Actors_2016-01-02"
4,"Mr","rr@qwertymail.com","Roger","Rabbit",2016-01-03T00:00:00.0000000,"Actors_2016-01-03"
3,"Mr","ks@qwertymail.com","Gene","Wilder",2016-01-04T00:00:00.0000000,"Actors_2016-01-04"
7,"Mr","bp@qwertymail.com","Brand","Pitt",2016-01-05T00:00:00.0000000,"Actors_2016-01-05"
6,"Mr","tc@qwertymail.com","Tom","Cruise",2016-01-06T00:00:00.0000000,"Actors_2016-01-06"
5,"Mr","jd@qwertymail.com","Johnny","Depp",2016-01-07T00:00:00.0000000,"Actors_2016-01-07"
8,"Ms","hf@qwertymail.com","Angelina","Jolie",2016-01-01T00:00:00.0000000,"Actors_2016-01-01"
9,"Ms","gc@qwertymail.com","Jessica","Rabbit",2016-01-04T00:00:00.0000000,"Actors_2016-01-04"
8,"Ms","hf@qwertymail.com","Angelina","Pitt",2016-01-06T00:00:00.0000000,"Actors_2016-01-06"
10,"Ms","ks@qwertymail.com","Nicole","Cruise",2016-01-07T00:00:00.0000000,"Actors_2016-01-07"
9,"Ms","gc@qwertymail.com","Jessica","Jones",2016-01-01T00:00:00.0000000,"Actors_2016-01-01"
10,"Ms","ks@qwertymail.com","Nicole","Kidman",2016-01-01T00:00:00.0000000,"Actors_2016-01-01"

Filename Regex

Finally we get to my preferred method which is to use a regex to pull the date from the filename. I prefer this because I can pass wildcards or a specific period to do an incremental, partial or full load of files and still pull the date into the result set. It also breaks the dependency of the directory structure on date data lineage. Regardless of directory I always want the date partition in the filename for all sorts of maintenance and administration purposes. In fact having it broken out into a hierarchy of directory structures can be a bit a pain with the traditional directory and data brwosing tools currently available.

//file format
DECLARE @code int = 44;
DECLARE @delimiter char = (char) @code;
DECLARE @escape char = Char.Parse("œ");
DECLARE @regExPattern string = @"([0-9]{4}\-[0-9]{2}\-[0-9]{2})";

// to test full load
DECLARE @year string = "{*}";
DECLARE @month string = "{*}";
DECLARE @day string = "{*}";

// to test incremental
/*
DECLARE @year string = "2016";
DECLARE @month string = "01";
DECLARE @day string = "01";
*/

// Path
DECLARE @path string = "/Patterns/" + @year + "/" + @month + "/" + @day + "/{filename:*}.csv";

@data =
 EXTRACT Id int,
 Title string,
 Email string,
 Firstname string,
 Surname string,
 filename string
 FROM @path
 USING Extractors.Text(
    delimiter : @delimiter, 
    escapeCharacter : @escape, 
    quoting : false, 
    skipFirstNRows : 1
);

@data =
 SELECT
 d.Id,
 d.Title,
 d.Email,
 d.Firstname,
 d.Surname,
 DateTime.Parse(Regex.Match(d.filename, @regExPattern).Groups[1].Value) AS date
 FROM @data AS d;

OUTPUT @data
TO "/Patterns/Test/ActorsOut.csv"
USING Outputters.Csv();

This example below shows the output with full load with the date passed through from the filename using a C# regex. We can easily do an incremental by adjusting the commented out block of variable initialisation of @year, @month and @day. This variables would be parameters in a productionised version of a proc allowing the same code to used for either a full or incremental load.

1,"Mr","hf@qwertymail.com","Harrison","Ford",2016-01-01T00:00:00.0000000
2,"Mr","gc@qwertymail.com","George","Clooney",2016-01-02T00:00:00.0000000
4,"Mr","rr@qwertymail.com","Roger","Rabbit",2016-01-03T00:00:00.0000000
3,"Mr","ks@qwertymail.com","Gene","Wilder",2016-01-04T00:00:00.0000000
7,"Mr","bp@qwertymail.com","Brand","Pitt",2016-01-05T00:00:00.0000000
6,"Mr","tc@qwertymail.com","Tom","Cruise",2016-01-06T00:00:00.0000000
5,"Mr","jd@qwertymail.com","Johnny","Depp",2016-01-07T00:00:00.0000000
8,"Ms","hf@qwertymail.com","Angelina","Jolie",2016-01-01T00:00:00.0000000
9,"Ms","gc@qwertymail.com","Jessica","Rabbit",2016-01-04T00:00:00.0000000
8,"Ms","hf@qwertymail.com","Angelina","Pitt",2016-01-06T00:00:00.0000000
10,"Ms","ks@qwertymail.com","Nicole","Cruise",2016-01-07T00:00:00.0000000
9,"Ms","gc@qwertymail.com","Jessica","Jones",2016-01-01T00:00:00.0000000
10,"Ms","ks@qwertymail.com","Nicole","Kidman",2016-01-01T00:00:00.0000000

Finer Points – USQL: File Format

Azure data lake is a file based data storage and analytics platform. Working with what appears to be SQL can lead a traditional SQL developer into a few traps. File format is one of them.

The following built-in extractors are available in U-SQL:

Extractors.Text()
Extractors.Csv()
Extractors.Tsv()

If you’re in the business of processing raw data files then the issue of choosing a file format should be pretty familiar territory. How do you know the file format isn’t a feature of the data itself? At which point the schema read comes crashing down like a ton of bricks. Ultimately you may have limited control over the protection of your chosen file format; how do you know what you have chosen now won’t be compromised in the future?

The Text() extractor and outputter provides an additional parameter to allow control over the file format delimiter. Below is a small experimental snippet showing the text extractor:

DECLARE @code int = 44;
DECLARE @delimiter char = (char) @code;
DECLARE @escape char = Char.Parse("œ");
DECLARE @path string = "/Patterns/2016/01/01/{filename:*}.csv";

@data =
 EXTRACT Id int,
 Title string,
 Email string,
 Firstname string,
 Surname string,
 filename string
 FROM @path
 USING Extractors.Text(
       delimiter : @delimiter, 
       escapeCharacter : @escape, 
       quoting : false, 
       skipFirstNRows : 1,
       rowDelimiter: "/n"
);

OUTPUT @data
TO "/Patterns/Test/ActorsOut.csv"
USING Outputters.Csv();

Column Delimiter

The escape character gets us out of a hole with column delimiters. Something to know is that if you use the escapeCharacter parameter then it won’t be used to escape the quote character (which is true by default); this is done using another quote e.g. “Some “”data”” with a quotes”

My go-to approach is just to always make use of the Text extractor, provide an escape character and avoid quotes for the following reasons:

  • Data factory supports the addition of an escape character that will automatically get injected when the file is written out, this provides an easy way to parse an escape character into the data
  • Data factory may also parse in escapes for quotes (I don’t know haven’t tried), it just seems like an overly verbose way of doing the same thing as escape characters except I can’t choose my character and the one I’m given is very common
  • I can change the file markup in a central place without editing all my loading procedures
  • I’d rather have an escape character and not need it than waste tedious amounts of time sorting out these issues later.
  • The escapeCharacter escapes itself so you don’t have to worry too much about what you choose, though you’ll probably do yourself some favours by choosing something obscure

With large amounts of data such issues aren’t going to be found until you start churning through lots of data at which point you may have a lot of code to change and pressing deadlines because you simply chose the wrong delimiter.

You could argue that choosing something obscure for your column delimiter is just as effective. However can you ever be 100% sure; also someone will have to look at this data to resolve issues. Choosing common supported delimiters ensures the data is easy to look at and troubleshoot.

Row Delimiter

With row delimiter we’re not so lucky. Unfortunately because it’s a big data platform and the processing of data is split across computation nodes it’s not possible to escape row delimiters. As a best practice approach I always now delete row delimiters from my data on the selection queries since it’s a relatively cheap operation.

The only other approach is to clean in data lake. This involves loading ragged files, looking for the ragged ends and cleansing out the delimiters for broken lines. Not impossible but a bit more coding involved albeit re-usable; after all nosql platforms shine with semi/non-structured data. Probably depends on the type of source since this kind of processing will be charged on Azure subscription; if it’s cheap and easy at source outside azure then why pay?

Interestingly the row delimiter can only be 1 character. /r/n is one of 3 defaults the others being /r and /n. So custom formats are limited to one character, since the following causes failure for being 2 characters.

 USING Extractors.Text(
       delimiter : @delimiter, 
       escapeCharacter : @escape, 
       quoting : false, 
       skipFirstNRows : 1,
       rowDelimiter: "/r/n"
);

Hierarchical Formats

There are of course hierarchical formats available such as json, xml and paraquet that are supported by data factory and other tools. Nosql platforms after all really shine with semi-structured data. However at the moment they’re not yet fully supported by data lake analytics.

U-SQL is highly extensible and custom extractors are available for hierarchical file formats. If traditional ETL was accused of one thing it would be that it’s too sensitive to schema changes. Hierarchical file formats compound this issue even further because it creates dependency between data entities at an asset (file) level.

If the source data by it’s original nature comes in as a hierarchical format then fair enough. If it’s coming from a RDBMS database then taking it to hierarchical structural format to get it back to flat format is taking 3 steps back and 1 step forward.

Centralising File Format

The code snippet above shows pushing in file format parameters using variables. We obviously have to maintain these values in every script or procedure. You could push file markup values in using parameters and then of course centralisation depends on what you’re using for orchestration. One way to centralise this is to pull the literal file markups values from shared a assembly:

REFERENCE ASSEMBLY Patterns.USQLHelper;

DECLARE @delimiter char = USQLHelper.FileFormat.GetDelimiter();
DECLARE @escape char = USQLHelper.FileFormat.GetEscape();

Shared assemblies can be written in C# (or any .Net language) and registered at database level. This assembly could read a json or xml file stored in the data lake itself. Then it’s just a matter of tweaking your global file formats organised by source when you hit a format issue.