Finer Points – USQL: Extraction Paths 2, reading too much!

I blogged recently on loading files using parameterised wildcards. Michael Rys kindly posted some feedback in my comments regarding parameters which was helpful and uncovered some important behavior to consider. Effectively when considering big data we want the optimizer to be as smart as possible when deciding what files to read since we pay for it (literally not just in performance!), at the same I want flexible and elegant code implementation frameworks.

So it turns out we can use dateparts and filter in the same job. We can post dateparts into extraction paths and follow it with a where a clause in separate statement. When this script compiles the execution ideally will only read extraction files that are covered by the where clause in the USQL statement that follows the extract. In fact when using a USQL predicate with a DateTime variable that’s exactly what it does. Essentially it doesn’t execute sequentially, we should expect the compiler to choose the best parallel query plan and only read the files it needs to.

So to provide a pattern that can load everything, year, month or date it might seem reasonable to do something like the following with C# types in order to provide a generic procedure.

Note: below I’ve configured the setup to load February, I have 1 file in Feb and 7 files in Jan:

 /* PARAMETERS BEGIN */

/* load everything */
//DECLARE @pyear int = -1;
//DECLARE @pmonth int = -1;
//DECLARE @pday int = -1;

/* load year */
//DECLARE @pyear int = 2016;
//DECLARE @pmonth int = -1;
//DECLARE @pday int = -1;

/* load month */
DECLARE @pyear int = 2016;
DECLARE @pmonth int = 2;
DECLARE @pday int = -1;

/* load day */
//DECLARE @pyear int = 2016;
//DECLARE @pmonth int = 1;
//DECLARE @pday int = 1;

/* PARAMETERS END */

//file format
DECLARE @tabcode int = 44;
DECLARE @delimiter char = (char) @tabcode;
DECLARE @escape char = Char.Parse("œ");
DECLARE @extension string = "csv";

//wildcard date
DECLARE @dateWildCard int = - 1;


//system attributes
DECLARE @systemCode string = "Patterns";

// Path
DECLARE @path string = "wasb://raw@sdbgkhsdbgkhds.blob.core.windows.net/" + @systemCode + "/{date:yyyy}/{date:MM}/{date:dd}/{filename}.csv";

@data =
 EXTRACT Id int,
 Title string,
 Email string,
 Firstname string,
 Surname string,
 date DateTime,
 filename string
 FROM @path
 USING Extractors.Text(delimiter : @delimiter, escapeCharacter : @escape, quoting : false, skipFirstNRows : 1);


@data =
 SELECT d. *
 FROM @data AS d
 WHERE (date.Year == @pyear OR @pmonth == @dateWildCard)
 AND (date.Month == @pmonth OR @pmonth == @dateWildCard)
 AND (date.Day == @pday OR @pday == @dateWildCard);

OUTPUT @data
TO "/Patterns/Test/Out.csv"
USING Outputters.Csv();


When the job executes it loads all the files for January and February after which it does the filter which is not what I expected. 8 streams read all the files but the output only contains February. We effectively read 7 files in Jan that we didn’t need to.

4

Output only contains February:

5

So in conclusion we’ve read far more data than we intended. The following code however avoids this problem. Note the differences in the following code. I’ve effectively passed the day parts into different columns day, month and year. This means I can use a native datetime predicate in the where clause. The analytics engine doesn’t have to load all the data to determine the value of year, month and day using higher level .net execution; well that’s my guess. I need to read more on the internals underneath if published anywhere. Effectively there is a read price to pay running custom .net in this scenario.

 
 /* PARAMETERS BEGIN */

/* load everything */
//DECLARE @pyear int = -1;
//DECLARE @pmonth int = -1;
//DECLARE @pday int = -1;

/* load year */
//DECLARE @pyear int = 2016;
//DECLARE @pmonth int = -1;
//DECLARE @pday int = -1;

/* load month */
DECLARE @pyear int = 2016;
DECLARE @pmonth int = 2;
DECLARE @pday int = -1;

/* load day*/
//DECLARE @pyear int = 2016;
//DECLARE @pmonth int = 1;
//DECLARE @pday int = 1;

/* PARAMETERS END */

//file format
DECLARE @tabcode int = 44;
DECLARE @delimiter char = (char) @tabcode;
DECLARE @escape char = Char.Parse("œ");
DECLARE @extension string = "csv";

//wildcard date
DECLARE @wildcard int = -1;
//datetime part filters from parameters
DECLARE @dteYear = new DateTime((@pyear==-1?1:@pyear),1,1);
DECLARE @dteMonth = new DateTime(1,(@pmonth==-1?1:@pmonth),1);
DECLARE @dteDay = new DateTime(1,1,(@pday==-1?1:@pday));

//system attributes
DECLARE @systemCode string = "Patterns";

// Path
DECLARE @path string = "wasb://raw@sfgsfdgsdfg.blob.core.windows.net/" + @systemCode + "/{year:yyyy}/{month:MM}/{day:dd}/{filename}.csv";

@data =
 EXTRACT Id int,
 Title string,
 Email string,
 Firstname string,
 Surname string,
 year DateTime,
 month DateTime,
 day DateTime,
 filename string
 FROM @path
 USING Extractors.Text(delimiter : @delimiter, escapeCharacter : @escape, quoting : false, skipFirstNRows : 1);

@data =
 SELECT d. *
 FROM @data AS d
 WHERE (year == @dteYear OR @pyear == @wildcard)
 AND (month == @dteMonth OR @pmonth == @wildcard)
 AND (day == @dteDay OR @pday == @wildcard);


OUTPUT @data
TO "/Patterns/Test/Out.csv"
USING Outputters.Csv();


This time we can see that the job only reads my single February file which is exactly what I want. I want the efficiency of selective reads and flexibility to load any part of the temporal hierarchy of my files in my blob storage.

6

In the code the filter variables day, month and year are still typed as datetime… however consider how 2017-02-18 gets split out into these variables:

  • Day  18-> defaults month to 01 and year to 0001 -> 00010118T00:00:00.000000
  • Month 2 -> defaults day to 01 and year to 0001 -> 00010201T00:00:00.000000
  • Year  2017-> defaults day and month to 01 -> 20170101T00:00:00.000000

Note I’ve left the date part datetimes in the output so you can see how they come through:

7

The result with this pattern is that we effectively read only what we needed for the output and perhaps the moral of the story – avoid custom .net in the where clause when filtering down files that you want to read. Part of that solution is making sure your file has attributes it needs to perform specific predicate selections.

Advertisements

Finer Points – USQL: Extraction Paths

This blog is about the finer points of ingesting temporally stamped files from directory structures.

Consider the following directory structure. This kind of structure is typically created by Data Factory when ingesting data into blob storage in azure. The temporal schedule of extract is reflected in the document structure in the form yyyy->mm->dd. In this example we stop at day but it is possible to go down to minute:

1

Each leaf directory contains the corresponding data for that day. This data reflects a daily change data capture (CDC) extraction from some source. This means that the characteristics of the data are as follows:

  • a file in a leaf directory only contains data that has changed since the last extract
  • this approach stores historical changes across time therefore data at month or year level may contain duplicate members but some part of the data will be different
  • data at leaf level will not contain duplicates, duplicates can only occur across different days
  • the current position of the data will be distributed throughout all the leaf directories since the directory that holds the latest version for a given member will depend on when it was last changed

Each leaf directory contains the following files and data about actors. Note the file names also contain the temporal stamp with the format yyyy-MM-dd, this is deliberate:

3

With this in mind it would be possible to create a procedure that loads a consolidated table that could either contain history or just the current version. This procedure could work in the following ways:

  1. Load all the data from the directories to provide a historical and current view of the information up to the current day – has to process all the data
  2. Incrementally load the current day merging the delta into an existing table of data – just has to process days not already processed
  3. Load a defined proportion of all the data up-to a provided point in time to create a consolidated view at a specific point in time – has to process a variable subset of data

I have patterns to do all 3 that I will blog about at some point. It’s just worth noting for now that 3 has the least value. It could be of some value in some specific use cases around creating snapshots for comparison; normally a full reload, incremental of a current or full historical consolidated view of data is adequate since the full historical view can be used to view any point in time. Also when thinking about this problem consider that this is file based nosql distributed batch processing system… row level updates and inserts aren’t available to you! Now think about how that will affect the incremental load pattern.

I’ve used the local debug execution environment to build and test my patterns. For this to work with the structure above I’ve changed my local data lake directory path to be my sourcecode directory. To do this in visual studio choose the menu [Data Lake] -> [Options and Setting…] and change the path:

2

So getting to the topic of discussion. I want to be able to:

  1. Optionally load all of my data from the entire directory structure
  2. Optionally load just 1 daily incremental leaf directory
  3. Pass the incremental snapshot date in the folder and/or file name into the data so I can determine which is the latest version of a row or where in the historical view that version should be inserted

Specific Directory

Review the following code.

//file format
DECLARE @tabcode int = 44;
DECLARE @delimiter char = (char) @tabcode;
DECLARE @escape char = Char.Parse("œ");
DECLARE @extension string = "csv";

// Path
DECLARE @path string = "/Patterns/2016/01/01/{filename:*}.csv";

@data =
 EXTRACT Id int,
 Title string,
 Email string,
 Firstname string,
 Surname string,
 filename string
 FROM @path
 USING Extractors.Text(
 delimiter : @delimiter, 
 escapeCharacter : @escape, 
 quoting : false, 
 skipFirstNRows : 1
);

OUTPUT @data
TO "/Patterns/Test/ActorsOut.csv"
USING Outputters.Csv();

This is obviously the most simple thing you can do. In this snippet we specifically named the year, month and day in the path. This will specifically load the files in the targeted directory. Obviously you would want to parameterise year, month and day so the proc can be re-used for any leaf period.

The following specifies a wildcard to load all the files at the directory path. The filename is then passed into the dataset into a field called filename

{filename:*}

This gives us an output file that looks like the following. It meets our objective of just loading 1 incremental partition. Although we pushed the filename into the output data but we haven’t extracted the date.

1,"Mr","hf@qwertymail.com","Harrison","Ford","Actors_2016-01-01"
8,"Ms","hf@qwertymail.com","Angelina","Jolie","Actors_2016-01-01"
9,"Ms","gc@qwertymail.com","Jessica","Jones","Actors_2016-01-01"
10,"Ms","ks@qwertymail.com","Nicole","Kidman","Actors_2016-01-01"

Wildcards

Review the following code. We can pass in wildcards to any part of the path using {*}. This meets our objective of loading all the data.

//file format
DECLARE @tabcode int = 44;
DECLARE @delimiter char = (char) @tabcode;
DECLARE @escape char = Char.Parse("œ");
DECLARE @extension string = "csv";

// Path
DECLARE @path string = "/Patterns/{*}/{*}/{*}/{filename:*}.csv";

@data =
 EXTRACT Id int,
 Title string,
 Email string,
 Firstname string,
 Surname string,
 filename string
 FROM @path
 USING Extractors.Text(
    delimiter : @delimiter, 
    escapeCharacter : @escape, 
    quoting : false, 
    skipFirstNRows : 1);

OUTPUT @data
TO "/Patterns/Test/ActorsOut.csv"
USING Outputters.Csv();

The output data set includes all the data and again we see that we’ve pushed the filename into the output data set but haven’t extracted the date. Note the duplicates appearing because this is a complete historical concatenation. Also note how easy it is to concatenate the load of multiple files with a single extract statement, which on Data Lake in the cloud will utilize multiple nodes to load the data in parallel resulting in random interlaced ordering data in the output.

1,"Mr","hf@qwertymail.com","Harrison","Ford","Actors_2016-01-01"
2,"Mr","gc@qwertymail.com","George","Clooney","Actors_2016-01-02"
4,"Mr","rr@qwertymail.com","Roger","Rabbit","Actors_2016-01-03"
3,"Mr","ks@qwertymail.com","Gene","Wilder","Actors_2016-01-04"
7,"Mr","bp@qwertymail.com","Brand","Pitt","Actors_2016-01-05"
6,"Mr","tc@qwertymail.com","Tom","Cruise","Actors_2016-01-06"
5,"Mr","jd@qwertymail.com","Johnny","Depp","Actors_2016-01-07"
8,"Ms","hf@qwertymail.com","Angelina","Jolie","Actors_2016-01-01"
9,"Ms","gc@qwertymail.com","Jessica","Rabbit","Actors_2016-01-04"
8,"Ms","hf@qwertymail.com","Angelina","Pitt","Actors_2016-01-06"
10,"Ms","ks@qwertymail.com","Nicole","Cruise","Actors_2016-01-07"
9,"Ms","gc@qwertymail.com","Jessica","Jones","Actors_2016-01-01"
10,"Ms","ks@qwertymail.com","Nicole","Kidman","Actors_2016-01-01"

Date Parts

Review the following code. We can pass in date part variables to any part of the path. We can then pull the date into our result set in the same way as the filename. The date gets passed through as a strongly typed DateTime. This meets our objective of loading all the data and passing in the date from the directory path.

//file format
DECLARE @tabcode int = 44;
DECLARE @delimiter char = (char) @tabcode;
DECLARE @escape char = Char.Parse("œ");
DECLARE @extension string = "csv";

// Path
DECLARE @path string = "/Patterns/{date:yyyy}/{date:MM}/{date:dd}/{filename:*}.csv";

@data =
 EXTRACT Id int,
 Title string,
 Email string,
 Firstname string,
 Surname string,
 date DateTime,
 filename string
 FROM @path
 USING Extractors.Text(
    delimiter : @delimiter, 
    escapeCharacter : @escape, 
    quoting : false, 
    skipFirstNRows : 1);

OUTPUT @data
TO "/Patterns/Test/ActorsOut.csv"
USING Outputters.Csv();

The output data set includes all the data. Note that using the date parts has achieved the goal of passing the date into the result data set from the directory path. However we can’t use this method for an incremental load because date parts effectively wildcard the directory path and will always load all of the data.

1,"Mr","hf@qwertymail.com","Harrison","Ford",2016-01-01T00:00:00.0000000,"Actors_2016-01-01"
2,"Mr","gc@qwertymail.com","George","Clooney",2016-01-02T00:00:00.0000000,"Actors_2016-01-02"
4,"Mr","rr@qwertymail.com","Roger","Rabbit",2016-01-03T00:00:00.0000000,"Actors_2016-01-03"
3,"Mr","ks@qwertymail.com","Gene","Wilder",2016-01-04T00:00:00.0000000,"Actors_2016-01-04"
7,"Mr","bp@qwertymail.com","Brand","Pitt",2016-01-05T00:00:00.0000000,"Actors_2016-01-05"
6,"Mr","tc@qwertymail.com","Tom","Cruise",2016-01-06T00:00:00.0000000,"Actors_2016-01-06"
5,"Mr","jd@qwertymail.com","Johnny","Depp",2016-01-07T00:00:00.0000000,"Actors_2016-01-07"
8,"Ms","hf@qwertymail.com","Angelina","Jolie",2016-01-01T00:00:00.0000000,"Actors_2016-01-01"
9,"Ms","gc@qwertymail.com","Jessica","Rabbit",2016-01-04T00:00:00.0000000,"Actors_2016-01-04"
8,"Ms","hf@qwertymail.com","Angelina","Pitt",2016-01-06T00:00:00.0000000,"Actors_2016-01-06"
10,"Ms","ks@qwertymail.com","Nicole","Cruise",2016-01-07T00:00:00.0000000,"Actors_2016-01-07"
9,"Ms","gc@qwertymail.com","Jessica","Jones",2016-01-01T00:00:00.0000000,"Actors_2016-01-01"
10,"Ms","ks@qwertymail.com","Nicole","Kidman",2016-01-01T00:00:00.0000000,"Actors_2016-01-01"

Filename Regex

Finally we get to my preferred method which is to use a regex to pull the date from the filename. I prefer this because I can pass wildcards or a specific period to do an incremental, partial or full load of files and still pull the date into the result set. It also breaks the dependency of the directory structure on date data lineage. Regardless of directory I always want the date partition in the filename for all sorts of maintenance and administration purposes. In fact having it broken out into a hierarchy of directory structures can be a bit a pain with the traditional directory and data brwosing tools currently available.

//file format
DECLARE @code int = 44;
DECLARE @delimiter char = (char) @code;
DECLARE @escape char = Char.Parse("œ");
DECLARE @regExPattern string = @"([0-9]{4}\-[0-9]{2}\-[0-9]{2})";

// to test full load
DECLARE @year string = "{*}";
DECLARE @month string = "{*}";
DECLARE @day string = "{*}";

// to test incremental
/*
DECLARE @year string = "2016";
DECLARE @month string = "01";
DECLARE @day string = "01";
*/

// Path
DECLARE @path string = "/Patterns/" + @year + "/" + @month + "/" + @day + "/{filename:*}.csv";

@data =
 EXTRACT Id int,
 Title string,
 Email string,
 Firstname string,
 Surname string,
 filename string
 FROM @path
 USING Extractors.Text(
    delimiter : @delimiter, 
    escapeCharacter : @escape, 
    quoting : false, 
    skipFirstNRows : 1
);

@data =
 SELECT
 d.Id,
 d.Title,
 d.Email,
 d.Firstname,
 d.Surname,
 DateTime.Parse(Regex.Match(d.filename, @regExPattern).Groups[1].Value) AS date
 FROM @data AS d;

OUTPUT @data
TO "/Patterns/Test/ActorsOut.csv"
USING Outputters.Csv();

This example below shows the output with full load with the date passed through from the filename using a C# regex. We can easily do an incremental by adjusting the commented out block of variable initialisation of @year, @month and @day. This variables would be parameters in a productionised version of a proc allowing the same code to used for either a full or incremental load.

1,"Mr","hf@qwertymail.com","Harrison","Ford",2016-01-01T00:00:00.0000000
2,"Mr","gc@qwertymail.com","George","Clooney",2016-01-02T00:00:00.0000000
4,"Mr","rr@qwertymail.com","Roger","Rabbit",2016-01-03T00:00:00.0000000
3,"Mr","ks@qwertymail.com","Gene","Wilder",2016-01-04T00:00:00.0000000
7,"Mr","bp@qwertymail.com","Brand","Pitt",2016-01-05T00:00:00.0000000
6,"Mr","tc@qwertymail.com","Tom","Cruise",2016-01-06T00:00:00.0000000
5,"Mr","jd@qwertymail.com","Johnny","Depp",2016-01-07T00:00:00.0000000
8,"Ms","hf@qwertymail.com","Angelina","Jolie",2016-01-01T00:00:00.0000000
9,"Ms","gc@qwertymail.com","Jessica","Rabbit",2016-01-04T00:00:00.0000000
8,"Ms","hf@qwertymail.com","Angelina","Pitt",2016-01-06T00:00:00.0000000
10,"Ms","ks@qwertymail.com","Nicole","Cruise",2016-01-07T00:00:00.0000000
9,"Ms","gc@qwertymail.com","Jessica","Jones",2016-01-01T00:00:00.0000000
10,"Ms","ks@qwertymail.com","Nicole","Kidman",2016-01-01T00:00:00.0000000