Finer Points – USQL: Extraction Paths 2, reading too much!

I blogged recently on loading files using parameterised wildcards. Michael Rys kindly posted some feedback in my comments regarding parameters which was helpful and uncovered some important behavior to consider. Effectively when considering big data we want the optimizer to be as smart as possible when deciding what files to read since we pay for it (literally not just in performance!), at the same I want flexible and elegant code implementation frameworks.

So it turns out we can use dateparts and filter in the same job. We can post dateparts into extraction paths and follow it with a where a clause in separate statement. When this script compiles the execution ideally will only read extraction files that are covered by the where clause in the USQL statement that follows the extract. In fact when using a USQL predicate with a DateTime variable that’s exactly what it does. Essentially it doesn’t execute sequentially, we should expect the compiler to choose the best parallel query plan and only read the files it needs to.

So to provide a pattern that can load everything, year, month or date it might seem reasonable to do something like the following with C# types in order to provide a generic procedure.

Note: below I’ve configured the setup to load February, I have 1 file in Feb and 7 files in Jan:

 /* PARAMETERS BEGIN */

/* load everything */
//DECLARE @pyear int = -1;
//DECLARE @pmonth int = -1;
//DECLARE @pday int = -1;

/* load year */
//DECLARE @pyear int = 2016;
//DECLARE @pmonth int = -1;
//DECLARE @pday int = -1;

/* load month */
DECLARE @pyear int = 2016;
DECLARE @pmonth int = 2;
DECLARE @pday int = -1;

/* load day */
//DECLARE @pyear int = 2016;
//DECLARE @pmonth int = 1;
//DECLARE @pday int = 1;

/* PARAMETERS END */

//file format
DECLARE @tabcode int = 44;
DECLARE @delimiter char = (char) @tabcode;
DECLARE @escape char = Char.Parse("œ");
DECLARE @extension string = "csv";

//wildcard date
DECLARE @dateWildCard int = - 1;


//system attributes
DECLARE @systemCode string = "Patterns";

// Path
DECLARE @path string = "wasb://raw@sdbgkhsdbgkhds.blob.core.windows.net/" + @systemCode + "/{date:yyyy}/{date:MM}/{date:dd}/{filename}.csv";

@data =
 EXTRACT Id int,
 Title string,
 Email string,
 Firstname string,
 Surname string,
 date DateTime,
 filename string
 FROM @path
 USING Extractors.Text(delimiter : @delimiter, escapeCharacter : @escape, quoting : false, skipFirstNRows : 1);


@data =
 SELECT d. *
 FROM @data AS d
 WHERE (date.Year == @pyear OR @pmonth == @dateWildCard)
 AND (date.Month == @pmonth OR @pmonth == @dateWildCard)
 AND (date.Day == @pday OR @pday == @dateWildCard);

OUTPUT @data
TO "/Patterns/Test/Out.csv"
USING Outputters.Csv();


When the job executes it loads all the files for January and February after which it does the filter which is not what I expected. 8 streams read all the files but the output only contains February. We effectively read 7 files in Jan that we didn’t need to.

4

Output only contains February:

5

So in conclusion we’ve read far more data than we intended. The following code however avoids this problem. Note the differences in the following code. I’ve effectively passed the day parts into different columns day, month and year. This means I can use a native datetime predicate in the where clause. The analytics engine doesn’t have to load all the data to determine the value of year, month and day using higher level .net execution; well that’s my guess. I need to read more on the internals underneath if published anywhere. Effectively there is a read price to pay running custom .net in this scenario.

 
 /* PARAMETERS BEGIN */

/* load everything */
//DECLARE @pyear int = -1;
//DECLARE @pmonth int = -1;
//DECLARE @pday int = -1;

/* load year */
//DECLARE @pyear int = 2016;
//DECLARE @pmonth int = -1;
//DECLARE @pday int = -1;

/* load month */
DECLARE @pyear int = 2016;
DECLARE @pmonth int = 2;
DECLARE @pday int = -1;

/* load day*/
//DECLARE @pyear int = 2016;
//DECLARE @pmonth int = 1;
//DECLARE @pday int = 1;

/* PARAMETERS END */

//file format
DECLARE @tabcode int = 44;
DECLARE @delimiter char = (char) @tabcode;
DECLARE @escape char = Char.Parse("œ");
DECLARE @extension string = "csv";

//wildcard date
DECLARE @wildcard int = -1;
//datetime part filters from parameters
DECLARE @dteYear = new DateTime((@pyear==-1?1:@pyear),1,1);
DECLARE @dteMonth = new DateTime(1,(@pmonth==-1?1:@pmonth),1);
DECLARE @dteDay = new DateTime(1,1,(@pday==-1?1:@pday));

//system attributes
DECLARE @systemCode string = "Patterns";

// Path
DECLARE @path string = "wasb://raw@sfgsfdgsdfg.blob.core.windows.net/" + @systemCode + "/{year:yyyy}/{month:MM}/{day:dd}/{filename}.csv";

@data =
 EXTRACT Id int,
 Title string,
 Email string,
 Firstname string,
 Surname string,
 year DateTime,
 month DateTime,
 day DateTime,
 filename string
 FROM @path
 USING Extractors.Text(delimiter : @delimiter, escapeCharacter : @escape, quoting : false, skipFirstNRows : 1);

@data =
 SELECT d. *
 FROM @data AS d
 WHERE (year == @dteYear OR @pyear == @wildcard)
 AND (month == @dteMonth OR @pmonth == @wildcard)
 AND (day == @dteDay OR @pday == @wildcard);


OUTPUT @data
TO "/Patterns/Test/Out.csv"
USING Outputters.Csv();


This time we can see that the job only reads my single February file which is exactly what I want. I want the efficiency of selective reads and flexibility to load any part of the temporal hierarchy of my files in my blob storage.

6

In the code the filter variables day, month and year are still typed as datetime… however consider how 2017-02-18 gets split out into these variables:

  • Day  18-> defaults month to 01 and year to 0001 -> 00010118T00:00:00.000000
  • Month 2 -> defaults day to 01 and year to 0001 -> 00010201T00:00:00.000000
  • Year  2017-> defaults day and month to 01 -> 20170101T00:00:00.000000

Note I’ve left the date part datetimes in the output so you can see how they come through:

7

The result with this pattern is that we effectively read only what we needed for the output and perhaps the moral of the story – avoid custom .net in the where clause when filtering down files that you want to read. Part of that solution is making sure your file has attributes it needs to perform specific predicate selections.