Finer Points – USQL: File Format

Azure data lake is a file based data storage and analytics platform. Working with what appears to be SQL can lead a traditional SQL developer into a few traps. File format is one of them.

The following built-in extractors are available in U-SQL:

Extractors.Text()
Extractors.Csv()
Extractors.Tsv()

If you’re in the business of processing raw data files then the issue of choosing a file format should be pretty familiar territory. How do you know the file format isn’t a feature of the data itself? At which point the schema read comes crashing down like a ton of bricks. Ultimately you may have limited control over the protection of your chosen file format; how do you know what you have chosen now won’t be compromised in the future?

The Text() extractor and outputter provides an additional parameter to allow control over the file format delimiter. Below is a small experimental snippet showing the text extractor:

DECLARE @code int = 44;
DECLARE @delimiter char = (char) @code;
DECLARE @escape char = Char.Parse("œ");
DECLARE @path string = "/Patterns/2016/01/01/{filename:*}.csv";

@data =
 EXTRACT Id int,
 Title string,
 Email string,
 Firstname string,
 Surname string,
 filename string
 FROM @path
 USING Extractors.Text(
       delimiter : @delimiter, 
       escapeCharacter : @escape, 
       quoting : false, 
       skipFirstNRows : 1,
       rowDelimiter: "/n"
);

OUTPUT @data
TO "/Patterns/Test/ActorsOut.csv"
USING Outputters.Csv();

Column Delimiter

The escape character gets us out of a hole with column delimiters. Something to know is that if you use the escapeCharacter parameter then it won’t be used to escape the quote character (which is true by default); this is done using another quote e.g. “Some “”data”” with a quotes”

My go-to approach is just to always make use of the Text extractor, provide an escape character and avoid quotes for the following reasons:

  • Data factory supports the addition of an escape character that will automatically get injected when the file is written out, this provides an easy way to parse an escape character into the data
  • Data factory may also parse in escapes for quotes (I don’t know haven’t tried), it just seems like an overly verbose way of doing the same thing as escape characters except I can’t choose my character and the one I’m given is very common
  • I can change the file markup in a central place without editing all my loading procedures
  • I’d rather have an escape character and not need it than waste tedious amounts of time sorting out these issues later.
  • The escapeCharacter escapes itself so you don’t have to worry too much about what you choose, though you’ll probably do yourself some favours by choosing something obscure

With large amounts of data such issues aren’t going to be found until you start churning through lots of data at which point you may have a lot of code to change and pressing deadlines because you simply chose the wrong delimiter.

You could argue that choosing something obscure for your column delimiter is just as effective. However can you ever be 100% sure; also someone will have to look at this data to resolve issues. Choosing common supported delimiters ensures the data is easy to look at and troubleshoot.

Row Delimiter

With row delimiter we’re not so lucky. Unfortunately because it’s a big data platform and the processing of data is split across computation nodes it’s not possible to escape row delimiters. As a best practice approach I always now delete row delimiters from my data on the selection queries since it’s a relatively cheap operation.

The only other approach is to clean in data lake. This involves loading ragged files, looking for the ragged ends and cleansing out the delimiters for broken lines. Not impossible but a bit more coding involved albeit re-usable; after all nosql platforms shine with semi/non-structured data. Probably depends on the type of source since this kind of processing will be charged on Azure subscription; if it’s cheap and easy at source outside azure then why pay?

Interestingly the row delimiter can only be 1 character. /r/n is one of 3 defaults the others being /r and /n. So custom formats are limited to one character, since the following causes failure for being 2 characters.

 USING Extractors.Text(
       delimiter : @delimiter, 
       escapeCharacter : @escape, 
       quoting : false, 
       skipFirstNRows : 1,
       rowDelimiter: "/r/n"
);

Hierarchical Formats

There are of course hierarchical formats available such as json, xml and paraquet that are supported by data factory and other tools. Nosql platforms after all really shine with semi-structured data. However at the moment they’re not yet fully supported by data lake analytics.

U-SQL is highly extensible and custom extractors are available for hierarchical file formats. If traditional ETL was accused of one thing it would be that it’s too sensitive to schema changes. Hierarchical file formats compound this issue even further because it creates dependency between data entities at an asset (file) level.

If the source data by it’s original nature comes in as a hierarchical format then fair enough. If it’s coming from a RDBMS database then taking it to hierarchical structural format to get it back to flat format is taking 3 steps back and 1 step forward.

Centralising File Format

The code snippet above shows pushing in file format parameters using variables. We obviously have to maintain these values in every script or procedure. You could push file markup values in using parameters and then of course centralisation depends on what you’re using for orchestration. One way to centralise this is to pull the literal file markups values from shared a assembly:

REFERENCE ASSEMBLY Patterns.USQLHelper;

DECLARE @delimiter char = USQLHelper.FileFormat.GetDelimiter();
DECLARE @escape char = USQLHelper.FileFormat.GetEscape();

Shared assemblies can be written in C# (or any .Net language) and registered at database level. This assembly could read a json or xml file stored in the data lake itself. Then it’s just a matter of tweaking your global file formats organised by source when you hit a format issue.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s