The Basics – Azure SSAS: Authentication

Azure SSAS has hit preview and if you’re familiar with SSAS you’ll know it only works with Active Directory Integrated authentication. If you’re not an Active Directory spod this might throw you a little with it being an Azure PaaS service. Effectively it works in a similar way and requires you to create the service declaring  an Azure Active Directory administration account. If you’ve created your own subscription what you use to login is not an Azure AD account and choosing that account won’t be an option.

So… A little on Azure AD. Azure AD wasn’t really intended to be a stand alone Active Directory, well not yet anyway. Essentially for cloud to work there had to be an authentication method that works well over the internet and Microsoft went for OAuth 2 which is basically what Azure AD uses to create secure tokens. Azure AD is effectively a cloud extension to traditional AD that we are familiar with at least on some level and is the organization account we use every day to logon onto our work machines.

Good news is you get basic Azure Active Directory for free. See the screen shot below.

capture

In the screen shot there are 2 accounts:

  • Microsoft Account
  • Azure Active Directory Account

The Microsoft Account will be there already. That’s what was possibly used to create and use the subscription up till now. This is the personal account sign in when you go to sign into office 365 or the Azure portal. The personal account username will be whatever email address you used… e.g.

joe.blog@hotmail.com

When you look at the profile of the account in the next blade on the above image you’ll see the source of this account is declared Microsoft Account.

The other account is an Azure Active Directory account. This can be created in the blade above using “Add User”. It will have the following username convention.

username@joebloghotmail.onmicrosoft.com

If you want to assign your own domain to get rid of the one given to you by Azure you can but that isn’t free. I’ll leave that beyond the scope of this blog. There are brilliant courses on Pluralsight for Azure AD… Thoroughly recommend pluralsight… it’s awesome.

When we look at the profile of this account it’s source is declared as Azure Active Directory. So to create a SSAS service we need an Azure Active Directory account to be an administrator so go ahead and create one. This account is what we know and sign in as an organizational account and you can indeed give it further privileges and use it to sign into the portal to provision and use other services.

capture2

Now we have an Azure AD account we’re ready to go…

When creating the SSAS service you’ll be asked for this type of account for an administrator. It won’t allow to select a Microsoft Account (personal account). Below we see the Microsoft Account is grayed out I can only select an Azure AD Account (organizational account).

capture3

Once the service is provisioned we can deploy to this services using SQL Server Data Tools (SSDT) and connect to it using SQL Server Management Studio (SSMS). These docs go into detail:

When connecting using SSMS normally with SSAS we just go straight in using our on premise AD account with integrated authentication. Because we’re signing in with our Azure AD Account we need to provide the account details that we have created in Azure. In order to connect we need a new option Active Directory Password Authentication that requires a registry entry to enable it (Connect using SSMS). If you’re building up advanced security models in SSAS then we can obviously create a number of users and use different sign-in’s to test the results.

capture4

Unless of course your on premise AD and Azure AD are hooked up via federation then you can go straight in with Active Directory Integrated Authentication without having to sign in.

In a working environment we obviously wouldn’t want to assign individual users. It would be better to create a groups e.g. SSAS Administrators, assign the groups and manage user access by assigning users to groups in Azure Active Directory.

I hope this helps. Not all BI developers have a handle on Active Directory especially the in’s and out’s with Azure. I only know what I know from app programming many moons ago and some hours spent on Pluralsight learning enough to get by and design good security models

Finer Points – USQL: Extraction Paths 2, reading too much!

I blogged recently on loading files using parameterised wildcards. Michael Rys kindly posted some feedback in my comments regarding parameters which was helpful and uncovered some important behavior to consider. Effectively when considering big data we want the optimizer to be as smart as possible when deciding what files to read since we pay for it (literally not just in performance!), at the same I want flexible and elegant code implementation frameworks.

So it turns out we can use dateparts and filter in the same job. We can post dateparts into extraction paths and follow it with a where a clause in separate statement. When this script compiles the execution ideally will only read extraction files that are covered by the where clause in the USQL statement that follows the extract. In fact when using a USQL predicate with a DateTime variable that’s exactly what it does. Essentially it doesn’t execute sequentially, we should expect the compiler to choose the best parallel query plan and only read the files it needs to.

So to provide a pattern that can load everything, year, month or date it might seem reasonable to do something like the following with C# types in order to provide a generic procedure.

Note: below I’ve configured the setup to load February, I have 1 file in Feb and 7 files in Jan:

 /* PARAMETERS BEGIN */

/* load everything */
//DECLARE @pyear int = -1;
//DECLARE @pmonth int = -1;
//DECLARE @pday int = -1;

/* load year */
//DECLARE @pyear int = 2016;
//DECLARE @pmonth int = -1;
//DECLARE @pday int = -1;

/* load month */
DECLARE @pyear int = 2016;
DECLARE @pmonth int = 2;
DECLARE @pday int = -1;

/* load day */
//DECLARE @pyear int = 2016;
//DECLARE @pmonth int = 1;
//DECLARE @pday int = 1;

/* PARAMETERS END */

//file format
DECLARE @tabcode int = 44;
DECLARE @delimiter char = (char) @tabcode;
DECLARE @escape char = Char.Parse("œ");
DECLARE @extension string = "csv";

//wildcard date
DECLARE @dateWildCard int = - 1;


//system attributes
DECLARE @systemCode string = "Patterns";

// Path
DECLARE @path string = "wasb://raw@sdbgkhsdbgkhds.blob.core.windows.net/" + @systemCode + "/{date:yyyy}/{date:MM}/{date:dd}/{filename}.csv";

@data =
 EXTRACT Id int,
 Title string,
 Email string,
 Firstname string,
 Surname string,
 date DateTime,
 filename string
 FROM @path
 USING Extractors.Text(delimiter : @delimiter, escapeCharacter : @escape, quoting : false, skipFirstNRows : 1);


@data =
 SELECT d. *
 FROM @data AS d
 WHERE (date.Year == @pyear OR @pmonth == @dateWildCard)
 AND (date.Month == @pmonth OR @pmonth == @dateWildCard)
 AND (date.Day == @pday OR @pday == @dateWildCard);

OUTPUT @data
TO "/Patterns/Test/Out.csv"
USING Outputters.Csv();


When the job executes it loads all the files for January and February after which it does the filter which is not what I expected. 8 streams read all the files but the output only contains February. We effectively read 7 files in Jan that we didn’t need to.

4

Output only contains February:

5

So in conclusion we’ve read far more data than we intended. The following code however avoids this problem. Note the differences in the following code. I’ve effectively passed the day parts into different columns day, month and year. This means I can use a native datetime predicate in the where clause. The analytics engine doesn’t have to load all the data to determine the value of year, month and day using higher level .net execution; well that’s my guess. I need to read more on the internals underneath if published anywhere. Effectively there is a read price to pay running custom .net in this scenario.

 
 /* PARAMETERS BEGIN */

/* load everything */
//DECLARE @pyear int = -1;
//DECLARE @pmonth int = -1;
//DECLARE @pday int = -1;

/* load year */
//DECLARE @pyear int = 2016;
//DECLARE @pmonth int = -1;
//DECLARE @pday int = -1;

/* load month */
DECLARE @pyear int = 2016;
DECLARE @pmonth int = 2;
DECLARE @pday int = -1;

/* load day*/
//DECLARE @pyear int = 2016;
//DECLARE @pmonth int = 1;
//DECLARE @pday int = 1;

/* PARAMETERS END */

//file format
DECLARE @tabcode int = 44;
DECLARE @delimiter char = (char) @tabcode;
DECLARE @escape char = Char.Parse("œ");
DECLARE @extension string = "csv";

//wildcard date
DECLARE @wildcard int = -1;
//datetime part filters from parameters
DECLARE @dteYear = new DateTime((@pyear==-1?1:@pyear),1,1);
DECLARE @dteMonth = new DateTime(1,(@pmonth==-1?1:@pmonth),1);
DECLARE @dteDay = new DateTime(1,1,(@pday==-1?1:@pday));

//system attributes
DECLARE @systemCode string = "Patterns";

// Path
DECLARE @path string = "wasb://raw@sfgsfdgsdfg.blob.core.windows.net/" + @systemCode + "/{year:yyyy}/{month:MM}/{day:dd}/{filename}.csv";

@data =
 EXTRACT Id int,
 Title string,
 Email string,
 Firstname string,
 Surname string,
 year DateTime,
 month DateTime,
 day DateTime,
 filename string
 FROM @path
 USING Extractors.Text(delimiter : @delimiter, escapeCharacter : @escape, quoting : false, skipFirstNRows : 1);

@data =
 SELECT d. *
 FROM @data AS d
 WHERE (year == @dteYear OR @pyear == @wildcard)
 AND (month == @dteMonth OR @pmonth == @wildcard)
 AND (day == @dteDay OR @pday == @wildcard);


OUTPUT @data
TO "/Patterns/Test/Out.csv"
USING Outputters.Csv();


This time we can see that the job only reads my single February file which is exactly what I want. I want the efficiency of selective reads and flexibility to load any part of the temporal hierarchy of my files in my blob storage.

6

In the code the filter variables day, month and year are still typed as datetime… however consider how 2017-02-18 gets split out into these variables:

  • Day  18-> defaults month to 01 and year to 0001 -> 00010118T00:00:00.000000
  • Month 2 -> defaults day to 01 and year to 0001 -> 00010201T00:00:00.000000
  • Year  2017-> defaults day and month to 01 -> 20170101T00:00:00.000000

Note I’ve left the date part datetimes in the output so you can see how they come through:

7

The result with this pattern is that we effectively read only what we needed for the output and perhaps the moral of the story – avoid custom .net in the where clause when filtering down files that you want to read. Part of that solution is making sure your file has attributes it needs to perform specific predicate selections.

Finer Points – Azure Stream Analytics & PowerBI: Duplicates

Power BI is evolving rapidly and market competition is really hotting up especially with Tableau. Arguably Power BI’s real-time capability puts it slightly ahead of the game in one aspect. I managed to get some detailed hands on to see what the experience and capability is really like. There are 2 ways to stream data into power BI:

  1. API’s
  2. Azure Stream Analytics

You can get up and running with stream analytics pretty quickly however sometimes it’s good to break stuff just to see what you’re getting yourself into.

I’m not going to go into details of how the architecture is set up since there’s loads of blogs on that already. Basically:

  1. Set-up a stream analytics job pointing at Storage blob to read CSV’s
  2. Added a query to pull apart the date and add in some system columns
  3. Fed the data into my Power BI subscription

arch

You can stream data from IoT, event hub or storage events. I’ve used storage events here just because it’s easy to set up. Using storage for real-time is a way to batch events up into an event pay load which you can stream through and then archive into a historic database or data lake.

The query is as follows:

SELECT
BlobName,
EventProcessedUtcTime,
BlobLastModifiedUtcTime,
PartitionId,
CAST(DATEPART(year,Date) AS nvarchar(max)) AS Year,
CAST(DATEPART(month,Date) AS nvarchar(max)) AS Month,
CAST(DATEPART(day, Date) AS nvarchar(max)) AS Day,
Date,
Region,
CAST(Sales as BIGINT) AS Sales
INTO
 [TestReport]
FROM
 [Test]

In this scenario I’m just going to use the storage account write events to write a few simple rows of data into power BI. I will load 2 csv files:

  1. Sales for UK & Wales for Jan 2017 09:00:01 – 09:30:00
  2. Sales for UK & Wales for Jan 2017 09:30:01 – 10:00:00

Here is an example showing the 2nd file start and end, note that I’ve deliberately included data that will cause an error in stream analytics – “wrong” cannot be cast as a bigint:

Date,Region,Sales
2017-01-01 09:30:01,UK,5
2017-01-01 09:31:00,UK,1
2017-01-01 09:32:00,UK,2
....
2017-01-01 09:58:00,WALES,1
2017-01-01 09:59:00,WALES,4
2017-01-01 10:00:00,WALES,wrong

Step 1 – I copy file 1 up into the storage account, within a few seconds my test data set appears and I can see data from the 1st file. Note that I’ve included bloblastmodifiedutctime.

time-data-1

Step 2 – I copy file 2 up into the storage account.At this point the job stops and is in a failed state as expcted. Looking in the logs we can clearly see the error:

exception-message

However interestingly when we look at Power BI some data (1 row) made it through before the failure ended the stream:

2-with-error

Step 3 – so I’ve picked up my failure and I correct the file changing the value of wrong to 50. Note I can only do this because I have a file to change.

Date,Region,Sales
2017-01-01 09:30:01,UK,5
2017-01-01 09:31:00,UK,1
2017-01-01 09:32:00,UK,2
....
2017-01-01 09:58:00,WALES,1
2017-01-01 09:59:00,WALES,4
2017-01-01 10:00:00,WALES,50

Copy file 2 back up into my storage account overwriting the previous copy that caused the data to fail. Then I restart the job from when it failed.Note I haven’t used TIMESTAMP BY to set the event date using a source column.

when-last-finished

The job starts successfully and picks up the corrected file. However on checking Power BI I now have a situation where duplicates have made it into my Power BI data set. Looking at the bloblastmodifiedutctime I can see that they originate from different blob events; 1 that caused failure and 1 that did not. The numbers are different since there are 2 rows for uk in the data set with the same time which Power BI has aggregated. This shows that only 1 row made it through on the failure:

dupes-time

Duplicates

On perusing the documentation I came across the following:

“….When a Stream Analytics job is running, duplicate records may occasionally be noticed in the output data. These duplicate records expected because Azure Stream Analytics output adapters don’t write the output events transactionally. This ‘duplicate record’ scenario can result if one of the following conditions occur;

  • The Azure instance is upgraded while the job is running
  • The Stream Analytics job is upgraded or an issue occurs with connectivity or reliability to the job output
  • The Azure instance running the job has an outage

The downstream consumer of the output events need to dedupe the events using logical identity of the events. For example, if you are aggregating events by groups in a tumbling window, the logical identity of the event is the groups and the tumbling window’s end time. If you are running a pass through query, you may need to carry a unique id on the event in order to dedupe…

Fixing the Dupes

A simple observation is making sure the queries and source data don’t invoke errors. Although the docs imply that azure instance interruptions can cause the same issue… and outright states that there is no transaction protection of outputs.

In our case we are partitioning data into files before dumping the data into an event window for potential aggregation e.g. every 30 minutes. Because it’s a file on azure storage I can get in to correct the data and restart the stream.

Windowing

So does windowing help us? Well yes and no since it depends on the requirements. It fundamentally comes back to are we using the right thing for the job. With stream analytics we’d want to be doing some sort of windowed aggregate. If we partitioned our file at the same grain as the aggregate window then no files would commit as output events unless the operation succeeded for all the data in the file.

Dedupe

This is pretty much out of the window since we’ve plugged it straight into Power BI. There is no way to insert custom dedupe logic. If you know which files have failed you can filter them out which is an ugly work around.

Reload

All my data is in archived files so I can delete the data set, correct my data set and reload everything.

API

I can use the Power BI api to delete rows of data from my data set using the bloblastmodifiedutctime.

https://msdn.microsoft.com/en-us/library/dn877544.aspx

https://github.com/Microsoft/PowerBI-CSharp/tree/master/samples/webforms/get-started-web-app-asp.net

Do we care?

Considering the use case do we really care? If we’re running an event stream we should only really care about the now. Particularly if we using the stream analytics to stream data through prediction functions or statistical aggregations since they are approximate anyway based on a sample populations of data.

If on the other hand if it’s for a BI function that just want to see their data in a BI model as soon as possible then we are going to care that the numbers are correct especially in Finance. As well as real-time we’d be archiving this data off to load a historical data model. This is where the real-time feature can get a bit misconstrued. Real-time BI is a different beast to real time event processing. Currently using this method there is no way to blend your real-time feed into a richer BI model with related data sets and calculations. Even with event stream processing it would be good to visualize the event stream against the backdrop of historical aggregates.

In order to build real-time BI models we need to be able to partition data sets and point them at different sources – e.g. a real time source and retrospective source. It’s a tried and tested method with SSAS multi-dim. Partitioning is on the way for PaaS SSAS tabular and perhaps the API gives me more options for achieving real-time BI. Or perhaps a richer architecture with stream analytics putting something in between it’s output and Power BI. The only other way to push data real time into data sets currently is using API’s.

Right… onto the API.