Quick Tip – SSIS: Excel Drivers

So last week I was just finishing of a generic metadata import solution for MDS. Basically you map your Excel semi-structured data source file to the entities and attributes in MDS run the solution and it automatically imports, transforms the file and loads it through to the mapped entities. It was all working great; until Windows installed a patch! Basically it looks like they have finally nailed the coffin shut on the Jet 4 driver – to be fair I probably shouldn’t have been using it.

So for older xls files (2007 – 2010) you can use the Microsoft.ACE.OLEDB.12.0 driver. If you have office 64 bit installed like me your SSIS IDE may not have this driver available in SSIS; but why?

Well because the IDE requires the 32 bit driver. You can flip the debugging run time in SSIS to be 64 bit however the IDE requires the 32 bit in order to design time validate the component and the solution can’t run in 64 bit until it’s validated; which is a tad annoying.

You can install the 32 (& 64) bit drivers from here. The really annoying thing is that if you have office 64 bit installed it won’t let you install the 32 drivers. This is the workaround cmd:

c:\>AccessDatabaseEngine.exe /quiet

 

 

 

Advertisements

Quick Tip – TSQL: Parameter & Variable Naming

Saw this a while ago and used it ever since… can’t remember who from otherwise would credit…

I always use @@ for parameters and @ for variables. Makes for easier intellisense and readability.

CREATE PROCEDURE [dbo].[my_proc]
( 
 @@name varchar(100), 
 @@is_enabled bit, 
 @@catageory char(5)
) as
BEGIN

DECLARE @dte DATETIME = GETDATE();

SELECT 
 name,
 status
FROM dbo.process
WHERE name = @@name
  AND is_enabled = @@is_enabled
  AND category = @@category
  AND rundate < @dte

END

Quick Tip – TSQL: Validate Procedure Metadata

So lets say you’re building some sort of DB integration and you want list only those procs that have a specific input and output definition. Here’s one way to do it… I used a table variable to hold the definitions but obviously there might be a better way to do that!

First create a proc we can target:

CREATE PROCEDURE [dbo].[my_proc]
( 
  @@name varchar(100), 
  @@is_enabled bit, 
  @@catageory char(5)
) as
BEGIN

  SELECT 
    name, 
    object_id, 
    schema_id, 
    type_desc, 
    create_date 
  FROM sys.tables 

END

Some T-SQL that returns the list procs that meets the specific inputs and outputs:

DECLARE @valid_definition TABLE
( 
  name VARCHAR(150) NOT NULL, 
  type VARCHAR(50) NOT NULL, 
  ordinal INT NOT NULL, 
  direction BIT NOT NULL --1=in and 2=out 
) 

  INSERT INTO @valid_definition (name, type, ordinal, direction) 
  VALUES 
    ('@@name',       'varchar'       ,1 ,1), 
    ('@@is_enabled', 'bit'           ,2 ,1), 
    ('@@catageory',  'char'          ,3 ,1), 
    ('name',         'nvarchar(128)' ,1 ,0), 
    ('object_id',    'int'           ,2 ,0), 
    ('schema_id',    'int'           ,3 ,0), 
    ('type_desc',    'nvarchar(60)'  ,4 ,0), 
    ('create_date',  'datetime'      ,5 ,0)

  ;WITH cte_params AS 
  ( 
    SELECT  
      pa.object_id, 
      procedure_name   = '[' + OBJECT_SCHEMA_NAME(pa.object_id) + '].[' + OBJECT_NAME(pa.object_id) + ']', 
      parameter_name   = pa.name,
      parameter_type   = TYPE_NAME(pa.user_type_id),
      parameter_length = pa.max_length,
      parameter_prec   = CASE WHEN TYPE_NAME(pa.system_type_id) = 'uniqueidentifier' THEN precision ELSE OdbcPrec(pa.system_type_id, pa.max_length, precision) END,   
      parameter_scale  = OdbcScale(pa.system_type_id, pa.scale),
      parameter_order  = pa.parameter_id 
    FROM sys.parameters pa  
  )
  SELECT DISTINCT 
    proc_signiture.procedure_name   
  FROM 
  ( 
    SELECT 
      pa.procedure_name 
    FROM sys.procedures pr 
    JOIN cte_params pa ON OBJECT_ID(SCHEMA_NAME(pr.schema_id) + '.' + pr.name) = pa.object_id 
    JOIN @valid_definition vd ON vd.name = pa.Parameter_name AND vd.type = pa.parameter_type AND vd.ordinal = pa.parameter_order AND vd.direction = 1 
    GROUP BY pa.procedure_name 
    HAVING COUNT(1) = 
    (
      SELECT COUNT(1) 
      FROM @valid_definition WHERE direction = 1
    ) 
  ) proc_signiture 
  -- get the input procedure signiture 
  CROSS APPLY sys.dm_exec_describe_first_result_set (proc_signiture.procedure_name, NULL, 0) proc_metadata 
  --get the output metadata 
  JOIN @valid_definition vd ON vd.name = proc_metadata.name AND vd.type = proc_metadata.system_type_name AND vd.ordinal = proc_metadata.column_ordinal 
  GROUP BY proc_signiture.procedure_name 
  HAVING COUNT(1) = 
  (
    SELECT COUNT(1) 
    FROM @valid_definition WHERE direction = 0
  )

The Basics – T-SQL: Recursive Inheritance

This is a post regarding a SQL design pattern for applying configuration attributes to a parent child table. There’s a bunch of obvious use cases particularly around metadata and ETL control & configuration repositories…

Imagine if your ETL has hundreds if not thousands of processes. Often it makes sense to model process meta data in parent child table since you have 1 root process that when executed executes a hierarchy of child process. If then you have configuration options for those processes you won’t want to maintain a configuration row for every process right? that would be a huge maintenance headache. Ideally we would configure the root process with options that the child processes just inherit unless they’re specifically overridden. This gives us the best of both worlds in terms of being able to have the flexibility of specific configurations for child processes but where they don’t have configurations then inherit the configurations from the parent resulting in less configuration maintenance. This model doesn’t persist the inherited configuration but remains sparse and derives the inheritance logically when queried removing the need for updates or inserts.

We’ll have 2 tables:

  1. The parent child entity: process_executable
  2. The configuration attributes: process_configuration

Here they are:

CREATE TABLE [etl].[process_executable] (
 [process_executable_id] INT IDENTITY (1, 1) NOT NULL,
 [parent_process_executable_id] INT DEFAULT ((-1)) NOT NULL,
 [connection_id] INT NOT NULL,
 [process_executable_name] VARCHAR (200) NOT NULL,
 [created_date] DATETIME DEFAULT (getdate()) NOT NULL,
 [is_enabled] BIT DEFAULT ((1)) NOT NULL
 CONSTRAINT [PK_process_executable] PRIMARY KEY CLUSTERED ([process_executable_id] ASC)
);


CREATE TABLE [etl].[process_configuration] (
 [process_configuration_id] INT IDENTITY (1, 1) NOT NULL,
 [process_executable_id] INT NULL,
 [number_of_retries] INT NULL,
 [is_incremental] BIT NULL,
 CONSTRAINT [PK_process_configuration] PRIMARY KEY CLUSTERED ([process_configuration_id] ASC)
);

Here is the solution

First left join the configurations in so we get all process with and without configurations. Then recurse down the hierarchy using the a recursive CTE defaulting the configurations to the parent values where they are null.

WITH [cte_process_configuration]  
AS --join in the configuration 
(
  SELECT  
    p.[process_executable_id],  
    p.[parent_process_executable_id],  
    pc.number_of_retries, 
    pc.is_incremental 
  FROM etl.process_executable p 
  LEFT JOIN [etl].[process_configuration] pc ON p.process_executable_id = pc.process_configuration_id
 ), 
cte_collapse 
AS --recurse  down tacking the parent settings down to children where they don't have settings. 
(
  SELECT  
    m.process_executable_id, 
    m.parent_process_executable_id,
    m.max_threads,
    m.number_of_retries,
    m.is_incremental 
  FROM [cte_process_configuration] m 
  WHERE m.parent_process_executable_id = -1 

  UNION ALL

  SELECT 
    m.process_executable_id, 
    m.parent_process_executable_id, 
    --carry down the settings where there aren't any 
    ISNULL(m.number_of_retries, pm.number_of_retries) AS number_of_retries, 
    ISNULL(m.is_incremental, pm.is_incremental) AS is_incremental 
  FROM [cte_process_configuration] m 
  JOIN cte_collapse pm ON m.parent_process_executable_id = pm.process_executable_id 
)
--show me the money! 
SELECT 
  process_executable_id,  
  number_of_retries, 
  is_incremental 
FROM cte_collapse c

 

Quick Tip – C# Batch Parallel Processing

There’s a fair few creative solutions around for executing data processing tasks in parallel in SQL and SSIS. As good as they are they’re not really necessary if you use Parallel.ForEach in the C# library System.Threading.Tasks. This is a parallel loop that can be used to iterate items in a collection and run some code for each item. All the thread allocation and loop tracking control is handled internally without having to do anything.

The only other thing you could do is pass it a parameter to specify how many concurrent executions you want making it adjustable and easy to configure at runtime:

int maxP = 5;

Parallel.ForEach(
collectionOfProcesses,

new ParallelOptions { MaxDegreeOfParallelism = maxP},
(currentProcess) =>
{

//Excecute SSIS Package
//Execute SQL Proc
//Azure Data Lake Analytics Job

  Execute(currentProcess);
});

The Basics – BIML : Preview Pain & Multiple File Dependencies

BIML Express 2017 has been formally release and a great new feature is the preview pain that lets you see how you’re BIML will be rendered during development. For metaprogramming this is awesome and saves a lot of pain. You can read about it here and here.

pvpain.png

So what if you have multiple dependent BIML files? e.g. I have a file that retrieves my table meta data to build table def BIML and a file that consumes the table def BIML to create my packages. This way I can use the same table definition BIML file for multiple package patterns. I just thought this would be too much for it to cope with for the 1st release – but it does in fact work which is super awesome.

The trick is the preview pain operates on all the open files. So all the dependent files have to be open to fully rendered the desired result in the preview pain.

So with just my package definition BIML file open no package BIML code is rendered in the preview… this is because there are no table defs to iterate to create the packages.

pvpainpkg

If I open both the table definition BIML Script and the packages definition script then we’re all good.

Table definition BIML

pvpaintbl

Packages definition BIML Preview now has packages!

pvpainpkgwrk

Auto DW – Metaprogramming

This is a high level consideration on using metaprogramming to build Automated Data Provisioning frameworks.

Once you’ve mastered writing code and applying solution patterns to solve real world problems a next natural progressive step is to write code that writes code! A lot of implementation technologies might even dip into this without you being aware or you may just start dipping into a natural innovative way so solve a certain problem. The topic is part of an advanced knowledge domain called Metaprogramming; the wiki-post discusses it pro’s and challenges. Kathleen Dollard has a great course on Pluralsight called Understanding Metaprogramming.

My own experience and perhaps the most common is that you’ll start metaprogramming before you’ve given the topic it’s full attention. I don’t remember getting out bed and thinking… “today I will do some metaprogramming”. What happened is that chasing the benefits as a result of experiencing pain provided the motivation. The next thing to say about code writing code is that it can go fantastically well or horrifically bad. Without giving the knowledge domain the respect that it deserves chances are it will be the latter.

Another fundamental software engineering trap I learnt the hard way is don’t program generic solutions to very specific problems you’ll pay for it in complexity and performance. The temptation can be quite strong because we’re taught to abstract and conquer particularly if the problem looks the same – but is it? This is particularly relevant for data provisioning platforms because (not exhaustive):

  1. Performance is high on the agenda. We attempt to routinely and frequently move and change tons of data; performance is crucial for success and it’s directly related to costs
  2. The repetition is obvious and can appear to constitute a large proportion of man hours; to an economist it seems to be the same solution over and over again e.g. stage data, build snapshot fact, build type 1 dimension, build type 2 dimension, etc…
  3. The content (the data) is fluid and dynamic over a large temporal period. Beyond it’s schema definition it has a low level and often an incomplete persistence of semantics that can be illusive that are a product of real world economic and human behaviour
  4. The expectations and requirements are also fluid and dynamic; they seek recover semantic meaning or information from data using reports, interactive data visualisation tools, semantic layers or other system to system interfaces.

So bringing this all into context:

  • Design patterns are common but the solutions are never the same. A type 2 dimension is a design pattern not a solution. This isn’t helped by teams running bad agile delivery. A type 2 dimension is not a backlog story and neither is a fact table.
  • The solution is to provide specific information to meet a business requirement. Not only is it different on every single project, it’s different in the same project over time. The business has to respond to it’s market which in turn motivates expectation and influences human behaviour which  in turn churns the raw solution content; the data. A static solution is not a solution.
  • The solution content is the data which is also different on every single implementation and within an implementation over time. It has features and they can all change either explicitly or implicitly
  • Performance in data platforms rely on issuing the correct amount of physical computing resources at exactly the right time. What this means is that a physical implementation needs to know about the features of the data very explicitly in order to allocate the correct amount of resources. Get it wrong in an on premise architecture and a job hogs limited resources causing other processes to suffer. Get it wrong on cloud MPP architecture and you’ll pay too much money. This is not going away; why? because information has entropy and you can’t cheat the law of physics.

In Conclusion

Building a generic solution to solve the problem of repetition in Data Platform delivery isn’t the answer. The data is specific, the requirements are specific and if you take this approach the solution is abstract leading to overly complicated and poor performing technical architectures. At their very worst the try to shoe horn the specifics into an architecture that hinders the goal and completely misses the requirements. I’d stick my neck out based on my own experience and state that 2 solutions are never the same; even in the same industry using the same operational systems.

Be very wary of magical all singing and all dancing products that claim to be a generic solution to data provisioning. AI is long way off being able derive specific semantics about the real world based on data. It’s just not possible right now… a lot of AI is approximate based on population statistics; the features of data and information are very specific.

Metaprogamming solves the problem of repetition but delivers specific solution artefacts that don’t sacrifice what Data Platforms implementations need in order to succeed which is:

  • Perform within their budget
  • Meet the business requirements

We aim to solve the repetition problem (and a whole host secondary problems) during the development process and recognise that there is the following:

  • Specific metadata about the technical features of the raw data
  • Specific metadata about the technical features of the deliverables
  • Generic implementation patterns

Development frameworks can collect the metadata specifics and combine them with generic implementation patterns to automatically generate the code of our specific solution artefacts. No product or framework however can do the following:

  • Semantically analyse the data to determine the code required to perform the transformations required to meet the information requirement. This requires real intelligence i.e. a human! It can also be extremely hard if the data and requirements are particularly challenging – this is where the real business value sits in your solution
  • Decide what are the best design patterns to use and how to construct them into a into a solution to meet the requirements. This requires knowledge and experience – An experienced solution architect

There are number of technical ways to achieve metaprogramming. I generally work in the Microsoft data platform space. Here are some I’ve used before I knew about metaprogramming:

  • XML/XSLT creating JavaScript!! Not data platform and a long time ago. Wouldn’t recommend it
  • SQL creating SQL (Dynamic SQL)
  • C# creating SSIS and SQL
  • T4 Templates
  • BIML (a Varigence creation)

I’ve built a few automated development frameworks using the above. Some of them were awful. I found myself neck deep in some crazy code maintenance and debugging hell which motivated me to learn more about the in’s and out’s of metaprogramming. I strongly recommend Kathleen’s course Understanding Metaprogramming if you’re heading down this road since it goes into detail about the approaches and the best classes of solutions for code generating code. Now I only use to the following:

The way that BIML works is actually a very similar T4 Templates it’s just that BIML brings a really useful mark-up language and development IDE to the party for scripting the creation of SSIS packages and database objects. They have also just released their automated development framework called BIML Flex if you don’t have the bandwidth/time to build your own.

As it turns out tackling metadata as a technical requirement during the development cycle lends itself to solving other common difficult problems in the data provisioning space which is integrating the following:

  • Data Catalog
  • Data Lineage
  • Operational Logging
  • Operational Auditing

Because the metadata collection is required and the assets are created from it, integrating these data platform features becomes a by-product of the development process itself. It’s a very proactive and effective solution. Retrospective solutions in this space can never keep up to the pace of change or are too pervasive requiring constant maintenance and support over and above the solution itself.