MSBI (SSIS/SSRS/SSAS) Online Training

Thursday, October 17, 2013

SSIS - Slowly Changing Dimension Type 1 Implementaion with MergeJoin Vs Lookup Vs SCD Transformation

Hi All,
I would like to differentiate the SCD implementation design patterns with different design advantages/disadvantages in terms of performance. Enjoy !!

I.                    Slowly Changing Dimension Type 1 with Merge Join.
Right from the start the design is a bit different. Instead of adding your source query to an OLEDB Source component and then connecting it to the lookup or the SCD component, we create 2 OLEDB Source components. 
The first one is the same as in the previous two posts, it connects to our source table (UserUpdate).  However we have to modify the query a little bit to get it to work with the Merge Join component.  So lets take a look at that query.
SELECT
    [UserAK]
    ,[ScreenName]
    ,[AccountCreateDate]
    ,[TimeZone]
    ,[Language]
    ,[GeographySK]
    ,CHECKSUM( 
        [ScreenName]
        ,[AccountCreateDate]
        ,[TimeZone]
        ,[Language]
        ,[GeographySK] 
    ) as Hash_CD
FROM
    DimUserUpdate
Order By UserAK

The addition of the ORDER BY clause is essential when using the Merge Join design pattern.

The second OLEDB Source component will query against our destination table, essentially replacing the lookup  component from the Lookup Conditional Split design pattern.  As in the lookup query we only need to bring back the UserAK, UserSK and the Hash_CD.  But just like the source query above we need to add the ORDER BY clause because we are going to use the Merge Join.

SELECT [UserSK]
      ,[UserAK]
      ,[Hash_CD]
  FROM [Demo].[dbo].[DimUser]
Order By UserAK 
Simply using the ORDER BY clause is not enough for SSIS to know that the datasets you are intending to enter into the Merge Join is not enough. You have to tell SSIS that the data is sorted, and on what column the sort is happening.  I agree it’s a little redundant, but that’s how it is.  You must do this for each OLEDB Source.  To do this right click on the OLEDB Source and select Show Advanced Editor from the menu.
Once inside the advanced editor select the Input and Output Properties Tab along the top of the window. Then under the Common Properties for the OLEDB Source Output change the IsSorted property to True.
That takes care of telling SSIS that the data is sorted, but now we need to tell SSIS what column it is sorted on.  To do that we drill down OLEDB Source Output to the Output Columns and select the column, UserAK, (or columns if your query is ordered by more than one column) and change its SortKeyPosition to 1
Now do this for the second OLEDB source and then we’ll be ready for the Merge Join component.  Please note that if you have multiple column in your order by clause they must match in both queries, and you must set the SortKeyPosition in the same order for both components.  Once the the source components are configured drag in the Merge Join component.
Now drag the first source component output connection to the Merge Join component. Once you’ve connected this a selection box will appear.  It will ask you to tell it what side of the join this output belongs on.  This is important, because we are basically going to be creating a Left Outer Join inside the component, so knowing which side of the join each portion belongs to is obviously essential.
When you add the second output from the other OLEDB Source you won’t be prompted again, it will just make it the input for the other side of the join. Now double click on the Merge Join component and open the editor. 
The first property to set is the Join Type.  Your options are Left Outer Join, Inner Join, or Full Outer Join.  We are going to be using the Left Outer Join option.  Below the Join Type is the columns from our two queries each on the side we specified when hooking up the component to the Merge Join.  The keys are already linked now we need to add the columns we want to return from both sides of the join.  Make sure to both the Hash_CD columns to make comparing them easier later on in the package.
The next step will be to add a conditional split to the package to determine which rows are new rows to be inserted and which rows need to be evaluated for updates.  To do this requires only one output to be created and we’ll call it NewRecords.  Here is the expression you need to add to the editor.
If this doesn’t make immediate sense let me explain.  Since we did a left outer join if there were no matches between the two queries on the UserAK it means that the records without a match are new records and as a result the UserSK would evaluate to NULL.  Now hook up the NewRecords output to your OLEDB Destination, set the connection to the destination table, confirm the mappings and the new records will get inserted correctly.
Now on to the second Conditional Split. Here we need to evaluate if the records that matched on UserAK have had a change in any of the columns that we are tracking changes on.  To do this we will use the Checksum values that we’ve created in our source query.  We need to write 2 SSIS expressions, 1 for each output, to determine if the records have changed.  Here are the expressions:
We then connect the Update output to the OLEDB Command component to update the rows that need to be updated.
Inside the OLEDB Command editor the first tab you are shown is the connection manager tab. Simply connect to the destination database and then select the component properties tab.
At the bottom of this tab, under Custom Properties is the SQLCommand property. Here you will write your update statement. To do this you will have to map all of the columns to be updated in the destination to the columns coming through in the data flow. The query will look like you are mapping them to parameter values (?). Notice that even the where clause is set using the parameter indicator.
UPDATE DimUser
 SET
     [ScreenName] = ?
    ,[AccountCreateDate] = ?
    ,[TimeZone] = ?
    ,[Language] = ?
    ,[GeographySK] = ? WHERE [UserSK] =?
Once the query is complete go to the Column Mapping tab.
No it is just a matter of correctly mapping the parameter values to the columns in your data flow (input columns). Make sure to pay attention to your column order in the update query to map the parameter values to the appropriate input column. Remember that the last parameter value is for the where clause and it is why we brought the UserSK value from the Lookup query to begin with.
When your done the Data flow should look something like this.
I hope this helps some of you looking to try different ways to update SCD1 Dimensions.

II. Slowly Changing Dimension Type 1 with Lookup and Conditional Split
In below post I talked about using the SCD component that comes with SSIS to load a Dimension.  This method is ok for loading small tables (<50 span=""> The Lookup and Conditional Split design pattern performs much better.  The main reason for the enhanced performance is the lookup component.  The lookup component executes its query once and stores the results in cache where as the SCD component queries the destination table for each row that comes from the source.  While there is no easy to configure wizard to setting up this design pattern, it isn’t too difficult to do manually.
With the SCD component we didn’t have to do any additional work in our source query to make comparing the columns quick and easy.  With the Lookup and Conditional Split we need to add a checksum to our query.  Here is the query that includes the checksum:
SELECT
    [UserAK]
    ,[ScreenName]
    ,[AccountCreateDate]
    ,[TimeZone]
    ,[Language]
    ,[GeographySK]
    ,CHECKSUM( 
        [ScreenName]
        ,[AccountCreateDate]
        ,[TimeZone]
        ,[Language]
        ,[GeographySK] 
    ) as Hash_CD
FROM
    DimUserUpdate
Now that we have our source query configured correctly and added the Lookup component and connected the two, lets configure the Lookup.
On the General Tab you can configure the Cache mode, the connection type and the way to handle no matches.  This third setting is very important when configuring the lookup for Slowly Changing Dimensions.  You want to set the option to Redirect Rows to No Match output.  This allows you to insert rows that don’t already exist in you destination table.
On the Connection Tab you obviously set the connection to your database and can either choose to select a table or use a SQL query.  I suggest you always use a SQL query and bring back only the rows you need for the lookup.  Doing this will save space in memory, and if you have a very wide table with millions of rows this could cause your lookup to run slowly or even fail.  In this query I am only bringing back the UserAK (business key) UserSK(surrogate key) and the Checksum value.  With regards to the checksum it is entirely up to you (or the data architect if that isn’t you as well) on whether or not you store the checksum value.  In this example I am not storing the value.  Here is the query:
SELECT
    UserSK
    ,UserAK
    ,CHECKSUM( 
        [ScreenName]
        ,[AccountCreateDate]
        ,[TimeZone]
        ,[Language]
        ,[GeographySK] 
    ) as Hash_CD
FROM dbo.DimUser 
The Columns Tab is pretty easy to configure.  The two boxes on the upper half of the editor window represent the source query (left) and the lookup query (right).  To configure just drag the business key (UserAK) from the source query on to the busness key (UserAK) from the lookup query.  Then put a check mark next to the columns from the Lookup that you want to bring back into the dataflow (UserSK and Hash_CD).  Since the column Hash_CD exists in both the lookup and the source queries make sure to set the Output Alias (LKP_Hash_CD) so it is easy to differentiate between the two.  This is all you need to do to configure the lookup.
From the lookup we should have to outputs:
  1. Lookup Match Output
  2. Lookup No Match Output
The Lookup No Match Output will contain all of our new records.  We can map this output directly to our OLEDB Destination with no further work. 
The Lookup Match Output contains all the records that had matches in the destination table, so the next step will be to determine if the records coming are different from the records that already exists.  To do this we will use the Conditional Split transform.
To configure this transform we will use the two Hash_CD values two create two different outputs from the Conditional Split.  Fist we will configure the NoChange Outupt.  First name the output and then add the SSIS expression to compare the two values.  In this instance we want to send all the matching Hash_CD values to this output.  Here is the expression :
LKP_Hash_CD == Hash_CD
The next output will be the Change Output, and will contain all the records where the Hash_CD values didn’t match. Here is that expression:
LKP_Hash_CD != Hash_CD
That is it for the Conditional Split configuration. Now we need to set up the package to perform the updates.  There are a couple of methods to do this, but we are going to use the OLEDB Command to do the updates.

As you can see we use the Change Output to connect to our OLEDB Command component. 

Inside the OLEDB Command editor the first tab you are shown is the connection manager tab. Simply connect to the destination database and then select the component properties tab.
At the bottom of this tab, under Custom Properties is the SQLCommand property.  Here you will write your update statement.  To do this you will have to map all of the columns to be updated in the destination to the columns coming through in the data flow.  The query will look like you are mapping them to parameter values (?).  Notice that even the where clause is set using the parameter indicator.
UPDATE DimUser
 SET
     [ScreenName] = ?
    ,[AccountCreateDate] = ?
    ,[TimeZone] = ?
    ,[Language] = ?
    ,[GeographySK] = ?
 WHERE [UserSK] =?
Once the query is complete go to the Column Mapping tab.
No it is just a matter of correctly mapping the parameter values to the columns in your data flow (input columns).  Make sure to pay attention to your column order in the update query to map the parameter values to the appropriate input column. Remember that the last parameter value is for the where clause and it is why we brought the UserSK value from the Lookup query to begin with. 
Once all the mapping is done click OK and you are now ready to handle Type 1 changes in your Slowly Changing Dimension.











III. Slowly Changing Dimension Type 1 Changes using SSIS SCD Component
The Slowly Changing Dimension Component included with SSIS is one of the methods you can use to manage slowly changing dimensions, and its pretty easy to use.  The problem with the SCD component is that it performs pretty badly, especially as the number of rows in your table grows.  I would say that any Dimension with over 50,000 records in it would be too big for this component.  For small dimensions it will work just fine.
Once you have configured your source component and placed the SCD component in you design pane and hooked the two components up, double click on the SCD component to open the editor.
The first step is to connect to your destination table.  Then you need to select the column or columns in the destination table that match the key or keys from your source table.  In this instance we are choosing the UserAK column which is the Primary Key in our source Table and the Alternate Key in the destination table.  Once the Key columns are selected click NEXT.
On this screen you tell the wizard which of you non key columns are you going to update.  You do this by selecting one of the options from the dropdown menu under the Change Type column next to each Dimension Column.  There are three options:
  1. Fixed Attributes, which means that the data in these columns won’t ever change, even if a change comes through from the source. 
  2. Changing Attributes which corresponds to a Type 1 change.
  3. Historical Attributes which corresponds to a Type 2 change.
Since that we are only worried about Type 1 changes we are going to select the Changing Attribute option.  Once all the columns are configured as you would like them click next.
Here we will configure how to handle the Fixed and Changing attributes. The first options determines how we want to handle updates that come for fixed attributes, either fail the transform or not. The next options allows you to update columns in historical records as well as the current record that are changing attributes. Once done here click next.
On the next screen you configure how to handle inferred members, for our purposes we will just leave this option disabled. Click Next and Finish.  The wizard will now add both an OLEDB Destination component and OLEDB Command component to handle the inserts and the updates and configure them for you.
Please don’t let the relative ease of setting up the SCD component drive your decision to use it.  If you expect your dimension to get fairly large there are other design patterns that you can use that will produce much better results.