Hi All,
I would like to differentiate the SCD implementation
design patterns with different design advantages/disadvantages in terms of
performance. Enjoy !!
I.
Slowly
Changing Dimension Type 1 with Merge Join.
Right from
the start the design is a bit different. Instead of adding your source query to
an OLEDB Source component and then connecting it to the lookup or the SCD
component, we create 2 OLEDB Source components.
The first one
is the same as in the previous two posts, it connects to our source table
(UserUpdate). However we have to modify the query a little bit to get it
to work with the Merge Join component. So lets take a look at that query.
SELECT
[UserAK]
,[ScreenName]
,[AccountCreateDate]
,[TimeZone]
,[Language]
,[GeographySK]
,CHECKSUM(
[ScreenName]
,[AccountCreateDate]
,[TimeZone]
,[Language]
,[GeographySK]
) as Hash_CD
FROM
DimUserUpdate
Order By
UserAK
The addition
of the ORDER BY clause is essential when using the Merge Join design pattern.
The second
OLEDB Source component will query against our destination table, essentially
replacing the lookup component from the Lookup Conditional Split design
pattern. As in the lookup query we only need to bring back the UserAK,
UserSK and the Hash_CD. But just like the source query above we need to
add the ORDER BY clause because we are going to use the Merge Join.
SELECT [UserSK]
,[UserAK]
,[Hash_CD]
FROM [Demo].[dbo].[DimUser]
Order By
UserAK
Simply using
the ORDER BY clause is not enough for SSIS to know that the datasets you are
intending to enter into the Merge Join is not enough. You have to tell SSIS
that the data is sorted, and on what column the sort is happening. I
agree it’s a little redundant, but that’s how it is. You must do this for
each OLEDB Source. To do this right click on the OLEDB Source and select
Show Advanced Editor from the menu.
Once inside
the advanced editor select the Input and Output Properties Tab along the top of
the window. Then under the Common Properties for the OLEDB Source Output change
the IsSorted property to True.
That takes
care of telling SSIS that the data is sorted, but now we need to tell SSIS what
column it is sorted on. To do that we drill down OLEDB Source Output to
the Output Columns and select the column, UserAK, (or columns if your query is
ordered by more than one column) and change its SortKeyPosition to 1
Now do this
for the second OLEDB source and then we’ll be ready for the Merge Join
component. Please note that if you have multiple column in your order by
clause they must match in both queries, and you must set the SortKeyPosition in
the same order for both components. Once the the source components are
configured drag in the Merge Join component.
Now drag the
first source component output connection to the Merge Join component. Once
you’ve connected this a selection box will appear. It will ask you to
tell it what side of the join this output belongs on. This is important,
because we are basically going to be creating a Left Outer Join inside the
component, so knowing which side of the join each portion belongs to is
obviously essential.
When you add
the second output from the other OLEDB Source you won’t be prompted again, it will
just make it the input for the other side of the join. Now double click on the
Merge Join component and open the editor.
The first
property to set is the Join Type. Your options are Left Outer Join, Inner
Join, or Full Outer Join. We are going to be using the Left Outer Join
option. Below the Join Type is the columns from our two queries each on
the side we specified when hooking up the component to the Merge Join.
The keys are already linked now we need to add the columns we want to return
from both sides of the join. Make sure to both the Hash_CD columns to
make comparing them easier later on in the package.
The next step
will be to add a conditional split to the package to determine which rows are
new rows to be inserted and which rows need to be evaluated for updates.
To do this requires only one output to be created and we’ll call it
NewRecords. Here is the expression you need to add to the editor.
If this
doesn’t make immediate sense let me explain. Since we did a left outer
join if there were no matches between the two queries on the UserAK it means
that the records without a match are new records and as a result the UserSK
would evaluate to NULL. Now hook up the NewRecords output to your OLEDB
Destination, set the connection to the destination table, confirm the mappings
and the new records will get inserted correctly.
Now on to the
second Conditional Split. Here we need to evaluate if the records that matched
on UserAK have had a change in any of the columns that we are tracking changes
on. To do this we will use the Checksum values that we’ve created in our
source query. We need to write 2 SSIS expressions, 1 for each output, to
determine if the records have changed. Here are the expressions:
We then
connect the Update output to the OLEDB Command component to update the rows
that need to be updated.
Inside the
OLEDB Command editor the first tab you are shown is the connection manager tab.
Simply connect to the destination database and then select the component
properties tab.
At the bottom
of this tab, under Custom Properties is the SQLCommand property. Here you will
write your update statement. To do this you will have to map all of the columns
to be updated in the destination to the columns coming through in the data
flow. The query will look like you are mapping them to parameter values (?).
Notice that even the where clause is set using the parameter indicator.
UPDATE DimUser
SET
[ScreenName] = ?
,[AccountCreateDate] =
?
,[TimeZone] = ?
,[Language] = ?
,[GeographySK] = ? WHERE [UserSK] =?
Once the
query is complete go to the Column Mapping tab.
No it is just
a matter of correctly mapping the parameter values to the columns in your data
flow (input columns). Make sure to pay attention to your column order in the
update query to map the parameter values to the appropriate input column.
Remember that the last parameter value is for the where clause and it is why we
brought the UserSK value from the Lookup query to begin with.
When your
done the Data flow should look something like this.
I hope this
helps some of you looking to try different ways to update SCD1 Dimensions.
II. Slowly Changing Dimension Type 1 with Lookup and
Conditional Split
In below post I talked about using the SCD
component that comes with SSIS to load a Dimension. This method is ok for
loading small tables (<50 span=""> The Lookup
and Conditional Split design pattern performs much better. The main
reason for the enhanced performance is the lookup component. The lookup
component executes its query once and stores the results in cache where as the
SCD component queries the destination table for each row that comes from the
source. While there is no easy to configure wizard to setting up this
design pattern, it isn’t too difficult to do manually. 50>
With the SCD
component we didn’t have to do any additional work in our source query to make
comparing the columns quick and easy. With the Lookup and Conditional
Split we need to add a checksum to our query. Here is the query that
includes the checksum:
SELECT
[UserAK]
,[ScreenName]
,[AccountCreateDate]
,[TimeZone]
,[Language]
,[GeographySK]
,CHECKSUM(
[ScreenName]
,[AccountCreateDate]
,[TimeZone]
,[Language]
,[GeographySK]
) as Hash_CD
FROM
DimUserUpdate
Now that we
have our source query configured correctly and added the Lookup component and
connected the two, lets configure the Lookup.
On the
General Tab you can configure the Cache mode, the connection type and the way
to handle no matches. This third setting is very important when
configuring the lookup for Slowly Changing Dimensions. You want to set
the option to Redirect Rows to No Match output. This allows you to insert
rows that don’t already exist in you destination table.
On the
Connection Tab you obviously set the connection to your database and can either
choose to select a table or use a SQL query. I suggest you always use a
SQL query and bring back only the rows you need for the lookup. Doing
this will save space in memory, and if you have a very wide table with millions
of rows this could cause your lookup to run slowly or even fail. In this
query I am only bringing back the UserAK (business key) UserSK(surrogate key)
and the Checksum value. With regards to the checksum it is entirely up to
you (or the data architect if that isn’t you as well) on whether or not you
store the checksum value. In this example I am not storing the
value. Here is the query:
SELECT
UserSK
,UserAK
,CHECKSUM(
[ScreenName]
,[AccountCreateDate]
,[TimeZone]
,[Language]
,[GeographySK]
) as Hash_CD
FROM dbo.DimUser
The Columns
Tab is pretty easy to configure. The two boxes on the upper half of the
editor window represent the source query (left) and the lookup query
(right). To configure just drag the business key (UserAK) from the source
query on to the busness key (UserAK) from the lookup query. Then put a
check mark next to the columns from the Lookup that you want to bring back into
the dataflow (UserSK and Hash_CD). Since the column Hash_CD exists in
both the lookup and the source queries make sure to set the Output Alias (LKP_Hash_CD)
so it is easy to differentiate between the two. This is all you need to
do to configure the lookup.
From the
lookup we should have to outputs:
- Lookup Match Output
- Lookup No Match Output
The Lookup No
Match Output will contain all of our new records. We can map this output
directly to our OLEDB Destination with no further work.
The Lookup
Match Output contains all the records that had matches in the destination
table, so the next step will be to determine if the records coming are different
from the records that already exists. To do this we will use the
Conditional Split transform.
To configure
this transform we will use the two Hash_CD values two create two different
outputs from the Conditional Split. Fist we will configure the NoChange
Outupt. First name the output and then add the SSIS expression to compare
the two values. In this instance we want to send all the matching Hash_CD
values to this output. Here is the expression :
LKP_Hash_CD == Hash_CD
The next
output will be the Change Output, and will contain all the records where the
Hash_CD values didn’t match. Here is that expression:
LKP_Hash_CD != Hash_CD
That is it
for the Conditional Split configuration. Now we need to set up the package to
perform the updates. There are a couple of methods to do this, but we are
going to use the OLEDB Command to do the updates.
As you can
see we use the Change Output to connect to our OLEDB Command component.
Inside the OLEDB Command editor the first tab you are shown is the connection manager tab. Simply connect to the destination database and then select the component properties tab.
At the bottom
of this tab, under Custom Properties is the SQLCommand property. Here you
will write your update statement. To do this you will have to map all of
the columns to be updated in the destination to the columns coming through in
the data flow. The query will look like you are mapping them to parameter
values (?). Notice that even the where clause is set using the parameter
indicator.
UPDATE DimUser
SET
[ScreenName] = ?
,[AccountCreateDate] =
?
,[TimeZone] = ?
,[Language] = ?
,[GeographySK] = ?
WHERE [UserSK]
=?
Once the
query is complete go to the Column Mapping tab.
No it is just
a matter of correctly mapping the parameter values to the columns in your data
flow (input columns). Make sure to pay attention to your column order in
the update query to map the parameter values to the appropriate input column.
Remember that the last parameter value is for the where clause and it is why we
brought the UserSK value from the Lookup query to begin with.
Once all the
mapping is done click OK and you are now ready to handle Type 1 changes in your
Slowly Changing Dimension.
III. Slowly Changing Dimension Type 1 Changes using SSIS
SCD Component
The Slowly
Changing Dimension Component included with SSIS is one of the methods you can
use to manage slowly changing dimensions, and its pretty easy to use. The
problem with the SCD component is that it performs pretty badly, especially as
the number of rows in your table grows. I would say that any Dimension
with over 50,000 records in it would be too big for this component. For small
dimensions it will work just fine.
Once you have
configured your source component and placed the SCD component in you design
pane and hooked the two components up, double click on the SCD component to
open the editor.
The first
step is to connect to your destination table. Then you need to select the
column or columns in the destination table that match the key or keys from your
source table. In this instance we are choosing the UserAK column which is
the Primary Key in our source Table and the Alternate Key in the destination
table. Once
the Key columns are selected click NEXT.
On this
screen you tell the wizard which of you non key columns are you going to
update. You do this by selecting one of the options from the dropdown
menu under the Change Type column next to each Dimension Column. There are three
options:
- Fixed Attributes,
which means that the data in these columns won’t ever change, even if a
change comes through from the source.
- Changing Attributes
which corresponds to a Type 1 change.
- Historical Attributes
which corresponds to a Type 2 change.
Since that we
are only worried about Type 1 changes we are going to select the Changing
Attribute option. Once all the columns are configured as you would like
them click next.
Here we will
configure how to handle the Fixed and Changing attributes. The first options
determines how we want to handle updates that come for fixed attributes, either
fail the transform or not. The next options allows you to update columns in
historical records as well as the current record that are changing attributes. Once done here
click next.
On the next
screen you configure how to handle inferred members, for our purposes we will
just leave this option disabled. Click Next and Finish. The wizard will
now add both an OLEDB Destination component and OLEDB Command component to
handle the inserts and the updates and configure them for you.
Please don’t let the relative ease of setting up
the SCD component drive your decision to use it. If you expect your
dimension to get fairly large there are other design patterns that you can use
that will produce much better results.