SSIS - Using Checksum to Load Data into Slowly Changing Dimensions
From SQLServerPedia
|
Using Checksum to Load Data into Slowly Changing Dimensions SQL Server Integration Services (SSIS) offers a wizard to help configure the loading of data into slowly changing dimensions. But you can realize significant performance gains by using the checksum method instead of the wizard, and gain more control as well. This article explains how.
Introduction to SSISIn the Microsoft business intelligence world, most people use SSIS to extract, transform and load (ETL). Usually ETL is used to populate a DataMart or OLAP database which is then used as a source to build a multi-dimensional cube using SQL Server Analysis Services (SSAS) or to report on data using SQL Server Reporting Services (SSRS). To create an ETL process using SSIS, you pull data from one or many sources (usually an OLTP database) and summarize it. Then you put it into a snowflake or star schema in your DataMart. You build your Dimension tables and Fact tables, and you can then build your Dims and Measures in SSAS. When building and populating your Dimensions in SSIS, you pull data from a source table, and then move it to your dimension table (in the most basic sense). You can handle situations in different ways (SCD type 1, SCD type 2, etc.); for instance, you can update records or mark them as old and add new records. The basic difficulty with dimension loading occurs when you grab data from the source, check whether you already have it in your destination dimension table, and then either insert, update or ignore it. The Slowly Changing Dimension (SCD) WizardSSIS provides a built-in transformation, the Slowly Changing Dimension (SCD) Wizard, to help configure the loading of data into slowly changing dimensions. The wizard enables you to choose what columns from your source are “business keys” (columns that are the unique key or primary key in your source table), along with the other columns you want and whether they change or are historical columns. Then you choose what should happen when data is different, such as to update the records directly or add date columns. When you complete the SCD wizard, some things happen behind the scenes. You can see that the INSERT, UPDATE transformations are there and they just “work” when you run your package. Unfortunately, most people fail to check the settings of the transformations that the SCD Wizard creates. You should tweak some settings, depending on your environment, the size of you source and destination data, etc. Moreover, experience teaches that the SCD Wizard is good for smaller dimensions (fewer than 10,000 records). For larger dimensions, the wizard slows down dramatically. For example, the SCD wizard will take 15-20 minutes to process a dimension with two million rows. Checksum: An Alternative to the SCD WizardFortunately, you have alternatives to using the SCD Wizard. The fastest and the most reliable option is to use the checksum SCD method. The checksum method works much like the SCD wizard, but you do it yourself. See the following steps to do so.
First, you need to get a checksum transformation. You can download one here: http://www.sqlis.com/post/Checksum-Transformation.aspx) The basic layout of your package to populate a dimension will look like this:
Next, add a BIGINT column called RowChecksum to your dimension destination table. You should allow nulls, and set the default value of the column to 0. In your Get Source Data source, add the RowChecksum column to the result query. In your Checksum Transformation, add the columns that you want your checksum to be created on. Add all the columns that are ‘’’not’’’ your business keys.
In your lookup transformation, your query should grab from your destination—the RowChecksum column and the business keys. You should map the business keys as the lookup columns. Then add the dimension key as a new column on the output, and add the existing checksum column as a new column as well.
If you do a lookup and the record already exists, it’s going to come out the valid output (the green line in the graphic), which you should connect to your conditional transformation. You will need to take the error output from the lookup to your insert destination. You can think of it this way. When you do a lookup and you can’t find a match, then it becomes an error condition. Since the row doesn’t already exist in your destination, you INSERT it. On the conditional transformation, you need to check whether the existing checksum is equal to the new checksum.
In your INSERT statement, you should make sure that you set the new checksum column from the lookup output to the RowChecksum column in your table. In the UPDATE statement, you want to perform the UPDATE on the record that matches your business keys, and set RowChecksum to the new checksum value.
If your destination table key column isn’t an identity column, you need to create a new ID before your INSERT transformation. You can probably grab the MAX key + 1 in the Control Flow tab, dump it into a variable, and use that variable in the Data Flow tab. You can use a script component to add one each time, or you can get a rownumber transformation as well. Note: Both the trash and rownumber transformations are on the SQLIS website, along with the checksum transformation.
You will probably want to uncheck the Table Lock checkbox in the Insert Destination. It is not uncommon to have the INSERT and UPDATE transformations lock up on each other. Basically what happens is that they run into a race condition: the CPU on your server will remain at a low level, but the package will run forever and never error out. This usually happens with dimension tables that are extremely large.
After you have set up your custom slowly changing dimension data flow, you should see much better performance. For example, it is not uncommon to see 2.5 million rows process in about two minutes. SummaryThe checksum method is good for large dimension tables. A rule of thumb is to use the checksum method for all dimension populations, small or large. It should be easier to manage when everything is standardized. By no means is this post an extensive list of everything that you would run across; it is a basic overview working with the checksum method. By using the checksum method instead of the SCD Wizard that comes with SSIS, you can get huge performance gains, and you will also have more control of what is going on. Author CreditsThis wiki article was adapted from a blog post by Steve Novoselac. Steve is a Microsoft development enthusiast that started as an ASP developer, then .NET, all along the way using SQL 2000/2005 and related technologies. He is currently the Business Intelligence manager at Trek Bicycle Corporation, in Waterloo, WI. Steve love's SQL Server and the BI tools (SSAS, SSRS, SSIS), and learning and sharing things from these areas. His blog postings will cover mostly topics about the Microsoft BI tools from a BI developer perspective, but also delve into topics and discussion around these tools that cover the end-user and SQL Server DBA perspectives. His online presences include:
Related ReadingFor more information about Slowly Changing Dimensions, check out these articles:
|
