Archive for November, 2008

Happy Thanksgiving from SQLServerPedia!

Thursday, November 27th, 2008

Have a great Thanksgiving and don’t each *too* much turkey.

NonEmpty() and that all-important second parameter

Thursday, November 27th, 2008

Here's a question which comes up all the time - it was asked at Mosha's MDX seminar last week, and a friend of mine asked me about it recently too - what does the NonEmpty function do if you don't specify the second parameter?

Let's take a look at some example queries. I think everyone knows that you can use NON EMPTY before an axis definition to remove all the empty tuples on that axis, as with:

SELECT [Measures].[Internet Sales Amount] ON 0,
NON EMPTY
[Date].[Date].[Date].MEMBERS
ON 1
FROM [Adventure Works]
WHERE([Product].[Subcategory].&[1])

The problem comes when people assume that you can use the NonEmpty() function in the following way to get the same result:

SELECT [Measures].[Internet Sales Amount] ON 0,
NONEMPTY(
[Date].[Date].[Date].MEMBERS
)
ON 1
FROM [Adventure Works]
WHERE([Product].[Subcategory].&[1])

In a lot of cases you might not see any obvious differences between what the two uses return, but if you run the query above you can see a lot of empty rows returned so they clearly aren't the same. So what's happening? If you clear the cache, rerun this second query and then run a Profiler trace you can get a hint:

NEProfiler

Why are the Reseller Sales measure group partitions being hit? Because the Reseller Sales Amount measure is the default measure on the Adventure Works cube, and since we didn't specify a measure in the second parameter for NonEmpty() it's using the default measure to decide which dates have values or not. To fix this we can explicitly tell AS which measure to use:

SELECT [Measures].[Internet Sales Amount] ON 0,
NONEMPTY(
[Date].[Date].[Date].MEMBERS
,[Measures].[Internet Sales Amount])
ON 1
FROM [Adventure Works]
WHERE([Product].[Subcategory].&[1])

The moral here is always, always, always specify a measure in the second parameter for NonEmpty() whenever you use it. If you don't you may get unexpected results back and you may also get poor performance, for example if the default measure comes from a very large measure group.

Oh, and as a bonus tip, don't ever use NonEmptyCrossjoin() with AS2005 or later. It's difficult to use and frankly unpredictable in what it does sometimes; you can always do whatever you want with NonEmpty or Exists (when specifying a measure group in the third parameter) much more reliably and just as fast.

For more information on this topic, have a look at this old-but-good blog post from Mosha:
http://sqljunkies.com/WebLog/mosha/archive/2006/10/09/nonempty_exists_necj.aspx

NonEmpty() and that all-important second parameter

Thursday, November 27th, 2008

Here's a question which comes up all the time - it was asked at Mosha's MDX seminar last week, and a friend of mine asked me about it recently too - what does the NonEmpty function do if you don't specify the second parameter?

Let's take a look at some example queries. I think everyone knows that you can use NON EMPTY before an axis definition to remove all the empty tuples on that axis, as with:

SELECT [Measures].[Internet Sales Amount] ON 0,
NON EMPTY
[Date].[Date].[Date].MEMBERS
ON 1
FROM [Adventure Works]
WHERE([Product].[Subcategory].&[1])

The problem comes when people assume that you can use the NonEmpty() function in the following way to get the same result:

SELECT [Measures].[Internet Sales Amount] ON 0,
NONEMPTY(
[Date].[Date].[Date].MEMBERS
)
ON 1
FROM [Adventure Works]
WHERE([Product].[Subcategory].&[1])

In a lot of cases you might not see any obvious differences between what the two uses return, but if you run the query above you can see a lot of empty rows returned so they clearly aren't the same. So what's happening? If you clear the cache, rerun this second query and then run a Profiler trace you can get a hint:

NEProfiler

Why are the Reseller Sales measure group partitions being hit? Because the Reseller Sales Amount measure is the default measure on the Adventure Works cube, and since we didn't specify a measure in the second parameter for NonEmpty() it's using the default measure to decide which dates have values or not. To fix this we can explicitly tell AS which measure to use:

SELECT [Measures].[Internet Sales Amount] ON 0,
NONEMPTY(
[Date].[Date].[Date].MEMBERS
,[Measures].[Internet Sales Amount])
ON 1
FROM [Adventure Works]
WHERE([Product].[Subcategory].&[1])

The moral here is always, always, always specify a measure in the second parameter for NonEmpty() whenever you use it. If you don't you may get unexpected results back and you may also get poor performance, for example if the default measure comes from a very large measure group.

Oh, and as a bonus tip, don't ever use NonEmptyCrossjoin() with AS2005 or later. It's difficult to use and frankly unpredictable in what it does sometimes; you can always do whatever you want with NonEmpty or Exists (when specifying a measure group in the third parameter) much more reliably and just as fast.

For more information on this topic, have a look at this old-but-good blog post from Mosha:
http://sqljunkies.com/WebLog/mosha/archive/2006/10/09/nonempty_exists_necj.aspx

Congratulations to PASS Board Winners!

Wednesday, November 26th, 2008

Congratulations to the PASS Board winners for 2009/2010:

Our thanks go out to everyone in the community who ran and who voted.  The elections were exciting this year for the amount of passion involved: everybody really wanted to do the best thing for the community at large.

For evidence of that, check out the PASS candidate video.  These folks really put their heart in it.

SQL Server 2008 Upgrade Technical Reference Guide

Tuesday, November 25th, 2008

The new upgrade guide for SQL Server 2008 is out. Please respect the environment and do not print the 7mb 490 page document but be sure to read cover to cover several times. :) I have just cracked it open but the contributors is the who's who of the SQL Server community.


Contributing writers from Microsoft: Arvind Rao, George Huey, Richard Waymire, Siva Harinath, Edward Melomed, Deepika Mistry, Fernando Caro, Goldie Chaudhuri, Max Verun, Vijay Tandra Sistla, Tom Michaels, Justin Erickson, Devendra Tiwari, Jingwei Lu, Fernando Azpeitia Lopez, Ketan Duvedi, Lukasz Pawlowski, David Noor, Matt Masson, Karandeep Anand



Contributing writers from Solid Quality Mentors: Ron Talmage, Aaron Johal, Steven Abraham, Allan Hirt, Herbert Albert, Antonio Soto, Greg Low, Joe Webb, Craig Utley, Dejan Sarka, Larry Barnes, Pablo Ahumada



Technical reviewers from Microsoft: Rebecca Laszlo, Saket Suman, Paul Mestemaker, Vishal Anand, Leo Giakoumakis, Alejandro Hernandez Saenz, Tom Michaels, Bob Ward, Lindsey Allen, Sanjay Mishra, Umachandar Jayachandran, Mike Wachal, Richard Tkachuk, Donald Farmer, Ritu Kothari, Rakesh Parida, Prash Shirolkar, Dave Sell, Craig Guyer, Denny Lee, Peter Scharlock, Yinyin Gao, Rahul Sakdeo, Eliza Tobias, Hajnalka Sarvari



Contributing editors from Solid Quality Mentors: Kathy Blomstrom



Contributing editors from Microsoft: Jen Witsoe, Suzanne Bonney, Megan Bradley, Tresy Kilbourne, Bronwyn McNutt


 

Haha, now when these folks do a vanity google search they will find themselves here. Hopefully, I can some up with something interesting enough to bring them back. :) However, this is like getting a free hardcore book, so read up.

 

 

Free SQL Server training from MSFT!

Tuesday, November 25th, 2008
 
Stay on top of the game by using this free online courses from MSFT. There are plenty of courses to choose.
Locations of visitors to this page

Free SQL Server training from MSFT!

Tuesday, November 25th, 2008
 
Stay on top of the game by using this free online courses from MSFT. There are plenty of courses to choose.
Locations of visitors to this page

All about SQL Server Statistics

Sunday, November 23rd, 2008
 
I thought I had pretty good understanding of SQL Server Statistics but wanted to explore and find out if there was something I missed out and didn't heard of. I did catch few extra points I didn't know of earlier and I am listing most of the links if any one are interested.
 
By Author: Eric N. Hanson and Contributor: Lubor Kollar
 
By Arun Marathe; Revised by Shu Scott
 
By Dan Cherry
 
By Rob Carrol
 
By Conor Cunningham
 
Chris Skorlinski
 
Stefano
 
Kimberly Tripp
 
 
Kevin Kline
 
Wesley B
 
Paul Randal
 
 
By Ian Jose
 
BOL Links
Locations of visitors to this page

All about SQL Server Statistics

Sunday, November 23rd, 2008
 
I thought I had pretty good understanding of SQL Server Statistics but wanted to explore and find out if there was something I missed out and didn't heard of. I did catch few extra points I didn't know of earlier and I am listing most of the links if any one are interested.
 
By Author: Eric N. Hanson and Contributor: Lubor Kollar
 
By Arun Marathe; Revised by Shu Scott
 
By Dan Cherry
 
By Rob Carrol
 
By Conor Cunningham
 
Chris Skorlinski
 
Stefano
 
Kimberly Tripp
 
 
Kevin Kline
 
Wesley B
 
Paul Randal
 
 
By Ian Jose
 
BOL Links
Locations of visitors to this page

SSIS – Slowly Changing Dimensions with Checksum

Saturday, November 22nd, 2008

In the Microsoft Business Intelligence world, most people use SQL Server Integration Services (SSIS) to do their ETL. Usually with your ETL you are building a DataMart or OLAP Database so then you can build a multi-dimensional cube using SQL Server Analysis Services (SSAS) or report on data using SQL Server Reporting Services (SSRS).

When creating the ETL using SSIS, you pull data from one or many sources (usually an OLTP database) and summarize it, put it into a snowflake or star schema in your DataMart. You build you Dimension tables and Fact tables, which you can then build your Dims and Measures in SSAS.

When building and populating your Dimensions in SSIS, you pull data from a source table, and then move it to your dimension table (in the most basic sense).

You can handle situations in different ways (SCD type 1, SCD type 2, etc) - basically, should you update records, or mark them as old and add new records, etc.

The basic problem with dimension loading comes in with grabbing data from the source, checking if you already have it in your destination dimension table, and then either inserting, updating, or ignoring it.

SSIS has a built in transformation, called the “Slowly Changing Dimension” wizard. You give it some source data, you go through some wizard screens, choosing what columns from your source are “business keys” (columns that are the unique key/primary key in your source table) and what other columns you want. You choose if the other columns are changing or historical columns, and then choose what should happen when data is different, update the records directly, or add date columns, etc.

Once you get through the SCD wizard, some things happen behind the scenes, and you see the insert, update transformations are there and things just “work” when you you run your package.

What most people fail to do, is tweak any settings with the automagically created transformations. You should tweak some things depending on your environment, size of source/destination data, etc.

What I have found in my experience though, is that the SCD is good for smaller dimensions (less than 10,000 records). When you get a dimension that grows larger than that, things slow down dramatically. For example, a dimension with 2 million, or 10 million or more records, will run really slow using the SCD wizard - it will take 15-20 minutes to process a 2 million row dimension.

What else can you do besides use the SCD wizard?

I have tried a few other 3rd party SCD “wizard” transformations, and I couldn’t get them to work correctly in my environment, or they didn’t work correctly. What I have found to work the best, the fastest and the most reliable is using a “checksum SCD method”

The checksum method works similarly to the SCD wizard, but you basically do it yourself.

First, you need to get a checksum transformation (you can download here: http://www.sqlis.com/post/Checksum-Transformation.aspx)

The basic layout of your package to populate a dimension will be like this:


What you want to do, is add a column your dimension destination table called “RowChecksum” that is a BIGINT. Allow nulls, you can update them all to 0 by default if you want, etc.

In your “Get Source Data” Source, add the column to the result query. In your “Checksum Transformation”, add the columns that you want your checksum to be created on. What you want to add is all the columns that AREN’T your business keys.

In your “Lookup Transformation”, your query should grab from your destination the checksum column and the business key or keys. Map the business keys as the lookup columns, add the dimension key as a new column on the output, and the existing checksum column as a new column as well.

So you do a lookup, and if the record already exists, its going to come out the valid output (green line) and you should tie that to your conditional transformation. You need to take the error output from the lookup to your insert destination. (Think of it this way, you do a lookup, you couldn’t find a match, so it is an “error condition, the row doesn’t already exist in your destination, so you INSERT it).

On the conditional transformation, you need to do a check if the existing checksum == the new checksum. If they equal, you don’t need to do anything. You can just ignore that output, but I find it useful to use a trash destination transform, so when debugging you can see the row counts.

If the checksums don’t equal, you put that output the OLEDB Command (where you do your update statement).

Make sure in your insert statement, you set the “new checksum” column from the lookup output to the RowChecksum column in your table. In the update, you want to do the update on the record that matches your business keys, and set the RowChecksum to the new checksum value.

One thing you might also run into is this. If your destination table key column isn’t an identity column, you will need to create that new id before your insert transformation. Which probably consists of grabbing the MAX key + 1 in the Control Flow tab, dumping into a variable, and using that in the Data Flow tab. You can use a script component to add 1 each time, or you can get a rownumber transformation as well (Both the trash and rownumber transformations are on the SQLIS website - same as the checksum transformation)>

After getting your new custom slowly changing dimension data flow set up, you should see way better performance, I have seen 2.5 million rows process in around 2 minutes.

One other caveat I have seen, is in the Insert Destination, you probably want to uncheck the “Table Lock” checkbox. I have seen the Insert and Update transformations lock up on each other. Basically what happens is the run into some type of race condition. The cpu on your server will just sit low, but the package will just run forever, never error out, just sit there. It usually happens with dimension tables that are huge.

Like I said earlier, the checksum method is good for large dimension tables. I use the checksum method for all my dimension population, small or large, I find it easier to manage when everything is standardized.

By no means is this post an extensive list of everything that you would run across, but more of a basic overview of how it works with using the checksum method.

In any event, you can get huge performance gains by using the checksum method and not using the SCD Wizard that comes with SSIS, and you will also feel like you have more control of what is going on, because you will. Happy ETL’ing :)