Archive for the ‘SQL Server 2005’ Category

Replication Gotcha – Including An Article In Multiple Publications

Tuesday, August 31st, 2010

When administering replication topologies it's common to group articles into publications based on roles that subscribers fulfill. Often you'll have multiple subscriber roles and therefore multiple publications, and in some cases a subset of articles are common between them. There's nothing to prevent you from adding the same article to more than one publication but I wanted to point out how this can potentially lead to major performance problems with replication.

Let's start with a sample table:

CREATE TABLE [dbo].[ReplDemo]
   
(
     
[ReplDemoID] [int] IDENTITY(1, 1) NOT FOR REPLICATION
                        
NOT NULL ,
     
[SomeValue] [varchar](20) NOT NULL ,
     
CONSTRAINT [PK_ReplDemo] PRIMARY KEY CLUSTERED ( [ReplDemoID] ASC )
       
ON [PRIMARY]
   
)
ON  [PRIMARY]
GO

Now let's pretend that we need this table replicated to two subscribers which have different roles. We'll create one publication for each role and add the table to both publications:

-- Adding the transactional publication
EXEC sp_addpublication @publication = N'ReplDemo Publication A',
   
@description = N'Publication to demonstrate behavior when same article is in multiple publications',
   
@sync_method = N'concurrent', @retention = 0, @allow_push = N'true',
   
@allow_pull = N'true', @allow_anonymous = N'false',
   
@enabled_for_internet = N'false', @snapshot_in_defaultfolder = N'true',
   
@compress_snapshot = N'false', @ftp_port = 21, @ftp_login = N'anonymous',
   
@allow_subscription_copy = N'false', @add_to_active_directory = N'false',
   
@repl_freq = N'continuous', @status = N'active',
   
@independent_agent = N'true', @immediate_sync = N'false',
   
@allow_sync_tran = N'false', @autogen_sync_procs = N'false',
   
@allow_queued_tran = N'false', @allow_dts = N'false', @replicate_ddl = 1,
   
@allow_initialize_from_backup = N'false', @enabled_for_p2p = N'false',
   
@enabled_for_het_sub = N'false'
GO
EXEC sp_addpublication_snapshot @publication = N'ReplDemo Publication A',
   
@frequency_type = 1, @frequency_interval = 0,
   
@frequency_relative_interval = 0, @frequency_recurrence_factor = 0,
   
@frequency_subday = 0, @frequency_subday_interval = 0,
   
@active_start_time_of_day = 0, @active_end_time_of_day = 235959,
   
@active_start_date = 0, @active_end_date = 0, @job_login = NULL,
   
@job_password = NULL, @publisher_security_mode = 1
GO
-- Adding the transactional articles
EXEC sp_addarticle @publication = N'ReplDemo Publication A',
   
@article = N'ReplDemo', @source_owner = N'dbo',
   
@source_object = N'ReplDemo', @type = N'logbased', @description = N'',
   
@creation_script = N'', @pre_creation_cmd = N'drop',
   
@schema_option = 0x00000000080350DF,
   
@identityrangemanagementoption = N'manual',
   
@destination_table = N'ReplDemo', @destination_owner = N'dbo', @status = 8,
   
@vertical_partition = N'false',
   
@ins_cmd = N'CALL [dbo].[sp_MSins_dboReplDemo]',
   
@del_cmd = N'CALL [dbo].[sp_MSdel_dboReplDemo]',
   
@upd_cmd = N'SCALL [dbo].[sp_MSupd_dboReplDemo]'
GO

-- Adding the transactional publication
EXEC sp_addpublication @publication = N'ReplDemo Publication B',
   
@description = N'Publication to demonstrate behavior when same article is in multiple publications',
   
@sync_method = N'concurrent', @retention = 0, @allow_push = N'true',
   
@allow_pull = N'true', @allow_anonymous = N'false',
   
@enabled_for_internet = N'false', @snapshot_in_defaultfolder = N'true',
   
@compress_snapshot = N'false', @ftp_port = 21, @ftp_login = N'anonymous',
   
@allow_subscription_copy = N'false', @add_to_active_directory = N'false',
   
@repl_freq = N'continuous', @status = N'active',
   
@independent_agent = N'true', @immediate_sync = N'false',
   
@allow_sync_tran = N'false', @autogen_sync_procs = N'false',
   
@allow_queued_tran = N'false', @allow_dts = N'false', @replicate_ddl = 1,
   
@allow_initialize_from_backup = N'false', @enabled_for_p2p = N'false',
   
@enabled_for_het_sub = N'false'
GO
EXEC sp_addpublication_snapshot @publication = N'ReplDemo Publication B',
   
@frequency_type = 1, @frequency_interval = 0,
   
@frequency_relative_interval = 0, @frequency_recurrence_factor = 0,
   
@frequency_subday = 0, @frequency_subday_interval = 0,
   
@active_start_time_of_day = 0, @active_end_time_of_day = 235959,
   
@active_start_date = 0, @active_end_date = 0, @job_login = NULL,
   
@job_password = NULL, @publisher_security_mode = 1
GO
-- Adding the transactional articles
EXEC sp_addarticle @publication = N'ReplDemo Publication B',
   
@article = N'ReplDemo', @source_owner = N'dbo',
   
@source_object = N'ReplDemo', @type = N'logbased', @description = N'',
   
@creation_script = N'', @pre_creation_cmd = N'drop',
   
@schema_option = 0x00000000080350DF,
   
@identityrangemanagementoption = N'manual',
   
@destination_table = N'ReplDemo', @destination_owner = N'dbo', @status = 8,
   
@vertical_partition = N'false',
   
@ins_cmd = N'CALL [dbo].[sp_MSins_dboReplDemo]',
   
@del_cmd = N'CALL [dbo].[sp_MSdel_dboReplDemo]',
   
@upd_cmd = N'SCALL [dbo].[sp_MSupd_dboReplDemo]'
GO

After creating the publications we create our subscriptions, take & apply the snapshot, and we're ready to start making changes so we execute this simple insert statement:

INSERT  INTO dbo.ReplDemo
       
( SomeValue )
VALUES  ( 'Test' )

Here's the million dollar question: How many times does this insert statement get added to the distribution database? To find out we'll run the following statement on the distributor (after the log reader agent has done it's work, of course):

SELECT  MSrepl_commands.xact_seqno ,
       
MSrepl_commands.article_id ,
       
MSrepl_commands.command_id ,
       
MSsubscriptions.subscriber_id
FROM    distribution.dbo.MSrepl_commands AS [MSrepl_commands]
       
INNER JOIN distribution.dbo.MSsubscriptions AS [MSsubscriptions] ON MSrepl_commands.publisher_database_id = MSsubscriptions.publisher_database_id
                                                             
AND MSrepl_commands.article_id = MSsubscriptions.article_id
       
INNER JOIN distribution.dbo.MSarticles AS [MSarticles] ON MSsubscriptions.publisher_id = MSarticles.publisher_id
                                                             
AND MSsubscriptions.publication_id = MSarticles.publication_id
                                                             
AND MSsubscriptions.article_id = MSarticles.article_id
WHERE   MSarticles.article = 'ReplDemo'
ORDER BY MSrepl_commands.xact_seqno ,
       
MSrepl_commands.article_id ,
       
MSrepl_commands.command_id

Here's the output of the statement:

Query Results

That's one row for each publication the table article is included in. Now imagine that an update statement affects 100,000 rows in the table. In this example that would turn into 200,000 rows that will be inserted into the distribution database and need to be cleaned up at a later date. It's not hard to see how this could lead to performance problems for tables that see a high volume of insert\update\delete activity.

Workarounds
Two workarounds for this behavior come to mind:

  1. Modify data using stored procedures, then replicate both their schema and execution. This won't help for insert statements and is useless if you're only updating\deleting a single row each time the procedure executes. This also assumes that all dependencies necessary for the stored procedure(s) to execute exist at the subscriber
  2. Limit table articles to one publication per article. If you're creating publications from scratch then place table articles that would otherwise be included in multiple publications into their own distinct publication. If you're working with existing publications that already include the table article then subscribe only to the article(s) that you need rather than adding the article to another publication. (Subscribing to individual articles within a publication can get tricky - I'll demonstrate how to do this in a future post)

Find Tables and Columns by Data Type

Tuesday, August 24th, 2010

old yellow eyesA couple weeks back someone asked me some questions about data types.  Apparently, while implementing some financial data, each developer on the project had chosen their own data type.  This was a bit of a problem that can lead to serious problems.  When financial data is rounded unexpectedly – bad things can sometimes happen and it usually doesn’t involve plots for half pennies.

Anyways, I advised that it might be a good idea to get all of the data types on these columns aligned. To do this I provided them with a couple scripts that are below that they used to find the occurrences of this information.  These scripts both use the sys.columns and sys.types catalog views.

This first script, pulls out all of the columns that are in the database using decimal, numeric, or money data type.  It includes a filter where scale is not equal to zero – which should have included all of the financial data.

SELECT OBJECT_NAME(c.object_id) as table_name
    , c.name
    , t.name
    , c.precision
    , c.scale
FROM sys.columns c
    INNER JOIN sys.types t ON c.user_type_id = t.user_type_id
WHERE t.name IN ('decimal','numeric','money')
AND c.scale <> 0
ORDER BY 1, 2

The second script lists all of the columns of a specific name with their table and data type information.

SELECT OBJECT_NAME(c.object_id) as table_name
    , c.name
    , t.name
    , c.precision
    , c.scale
FROM sys.columns c
    INNER JOIN sys.types t ON c.user_type_id = t.user_type_id
WHERE c.name = 'ListPrice'
ORDER BY 1, 2

Through the use of these and similar scripts, columns and the tables that they are in can be easily determined.  And this information can be found using either the data type or the name of the column.

Related posts:

  1. Index Those Foreign Keys
  2. Find Tables with Forwarded Records
  3. Webcast Next Week – Performance Impacts Related to Different Function Types

Replication Gotcha: Blank XML

Tuesday, August 24th, 2010

Transactional replication in SQL Server 2005\2008 can handle the XML datatype just fine with few exceptions - one in particular being when the XML value is blank. I'll save the argument about whether or not a blank (or empty string if you prefer) value is well formed XML for another day because the point is that SQL Server allows it. Consider the following table:

CREATE TABLE [dbo].[XMLReplTest]
   
(
     
[XMLReplTestID] [int] IDENTITY(1, 1) NOT FOR REPLICATION
                           
NOT NULL ,
     
[SomeXML] [xml] NOT NULL ,
     
CONSTRAINT [PK_XMLReplTest] PRIMARY KEY CLUSTERED
       
( [XMLReplTestID] ASC ) ON [PRIMARY]
   
)
ON  [PRIMARY]
GO

Execute the following statement and you'll see that SQL Server handles it just fine:

INSERT  INTO dbo.XMLReplTest
       
( SomeXML )
VALUES  ( '' )

Now let's add this table to a transactional replication publication:

-- Adding the transactional publication
EXEC sp_addpublication @publication = N'XML Replication Test',
   
@description = N'Sample publication to demonstrate blank XML gotcha',
   
@sync_method = N'concurrent', @retention = 0, @allow_push = N'true',
   
@allow_pull = N'true', @allow_anonymous = N'false',
   
@enabled_for_internet = N'false', @snapshot_in_defaultfolder = N'true',
   
@compress_snapshot = N'false', @ftp_port = 21, @ftp_login = N'anonymous',
   
@allow_subscription_copy = N'false', @add_to_active_directory = N'false',
   
@repl_freq = N'continuous', @status = N'active',
   
@independent_agent = N'true', @immediate_sync = N'false',
   
@allow_sync_tran = N'false', @autogen_sync_procs = N'false',
   
@allow_queued_tran = N'false', @allow_dts = N'false', @replicate_ddl = 1,
   
@allow_initialize_from_backup = N'false', @enabled_for_p2p = N'false',
   
@enabled_for_het_sub = N'false'
GO
EXEC sp_addpublication_snapshot @publication = N'XML Replication Test',
   
@frequency_type = 1, @frequency_interval = 0,
   
@frequency_relative_interval = 0, @frequency_recurrence_factor = 0,
   
@frequency_subday = 0, @frequency_subday_interval = 0,
   
@active_start_time_of_day = 0, @active_end_time_of_day = 235959,
   
@active_start_date = 0, @active_end_date = 0, @job_login = NULL,
   
@job_password = NULL, @publisher_security_mode = 1
GO

-- Adding the transactional articles
EXEC sp_addarticle @publication = N'XML Replication Test',
   
@article = N'XMLReplTest', @source_owner = N'dbo',
   
@source_object = N'XMLReplTest', @type = N'logbased', @description = N'',
   
@creation_script = N'', @pre_creation_cmd = N'drop',
   
@schema_option = 0x00000000080350DF,
   
@identityrangemanagementoption = N'manual',
   
@destination_table = N'XMLReplTest', @destination_owner = N'dbo',
   
@status = 8, @vertical_partition = N'false',
   
@ins_cmd = N'CALL [dbo].[sp_MSins_dboXMLReplTest]',
   
@del_cmd = N'CALL [dbo].[sp_MSdel_dboXMLReplTest]',
   
@upd_cmd = N'SCALL [dbo].[sp_MSupd_dboXMLReplTest]'
GO

Assume we've created the publication, added a subscriber, taken & applied the snapshot, and we're ready to start changing data. Let's throw a monkey wrench into the works by executing the insert statement with the blank XML again and watch what happens to the log reader agent:

Log Reader Agent Error

That's not a very nice error (or resolution)! I've been able to reproduce this behavior in SQL 2005 & 2008 but I have not tried it in 2008 R2. I've entered a Connect bug report so hopefully this is fixed in a forthcoming cumulative update. In the meantime there is a simple workaround - add a check constraint. Since we're working with the XML datatype the only option for checking length with a scalar function is DATALENGTH. The DATALENGTH for a blank xml value is 5 so we want to check that any inserted or updated value is greater than 5:

ALTER TABLE dbo.XMLReplTest ADD CONSTRAINT
  
CK_XMLReplTest_SomeXML CHECK (DATALENGTH(SomeXML) > 5)
GO

If you are affected by this behavior please consider taking a moment to go vote for it on Connect.

Video: Disk Tuning and Optimization for SQL Server

Thursday, July 29th, 2010

This video focuses on accelerated Disk Optimization and improved productivity with SQL Query Tuning.

Video created Dec. 18, 2006.

Free one day training on Storage and Virtualization with Mr.Denny

Sunday, July 25th, 2010

Free training is good, if it is on a special topic important for every DBA then better, if it is coming from an industry expert on the subject then much better, if it is an all day event then excellent, if it is from Mr.Denny then excellent * 2, if you can get all of the above and get cool swag (xbox 360) then excellent * 3.

Why wait? Register for this session on Storage and Virtualization for the DBA session by Mr.Denny at Microsoft Office in Irvine, CA on Aug 9th 2010.

Yes, the event is on a monday but you will get good training without pluff on the topic. And its about time to burn that vacation day and use it wisely to learn something about the topic if you are around the area. Talk to your manager/supervisor to register for this event as the seats are limited and will fill up as first come first served basis.

Note that this event is priced at $395 at PASS conference and you are getting this for FREE! It’s a win, win!

Going into 2011, I want to focus on Storage and High availability side in SQL Server and its a good opportunity for me to attend this session and its worth taking a 7 hour drive. :-)

Digg This  Reddit This  Stumble Now!  Buzz This  Vote on DZone  Share on Facebook  Bookmark this on Delicious  Kick It on DotNetKicks.com  Shout it  Share on LinkedIn  Bookmark this on Technorati  Post on Twitter  Google Buzz (aka. Google Reader)  

SQL University: Introduction to Indexes, Part the Third

Friday, July 23rd, 2010

Nice to see most of you have managed to fight your way through the shoggoths outside to attend another lecture at the Miskatonic branch of SQL University. This will be the third and final part of the introduction to indexes lecture. Please, if you’re going mad, step out into the hall. Our previous two lectures introduced the concept of indexes and then talked about two types of indexes, clustered and nonclustered. This lecture will cover the concept of statistics as they relate to indexes.

If you followed the previous lecture then you know that indexes are stored in a Balanced Tree or B-Tree structure. You know that this storage mechanism is intended to provide fast retrieval of data. But, how can the query engine inside SQL Server know which index has the information it wants? How can the query engine know that a given index will provide a mechanism to speed up the query that is being run? Statistics.

Statistics are information generated defining the selectivity of a column or columns in the database. They may or may not be part of an index. Columns alone can have statistics and indexes can have statistics. The query optimizer uses statistics to determine how it will access data. You may have the greatest index in the world, but if your statistics are wrong, that index won’t get used. By default statistics are automatically created and automatically updated. The automatic update of statistics is based on a formula that depends on changes in the data and changes in structure, as outlined in this Microsoft documentation. In general, it’s a very good practice to leave this alone because SQL Server does a great job of maintaining statistics.

But what are statistics really?

Statistics are basically a sampling of data stored in such a way as to make it possible for the optimizer to understand what the data looks like without actually accessing it. It is, absolutely, meta-data, data about data. The sampling of the data is, usually, a scan across the data, dropping in every certain number of rows as a percentage of the data in the table, but it can be a full scan, where it looks at all the data in the table. This meta data that is collected is broken down into several pieces. Statistics are represented through:

  • The histogram, which provides a picture of the data in a column
  • The header, which provides information about the statistics or meta-data about meta-data
  • Density information, which stores the distribution of the data across the columns being tracked by the statistics

In short, a lot of information. To start understanding what this all means, let’s see how to retrieve a set of statistics:

DBCC SHOW_STATISTICS ("Sales.SalesOrderDetail", IX_SalesOrderDetail_ProductID);

This query uses tables stored in the AdventureWorks2008R2 to display statistics information. Just a side note, so many DBCC functions have been replaced by dynamic management views that I honestly expected one for statistics too, but it’s not there. Running the query results in the following set of information:

The first result set at the top is the header. Most of the information you see there should make sense to you based on what we’ve been talking about. A couple of interesting points worth noting are the “Rows Sampled” column which shows how many of the rows were looked at while the stats were being built, Steps, which gives you an indication of the size of the histogram, and Density, which shows the selectivity of the statistics, but that column is not the one used by the optimizer. The main thing I use header information for is to see when the statistics were last updated.

The second set of data shows the density of the columns used to define the index. You can see that each column is measured as it gets added to the index. This shows you how a compound index makes the data more and more selective. If those numbers are at all big, as opposed to vanishingly small like those presented, you have a selectivity isssue with your index. The less selective the index is, the less likely it will be used.

Finally, the last data set is the histogram. What I have displayed is actually just the first seven of two hundred rows. A histogram is basically a small snapshot of the data in your index. Specifically, the data in the first column of your index. That’s worth knowing. If you choose to, for some crazy reason, build an index with a bit column as the leading edge (that means the first column) and you’re surpised that your index isn’t getting used, this is where you can go to understand why. The histogram is one of the main tools that SQL Server uses to determine which operations it will use to retrieve your data. Here’s how it works. First you get the RANGE_HI_KEY which shows the upper bound column value, the top of the range of rows, that the this entry in  the histogram represents. Then you see the RANGE_ROWS that displays the number of rows within the range represented by the this entry in the histogram (except, for this entry, there is no range, it’s a single value). Next is the EQ_ROWS that tells you how many rows are equal to the value represented by the RANGE_HI_KEY. That number is large here because there is no range of rows, this value is represented by approximately 2838.166 rows inside the data. Following that is DISTINCT_RANGE_ROWS which shows you the distinct number of rows within each stepped range of values. In this case, again, because we’re not dealing with a range, that number is zero. Finally, you can see the AVG_RANGE_ROWS displaying the average number of duplicate values within the range.

Scrolling down within that set of data you can see what a range of values looks like:

In this case the range covers two values with the upper limit being 805 and the previous upper limit being 801.

In short, you can see what the optimizer sees and determine if you should have more up to date statistics. For example, if I query the table and retrieve ProductID values that are between 802 and 805, which would represent step 72 in the histogram, I get 246 rows, not 442, which is what I should see. Since the statistics are a little old on the table, they haven’t been updated since June 23rd, I’ll update them. You can use sp_updatestats, which will sample the data and generate a histogram with some guesses, or you can do this:

UPDATE STATISTICS Sales.SalesOrderDetail WITH FULLSCAN

Now when I run DBCC SHOW_STATISTICS, the header information shows that the rows sampled equal the rows in the table. My new histogram has almost exactly the same distribution, but the row counts are incredibly accurate. In my case the number of RANGE_ROWS is equal to 200 and the number of EQ_ROWS is equal to 46. Which equals the 246 rows that we got from querying the data directly. That represents a very accurate set of statistics.

So how do you know when to update your statistics? It’s really hard to say. If you start seeing different execution plans than you did previously, or if you note that the date on the statistics seems old, you might want to update stats. How do you determine if you should use a full scan or sampled statistics? Again, hard to say. In most circumstances a sampled set of statistics should work, but in some areas, if the data distribution is off, you may want to run a full scan.

There is clearly more and more to discuss about statistics and how to maintain them. There is a lot more to discuss about indexes. I didn’t even get to index fragmentation. Today you should remember:

  • Statistics are used to determine which indexes work well for a query
  • Statistics are composed of multiples sets of data, the header, density information, and the histogram
  • Histograms have a maximum of 200 steps
  • Histograms show the data distribution
  • Despite automatic maintenance, statistics can be out of date
  • The sampled update of statistics doesn’t always create the best set of data

For more information on statistics, I recommend Microsoft’s Books On Line for SQL Server. It’s a great resource.

But I hear the whipoorwills kicking up and I see a funny looking fog developing outside. It’s getting awfully dark for mid-day. One of the other professors stepped on some squealing white… thing… on his way into today and a one of the TA’s went stark raving mad at something they saw in the rectory. Time to end class. Best of luck on your journey back to your dorms. I’m locking myself in my office.

While I’m there. I might churn out one or two more entries on indexes. Look for them in the SQL University E-Book (no doubt to be published posthumously)


Webcast Next Week – Using XML to Query Execution Plans

Wednesday, July 21st, 2010

I’ll be speaking for the SQL PASS Database Administration Virtual Chapter next week.  It’ll be on July 28 at 12 PM Eastern time.  The topic will be Using XML to Query Execution Plans.  The abstract for the event is the following:

SQL Server stores its execution plans as XML in dynamic management views. The execution plans are a gold mine of information. From the whether or not the execution plan will rely on parallelism to what columns are requiring a key lookup after a non-clustered index seek. Through a the use of XML this information can be available at your fingertips to help determine the value and impact of an index and guide you in improving the performance of your SQL Server databases. In this session we’ll look at how you can begin to understand and query the structure of the execution plans in the procedure cache. Also, we’ll review how to uncover some potential performance issues that may be lurking in your SQL Server.

If you register for the event ahead of time you will be entered into a drawing brought to us from CA, our event sponsor.  You can get to the event here.

Related posts:

  1. Webcast Today – Using XML to Query Execution Plans
  2. Webcast Next Week – Performance Impacts Related to Different Function Types
  3. Webcast Today – Performance Impacts Related to Different Function Types

SQL University: Introduction to Indexes, Part the Second

Wednesday, July 21st, 2010

Welcome once more to the Miskatonic branch of SQL University. Please try to concentrate. I realize the whipoorwills singing outside the window in a coordinated fashion that sounds almost like laboured breathing can be distracting, but we’re talking about indexes here.

We left last class with a general idea what an index is, now it’s time for some specifics. There are several different kinds of indexes, as we talked about last class. But the two you’re probably going to work with the most are clustered, non-clustered. Each of these indexes is stored in a structure called a B-Tree, a balanced tree, not a binary tree. That’s a very important distinction.

A B-Tree is a double-linked list that is defined by the keys of the indexes on the top and intermediate pages, and at the leaf level by the data itself in the case of clustered indexes. Some of you no doubt think I’m quoting from De Vermis Mysteriis. Basically, for our purposes, a B-Tree consists of a series of pages. There is a top page, or root page, that defines the beginning of the index key. It points to a series of intermediate pages. Each intermediate page contains a range, a previous and a next value. These all point to each other, hence, double linked. The idea is that SQL Server can quickly identify which intermediate page has the pointers down to the leaf node, the final node in the stack. The values of these pointers are defined by the key of the index, the column or columns that you define when you create the index. There are always at least two levels, leaf & root, but there can be more, depending on the amount of data and the size of the keys. Just remember, the size of the key, which refers both to the data types in the key and the number of columns, determines how many key values can get on a page, the more key values on a page, the faster access will be, the fewer key values, the more pages that have to be read, and therefore, the slower the performance.

In general the purpose is to be able to quickly navigate to a leaf or set of leaf pages. When a B-Tree is used and the query engine is able to navigate quickly down to the leaf needed, that is an index seek. But when the B-Tree has to be moved through, in whole or in part, scanning for the values, you’re looking at an index scan. Obviously, in most cases, a seek will be faster than a scan becuase it’s going to be accessing fewer pages to get to the leaf needed to satsify the query. Just remember, that’s not always true.

Let’s get on to the indexes. It’s already been mentioned, but it bears repeating, the principle difference between a clustered and non-clustered index is what is at the leaf level. In a non-clustered index, it’s simply the key values and an values added through the use of the INCLUDE option along with a lookup value to either the clustered index key or an identifier within a table. In a clustered index, the data is stored down at the leaf. This is why people will frequently refer to a clustered index as being “better” than a non-clustered index, because you’re always going directly to the data when you’re looking information up within a clustered index. But, as with the scans vs. seek argument, this is not always true either.

I mentioned that a non-clustered index refers back to the clustered index, if there is one on the table. Because the data is stored at the leaf level of the clustered index, when you need to retreive other columns after performing a seek on a non-clustered index, you must go and get those columns from the clustered index. This is known as a key lookup, or in older parlance, a bookmark lookup. This operation is necessary when data not supplied by the non-clustered index, but can be very expensive because you’ve just added extra reads to your query.

What if there isn’t a clustered index on the table? What does the non-clustered index use to find other columns? If the table doesn’t have a clustered index, then that table is referred to as a heap. It’s called a heap because the data is simply stored in a pile, with no logical or physical ordering whatsoever. With a heap, SQL Server takes it on itself to identify the leaf level storage and creates a row id value for all the rows in the table. This row id can be used by the non-clustered index to find the data. That is referred to by the completely arcane and incomprehensible term, row id lookup. You might be thinking, hey, that means I don’t have to create a clustered index because SQL Server will create one for me. You’d be wrong. Maintaining the row id is an expensive operation  and it doesn’t help in retrieving the data in an efficient manner. It’s just necessary for SQL Server to get the data back at all. In general, this is something to be avoided.

A non-clustered index doesn’t necessarily have to perform a lookup. If all the columns referred to in a query are stored within a non-clustered index, either as part of the key or as INCLUDE columns at the leaf, it’s possible to get what is called a “covering” query. This is a query where no lookup is needed. Indexes that can provide a covering query everything it needs are referred to as covering indexes. A covering query is frequently one of the fastest ways to get at data. This is because, again, depending on the size of the keys and any INCLUDE columns, a non-clustered index will have more information stored on the page than a clustered index will and so fewer pages will have to be read, making the operation faster.

By and large, a good guideline is to put a clustered index on all tables. SQL Server works extremely well with clustered indexes, and it provides you with a good access mechanism to your data. If you don’t put a clustered index on the table, SQL Server will create and maintain a row ID anyway, but as I said before, this doesn’t save much work on the server and it doesn’t provide you with any performance enhancement.

That’s a basic introduction to the three concepts of the clustered index, the non-clustered index and the heap. The points I’d like you to remember are:

  • Indexes are stored in Balanced Trees
  • Balanced Trees have, generally, three levels, root page, intermediate page, and leaf page
  • In clustered indexes, data is stored at the leaf page
  • In non-clustered indexes, a pointer is maintained back to the clustered index or the row id
  • A heap is a table without a clustered index

Remember those things and you can really begin to dig down on how indexes work. Understanding how they work will assist you in designing them for your database and your queries.

Next class we’ll go over statistics.

I wouldn’t walk back to your dorm by way of the shore. I’ve seen some rather odd looking people near the docks lately that didn’t give me a good feeling. See you next time… maybe.


SQL University: Introduction to Indexes, Part the First

Monday, July 19th, 2010

Right, all eldritch tomes are to be closed and Elder Signs are to be put away during this course.

Welcome to the History department here at the Miskatonic branch of SQL University. Why the History department? Well, first, because I like history and have frequently thought I would enjoy teaching it. Second, because I needed a hook upon which to hang part of the story I want to tell. What story is that you ask? Why, the story of the Dewey Decimal System. We are interested in studying history and historians must classify our subjects carefully. For advanced students we’ll be covering the Library of Congress Classification System and the…

Right, I give, this is the introductory class on indexes. If you thought we were covering something exciting and sexy like PowerShell, you’re in the wrong room.

Indexes… indexes…. There are, of course, different kinds of indexes. I’m sure that some of you, glancing ahead in your books, are already thinking, “yeah, two.” And you would, of course, be ABSOLUTELY WRONG! That’s why you’re in this class, because you don’t know. There are a large number of different kinds of indexes. Most people think of the standard indexes, of which there are two, clustered and non-clustered. But when pressed they can usually come up with the Full-Text index and possibly even the XML index. But that leaves out Spatial indexes, filtered indexes… more. Microsoft’s documentation lists eight different indexes:

  • clustered
  • non-clustered
  • unique
  • indexes with included columns
  • Full-Text
  • Spatial
  • Filtered
  • XML

But I’ve seen other people count them other ways and arrive at varying amounts. Is a compound index a different kind of index? If it’s not, is unique really a different kind of index? Things to think about.

Why so many? What the heck is an index good for? They must be useful critters or Microsoft wouldn’t have put so many different sorts (however many that is) into SQL Server. I started off talking about the Dewey Decimal System for a reason. An index, any of the indexes we’re going to talk about, is primarily meant, like the DDS, as a mechanism to make finding things easier. That’s all it is. Pretty simple, right? Wrong. You clearly haven’t spent time with SQL Server indexes or the DDS. It’s really complicated. But, just like the DDS, learning how indexes work will make using them much easier.

Remember, the main purpose of a database, despite what your DBA may secretly feel in his or her heart, is not to keep, store and protect data. No, the main purpose of a database is to feed that data up to your business users, whoever they may be, in a timely and accurate fashion. That’s where indexes come in. They will help your queries get the data out to your users faster. Think about your data like a really huge library and your data like a bunch of books. The index acts like the DDS as a mechanism to speed you through the library and quickly and easily retrieve the book that you want.

Enough comparisons, since this is introductory, I just wanted to get the idea of indexes into your head. In the next installment I’ll take on two (or four, depends on how you count them) different kinds of indexes, starting with the standard two that you expected me to cover, clustered and non-clustered indexes. I’ll also introduce the concept of a heap and we’ll talk about what the heck a B-Tree is.

See you next class, probably. Be careful crossing the quad, I’ve heard Wilbur Whately is back on campus and we all remember what happened last time.


Free MSDN Ultimate Subscriptions for a good cause

Saturday, July 17th, 2010

Few weeks ago, all of the MVP’s were given 3 Visual Studio 2010 Ultimate with MSDN activation cards to celebrate the launch of Visual Studio 2010 and .Net Framework 4. I wasn’t really sure how to make good use of them and wondering what’s the best way to share these cards. Arnie Rowland (Blog|Twitter) had a great idea to share these to underemployed or unemployed folks who need a helping hand and in return you have to spend some quality time to help a non-profit organization. If you think you are eligible for this offer, contact Arnie directly and you can read his entire post here.

I am also pitching in 2 of my cards for this great cause. Thanks Arnie for taking the initiative on this and making an impact in the community.

Digg This  Reddit This  Stumble Now!  Buzz This  Vote on DZone  Share on Facebook  Bookmark this on Delicious  Kick It on DotNetKicks.com  Shout it  Share on LinkedIn  Bookmark this on Technorati  Post on Twitter  Google Buzz (aka. Google Reader)