Author Archive

SQL University: Advanced Indexing – Indexing Strategies

Friday, November 11th, 2011

Right, I know it’s Friday and everyone’s tired and looking forward to the weekend, but I do need to finish off this indexing section and I’ll try to keep this short and interesting and hopefully keep everyone awake.

There’s no shortage of information available on how to create indexes. Hell, I’ve written a copious amount myself. Most of these many articles however are written from the point of indexing single queries. What you chose for a where clause, what has to go into the include to create the perfect index for this query. Now that’s all well and good, but I’ve never met a system that had only one query per table (maybe there is such a system out there, but I’ve never found it)

So what I’m going to try to do today is address the topic of a strategy for indexing. How to approach indexing, not for a single query, but for the system as a whole. I won’t be able to cover this in-depth, this is material worthy of an entire book chapter, if not an entire book, but I can at least touch on the essential portions.

Now, there’s two main positions that we could be in when considering indexing strategies for an entire system
1) A brand new system that’s still in development
2) An existing system that’s being used actively.

One at a time…

Indexing strategies for a brand new system

Start by choosing a good clustered index. What makes a good clustered index? Well, it depends :-)

The clustered index is the base, it will affect each and every nonclustered index, and it’s not trivial to change once the system is in use, so chose carefully. I’m not saying another word on the subject of a clustered index, not today.

Once that’s done…

Design any unique constraints, unique indexes or primary key constraints that are required by the database design. If the primary key should go on the column(s) that were chosen as the clustering key, great, the primary key gets created clustered. If not, then the clustered index goes on the column(s) chosen for the clustered index and the primary key gets created as nonclustered.

Index the foreign keys. These may end up not being the final indexes on these columns, but it’s an excellent place to start and it’s something that’s left out far, far too often.

That’s the absolute bare minimum that must be done. That can all be done with just the database design. The rest is going to require some knowledge of the queries that will be running against the server.

Speak to the developers, see what queries they’re going to be sending to the database, speak to the business analysts (or users) and see what think will be the commonly used aspects. Bear in mind that both the developers and business analysts (or users) may well be wrong. Not intentionally wrong, but wrong because they’re looking at things from a different perspective.

To give an example of that, a system I worked on some time back had custom security built-in to the DB (tables storing access rights to various sets of data). The users swore that the most accessed portion of the system was the address book. The developers claimed that the account balances procedure would have the heaviest impact. A trace showed that the custom security ran far more frequently than anything else.

Hence, if you can trace a workload (from automated testing, user testing or acceptance testing is the best) you should. Combine that with what the developers and business analysts say and that should be reasonably accurate.

Find the most critical queries, the ones that are going to run often. These may be ‘housekeeping’ queries like the custom security, or maybe they’ll be queries run when the user opens the app. It’s going to differ for everyone.

Create a minimal set of indexes to support the most critical, most frequent queries. Do not, at this point, try to index everything. It’s going to be a waste of time without accurate stats of how the users really use the system. Create just a minimum of indexes to start with. You want enough so that the system runs acceptably but not so many that you’ll be cleaning unused indexes off for months after the system goes live.

I’m not going to go into detail here on how to create indexes, see all the links that I gave in the first part of this series. There are, however, a few things to keep in mind

Once done, come back to the indexes after the system is in use and re-evaluate.

Indexing strategies for an established system

This one depends heavily on how badly the system is indexed.

If tables are missing clustered indexes or primary keys that should be the first priority. This is harder than for a new system as, in the absence of enforced constraints, duplicate values could have crept into the supposedly unique columns.

If primary keys can’t easily be added (due to errant data), the clustered indexes should still be considered. All the same considerations as for a new system apply here. It’s worth noting that adding clustered indexes to huge tables is not a quick exercise.

There are two factors to fixing the indexing for an existing system

  • creating or widening indexes to support the current workload
  • removing indexes that are not used

Creating or widening indexes

This should be done based on the workload. Either SQLTrace or Extended Events can be used to capture the queries running against the server. This can be examined manually or it can be submitted to the Database Tuning Adviser (DTA).

If using DTA, that must not be the entire story. DTA recommendations have to be carefully examined and tested and implemented only if they make sense (and improve performance). Be wary of accidentally creating redundant indexes this way (How can you tell if an index is REALLY a duplicate?)

If manually tuning, the same cautions as mentioned in the ‘Indexing a new system’ apply.

Also, be aware that you can’t create perfect indexes for all queries (except maybe in a data warehouse, but likely not even then). Index for the important (frequent or high priority) queries, create indexes that can support multiple queries. Tune the system, not the individual query. See Indexes: just because you can, doesn’t mean you should! (Kimberly Tripp).

If the less important, less frequently run queries aren’t quite as optimal as they could be (but are still in the acceptable range), that’s fine. Be very careful of over-indexing.

Removing indexes

To be honest, this is fraught with peril. Telling that an index is unused is not as easy as it may seem.

Start with the sys.dm_db_index_usage_stats DMV. Indexes that have no seeks, no scans and just updates, or ones that don’t appear in there at all may appear to be unused. Whether they are really unused however is another question.

The sys.dm_db_index_usage_stats DMV is cleared by a restart of SQL or any time the database is closed (beware auto_close). So if the SQL instance has only been running for three days, then the best that can be said about indexes that appear in sys.dm_db_index_usage_stats with no seeks and no scans is that they haven’t been used in three days.

Before deciding to drop any indexes monitor that DMV over a period of time. How long? Depends on your application. If it’s an app that has a steady and consistent usage, maybe not long. If it’s an app that has radically different usage patterns at different times of the month/year, then long enough that you capture them all.

Also, make sure you keep documentation and preferably scripts of any indexes dropped, so that you can easily recreate them should it be necessary.

In conclusion

Comprehensive indexing strategies are not exactly easy things to write short posts on, but I hope this has given some idea on how to fit all the pieces together. For another view on this topic, be sure to watch Kimberly’s video on indexing strategies: http://technet.microsoft.com/en-us/sqlserver/gg545006.aspx

Oh, I almost forgot the answer to Wednesday’s homework (for anyone that’s still awake). There are likely many answers, but this one satisfies the requirements:

CREATE NONCLUSTERED INDEX idx_HomeworkAnswer
ON dbo.CallLog (Severity, LastUpdateDate, CallStatus)
INCLUDE (AssignedTo, LogDate)
WHERE LastUpdateDate IS NOT NULL AND Severity IN (1,2)

SQL University: Advanced Indexing – Filtered Indexes

Wednesday, November 9th, 2011

Welcome back to day 2 of Advanced Indexing. Today we’re going to look at a feature that was added in SQL Server 2008 – filtered indexes.

In versions previous, indexes were always on the entire table. An index would always have the same number of rows as the table it was built on did (which is why COUNT(*) can just scan the smallest index on the table)

With filtered indexes, it’s possible to have an index that’s built on a subset of the rows in the table. The definition for a filtered index contains a WHERE clause predicate that determines if a row in the table will be in the index or not.

This can be a major advantage on really large tables where most queries are only interested in a small fraction of the table. A normal index would be based on the entire table regardless of the fact that most of the table is of no interest, meaning the index would be larger than necessary, deeper than necessary and take up more space than would be ideal. With a filtered index on just the interesting portion of the table, the index size is kept to a minimum, meaning it’s shallower than an index on the entire table and hence more efficient.

A simple example of a filtered index would be

CREATE NONCLUSTERED INDEX idx_Example
ON Account (AccountNumber)
WHERE Active = 1;

There are two main uses of a filtered index:
1) Enforcing moderately complex uniqueness requirements
2) Supporting queries that filter on common subsets of a table

Filtered indexes and unique columns

One very interesting use of filtered indexes is in enforcing uniqueness over portions of a table. One requirement that come up again and again is to have a nullable column that must have unique entries in it, but whose entries are optional. Basically, the column must be unique or null. Sounds easy, but the problem is that a unique index allows only one null. So much for nulls not being equal to anything including other nulls.

Prior to SQL 2008 implementing such a constraint required computed columns, indexed views or triggers. With SQL 2008′s filtered indexes, it’s trivial.

CREATE UNIQUE NONCLUSTERED INDEX idx_SomeTable_SomeColumn
ON SomeTable (SomeColumn)
WHERE SomeColumn IS NOT NULL;

It has to be a unique index not a unique constraint as indexes can be filtered, constraints cannot.

This can be extended to various forms of moderately complex unique requirements and is certainly an improvement over using indexed views or complex calculated columns (or just trusting the application to do things right).

Supporting queries

The really interesting use of filtered indexes though is for supporting queries. Here filtered indexes are very useful in cases where a queries against a table frequently include a specific filter.

A couple common cases of this are table that flag rows as active or inactive and most queries are interested in only the active rows, or in a database design where deletes are logical (an IsDeleted column) and almost every query filters for rows not marked as deleted.

Let’s have a look at a couple examples here. I’m not using AdventureWorks because the database design doesn’t include these kinds of patterns. The table design is given at the end of the post and a SQLDataGenerator project file is attached.

First let’s look at a simple example. This table stores support calls, and this query is looking for all recent open calls logged to one of the support people.

SELECT CallID, LogDate, AssignedTo
FROM dbo.CallLog AS cl
WHERE CallStatus = 'Open'
AND AssignedTo = 42
AND LogDate > DATEADD(ww,-1,GETDATE());

Now, we could create a normal nonclustered index with CallStatus, AssignedTo and LogDate in the key, but let’s say that while the AssignedTo and LogDate filters change for this query, the filter is always, always, always for CallStatus = ‘Open’. This table has around 200 open calls and 100000 closed calls. Creating an index with the closed calls as well is just wasting space and time, no one’s interested. So, what I can do is this:

CREATE NONCLUSTERED INDEX idx_CallLog_AssignedToLogDate
ON dbo.CallLog (AssignedTo, LogDate)
WHERE CallStatus = 'Open';

FilteredIndex1

One thing to note here is that if the query filter exactly matches the index filter, the index doesn’t need to have that column as either a key or include column. It’s not being selected and the filter is entirely taken care of with the indexes filter.

It’s worth noting that there’s bug relating to this, specifically around filtered indexes with an IS NULL filter. See http://connect.microsoft.com/SQLServer/feedback/details/454744/filtered-index-not-used-and-key-lookup-with-no-output/ The bug is still unfixed in the latest CTP of SQL Server 2012.

Let’s have a look at a second example, where the query’s filter is a subset of the index’s filter.

Let’s say that a very frequent query is for the urgent or urgent and high priority calls (severity 1 and 2). So these are two common queries:

SELECT CallID, LogDate, CallStatus, Severity
FROM dbo.CallLog AS cl
WHERE Severity < 3 -- urgent and high
AND AssignedTo = 1;

SELECT CallID, LogDate, CallStatus, Severity
FROM dbo.CallLog AS cl
WHERE Severity = 1  -- urgent
AND AssignedTo = 1;

So, given that, I can create a filtered index on the larger of those ranges

CREATE NONCLUSTERED INDEX idx_CallLog_Severity
ON dbo.CallLog (AssignedTo,    Severity)
INCLUDE (CallStatus, LogDate)
WHERE Severity < 3

FilteredIndex2

In this case the filtered index can be used, but the column that’s being filtered on must also be in the index key, because the second query is filtering for a subset of the rows that the index contains.

An examination of the properties of the index seeks shows that for the first query there’s only one seek predicate – AssignedTo, whereas for the second query there are two seek predicates – AssignedTo and Severity

FilteredIndex3 FilteredIndex4

There are some limitations around filtered indexes and the matching of the filters. There are cases where a query and a filtered index have predicates that are logically equivalent, but where the filtered index can’t be used.

An example of this is not hard to generate. Let’s try a table that has an IsDeleted bit column (defined as not nullable)

CREATE TABLE Users (
UserID INT IDENTITY PRIMARY KEY,
UserName VARCHAR(50),
DepartmentID INT NOT NULL,
IsDeleted BIT NOT NULL DEFAULT 0
);

CREATE NONCLUSTERED INDEX idx_Users_DepartmentID
ON dbo.Users (DepartmentID)
INCLUDE (UserName)
WHERE IsDeleted = 0;

The IsDeleted column is a non-nullable bit column. Hence it can only have two possible values, 0 and 1. Hence, these two queries are completely equivalent in their results

SELECT UserName, DepartmentID
FROM dbo.Users
WHERE IsDeleted = 0 AND DepartmentID = 3;

SELECT UserName, DepartmentID
FROM dbo.Users
WHERE IsDeleted != 1 AND DepartmentID = 3;

FilteredIndex7

The first one uses the filtered index, the second does not. The second one scans the clustered index because despite the filter essentially being the same as the index filter, it’s not the same and the query hence cannot use the filtered index.

Another limitation has to do with parametrisation. If the query is passed in a parametrised form, or is subject to simple or forced parametrisation, then by the time that the optimiser gets the query there may not be sufficient information to tell that a filtered index is usable or not.

If we imagine the case of the table with the IsDeleted column again, a query that has a filter IsDeleted = 0 is definitely capable of using a filtered index that has the predicate IsDeleted = 0, but if the query gets parametrised and arrives at the optimiser in the form IsDeleted = @p1, there’s no way that query can match the filtered index because the value of @p1 on future executions could be 0, 1, NULL or 42, and in any case other than 0, if the cached plan used the filtered index it would produce incorrect results.

We can see this by setting the database parametrisation to forced and re-running an earlier example.

With parametrisation simple and the index with a filter on severity < 3, these two queries produce different execution plans

SELECT CallID, LogDate, CallStatus, Severity
FROM dbo.CallLog AS cl
WHERE Severity = 1
AND AssignedTo = 1;

SELECT CallID, LogDate, CallStatus, Severity
FROM dbo.CallLog AS cl
WHERE Severity = 4
AND AssignedTo = 1;

FilteredIndex5

However if the database is set for forced parameterisation, then the query is only seen in its parametrised form, and both queries have the same plan, one that does not use the filtered index

ALTER DATABASE Testing SET PARAMETERIZATION FORCED;
GO

SELECT CallID, LogDate, CallStatus, Severity
FROM dbo.CallLog AS cl
WHERE Severity = 1
AND AssignedTo = 1;

SELECT CallID, LogDate, CallStatus, Severity
FROM dbo.CallLog AS cl
WHERE Severity = 4
AND AssignedTo = 1;

FilteredIndex6

Right, so, homework for today. Given the table design below and the following queries, design one filtered index that both queries can use effectively (a filtered index that doesn’t filter out any rows is not an acceptable answer). Assume those are fixed queries that are frequently run with exactly that structure and exactly those values.

CREATE TABLE CallLog (
CallID INT IDENTITY PRIMARY KEY,
CallStatus CHAR(6) NOT NULL,
LogDate DATETIME NOT NULL,
LastUpdateDate DATETIME,
Title VARCHAR(500),
Severity TINYINT,
AssignedTo INT,
UserID INT
);

SELECT CallID, CallStatus, AssignedTo
FROM dbo.CallLog
WHERE CallStatus = 'Open' AND LastUpdateDate IS NOT NULL AND Severity = 1;

SELECT CallID, LogDate, LastUpdateDate FROM dbo.CallLog
WHERE LastUpdateDate < DATEADD(dd,7,GETDATE()) AND Severity IN (1,2);

FilteredIndexes (SQL DataGenerator project)

Answers for Monday’s homework:

1) Yes, this can use an index for both filter and group by. Index key columns would be (TransactionType, ReferenceOrderID, ProductID), in that order, and index include columns would be (Quantity, ActualCost)

2) No, because of the inequality we can use an index to support filtering or aggregating but not both. So there will either be a plan with an index seek and a hash aggregate (or sort and stream aggregate) or a plan with an index scan and a stream aggregate, but there’s no way to get a seek and stream aggregate without a sort

SQL University: Advanced indexing – Sorting and Grouping

Monday, November 7th, 2011

Good day everyone and welcome to another week of SQL University. I know we’re getting close to the end of the year and everyone’s looking forward to a nice long vacation soaking up the sun at the beach, but a little bit of attention would be nice. Thank you.

This week is Advanced Indexing, and I mean advanced, none of that selectivity, SARGable, predicate stuff that gets repeated all over the place. If you need a refresher on the basics before we get started, the following can be considered pre-requisite reading for this course

There’s also some additional background material available for more enthusiastic students:

Right, now that the admin has been handled, let’s get straight into things. Nothing like starting at the deep end…

Most people would rightly associate indexes with where clause predicates and joins, after all, the main usage of an index is to reduce the rows in consideration for a query as fast as possible. However there’s another portion of your queries that indexes can, if appropriately designed, help with – grouping and sorting.

Sorting is an extremely expensive operation, especially on large numbers of rows. For the academics in the audience, the algorithmic complexity of sorting is above linear, the time to sort a set of data increases faster than the number of items in the list. The common sorting algorithms have an average time complexity of O(n log n). It’s better than O(n2), but it can still hurt at the higher row counts.

O(n^2) O(n log n)

O(n2) on the left, O(n log n) on the right (Thanks to QuickMath)

Right, the non-academics can wake up now.

The other reason that sorting hurts is that it needs a memory grant to do the sort. If there’s a shortage of memory the query could have to wait a while for the memory to be granted and, if the optimiser mis-estimates the number of rows to be sorted, the memory grant could be insufficient and the sort would have to spill into TempDB. You don’t want that happening.

Finally, sort is a blocking operator in the execution plan (all rows must have been fetched from the source before any output can be returned), and so the client won’t start seeing rows returned until the entire sort has been completed. This can make the query feel like it takes longer than it really does.

Grouping and aggregation are much the same. To aggregate one set of values based on another set of values, SQL has to get all the like values of the grouping columns together so that it can do the aggregation. That sounds suspiciously like a sort doesn’t it?

SQL doesn’t always sort to do aggregations, but the alternative – hash tables – isn’t exactly free (homework exercise – read up on hash tables)

So for both sorting and grouping, the query processor’s job would be a lot easier if there was some way that it could get the data ordered by the sorting or grouping columns without having to do the work of actually sorting. Sounds impossible? No.

Enter indexes. The b-tree structure of an index makes it great for quickly finding matching rows, but today we’re more interested in the leaf-level of the index than the upper levels. The leaf level of an index is logically ordered by the index key (not physically ordered, let’s stop that myth right here please).

The leaf level of an index is logically ordered by the index key columns. So, if the index leaf level is read, it can return the data in the order of the index key. If that order is what the query processor needs in order to process a sort or grouping, then there’s no need for an expensive sort to be done, the underlying order can be leveraged.

Now, before someone misquotes me and goes off crying out that data is always returned in the order of the index used, no it is not. If there’s no order by on a query, there’s no guarantee of order regardless of indexes. However if there is an order by, the query optimiser and query processor may be able to utilise the underlying index order and avoid the cost of actually sorting the data.

Ok, enough theory for now, let’s look at some practical examples. We’ll use AdventureWorks here and I’m going to use the TransactionHistoryArchive table (in the Product schema)

Firstly a simple (and silly) example:

SELECT ProductID, TransactionDate, ActualCost
 FROM Production.TransactionHistoryArchive
 ORDER BY ActualCost

This one’s not very realistic, but it’s a nice simple one to start with. Give me all the rows in the table ordered by the ActualCost column. There’s no filter here so many would say that indexes can’t help here, and it’s true that an index can’t help with finding rows (because all are required), but it can help.

SortExample1

Table ‘TransactionHistoryArchive’. Scan count 1, logical reads 1419, physical reads 0.

SQL Server Execution Times:
CPU time = 421 ms,  elapsed time = 4855 ms.

421ms of CPU time, with an estimated cost of 93% for the sort. Let’s see if we can make this any better.

If I create an index on ActualCost, that gives SQL the option of scanning the index (scan, because there’s no seek predicate) to retrieve the data in the order of the ActualCost column. I’m going to have to make it a covering index, as there is no way at all that SQL will willingly do key lookups for every single row of the table.

CREATE NONCLUSTERED INDEX idx_TransactionHistoryArchive_ACtualCost
 ON Production.TransactionHistoryArchive (ActualCost)
 INCLUDE (ProductID, TransactionDate)

Let’s see how that’s changed the query’s execution.

SortExample2

Table ‘TransactionHistoryArchive’. Scan count 1, logical reads 682, physical reads 0.

SQL Server Execution Times:
CPU time = 47 ms,  elapsed time = 4073 ms.

The reads have dropped slightly, because we’re now scanning a nonclustered index which is smaller than the clustered index, but that’s not the main point here. The CPU usage has dropped by around a factor of 8. 421ms down to 47ms. If this was a critical query that ran several times a minute then this change could make a nice improvement to throughput and overall CPU usage.

That was a silly example, let’s try for something a little more complex:

SELECT  ProductID ,
TransactionDate ,
TransactionType ,
Quantity ,
ActualCost
FROM Production.TransactionHistoryArchive
WHERE TransactionType = 'S'
AND ReferenceOrderID = 51739
ORDER BY TransactionDate

If we look at the execution plan, there’s already an index in use. There’s an index on ReferenceOrderID, but it’s clearly not as good as it could be.

SortExample3

Table ‘TransactionHistoryArchive’. Scan count 1, logical reads 222, physical reads 0.

SQL Server Execution Times:
CPU time = 0 ms,  elapsed time = 1 ms.

Well, it’s not exactly taking ages on the CPU (it’s a tiny resultset), but this can still be better than it is. The key lookup is there because the existing index is only on two columns – ReferenceOrderID and ReferenceOrderLineID. I can’t modify that index for this query without potentially breaking some other query’s use of it, so I’ll create a new index.

CREATE NONCLUSTERED INDEX idx_TransactionTypeReferenceOrderID
ON Production.TransactionHistoryArchive (TransactionType, ReferenceOrderID)
INCLUDE (ProductID, TransactionDate, Quantity, ActualCost)

Why TransactionType first? Come back on Friday and I’ll be discussing that.

Plan’s now much simpler, the key lookup has gone, and now the sort is the majority of the cost

SortExample4

The key to getting rid of that sort is to understand that a multi-column index is sorted by the entire key. So, if we have an index on Col1, Col2, then for rows where Col1 has the same value, those rows are listed within the index ordered by Col2. In other words, if we read the leaf level of that index it would look like this:

Col1      Col2
A         1
A         3
A         8
B         4
B         8
B         9
B         14
C         0
\C         1

And so on and so on. Hence, if I filtered that by Col1 = B, the resulting rows could be returned ordered by Col2 with no additional work.

So in the above case, if I add the sort column (TransactionDate) as a key column in the index (it’s currently an Include column), then once the filter on TransactionType and ReferenceOrderID is done, the qualifying rows can be read in order of TransactionDate.

Let’s drop the index we created and create one with TransactionDate as an additional key column.

CREATE NONCLUSTERED INDEX idx_TransactionTypeReferenceOrderID
ON Production.TransactionHistoryArchive (TransactionType, ReferenceOrderID, TransactionDate)
INCLUDE (ProductID, Quantity, ActualCost)

SortExample5

Success, the sort has gone!

One last example, then we’ll take a brief look at group by before finishing up for the day.

SELECT  ProductID ,
TransactionDate ,
TransactionType ,
Quantity ,
ActualCost
FROM Production.TransactionHistoryArchive
WHERE ActualCost > 2500
ORDER BY TransactionDate

Can I use the same technique here to remove the need for a sort?

Let’s look at that simplistic index that I described earlier and see how the inequality would play out.

Col1      Col2
A         1
A         3
A         8
B         4
B         8
B         9
B         14
C         0
C         1

If I filter that for Col1 > ‘A’, are the resultant rows ordered by Col2?

No, because the filter on Col1 is an inequality, a filter that returns multiple different values of Col1, the results aren’t ordered by the second column and hence we can’t  use an index here to both support the filter and the order.

Given this one, we could create an index to support the filter and have SQL sort the rows that qualified or we could create an index to support the order by and have SQL scan that and filter out rows that don’t match.

In general, the first option is the one that will be best in the majority of cases. The primary use of an index is to locate rows and filter resultsets. There are cases where the alternative may be appropriate, if the vast majority of the rows in the table qualify for the filter then it may be more optimal to support the sort and let SQL scan and filter. Definitely not the usual case though.

That about wraps up sorts, now for a quick look at group by.

As mentioned earlier, SQL has two ways to process grouping, it can do what is called a Stream Aggregate or it can use a hash table and do a Hash Aggregate. Stream Aggregate requires that the resultset be sorted in the order of the grouping columns. Well, given that fact and that we’ve spend a lot of time showing how to use indexes to support a sort, this should be quick and easy.

Let’s dive straight into some examples, because the theory is the same as for the sort described earlier

SELECT ProductID, SUM(ActualCost) AS TotalCostPerProduct
FROM Production.TransactionHistoryArchive
GROUP BY ProductID

This currently executes as a hash aggregate.

Grouping1

Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0.

Table ‘TransactionHistoryArchive’. Scan count 1, logical reads 1419, physical reads 0.

SQL Server Execution Times:

CPU time = 47 ms,  elapsed time = 46 ms.

If we want a stream aggregate, then we need to get the rows entering the aggregation ordered by ProductID. SQL is not going to willingly sort the entire resultset (the hash aggregate is cheaper), so the only way we’re going to get a stream aggregate (other than with a hint) is by adding an index so that the data can be read from the index already ordered. There’s no filter here so it’s a straightforward index

CREATE NONCLUSTERED INDEX idx_TransactionHistoryArchive_ProductID
ON Production.TransactionHistoryArchive (ProductID)
INCLUDE (ActualCost)

Grouping2

Table ‘TransactionHistoryArchive’. Scan count 1, logical reads 482, physical reads 0.

SQL Server Execution Times:

CPU time = 31 ms,  elapsed time = 34 ms.

Not a massive improvement, but the principle is there.

Now, for homework, take these two queries and see firstly if it is possible to create an index to support both the filter and the group by and, if so, identify what that index is.

Question 1

SELECT  ReferenceOrderID
ProductID ,
SUM(Quantity) AS TotalQuantity ,
SUM(ActualCost) AS TotalCost
FROM Production.TransactionHistoryArchive
WHERE TransactionType = 'S'
 GROUP BY ReferenceOrderID, ProductID

Homework1

Question 2

SELECT ReferenceOrderID ,MIN(ActualCost)|
FROM Production.TransactionHistoryArchive AS tha
WHERE TransactionDate > '2004-01-01'
GROUP BY ReferenceOrderID

Homework2

Edit: And (as a late clarification) assume that the filter on transaction date is not always the same date.

Enough for today. Same time, same place Wednesday for a look at indexes on part of a table.

24 Hours of PASS Questions

Tuesday, September 27th, 2011

I finally found the time to work through the questions from the 24 Hours of PASS session that I did. Thanks to everyone that attended the event

Q1: Can you filter execution plans for sort warnings?

No. The sort and hash warnings don’t appear in the execution plan. You’d have to trace for the hash and sort warning events and correlate that with either batch/statement started and completed events or with the run-time plan events.

Q2: To get the query executions can you just add the statement completed or batch completed events?

The Statement Completed and Batch completed events have durations and can be filtered on that duration. The problem however is that the execution plan events (eg showplan all, showplan xml, statistics profile, statistics xml) have no duration column. Hence the execution plan events can’t be filtered on duration even though the statement_completed and batch_completed events can.

Q3: If you run a Profiler trace and all the plans are being pulled from cache will that mean that no ShowPlan event data will be shown in the trace?

Depends which event is being traced. There are events for query compile (showplan all for query compile and showplan xml for query compile) that only fire when the query compiles, so those will not fire if the plan is being pulled from cache. The other execution plan events are all fired each time the query executes.

Q4: Why can you sometimes get nulls or blank strings for the query plan from sys.dm_exec_query_plan

From Books Online:

Under the following conditions, no Showplan output is returned in the query_plan column of the returned table for sys.dm_exec_query_plan:

  • If the query plan that is specified by using plan_handle has been evicted from the plan cache, the query_plan column of the returned table is null.
  • Some Transact-SQL statements are not cached, such as bulk operation statements or statements containing string literals larger than 8 KB in size. XML Showplans for such statements cannot be retrieved by using sys.dm_exec_query_plan unless the batch is currently executing because they do not exist in the cache.

Q5: Do the execution plans from the DMVs contain execution information?

No. The plans extracted from cache contain compile-time information only.


If you enjoyed this and are going to be at PASS Summit this year, it’s not too late to sign up for the all-day precon (All about Execution Plans) that Grant Fritchey (blog|twitter) and I are doing.

Compiles and recompiles

Tuesday, August 16th, 2011

I want to spend some time over the next few months looking at query compilation and the plan cache, and there’s a couple concepts that I want to get cleared up beforehand. The first of those is around two terms that are often used interchangeably, compile and recompile.

Compile

A compile occurs when a query is given to the query optimiser and, when it does a lookup into the plan cache, no matching plan is found. The optimism must then compile the query, generating an execution plan, must add that plan to the plan cache (in most cases) and then must pass that plan onto the query execution engine so that the query can be executed. (http://technet.microsoft.com/en-us/library/Cc966425)

Recompile

A recompile is something slightly different. For a recompile, the optimiser must find a matching plan when it queries the plan cache, must hand that cached plan over to the query execution engine and then while doing validation checks the execution engine must determine that then query plan is no longer valid and request the optimiser to partially or completely recompile the query. (http://technet.microsoft.com/en-us/library/Cc966425)

Subtle difference. Both cases result in the optimiser generating an execution plan, but the reasons can be different. Also worth noting is that a compile results in a new plan in the cache, a recompile simply replaces an existing plan.

Another difference since SQL 2005 – a compile is for the entire batch, but a recompile can be for just a single statement within the batch.


Now the theory’s dealt with, let’s look at some examples and see how we can track these two events and try and get a better understanding of which occurs when and how they look.

The tools I’m going to use to track these are performance monitor with the compiles/sec and recompiles/sec counters and SQL Profiler with the event SP:StmtRecompile event (there’s no profiler event for compilation). I’ll also check what’s in the plan cache after each test.

The first one’s going to be very simplistic, a query run against an empty plan cache.

DBCC FREEPROCCACHE
GO

EXEC dbo.OutStandingTotalByStatus
GO

What we get from that is a non-zero value for SQL Compilations/sec (perfmon) and the following from profiler (The SQL Recompiles/sec remains 0)

Compiles1

and the plan cache now contains one plan with one use. (for more info on how the CacheMiss and CacheInsert events work, see http://sqlinthewild.co.za/index.php/2010/07/27/hit-and-miss/ and http://sqlinthewild.co.za/index.php/2010/08/31/come-and-gone/)

Compiles2

In this case, I hope it was clear, we had a compile occur (empty plan cache before, new plan added to cache).

Now what happens if, with no clearing of the cache nor anything else being done, I mark that procedure for recompile and run it again?

EXEC sp_recompile OutStandingTotalByStatus
GO

EXEC dbo.OutStandingTotalByStatus
GO

This is interesting. We don’t see a non-zero value for SQLRecompiles/sec (as might have been expected), instead, as with the previous case we get a spike for SQLCompiles/sec. The profiler output shows why

Compiles3

Note the SP:CacheRemove being run for the stored procedure’s plan. So sp_recompile does not mark a plan as invalid and needing recompilation. It removes it from the cache entirely. The next time that procedure runs SQL does a cache lookup, doesn’t find a matching plan and compiles a new one (compile, not recompile)

Onwards… What about if I alter the procedure?

I’m not going to clear the cache first, the procedure has a plan in there from the last test, we can use that to see the effects.

ALTER PROCEDURE OutStandingTotalByStatus
AS
-- This has changed!
SELECT 1;

SELECT o.OrderStatus, SUM(UnitPrice*Quantity) AS TotalOutstanding
FROM dbo.Orders o
INNER JOIN dbo.OrderDetails od ON o.OrderID = od.OrderID
WHERE ShippingDate IS NULL
GROUP BY OrderStatus;
GO

EXEC dbo.OutStandingTotalByStatus
GO

Again perfmon shows a bump in Compiles/sec and the recompiles/sec is still at 0. Again, profiler shows why.

Compiles4

Same as when I ran the sp_recompile, the ALTER PROCEDURE triggered a SP:CacheRemove. The plan was gone from cache at the point the stored procedure ran again, so this was considered a compile, not a recompile.

btw, for those curious, if I query the plan cache between altering the procedure and running it again, the procedure’s plan really has gone.

Let’s try something more complex, an alteration of one of the base tables. I’m going to clear the cache first, for reasons that will later be clear

DBCC FREEPROCCACHE
GO

EXEC dbo.OutStandingTotalByStatus
GO

ALTER TABLE dbo.OrderDetails ADD Filler CHAR(10)
GO

EXEC dbo.OutStandingTotalByStatus
GO

Ha! This time perfmon shows a non-zero value for SQLRecompiles/sec. The profiler trace has a different form too

Compiles5

There’s no SP:CacheRemove. No SP:CacheInsert either.

The alteration of the table did not result in SQL going through the plan cache and removing plans for dependant objects. That makes sense if you think about it. There could be millions of plans in cache, for SQL to check each and every one to see if it has a dependency on the table just changed would be terribly time consuming, so it’s not done. What happens instead is that the next time the procedure runs SQL looks in the cache, finds the procedure’s plan (the SP:CacheHit), does the pre-execution validations and finds that one of the base objects has changed since that plan was compiled. The plan is then sent to the optimiser to recompile that plan (the SP:Recompile event)

Here’s the other interesting thing, from a query of the plan cache

Compiles6

The usecounts is 2, not 1. I ran that procedure once before altering the table, once after. The alter of the table forced SQL to recompile the plan on the second execution, but the usecount for the plan reflects both executions, the one from before the recompile and the one from after.

In all the previous cases, when the plan was actually removed from cache and then compiled and inserted the usecount was only 1 because when the plan was removed all related information was removed with it. With a recompile it’s not removed, so the usecounts aren’t reset to 0.

This is important to note. There are lots of articles (and books even) that say to check plan usecounts as a low usecount is a sign of frequent recompiles. As we can now see, it’s not. It’s a sign of frequent compiles, of the plans being removed from cache and later re-added, recompiles don’t affect the usecount (well, at least not in this case, there may be cases where it does)

Now try a couple more common cases, see what they show, starting with rebuilding an index.

DBCC FREEPROCCACHE
GO

EXEC dbo.OutStandingTotalByStatus
GO

ALTER INDEX PK_OrderDetails ON dbo.OrderDetails REBUILD
GO

EXEC dbo.OutStandingTotalByStatus
GO

Any guesses?

Compiles7

It’s a recompile (and the perfmon counters confirm that). As with the previous case, the usecount shown in sys.dm_exec_cached_plans is 2 for the procedure’s plan, not 1

Lastly, a statistics update. Can anyone guess what’s going to happen here? Anyone?

DBCC FREEPROCCACHE
GO

EXEC dbo.OutStandingTotalByStatus
GO

-- Insert a few more rows into OrderDetails at this point

UPDATE STATISTICS dbo.OrderDetails WITH FULLSCAN
GO

EXEC dbo.OutStandingTotalByStatus
GO

Compiles9

Yup, it’s a recompile.

It is worth nothing that (in my tests anyway), if at least one row isn’t changed, the stats update has no effect. Perhaps an optimisation that checks whether there have been modifications or not before updating? I’m not sure what’s causing that.

There is a lot more I could test, SET option changes, memory pressure, configuration changes, etc, but I think I’ve done enough that anyone interested can play themselves for further info.

Ok, but what’s the use? Was this just a dive into internals with no practical usage? I don’t think so.

One thing that’s important when investigating SQL problems is to understand what the various counters, events and columns are saying and what they mean. If one is faced with a server with very high SQLCompiles/sec, it’s a waste of time to go looking at whether automatic or manual stats updates are happening too often. Similarly if the SQLRecompiles/sec is high, investigating the usage of WITH RECOMPILE and OPTION(RECOMPILE) is completely off track and is going to waste time and not help with solving the problem.

That’s why I believe it’s important to understand what the various counters, events and columns actually means and what occurrences within SQL affects them. None of us have time to spend a couple days busy investigating something that’s not related to what’s actually wrong with the server based on an incorrect understanding of what SQL is showing us.

All about Execution Plans

Tuesday, August 9th, 2011

Coming to the PASS Summit in October this year? Excellent!

I say excellent, because Grant (blog|twitter) and I are presenting a full-day seminar on the Monday, all about execution plans. Excited already? If not, let me give you a taste of what we’ll be discussing.

Grant will be kicking things off with a look at how the optimiser works. Not 500 level deep internals (that’d take all day by itself), but an overview of how the optimiser turns this

SELECT t.Title, u.UserName, u.UserLevel, t.LastModified, t.LastPoster
FROM Threads t
INNER JOIN Posts p on t.ThreadID = p.ThreadID
INNER JOIN Users u on p.Poster = u.UserID
WHERE t.ThreadID = 42

into this

QueryIntoExecPlan

as well as some DMVs that give a view on what the optimiser is doing.

After that, I’ll be digging into what the plan cache is, how it’s structured and how plans are matched to incoming queries. I’ll also cover how you might go about monitoring SQL’s usage of the cache.

After that we’ll discuss the myriad and varied ways to get execution plans out of SQL and what the various different options for that return and what some of the terms involved really mean (estimated execution plans anyone?).

Once all that’s out there (which will probably take the entire morning) it’ll be onto the fun of reading the execution plans, what those various icons represent and what all their arcane properties are telling you. Maybe, just maybe we’ll also have a look at the raw XML form of the plan, just for fun.

And because theory’s not very useful without something to apply it to, we’ll be ending off the day by seeing how you can use the information from the execution plan to tune your queries. After all, everyone wants better performance, right?

Still not completely convinced? Then check out the 24 hours of PASS coming in September. Both Grant and I are doing a pre-con preview. I’m doing extracting execution plans and Grant’s doing a portion of the reading execution plans section.

Right, now that everyone’s dying to attend this, I’m going to toss out a little bit of a challenge. If you’re coming to our precon and you have an execution plan that you don’t understand (maybe a strange operator, maybe a property with a value you’ve never seen), mail it to me (gail@<blog domain>) with a note as to what you want clarifying and we might discuss it during the pre-con. Some rules, just to keep things sane

  • No confidential stuff. I don’t want legal trouble, neither do you.
  • If the exec plan takes 5 minutes to open in SSMS, I’m deleting it.
  • If it’s multiple pages in SSMS, I’m deleting it.
  • I don’t promise to look at anyone’s plans, it depends how we’re doing on time.

All about Execution Plans

Tuesday, August 9th, 2011

Coming to the PASS Summit in October this year? Excellent!

I say excellent, because Grant (blog|twitter) and I are presenting a full-day seminar on the Monday, all about execution plans. Excited already? If not, let me give you a taste of what we’ll be discussing.

Grant will be kicking things off with a look at how the optimiser works. Not 500 level deep internals (that’d take all day by itself), but an overview of how the optimiser turns this

SELECT t.Title, u.UserName, u.UserLevel, t.LastModified, t.LastPoster
FROM Threads t
INNER JOIN Posts p on t.ThreadID = p.ThreadID
INNER JOIN Users u on p.Poster = u.UserID
WHERE t.ThreadID = 42

into this

QueryIntoExecPlan

as well as some DMVs that give a view on what the optimiser is doing.

After that, I’ll be digging into what the plan cache is, how it’s structured and how plans are matched to incoming queries. I’ll also cover how you might go about monitoring SQL’s usage of the cache.

After that we’ll discuss the myriad and varied ways to get execution plans out of SQL and what the various different options for that return and what some of the terms involved really mean (estimated execution plans anyone?).

Once all that’s out there (which will probably take the entire morning) it’ll be onto the fun of reading the execution plans, what those various icons represent and what all their arcane properties are telling you. Maybe, just maybe we’ll also have a look at the raw XML form of the plan, just for fun.

And because theory’s not very useful without something to apply it to, we’ll be ending off the day by seeing how you can use the information from the execution plan to tune your queries. After all, everyone wants better performance, right?

Still not completely convinced? Then check out the 24 hours of PASS coming in September. Both Grant and I are doing a pre-con preview. I’m doing extracting execution plans and Grant’s doing a portion of the reading execution plans section.

Right, now that everyone’s dying to attend this, I’m going to toss out a little bit of a challenge. If you’re coming to our precon and you have an execution plan that you don’t understand (maybe a strange operator, maybe a property with a value you’ve never seen), mail it to me (gail@<blog domain>) with a note as to what you want clarifying and we might discuss it during the pre-con. Some rules, just to keep things sane

  • No confidential stuff. I don’t want legal trouble, neither do you.
  • If the exec plan takes 5 minutes to open in SSMS, I’m deleting it.
  • If it’s multiple pages in SSMS, I’m deleting it.
  • I don’t promise to look at anyone’s plans, it depends how we’re doing on time.

SQLSkills Immersion training in London

Friday, July 22nd, 2011

Just short of the winter solstice, I bailed out of a freezing cold Johannesburg for warmer climates; well, actually for London where the weather could almost have been mistaken for a South African winter, except much wetter.

However I wasn’t going to London for the weather (fortunately), nor even for some sightseeing; I was going to London for some SQL training. Not just any SQL training, but some seriously deep training from among the best trainers in the world, one of Paul Randal and Kimberly Tripp’s five day immersion courses, and immersion it most certainly was.

It gives a good feel for the intensity and depth of the course when, after all the introductions and housekeeping clip_image002on the Monday morning were done, Paul started off by diving straight into the structure of pages and rows. Start slowly, where’s the fun in that? By lunchtime I had on the order of 20 pages of notes. The reasoning behind starting with that is that the internal structures come up again and again and again in the later material, so a firm understanding of how things are put together at the lowest level makes it easier to understand the features that are built on top.

The material covered is detailed on the SQLSkills website (http://www.sqlskills.com/T_ImmersionInternalsDesign.asp), so I’m not going clip_image004to waste space by listing it all again. The material was all covered in incredible detail and any point could be and frequently was expanded on if any of the students asked.

On the subject of questions, there was not a single one, over all five days, that Paul and Kimberly could not answer with at most a couple of minutes of research (mostly for things like kb article numbers). Questions and comments were encouraged and often the discussions were as valuable as the main course material.

By the Friday, all the students were looking a little worn out. Paul, of course, was still as vibrant as ever, to the point of heckling (in good fun naturally) someone who crawled in late on the Friday morning.

All in all that was a fantastic experience, and that was just week one out of the four. I’d really like to thank Paul and Kimberly for coming across to London and giving people who can’t make it to the US classes an opportunity to take one of their Immersion Courses. I certainly hope that they plan to make a return visit soon.

clip_image006

Converting OR to Union

Tuesday, July 5th, 2011

When I looked at indexing for queries containing predicates combined with OR, it became clear that the are some restrictive requirements for indexes for the optimiser to consider using the indexes for seek operations.

  • Each predicate (or set of predicates) combined with an OR must have a separate index
  • All of those indexes must be covering, or the row count of the concatenated result set low enough to make key lookups an option, as the optimiser does not apparent to consider the possibility of doing key lookups for a subset of the predicates before concatenating the result sets.

So what can be done if it’s not possible to meet those requirements?

The standard trick is to convert the query with ORs into multiple queries combined with UNION. The idea is that since OR predicates are evaluated separately and the result sets concatenated, we can do that manually by writing the queries separately and concatenating them using UNION or UNION ALL. (UNION ALL can only be safely used if the predicates are known to be mutually exclusive)

CREATE TABLE Persons (
PersonID INT IDENTITY PRIMARY KEY,
FirstName    VARCHAR(30),
Surname VARCHAR(30),
Country CHAR(3),
RegistrationDate DATE
)

CREATE INDEX idx_Persons_FirstName ON dbo.Persons (FirstName) INCLUDE (Surname)
CREATE INDEX idx_Persons_Surname ON dbo.Persons (Surname) INCLUDE (FirstName)
GO

-- Data population using SQLDataGenerator

SELECT FirstName, Surname
FROM dbo.Persons
WHERE FirstName = 'Daniel' OR Surname = 'Barnes'

SELECT FirstName, Surname
FROM dbo.Persons
WHERE FirstName = 'Daniel'
UNION
SELECT FirstName, Surname
FROM dbo.Persons
WHERE Surname = 'Barnes'

In this case, the OR can be replaced with a UNION and the results are the same. The Union form is slightly less efficient according to the execution plan’s costings (60% compared to the OR at 40%), and the two queries have the same general form, with two index seeks and some form of concatenation and remove duplicates.

OrResult1
OrExecPlan1

So in that case it worked fine, although the original form was a little more efficient

Some care does need to be taken however, as the query with OR and the query with UNION may not always be equivalent, and it has to do with the elimination of duplicate rows.

In an OR, if a row qualifies for both of the predicates, it’s only returned once. That should be obvious, it’s how things should work, we don’t want to see the row multiple times just because it qualifies for more than one of the OR predicates. If we change that to UNION ALL then the row will be returned twice, it appears in both queries that are concatenated, and UNION ALL means combine without eliminating duplicates.

SELECT FirstName, Surname
FROM dbo.Persons
WHERE FirstName = 'Herman' OR Surname = 'Anderson'

SELECT FirstName, Surname
FROM dbo.Persons
WHERE FirstName = 'Herman'
UNION ALL
SELECT FirstName, Surname
FROM dbo.Persons
WHERE Surname = 'Anderson'

OrResult2a

In that example, Herman Anderson appears once in the results of the OR query and twice in the results of the UNION ALL. That’s because it qualifies for both predicates. The OR eliminated the duplication, the UNION ALL does not.

So change that UNION ALL to UNION so that the elimination of duplicate rows is done, the row appears only once and life is good again. Or is it?

SELECT FirstName, Surname
FROM dbo.Persons
WHERE FirstName = 'Alfred' OR Surname = 'Hickman'

SELECT FirstName, Surname
FROM dbo.Persons
WHERE FirstName = 'Alfred'
UNION
SELECT FirstName, Surname
FROM dbo.Persons
WHERE Surname = 'Hickman'
ORDER BY FirstName, Surname

OrResult2b

This time, Alfred Hickman appears twice in the results from the OR, but only once in the output from the UNION

The difference comes in how the duplicates are eliminated. With an OR, SQL does the elimination of duplicates based on the key value regardless of what may be in the select list. With an UNION, SQL does the elimination of duplicates based on the select list, regardless of what the key value may be and in the above example there were two rows in the table with the value ‘Alfred Hickman’. So with UNION you can lose rows if they are duplicated in the table.

The solution’s fairly simple, if converting an OR into a UNION, ensure that the key column(s) are in the select list, then the duplicate elimination done by the UNION will only remove rows that were part of both result sets, instead of also removing ones that really do appear twice in the table.

So in conclusion, if you’re replacing a query using OR with a query using UNION, be careful with the finer details around duplicates. If you know the conditions are mutually exclusive, use UNION ALL. If you don’t, use UNION and ensure that the table’s key column(s) are present in the select list so that the UNION doesn’t remove rows that you don’t want it to remove.

Are all updates split into delete-insert?

Tuesday, June 21st, 2011

This should be another quick one.

Earlier I saw a forum post where someone asserted that SQL always executes an update as a delete of the row and then an insert of the row with the new values. Now I need another excuse to play with undocumented features, so let’s see if that’s true (or at least true for all the common cases I try)

First up, a heap, no indexes at all, an update that does not change the size of the row.

CREATE TABLE TestingUpdate1 (
ID INT IDENTITY,
SomeString CHAR(50)
)

INSERT INTO TestingUpdate1 (SomeString)
VALUES
('One'),('Two'),('Three'),('Four'),('Five'),('Six'),('Seven'),('Eight'),('Nine')

CHECKPOINT -- truncate the log, DB is in simple recovery.

UPDATE TestingUpdate1
SET SomeString = 'NotFour'
WHERE ID = 4 -- one row

SELECT Operation, Context, AllocUnitName, [Transaction Name], Description FROM fn_dblog(NULL, NULL) AS TranLog

DROP TABLE TestingUpdate1

UpdateHeapFixedSize

The log operation here is Modify Row. so in this case, the update was done as an in-place update.

Second test, a heap, no indexes at all, and an update that changes the row size (however plenty free space on the page, so no issues with forwarding pointers here)

CREATE TABLE TestingUpdate2 (
ID INT IDENTITY,
SomeString VARCHAR(50)
)

INSERT INTO TestingUpdate2 (SomeString)
VALUES
('One'),('Two'),('Three'),('Four'),('Five'),('Six'),('Seven'),('Eight'),('Nine')

CHECKPOINT -- truncate the log, DB is in simple recovery.

UPDATE TestingUpdate2
SET SomeString = 'NotFour'
WHERE ID = 4 -- one row

SELECT Operation, Context, AllocUnitName, [Transaction Name], Description FROM fn_dblog(NULL, NULL) AS TranLog

DROP TABLE TestingUpdate2

UpdateHeapChangingSize

Again we have a modify row, so that again was an in-place update.

Last on the heap, let’s see if things change when there’s lots and lots of rows (I’m going to use an update that does not change the row size, I don’t want to involve forwarding pointers in the discussion here)

CREATE TABLE TestingUpdate3 (
ID INT IDENTITY,
SomeString CHAR(50)
)

INSERT INTO TestingUpdate3 (SomeString)
SELECT TOP (1000000) ' ' FROM msdb.sys.columns a CROSS JOIN msdb.sys.columns b

CHECKPOINT -- truncate the log, DB is in simple recovery.

UPDATE TestingUpdate3
SET SomeString = 'Something'

SELECT Operation, Context, AllocUnitName, [Transaction Name], Description FROM fn_dblog(NULL, NULL) AS TranLog

DROP TABLE TestingUpdate3

UpdateHeapLotsOfRows

Still the modify row operation, so it’s not a case that lots of rows will cause SQL to split the update into a delete-insert pair.

Moving on, let’s try a table with a clustered index, an update of a non-key column that does not change the size of the row. Anyone willing to bet what we’ll see?

CREATE TABLE TestingUpdate4 (
ID INT IDENTITY,
SomeString CHAR(50)
)

CREATE CLUSTERED INDEX idx_ID ON TestingUpdate4 (ID)

INSERT INTO TestingUpdate4 (SomeString)
VALUES
('One'),('Two'),('Three'),('Four'),('Five'),('Six'),('Seven'),('Eight'),('Nine')

CHECKPOINT -- truncate the log, DB is in simple recovery.

UPDATE TestingUpdate4
SET SomeString = 'NotFour'
WHERE ID = 4 -- one row

SELECT Operation, Context, AllocUnitName, [Transaction Name], Description FROM fn_dblog(NULL, NULL) AS TranLog

DROP TABLE TestingUpdate4

The log records are a little more complex, there’s a few more than in the previous case, but the operation for the update is still the same – LOP_Modify_Row

UpdateClusterFixedSize

Second test with a clustered index, an update of a non-key column that does change the size of the row.

CREATE TABLE TestingUpdate5 (
ID INT IDENTITY,
SomeString VARCHAR(50)
)

CREATE CLUSTERED INDEX idx_ID ON TestingUpdate5 (ID)

INSERT INTO TestingUpdate5 (SomeString)
VALUES
('One'),('Two'),('Three'),('Four'),('Five'),('Six'),('Seven'),('Eight'),('Nine')

CHECKPOINT -- truncate the log, DB is in simple recovery.

UPDATE TestingUpdate5
SET SomeString = 'NotFour'
WHERE ID = 4 -- one row

SELECT Operation, Context, AllocUnitName, [Transaction Name], Description FROM fn_dblog(NULL, NULL) AS TranLog

DROP TABLE TestingUpdate5

Still have a modify row operation here. Still an in-place update.

UpdateClusterChangingSize

Lastly an update of the clustered index key value.

CREATE TABLE TestingUpdate6 (
ID INT,
SomeString CHAR(50)
)

CREATE CLUSTERED INDEX idx_ID ON TestingUpdate6 (ID)

INSERT INTO TestingUpdate6 (ID, SomeString)
VALUES
(1,'One'),(2,'Two'),(3,'Three'),(4,'Four'),(5,'Five'),(6,'Six'),(7,'Seven'),(8,'Eight'),(9,'Nine')

CHECKPOINT -- truncate the log, DB is in simple recovery.

UPDATE TestingUpdate6
SET SomeString = 'NotFour',
ID = 42
WHERE ID = 4 -- one row

SELECT Operation, Context, AllocUnitName, [Transaction Name], Description FROM fn_dblog(NULL, NULL) AS TranLog

DROP TABLE TestingUpdate6

SplitUpdate

Now we do have a split update. We’ve got a delete_rows and an insert_rows operation in the log. This was not done as an in-place update

So what can we conclude here? Does SQL do all updates as split updates?

It should be clear that, for cases where the index key is not changed, SQL can do updates as in-place updates. I’m not going to try and claim that it always will, that would be silly, there are lots of scenarios that I haven’t looked at (page splits and forwarded rows being among the most obvious), but it can and will do in-place updates.

For updates that change the key values, SQL will not do those as in-place updates. Paul explained that in one of his debunking posts a while back – http://sqlskills.com/BLOGS/PAUL/post/Do-changes-to-index-keys-really-do-in-place-updates.aspx