Index Selectivity and Column Order
Which Field Goes First in an Index?
There’s a common piece of advice given about columns in an index key that says that the most selective column should go first. It's correct, but the problem is that it’s often given without any explanation as to why the most selective column should go first, nor are the other considerations for index key order mentioned.
This can lead to misunderstandings like, in the extreme case, where one person after hearing that advice went and added the primary key column as the leading column of every single nonclustered index (because it’s highly selective), and then wondered why his database performance decreased dramatically.
The comment about selectivity is because of the way SQL keeps statistics on indexes (see Gail Shaw's post on statistics for more information about statistics). SQL only keeps the histogram for the first column of the index. That means that it only knows the actual distribution of values of the first column. If the first column is not selective, the index may not be used. However, that’s not the whole story.
SQL also, in addition to the histogram, keeps density values for all of the left-based subsets of the index keys. So, for a 3 column index key, SQL knows the density of the first column, of the first and second and of all three. The density is, in a nutshell, a value that shows how unique the set of columns is. It’s 1/(distinct values). The value can be seen for any index using DBCC Show_Statistics with the DENSITY_VECTOR option.
This means, while SQL only knows the actual data distribution of the first column, it does know, on average, how many rows will be returned by an equality match on any left-based subset of the index keys.
Therefore, a good rule for the order of columns in an index key is to put the most selective columns first, when all other considerations are equal.
Examples Using Equality Predicates
What does that mean in practice? Well, let’s consider a hypothetical table:
CREATE TABLE ConsideringIndexOrder ( ID INT IDENTITY, SomeString VARCHAR(100), SomeDate DATETIME DEFAULT GETDATE() ); CREATE INDEX IX_ID_SomeString_SomeDate ON ConsideringIndexOrder ( ID, SomeDate, SomeString );
Say this table has 10000 rows. It’s a heap, no clustered index. Let’s further say that there are 100 different values for Somestring, and 5000 different values for SomeDate. ID, since it’s an identity, is unique.
There’s a single nonclustered index on all three columns, in the order ID, SomeDate, SomeString.
Considering equalities first, this index can only used to seek for queries of the following forms:
In other words, it’s only useful when the columns in the where/join are a left-based subset of the index key columns.
It cannot be used to seek on by a query that only filters on SomeDate, only on SomeString or the combination of the two. It’s like using the phonebook (which is an index ordered by surname, first initial) to find all the people whose firstname starts with M. It’s possible, but only by reading the entire book, i.e. by doing a scan.
It’s useless to have the most selective column of the index on the left if very few queries filter on it. Queries that don’t filter on it, but do filter on the other columns of the index will have to scan, and scans are expensive.
If it turns out that there’s more than one column that can be first, based on the queries that run against the table, or if all queries do equality matches on two or more columns, then put the most selective one first so that SQL has more chance to know that the index is useful. Otherwise, put the columns that are more often filtered on earlier in the index, so that the index can be used by more queries.
Let’s take a look at some query scenarios based on the hypothetical table above to see how that index will be used.
Scenario 1: Equality Match on the ID Column
This one’s the simplest. Since it’s a direct equality match on the leading column of an index, it’s a single seek operation to find the row.
Scenario 2: Equality Match on the ID and Date Columns
This one’s also simple. Since it’s a direct equality match on a left-based subset of the index columns, it’s a single seek operation to find the row.
Scenario 3: Equality Match on the ID and String Columns
This one’s a little harder. Only the filter on ID can be done as part of the index seek because the string column is not the second column in the index. The date is, and this query’s not filtering on the date. Hence this will require a seek operation on ID and then the string column checked to see if it matches. It’s no longer a single seek operation, even though the second predicate is done by the seek operator. SQL has to seek to find the matching ID and then compare the value of the string column to see if it matches.
Scenario 4: Equality Match on the Date and String Columns
In this case, SQL can’t do a seek at all. The leading column of the index is not used in the where clause and, as such, the only way to satisfy this query is to scan. In fact, SQL decides to do a table scan and evaluate each of the rows in the table against the values specified for the date and string columns.
Examples Using Inequality Predicates
One thing to note straight off is that the selectivity of a column is much less important for inequality predicates than it was for equality. For equality predicates, the selectivity alone can give a reasonable idea of the number of rows a particular predicate will return. That’s not the case with inequalities. Also, with inequality predicates, the order of columns in the index becomes very important.
One of the most important considerations with inequality predicates is the number of rows that the predicate will return. An identity column may be highly selective, but if the filter is for all rows > 0 and the identity values start t one, then an index on that column is not going to be very useful.
The other consideration when there are inequality predicates is that only that column and columns to the left of it in the index key can be used for index seeks. Any columns to the right of the column with the inequality is no longer eligible for seeking.
To explain with an example, consider our hypothetical table from the previous post (with one small change):
CREATE TABLE ConsideringIndexOrder ( ID INT, SomeString VARCHAR(100), SomeDate DATETIME DEFAULT GETDATE() ); CREATE INDEX IX_ID_SomeString_SomeDate ON ConsideringIndexOrder ( ID, SomeDate, SomeString );
The same as previously, there’s a single nonclustered index on all three columns, in the order ID, SomeDate, SomeString.
If there’s an inequality predicate, then then the index is only fully seekable for the following queries:
If there’s another predicate, equality or inequality, on a column further to the right in the index, that cannot be executed as part of the index seek, and will be done as a second step, just as happened with equalities when the predicates were not left-based subsets of the index columns.
So, what does that mean for index columns order? Quite simply, if queries are always going to filter with one or more equality predicates and one or more inequality predicates, the columns used for the inequalities must appear further to the right in the index than the equalities.
That’s great when there’s only one inequality predicate, but what happens when there’s more than one? If there are going to be more than one inequality predicate, the one that is likely to return fewer rows should go earlier in the index. This is not to say the most selective one, but the one that will be queried with a more selective range.
Using the above table as an example, if a typical query will run with an inequality on the ID column that on average will return 1000 rows and with an inequality on the date column that will on average return 100 rows, then the date column should go before the ID in the index (assuming that’s the only query)
Let’s take a look at some query scenarios based on the hypothetical table above to see how that index will be used with some inequality predicates.
Scenario 1: Inequality Predicate on the ID Column
This is probably the simplest of the inequalities. Since ID is the leading column of the index, SQL does a seek to find the beginning of the range (or the first row in the table if applicable) and then reads along the leaf pages of the index until it reaches the end of the range. Those rows are then returned.
Scenario 2: Equality Match on the ID Column and Inequality on the Date Column
This one’s also fairly easy. SQL seeks to find a matching ID and the start of the range and then reads along the index to find the rest of the rows.
Scenario 3: Inequality Match on Both the ID and Date Columns
In this case, only one of the predicates can be used as a seek predicate, the other will be executed as a predicate, meaning that each row that the seek retrieves has to be compared against that predicate. Since the index starts with ID, it’s the inequality on ID that will be picked for the seek. If there was a second index that started with date, that one might be picked instead.
While both columns are mentioned in the seek predicate, note that there’s also a predicated on the SomeDate column, which is not present in the simple index seeks.
This article was adapted from a series of blog posts by Gail Shaw.
Gail is a database consultant from Johannesburg, South Africa, specialising in performance tuning and database optimisation. Before moving to consulting she worked at a large South African investment bank and was responsible for the performance of the major systems there.
Gail was awarded MVP for SQL Server in July 2008 and spoke at both TechEd South Africa and the PASS Community Summit in Seattle in the same year. She is a frequent poster on the SQLServerCentral forums and has written three articles so far for the same site.
Her online presences include:
More SQLServerPedia Articles on Indexes
How SQL Server Indexes Work
Types of Indexes in SQL Server
Best Practices on How to Design Database Indexes
Maintaining Indexes for Top Performance
Indexes need regular maintenance in order to perform well.