Clustered index impacts on writes vs reads

Filed under: Database Design — Brent Ozar at 11:59 am on Thursday, May 29, 2008

After today’s webcast, “What a Cluster! Clustered Indexes” we got an email from David Holt at Kaye Scholer LLP. He writes:

I just attended the webcast “What A Cluster! Clustered Indexes”; thanks for the info!

Clustered indexes obviously impact writes as well as reads, and there can be a conflict between the needs of the two; i.e., if a lot of different people are writing to the same table then you don’t want them to block each other, yet you want simultaneous queries (selects) on that table to be efficient. That seems to argue that the clustered index should make writes as efficient as possible, even if it’s not the most effective clustered index for reads. Would you agree?

Uh oh - you know it’s dangerous when someone from a law firm asks if you would agree, and it’s even more dangerous when that answer is no. I’ll take my life into my own hands and answer anyway.

When designing an index strategy for a table, it helps to know the read/write mix and any time ranges for each. I can illustrate with two examples of opposite ends of the spectrum.

Let’s say our company has a web filtering database that logs all employee web activity to a database table, including the user info, date/time, and the URL they were surfing. Every now and then, they’ll run reports to see what kinds of web sites the users are hitting, and sometimes they’ll audit the web history of a particular individual.

Example #1: The Web History Table

The web activity table would be extremely write-intensive, since we have a lot of company employees who sit around surfing the web all day long. (Like you right now, come to think of it - get back to work!) A write performance reduction as small as 10% might have a significant impact on our ability to keep up with our surf-happy employees. Plus, we store the data on a SATA RAID5 array with slow writes to begin with.In a case like this, we could make an argument for a clustered key using an identity field or using a datestamp field. That would keep write speeds quick at the cost of making reports slower. However, at report time, these users don’t really care how long the report takes to run - they just want to see it sooner or later.

Example #2: The IP Subnet Table

Our web filtering application would store a list of IP address subnets - network TCP/IP address ranges that identify workstations on our network. We would not do very frequent inserts into this table, because our network isn’t expanding like crazy.

When we query this table, the queries use a lot of range queries because they want to group users together by location. The IP address subnets are in similar ranges. The fields on this table include a lot of things that we want to report on, like geographic location and company department.

This table could have a completely different primary key design because it’d have different design objectives: faster queries instead of faster inserts.

Data warehouses are another great example of this kind of design: nightly loads happen during one window (nighttime), and end users do queries during a different window (daytime). Depending on our loads, we might design our primary keys differently. If we have plenty of time at night to load our files, and our users want faster responses on their daytime queries, then our primary keys wouldn’t be designed for fast inserts.

There’s yet another consideration in this design: the number of other indexes on the table. If a table has ten wide covering indexes, then the arrangement of the primary key may not make as much of a difference as you’d think, because there’s so much overhead in maintaining the indexes.

Bottom Line: Measure Every Change

This is why I drove home the Measure, Change, Measure, Revert point in the presentation: the only way to find out for sure is to get a complete picture of your query loads, design a replayable test (like a series of queries) and keep measuring the results.  If possible, replicate the production system as closely as possible in development, because the storage systems will affect your results.

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>