Clustered index impacts on writes vs reads

Filed under: Database Design — Brent Ozar at 11:59 am on Thursday, May 29, 2008

After today’s webcast, “What a Cluster! Clustered Indexes” we got an email from David Holt at Kaye Scholer LLP. He writes:

I just attended the webcast “What A Cluster! Clustered Indexes”; thanks for the info!

Clustered indexes obviously impact writes as well as reads, and there can be a conflict between the needs of the two; i.e., if a lot of different people are writing to the same table then you don’t want them to block each other, yet you want simultaneous queries (selects) on that table to be efficient. That seems to argue that the clustered index should make writes as efficient as possible, even if it’s not the most effective clustered index for reads. Would you agree?

Uh oh - you know it’s dangerous when someone from a law firm asks if you would agree, and it’s even more dangerous when that answer is no. I’ll take my life into my own hands and answer anyway.

When designing an index strategy for a table, it helps to know the read/write mix and any time ranges for each. I can illustrate with two examples of opposite ends of the spectrum.

Let’s say our company has a web filtering database that logs all employee web activity to a database table, including the user info, date/time, and the URL they were surfing. Every now and then, they’ll run reports to see what kinds of web sites the users are hitting, and sometimes they’ll audit the web history of a particular individual.

Example #1: The Web History Table

The web activity table would be extremely write-intensive, since we have a lot of company employees who sit around surfing the web all day long. (Like you right now, come to think of it - get back to work!) A write performance reduction as small as 10% might have a significant impact on our ability to keep up with our surf-happy employees. Plus, we store the data on a SATA RAID5 array with slow writes to begin with.In a case like this, we could make an argument for a clustered key using an identity field or using a datestamp field. That would keep write speeds quick at the cost of making reports slower. However, at report time, these users don’t really care how long the report takes to run - they just want to see it sooner or later.

Example #2: The IP Subnet Table

Our web filtering application would store a list of IP address subnets - network TCP/IP address ranges that identify workstations on our network. We would not do very frequent inserts into this table, because our network isn’t expanding like crazy.

When we query this table, the queries use a lot of range queries because they want to group users together by location. The IP address subnets are in similar ranges. The fields on this table include a lot of things that we want to report on, like geographic location and company department.

This table could have a completely different primary key design because it’d have different design objectives: faster queries instead of faster inserts.

Data warehouses are another great example of this kind of design: nightly loads happen during one window (nighttime), and end users do queries during a different window (daytime). Depending on our loads, we might design our primary keys differently. If we have plenty of time at night to load our files, and our users want faster responses on their daytime queries, then our primary keys wouldn’t be designed for fast inserts.

There’s yet another consideration in this design: the number of other indexes on the table. If a table has ten wide covering indexes, then the arrangement of the primary key may not make as much of a difference as you’d think, because there’s so much overhead in maintaining the indexes.

Bottom Line: Measure Every Change

This is why I drove home the Measure, Change, Measure, Revert point in the presentation: the only way to find out for sure is to get a complete picture of your query loads, design a replayable test (like a series of queries) and keep measuring the results.  If possible, replicate the production system as closely as possible in development, because the storage systems will affect your results.

How can I load image data into a SQL Server image datatype?

Filed under: Programming, SQL Server 2005, Transact-SQL (T-SQL) — Ari Weil at 11:17 am on Thursday, May 29, 2008

There are a number of articles (like this one) explaining how to do this for SQL Server 2000, but it’s so much easier in SQL Server 2005 that I’ll post the 2005-compatible code here. You should read this Microsoft Research article to get the Microsoft best practices justification for storing various file sizes in the database versus on the filesystem. Note, the following code sample is extremely simplistic and various iterations of similar logic are available from Microsoft and others.

if exists(select * from sysobjects where type=N'U' and name=N'picTable')
drop table dbo.picTable;
go
create table dbo.picTable(FileName sysname, Document image)
go
if exists(select * from sysobjects where type=N'P' and name=N'spLoadImagesFromLocalFolder')
drop procedure dbo.spLoadImagesFromLocalFolder;
go
create procedure [dbo].[spLoadImagesFromLocalFolder](@directory sysname,@fileextention nvarchar(10))
as
begin
declare @cmd varchar(8000),@fname nvarchar(200),@sqlstmt nvarchar(500),@ParmDefinition nvarchar(500),@cnt int

if object_id(’tempdb..#filestoload’) is not null
begin
drop table #filestoload
end
create table #filestoload ( RowNum int identity, fname varchar(8000) null)

SET @cmd = ‘dir ‘ + QUOTENAME(@directory,’”‘) + ‘ /T:W /A:-D /b’
insert into #filestoload ([fname])
exec [master].[dbo].[xp_cmdshell] @cmd
/* print @cmd */
delete from #filestoload where fname is null or lower(fname) not like N’%.’+lower(@fileextention)
select @fname = ”

select @cnt=count(*) from #filestoload
set @ParmDefinition = N’@fname nvarchar(200)’;
while @cnt>0
begin
select @fname = fname from #filestoload where RowNum > @cnt
set @sqlstmt=N’INSERT INTO picTable(FileName, Document) SELECT ”’+@fname+N”’ AS FileName,* FROM OPENROWSET(BULK N”’+@directory+@fname+N”’, SINGLE_BLOB) AS Document’
/* print @sqlstmt */
exec sp_executesql @sqlstmt,@ParmDefinition,@fname=@fname
select @cnt=@cnt-1;
end
drop table #filestoload;
end
go
exec [dbo].[spLoadImagesFromLocalFolder] N’C:\’,N’JPG’

Hope that helps!

Does SQL Server support PCI Compliance Standards features including periodic changing of keys, destruction of old keys, split knowledge and establishment of dual control of keys, and prevention of unauthorized substitution of keys?

Filed under: Administration, Katmai, Product, Security — Ari Weil at 9:11 am on Thursday, May 29, 2008

SQL Server 2008 has the encryption key management feature’s you’re looking for…

SQL Server 2008 Extensible Key Management enables the encryption keys that protect the database files to be stored in an off-box device such as a smartcard, USB device, or EKM/HSM module. This also enables data protection from database administrators (except members of the sysadmin group). Data can be encrypted by using encryption keys that only the database user has access to on the external EKM/HSM module.

Extensible Key Management also provides the following benefits:

* Additional authorization check (enabling separation of duties).
* Higher performance for hardware-based encryption/decryption.
* External encryption key generation.
* External encryption key storage (physical separation of data and keys).
* Encryption key retrieval.
* External encryption key retention (enables encryption key rotation).
* Easier encryption key recovery.
* Manageable encryption key distribution.
* Secure encryption key disposal..

See the full article Understanding Extensible Key Management for details.

What happened to Notification Services in SQL Server 2008?

Filed under: Katmai — Ari Weil at 7:53 am on Thursday, May 29, 2008

The quick answer is that it’s been dropped. Microsoft posted the following in the 2008 Read me documentation:

5.1 SQL Server Notification Services Removed from SQL Server 2008

Notification Services is not a component of SQL Server 2008, but will continue to be supported as part of the SQL Server 2005 product support cycle. In future releases, support for key notification scenarios will be incorporated into Reporting Services.

Clustered Indexes & SQL 2005’s Performance Dashboard

Filed under: Administration, Database Design, SQL Server 2005 — Brent Ozar at 6:53 am on Thursday, May 29, 2008

In today’s installment of the Quest Pain of the Week webcasts, Jason Hall and I will be discussing a reader’s question about finding missing clustered indexes, their impact on performance, and how to be proactive about monitoring performance to find those types of design issues.

MSSQLTips.com has a great post called “Finding primary keys and missing primary keys in SQL Server.”  That post includes two snippets of code, one for each of the popular SQL versions:

Finding missing primary keys in SQL 2005:

SELECT c.nameb.name
FROM sys.tables b
INNER JOIN sys.schemas c ON b.schema_id c.schema_id
WHERE b.type ‘U’
AND NOT EXISTS
(
SELECT a.name
FROM sys.key_constraints a
WHERE a.parent_object_id b.OBJECT_ID
AND a.schema_id c.schema_id
AND a.type ‘PK’ )

Finding missing primary keys in SQL 2000:

SELECT c.namea.name
FROM sysobjects a
INNER JOIN sysusers c ON a.uid c.uid
WHERE xtype ‘U’
AND NOT EXISTS
(
SELECT b.name
FROM sysindexes b
WHERE a.id b.id
AND (b.status 2048)<>0)

You can read their post for similar queries on finding tables WITH primary keys too.

Also in today’s webcast, I’ll be showing a few screenshots from Microsoft’s SQL Server Performance Dashboard Reports.  They’re a free set of SSRS reports that are integrated into SQL Server Management Studio.  The reports I’ll be showing in the webcast are somewhat limited in that they only show data that’s currently in the plan cache.  If your server is under memory pressure, it may not retain this data for too long, and you would have to revisit these reports pretty often to get a good picture of what’s going on under the hood.

For more information about the dashboard and how to use it, check out Kevin Kline’s Tool Time article “SQL Server 2005 Dashboard Reports” in SQL Server Magazine’s August 2007 issue.

Backing Up and Restoring Replicated Databases

Filed under: Administration, Backup and Restore, Replication — Jason at 8:04 am on Friday, May 23, 2008

At the last user group that I attended, a question was raised adbout restoring databases involved in replication and maintaining the replication settings.  I mentioned that the RESTORE command has a KEEP_REPLICATION option that can be used in this scenario but thought I’d go into a little more detail here.

 Typically when you restore a SQL Server database, replication settings are removed during the restore process.  This allows you to do a database recovery to a dev/test/DR environment and not have to worry about replication settings causing issues in a non replicated environment.  In many scenario’s however you may need to restore a replicated database and maintain these replication settings so that you can re-initialize replication after the restore.  The KEEP_REPLICATION option allows replication settings to be maintained after a database is recovered.  Syntax is below:

 RESTORE DATABASE northwind FROM DISK=’d:\backups\northwind.bak’ WITH KEEP_REPLICATION

For more information, Microsoft has an MSDN article that goes through the entire Backup and Recovery strategy for databases involved in replication.  There is quite a bit more work here than you may initially think, so be prepared.

http://msdn.microsoft.com/en-us/library/ms151152.aspx

If anyone has any real-world scenarios around backing up and restoring replicated databases and some of the challenges that you’ve faced, I’d love to hear them in the comments section.

SQL Server Reporting Services, a DBA’s tool?

Filed under: Administration, I'm a Newbie, Reporting Services — Jason at 6:55 am on Wednesday, May 21, 2008

Traditionally, SQL Server Reporting Services has been thought of as strictly a Business Intelligence tool that helps organizations make business decisions based on the data they have stored.  Over the past several years I have been preaching that there is a great deal of utility that Reporting Services can provide above and beyond what is strictly thought of as Business Intelligence. 

As a DBA, one of the most tedious and repetitive tasks that you must do on a daily, sometimes hourly, basis is to deliver reports to colleague’s and/or managers.  A manager might as for something simple such as, “Show me how much disk space is available on each drive on server X?” or “Show me the status of all of our scheduled jobs on server Y?”  What might appear a trivial tasks involves several steps, many I’m sure you’ve gone through already today:

  1. Build a query that returns the necessary data.
  2. Run the query in Management Studio and validate the result set.
  3. Copy and paste the result set into Microsoft Word or Excel
  4. Organize and format the data so that it looks presentable
  5. Compose an e-mail to the requestor with the file you created as the attachment.

What this presentation goes into is an approach for automating these tasks using Reporting Services.  If you build these reports into SSRS they can first of all be created with a single mouse click, and even better they can be automatically delivered on any given schedule.

 If you’re still interested in learning more about this topic, I have posted the presentation materials (including the slide deck, sample RDL files, and sample SQL Scripts) to Quest’s SQL Server Community Site (http://sqlserver.quest.com/thread.jspa?threadID=6781) where anyone can download them and ask any follow up questions.

Happy Reporting!!!

WHY ISN”T AUTO_UPDATE STATISTICS RUNNING?

Filed under: Internals and Architecture, Tuning and Optimization — KKline at 10:07 am on Monday, May 19, 2008

I find that auto_update statistics in SQL Server is a really good thing.  Without it, many third-party applications would simply fall over from lack of preventative maintenance.  With it, they are able to run for extraordinarily long periods of time without really needing a full-time DBA to check up on the databases.

 

Having said all of that, auto_update statistics is a mystery for many of us.  This is reflected in the fact that I get lots of questions about auto_update statistics when I speak publicly (which seems to be at least once per month these days).  The most common question is “why isn’t auto_update statistics running?”

 

There are many elements to the answer to this question.  Naturally, you’re not going to have any new statistics in a database that is marked as readonly.  However, you might not know that approximately 20% of a table needs to change before triggering an auto_update statistics run for the given table.  If the table is really big, it’s possible that you’re simply not changing enough data.

 

Also remember that statistics are not the same in SQL Server 2005 as they are in SQL Server 2000 and that they must be upgraded using the sp_updatestats stored procedure after you migrate from version 2000 to 20005.  Otherwise, the old statistics are unusable by the query engine.  Once you’ve run sp_updatestats, the engine will maintain them automatically.  Be sure to check the article “After Upgrading the Database Engine” at http://msdn2.microsoft.com/en-s/library/ms143695.aspx to make sure you covered all the bases.

 

Thanks to Connor Cunningham and Simon Sabin (http://sqlblogcasts.com/blogs/simons) for help on this post.

Consolidation Planning: Perfmon Statistics

Filed under: Administration, SQL — Brent Ozar at 10:29 am on Wednesday, May 14, 2008

In this Thursday’s webcast about SQL Server Consolidation, I’ll be covering what Perfmon statistics to gather when planning for consolidation. (I’ve got a blog post describing Performance Monitor in detail if you haven’t used it before.)

Here’s my short list of counters specifically for consolidation projects:

  • Physical Disk - Avg. Disk Queue Length - shows how many things are waiting on the I/O subsystem
  • Physical Disk - Avg Disk sec/Read - shows how long reads are taking
  • Physical Disk - Avg Disk sec/Write - shows how long writes are taking
  • Processor - % Processor Time - shows how busy the server’s CPUs are
  • System - Processor Queue Length - shows how many things are waiting on CPU power

When I’m gathering these statistics and analyzing the results, I’m most concerned about times when these counters show huge spikes. I’m looking for maxed-out CPUs and very heavy drive access. I don’t care about averages - I want the very worst times of the day.

Before a consolidation project, if a single server has a sudden, massive load on its processors, that load is confined to that single server. One server’s load doesn’t suddenly bring down any other applications.

After consolidation, however, more databases & applications live on the same hardware, and the application that used to live by itself now impacts other applications.

In a perfect world, when I consolidate multiple servers together, their extreme load times never overlap. These would be examples of good consolidation candidates:

  • Sales force application with heavy use from 8am-10am (morning reports)
  • Reporting application with heavy use from 6pm-9pm (as store closing info comes in)
  • Web filtering application with heavy use from noon-5pm (as people get bored and start surfing the web)

All of these applications will have load around the clock, but they don’t have knock-the-box down load at any given time when other apps need power too.

Even if the times don’t overlap, though, we have to watch out for apps that will consume all available power no matter what. An extreme example is a data warehouse that runs nightly loads. During that time window, the server is under very heavy load. There are long time windows where the server’s CPU load may average 100% and its I/O subsystems are responding slower. If we consolidated that server with a few others, the data warehouse load will still use all of the power it can get, and other applications will suffer. Plus, the data warehouse’s backup maintenance window might be during the day - and we can’t slow down our other apps during the day. (Yes, data warehouses are a bad candidate for consolidation, and yes, that’s why I picked them as an example.)

Notice that I didn’t include any memory counters in my Perfmon list. When consolidating multiple SQL Servers together onto the same server with different instances, you can segregate their memory use, so you don’t have to worry about them competing for the same memory. On the other hand, if you’re consolidating into a single instance, you can’t segregate applications into separate memory spaces inside SQL anyway, so there’s no use planning for something you can’t change.

That’s not to say you don’t have to plan for memory during consolidation, but it’s just easier than CPU and storage planning.

If there’s other Performance Monitor counters you like to gather before consolidation, post a comment here. Thanks!

MIRRORING THROTTLES BACKLOG

Filed under: Uncategorized — KKline at 2:11 am on Friday, May 2, 2008

For some reason, the title of this blog post makes me grin because it evokes a mental image of one guy trying to comically strangle another guy.  But I digress…

 

Speaking at a user group meeting recently, an attendee asked whether the mirroring process had any capability to speed up or slow down its workload according to the amount of work waiting on it.  I cast around for an answer to this and my friend, Kalen Delaney, already had the answer.  She already has an answer for most any question on SQL Server, but again I digress.  She got her answer from Peter Byrne, a Microsoft program manager and member of the SQL Server Storage Engine team.

 

Peter says…”There is a throttling heuristic used by mirroring to try to keep the backlog from getting too large on the mirror.  Essentially, at commit time SQL Server may pause briefly if it determines the REDO queue on the mirror is getting large enough that failover time would be significantly affected.”

 

You can find more details about this process in the white paper:

 http://www.microsoft.com/technet/prodtechnol/sql/2005/technologies/dbm_best_pract.mspx

 

And while we’re at it, don’t forget to investigate Kalen’s new DVD at http://sqlblog.com/blogs/kalen_delaney/archive/2008/01/24/did-you-know-my-dvd-is-now-orderable.aspx#4722.  Not only is it a great value at any price with fantastic content, it’s only $19.95 plus $3 for S/H, which, when you think of it is incredibly cheap.  Taking a MOC (Microsoft Official Curriculum) class is measured in the thousands of dollars.  Kalen’s content, which is some of the best available, is so affordable!

Technorati Tags: ,