Consolidation Planning: Perfmon Statistics

Filed under: Administration, SQL — Brent Ozar at 10:29 am on Wednesday, May 14, 2008

In this Thursday’s webcast about SQL Server Consolidation, I’ll be covering what Perfmon statistics to gather when planning for consolidation. (I’ve got a blog post describing Performance Monitor in detail if you haven’t used it before.)

Here’s my short list of counters specifically for consolidation projects:

  • Physical Disk - Avg. Disk Queue Length - shows how many things are waiting on the I/O subsystem
  • Physical Disk - Avg Disk sec/Read - shows how long reads are taking
  • Physical Disk - Avg Disk sec/Write - shows how long writes are taking
  • Processor - % Processor Time - shows how busy the server’s CPUs are
  • System - Processor Queue Length - shows how many things are waiting on CPU power

When I’m gathering these statistics and analyzing the results, I’m most concerned about times when these counters show huge spikes. I’m looking for maxed-out CPUs and very heavy drive access. I don’t care about averages - I want the very worst times of the day.

Before a consolidation project, if a single server has a sudden, massive load on its processors, that load is confined to that single server. One server’s load doesn’t suddenly bring down any other applications.

After consolidation, however, more databases & applications live on the same hardware, and the application that used to live by itself now impacts other applications.

In a perfect world, when I consolidate multiple servers together, their extreme load times never overlap. These would be examples of good consolidation candidates:

  • Sales force application with heavy use from 8am-10am (morning reports)
  • Reporting application with heavy use from 6pm-9pm (as store closing info comes in)
  • Web filtering application with heavy use from noon-5pm (as people get bored and start surfing the web)

All of these applications will have load around the clock, but they don’t have knock-the-box down load at any given time when other apps need power too.

Even if the times don’t overlap, though, we have to watch out for apps that will consume all available power no matter what. An extreme example is a data warehouse that runs nightly loads. During that time window, the server is under very heavy load. There are long time windows where the server’s CPU load may average 100% and its I/O subsystems are responding slower. If we consolidated that server with a few others, the data warehouse load will still use all of the power it can get, and other applications will suffer. Plus, the data warehouse’s backup maintenance window might be during the day - and we can’t slow down our other apps during the day. (Yes, data warehouses are a bad candidate for consolidation, and yes, that’s why I picked them as an example.)

Notice that I didn’t include any memory counters in my Perfmon list. When consolidating multiple SQL Servers together onto the same server with different instances, you can segregate their memory use, so you don’t have to worry about them competing for the same memory. On the other hand, if you’re consolidating into a single instance, you can’t segregate applications into separate memory spaces inside SQL anyway, so there’s no use planning for something you can’t change.

That’s not to say you don’t have to plan for memory during consolidation, but it’s just easier than CPU and storage planning.

If there’s other Performance Monitor counters you like to gather before consolidation, post a comment here. Thanks!

MIRRORING THROTTLES BACKLOG

Filed under: Uncategorized — KKline at 2:11 am on Friday, May 2, 2008

For some reason, the title of this blog post makes me grin because it evokes a mental image of one guy trying to comically strangle another guy.  But I digress…

 

Speaking at a user group meeting recently, an attendee asked whether the mirroring process had any capability to speed up or slow down its workload according to the amount of work waiting on it.  I cast around for an answer to this and my friend, Kalen Delaney, already had the answer.  She already has an answer for most any question on SQL Server, but again I digress.  She got her answer from Peter Byrne, a Microsoft program manager and member of the SQL Server Storage Engine team.

 

Peter says…”There is a throttling heuristic used by mirroring to try to keep the backlog from getting too large on the mirror.  Essentially, at commit time SQL Server may pause briefly if it determines the REDO queue on the mirror is getting large enough that failover time would be significantly affected.”

 

You can find more details about this process in the white paper:

 http://www.microsoft.com/technet/prodtechnol/sql/2005/technologies/dbm_best_pract.mspx

 

And while we’re at it, don’t forget to investigate Kalen’s new DVD at http://sqlblog.com/blogs/kalen_delaney/archive/2008/01/24/did-you-know-my-dvd-is-now-orderable.aspx#4722.  Not only is it a great value at any price with fantastic content, it’s only $19.95 plus $3 for S/H, which, when you think of it is incredibly cheap.  Taking a MOC (Microsoft Official Curriculum) class is measured in the thousands of dollars.  Kalen’s content, which is some of the best available, is so affordable!

I need to know whether a 3rd party application is leading to out of memory errors in my SQL Server…and I need to know right now.

Filed under: Administration, Internals and Architecture, SQL Server 2005, Tuning and Optimization — Ari Weil at 12:57 pm on Thursday, April 24, 2008

First things first - depending on the version of SQL Server that you’re using, SQL Server could be creating this problem for itself. See these articles if you’re running SQL Server 2005 pre-SP2:

So what’s going on? First we need to address two concepts: Virtual Address Space and Working Set. Windows uses virtual address space (VAS) to allocate memory to applications (that’s as deep as I’m going here). A program’s working set is the collection of VAS pages that have been recently referenced. When a process is created, it is assigned a minimum and maximum working set size. As the working set size increases, thereby increasing memory demand, the virtual memory manager will attempt to keep up to the maximum working set size (memory) resident without exceeding the maximum threshold.

One common misconception is that there is as much virtual memory available to a process as there is physical memory (RAM) on a machine. Not so (or I wouldn’t be going there); you can get much more depth by reading Slava OKs’s blog on VAS. One very important point from a SQL Server perspective, however, is that when you’re configuring min and max server memory, you have to take into account the MemToLeave area. MemToLeave is VAS that’s left unused when SQL Server starts so that external components have address space to consume when they’re loaded. Typically, the majority of memory allocations from MemToLeave are from non-SQL Server memory consumers like COM objects, extended stored procedures, and linked servers. You can use DBCC MEMORYSTATUS to see these, but bear in mind that what’s reported will only reflect allocations made by SQL Server directly (SQL Server has no knowledge of outside memory requests).

So, when troubleshooting out of memory errors, do your homework.

  • Check the list above to see whether your SQL Server is suffering from a known SQL Server bug, and understand (at least a little bit about) VAS and memory allocations.
  • You can use Performance Monitor (Task Manager won’t help here) to get an idea of how much of VAS is currently consumed inside of your process by looking at the virtual bytes counter.
  • If you believe that a 3rd party application is loading DLLs into SQL Server’s VAS, or you want to see if your own CLR or external procedures are doing so use SQL Server’s DMV sys.dm_os_loaded_modules to see what’s been loaded into SQL Server’s address space.

Oh, and by the way, for anyone who’s used to configuring the working set size parameter with sp_configure, you need to be aware that it’s been deactivated in SQL Server 2005.

Hope that helps.

NUMA AND TCP PORT AFFINITY

Filed under: Uncategorized — KKline at 10:45 am on Thursday, April 24, 2008

If you’ve ever taken the time to read http://www.tpc.org/results/FDR/TPCC/hp_orca1tb_win64_fdr.pdf, you might find a bit in there that’s confusing. “Microsoft SQL Server 2005 Enterprise Edition 64-bit Service Pack 1 was configured to utilize “soft NUMA”, a feature that allows network connections to be affined to specific groups of CPUs (this is independent of the hardware NUMA features of the HP Integrity  Superdome). SQL Server was configured with 33 SoftNuma nodes. 31 of these nodes were configured with 2 CPUs each. The 32nd SoftNuma node as configured with one CPU and used for the checkpoint process. The 33rd SoftNuma node, also with one processor, was reserved for the Log Writer thread. A script doing continuous checkpoints of 1700 seconds (28 minutes, 20 seconds) was started on one of the server after steady state was reached with a connection port that connected to the 33rd Numa Node.  This allowed the main checkpoint process to run on that single processor, which directed the processing of the checkpoint tasks that SQL assigned to each of the other 31 SoftNuma Nodes.” 

What’s confusing about this is it’s reference to the checkpoint process, since the BOL also says that you can configure I/O affinity and CPU but not specific threads.

 

As it turns out, this issue is explained in the BOL article “How to” Map TCP/IP Ports to NUMA Nodes”.  The “connection process” mentioned above is not the more commonly know background checkpoint that flushes dirty pages to disk, but is a separate connection affinitized (is that even a word?) to the 32node which issues a CHECKPOINT command. 

 

Thanks to Eladio Rincon (Solid Quality Mentors) and Peter Byrne (Microsoft) for working this one out.

After a recent upgrade to SQL Server 2005 my linked server export to Oracle 10g no longer gives me the option of deleting first or appending. Any clues?

Filed under: Administration, SQL Server 2005 — Ari Weil at 7:15 am on Tuesday, April 22, 2008

More: Since there are 3 machines in this mix so I’m not sure which one I should be checking. Computer A is running Windows 2000 and SQL Server 2000; computer B is running Windows XP Professional and Oracle Server 10.2.0.2; computer C is running Windows XP Professional with SQL Server 2005 client tools (Management Studio) and Oracle 10.2.0.2 client. I am using Computer C to transfer data from Computer A to Computer B. To confuse me even more, using computer D, which has SQL Server 2000 client tools, I can transfer the data from Computer A to Computer B with no problem. Doesn’t that mean A and B are configured correctly?

You need to address computer C in the scenario you outlined as that version of MDAC/ODBC/Oracle client is being used. There is likely a mismatch between the SQL Server 2005 tools and your linked server. I would try the following:
1) ensure computer C is running the latest version of MDAC and an ODBC driver that fully supports Oracle 10g
2) check that you’ve got the latest version of the Oracle client on the linked SQL Server machine. Some customers have said that they require patch 5203839
3) ensure you’ve configured the connection to “Allow inprocess” in the provider options

Finally, check out these MSDN links to ensure your Microsoft software is completely up-to-date for this scenario:
http://support.microsoft.com/kb/280106
http://msdn2.microsoft.com/en-us/library/ms190618.aspx
http://msdn2.microsoft.com/en-us/library/ms189063.aspx

Hope that helps.

Implementing a Hash Partition on SQL Server 2005

Filed under: Database Design, Programming, SQL, SQL Server 2005, Transact-SQL (T-SQL) — KKline at 2:54 pm on Monday, April 21, 2008

One of the best things about becoming a Microsoft MVP is meeting other MVPs.  I bring this up because last week was the annual MVP Summit in Seattle, WA.  I was really looking forward to meeting Steve Kass.  Steve Kass is one of the smartest SQL Server MVPs I’ve encountered, especially when it comes to SQL questions.  A while back, I noticed that Steve made an interesting recommendation for a hash function that you could use for partitioning that I thought was worth noting.  A hash function would be very useful if you wanted to implement your own variation of a range partition using a hash function rather than the standard sort of range partitioning where colA values of A-H go to partition 1, values of I-P got to partition 2, and so forth.

 

Steve notes that you could use the following for hashing something small in size:

   CAST(    SUBSTRING(      HASHBYTES(’SHA1′,         CAST(my_col AS NVARCHAR(appropriate_size))),8,1) AS tinyint) 

This is just an off the cuff recommendation from Steve and might need some fine tuning, for example, the CAST might throw off persistence.  However, it’s a good start.

 

Thanks, Steve, for sharing this and thanks, readers, for sharing any improvements you might develop out in the field.

I have two Windows 2005 instances running on Windows 2003. Connections to one instance start failing with SSPI context errors and I spot something in the log files aboust deregistering the SPN.

Filed under: Administration, Internals and Architecture, SQL Server 2005 — Ari Weil at 8:57 am on Monday, April 21, 2008

There’s either a network issue with regard to that server’s DNS, or you’re having an issue that’s addressed by KB article 319723, particularly section 3: Configure the SQL Server service to create SPNs dynamically.

Let’s discuss how SQL Server integrated security works to understand what’s happening. When the SQL Server driver attempts to connect it tries to resolve the fully-qualified domain name of the SQL Server using WinSock APIs (look up gethostbyname and gethostbyaddr for details) to determine the Service Principal Name (SPN). SQL Server is using Kerberos over TCP-IP sockets via Security Support Provider Interface (SSPI) to resolve the SPN. If there is a problem verifying the fully-qualified domain name from the client or server side during this exchange, the connection fails. You can see this yourself when/if the problem occurs using NSLOOKUP from the client machine.

Hope that helps.

NASTY RUMORS ABOUT MAXDOP

Filed under: Internals and Architecture, Tuning and Optimization — KKline at 11:53 am on Tuesday, April 15, 2008

You’d think an article called “NASTY RUMORS ABOUT MAXDOP” would have something to do with Britney Spears or maybe Robert Downy Jr, but in that case it’d be total fiction (at least, it would be coming from my pen).

 

So, I was en route to the 2008 Microsoft MVP Summit yesterday and I had a chance to catch up on my reading.  You may have heard some rumors that you should only ever set MAXDOP (maximum degrees of parallelism) to an even number.  I can neither confirm nor deny these rumors since I haven’t had time to test this yet myself.  But I’m curious if any readers out there have definitive information one way or the other.

 

Evidently, there are two roles involved in parallel processing, a writer and a consumer.  Naturally, when two roles are at work, the MAXDOP setting doesn’t have to share resources when it is set in increments of two.

 

Hope this helps,

 

-Kevin

Wondering whether to take a full or differential backup? See what percentage of your data has changed using native SQL Server tools!

Filed under: Administration, Backup and Restore, Internals and Architecture, Transact-SQL (T-SQL) — Ari Weil at 9:22 am on Tuesday, April 15, 2008

If you’ve never read Paul Randal’s blog on the SQLSkills website you should check it out. Paul recently posted a stored procedure that can be used to tell which percentage of a database’s data has changed so you can determine which type of backup to take. The procedure can be scheduled and its output analyzed over time to strengthen your backup and recovery strategy (because we know that you already have one, you just need to improve upon it ;-) ).  The cool thing about code like this is that it can be bundled into other maintenance procedures and it can integrate with monitoring and maintenance tools.  For instance, you could run this procedure on a schedule with predefined diff thresholds that would flag the  appropriate type of backup job for execution.  You could then use enterprise monitoring tools like MOM or SCOM to determine the best maintenance windows to guarantee the backups run at the right time, and voila!  You’ve got the beginnings of an automated backup and recovery strategy.

The Publishing Industry Takes Another Hit

Filed under: Uncategorized — KKline at 11:30 am on Friday, April 11, 2008

There’s a lot of talk at the moment that Amazon is forcing self-publishers to use its own publishing centers in order to stay listed with the powerhouse Internet sales site.  For example, this blog post Has Amazon Gone Mad by Rick Jelliffe rather aptly describes the situation.  The Writer’s Weekly first broke the story here, but it’s now been taken up by major media such as the Wall Street Journal.  Blogger M. David Peterson points out that there are alternatives - simply print enough copies ahead of time and ship them to the Amazon printing centers.

This certainly isn’t the death knell of publishing. But it is a telling sign.  When I first started writing books, an author could reasonably expect to sell the first printing of their book, probably about 5000 books.  This was usually enough sales to cover the advance that the publisher paid them and maybe a bit extra to cover a celebratory round of beers with his/her buddies.  If the book was good and the the stars were in proper alignment, the author would be lucky to get addition printings of their book out the door and actually make a little money on the project.  Nowadays, it seems like the first printing of a new title is only around 3000 and fewer books seem to be ascending to the level of “strong seller”. 

Of course, there is always a place for titles like Word 2007 for Dummies.  Those sort of everygreen titles will always sell.  But it appears that the more niche your content is, the worse it will do as a book.  The obvious reason for this is the Internet.  Why would someone spend $40 for a book on SQL Server query tuning (as an example, I’m not knocking any specific book), when you could simply subscribe to the RSS feeds at sites like SQLblog, SQLMag, or SQL-Server-Performance and get nearly as much content?  For that matter, the power of a good Google search (and not Windows Live Search, imo) enables you to pull valuable content from all three of these sites in short order.  Not only is your search more taylored to your specific needs, it’s also more likely to be up-to-date with the latest versions, service packs, and nuances.

Authors like me are also concerned because the immediate alternative that comes to mind is writing and publishing an e-book.  However, e-books seem to hold even less promise no than in years past.  At least when you buy a book, you have a physical object that you own.  You can take it on a plane and read it during that “turn off all electronic devices” stage of the flight.  You can take it to the bathroom.  You can share it with your team mates at work (hopefully, not right after taking it to the bathroom).  E-books, though, have all the limitations of a book with none of the advantages of the internet. 

I believe that as the hard-print media (book publishers, magazine publishers, newsletters, journals, and newspapers) continue to see shrinking audiences, the key to survival is tapping in to blogging and expert opinion.  For example, the New York Times is doing extremely well with their Freakonomics blog, based upon the eponymous book (and a personal favorite of mine).

Next Page »