Consolidation – Calculating Combined CPU

I got a consolidation query from Andrey.  He asked:

“I have a question related to SQL Server consolidation. You know that the physical memory on any machine is limited and SQL Server tends to consume all available memory. Let’s say I am consolidation two SQL Servers. The first one consumes 1GB of memory and the second one is consuming 1GB of memory. I tempt to say that we need 2GB when consolidating. BUT what if I have only 1.5GB available on the target server? Will the performance suffer if I consolidate two servers and make then only 1.5 GB available? Do you have any ideas on how to estimate memory requirements in this case? The same applies to CPU and IO as well.”

That question comes up all the time when we talk about consolidation.  First, let’s talk about memory since that one’s the easiest to measure.

Memory is so cheap now (under $500 for 8gb of server-quality memory) that when people consolidate, they should throw a lot of memory into the server.  You’re probably consolidating old servers that had less memory, so it’s easy to take 4 servers with 4gb ram each (16gb total) and consolidate them into one server with 32gb of ram.  When you’re saving tens of thousands of dollars on SQL Server licensing, it’s easy to argue for $2,000 of memory (32gb) to help guarantee the project’s success.

That’s my personal real-world answer to memory issues during consolidation: you throw a lot of memory at the problem, because it’s cheaper to buy memory than it is to spend a lot of time analyzing the applications and the memory use.  And it’s way cheaper to buy this memory than it is to roll back a failed consolidation project.

But that doesn’t answer your question, because a cheap company might not do that, and I wanna make sure I answer your question.

Here’s the problem with consolidation and memory: before, when you had two separate servers, they had their own cache.  When you combine the two servers into one, they will fight over the same cache.  If one of the users runs a big query, they will use ALL of the memory on the server, no matter how much memory the server has.  If you’re really worried about memory requirements when you’re consolidating, you consolidate into two or more instances on the same SQL Server.  That way, you can specify the maximum memory for each instance, and then you guarantee that each application will get enough memory.

Well, that’s not really true either: you’re not guaranteeing that the application will get ENOUGH, you’re just guaranteeing that it will get its own minimum share.  To me, one of the selling points of consolidation is giving each application the chance to grab more peak resources when necessary, pooling resources so that everyone runs faster.  Ask your CIO about utilization percentages, and she’ll say she wants them as high as humanly possible – and you can’t get that if you break up your databases into individual instances and hard-code their memory requirements.

So that’s my memory answer: you either buy more memory than the servers initially had, or you can separate them onto different instances.  I vote for the former.

CPU consolidation is the same kind of problem.  One bad query can still slow down the entire server.  If you write a query that takes 100% CPU for 30 seconds, and you consolidate its databases onto a server that has 5 times as much CPU power, you’re still going to take 100% CPU for 6 seconds.  (That’s not necessarily exactly right – depends on whether the query can be parallelized – but you get the idea.)

In theory, we can fix that by consolidating the servers into separate instances on the same server, and setting the CPU affinity mask for each instance so that they’re guaranteed their own CPU power.

In practice, that’s a huge pain in the butt.  It’s easy for two instances on one server, but as soon as you start talking about 3 and 4 node clusters, it’s a mess.  You have to make a spreadsheet to track which CPUs will be used by which servers on which instances.  When there’s a failover between cluster nodes, you have to stay on top of which instances can fail onto which nodes.  I don’t know about you, but when a node fails, I’ve got enough things on my mind already.

Now that we’ve covered CPU and memory, let’s talk about IO.  You can see that CPU and memory calculations are a mess – once you get to this point, people throw in the towel and don’t even think about IO.  The sad truth is that DBAs treat their storage as a black box: they don’t do much storage benchmarking to see if they’re getting enough IO throughput, and they don’t consider this during consolidation.  Getting more storage throughput means working closely with your SAN team to test multiple HBAs, multipathing software, and array configuration.  It’s a lot of work – it does pay off, but it is a heck of a lot of work.

If any of our readers have done storage throughput benchmarking as part of a consolidation project, I’d love to hear about it in the comments.  (Or if you haven’t, that’s good feedback too.)

Tags:

Leave a Reply