DBAs Behaving Badly (7/10): Manual Administration

Imagine for a moment it's late one Friday afternoon, and like a scene from Office Space, you try and avoid the boss on the way out the door. Before you make it out, he grabs you and tells you that you've just got another 10,000 SQL Servers to manage ....

"I'm gonna need you to go ahead and come in tomorrow..."

Office_space_bobbleheads
"Oh.... and I'm gonna need you to come in on Sunday as well ... mmmkay?"



I can joke about this, but for some of you this is probably all too real. In that context, let's examine some common worst practices, beginning with ....

Manual Administration

One of the common things I come across are DBAs who start with a very small number of servers to manage, and they manage them using a very manual technique. By that I mean using Management Studio wizards, for example, to backup databases, check logs and so forth...

Sometimes those DBAs are part time or “accidental” DBAs, for example, developers that have taken on a DBA role, or sometimes they just fall into the process of doing things in a manual manner.

Now managing one or two servers like that might be fine, but the problems begin as the number of servers increase, and unless the technique changes, those DBAs often end up running around in circles responding to whoever screams the loudest, which is not cool.

The fact is, there is simply no substitute for automation - automation enables more things to be achieved with fewer mistakes in a given amount of time. And what that means, is that there’s more time in the day for more pleasurable things – training, reading the newspaper, beer, golf or whatever floats the boat.

Essentially, if anything needs to be done more than once, it should be scripted, and created in a manner that enables it to be scheduled for execution, with a monitoring and alerting component placed over the top.

The other issue at play here is that during disaster recovery situations, scripts are absolutely crucial in getting things restored as quickly as possible – The alternative, that is running around manually poking things while under pressure, is rarely a good look.

Let’s continue by taking a look at the importance of alerts.

Not defining Alerts

One of the things that becomes very clear when dealing with a large number of servers is that you end up “managing by exception”. What that means is that you spend all of your time dealing with the things that go wrong, and the more servers you have, the more things that go wrong, increasing the need for proactive management.

Now, there’s two approaches to discovering things that go wrong; The first is to sit back and wait for people to start screaming, and the second is to define alerts, which help us discover problems before users do.

Alerts enable us to define the acceptable operating conditions, and to be notified when something falls outside that range – and that’s really the only efficient option for large environments.

For example, at a minimum, we should define alerts for SQL Agent job failures and high severity errors, but also for performance conditions. For example, we could define an alert for when batches/sec or CPU utilization reaches a critical level, at which performance is known to degrade. That would enable us to take actions, such as shutting down non essential processes like ad hoc reports or batch archiving.

One other thing I wanted to mention before moving on was what type of errors we should be creating alerts for. The classic alert setup is for errors of severity 19 +, but as per this post from Paul Randal earlier this year, such a scheme often misses important events such as error 825.

One of the many really cool things that Tibor Karaszi has done is create a SQL Agent Alerts Management Pack, which is essentially a T-SQL script to create alerts for all of the major errors and events that we should be looking out for, including error 825. So in the absence of anything else, this is a great starting point.

Before finishing, let’s cover off one final thing, and that’s the creation of a task list.

No Task/Check List

Regardless of the level of automation or 3rd party monitoring, one of the things I think is important is having a basic checklist of daily, weekly and monthly tasks.

Checklists are essential in many industries, aviation for example, and provide the framework within which we can automate jobs, bring newly hired DBAs up to speed, and various other things. So included here is a checklist I like to use myself.

Checklist

Obviously a lot of these checks can be automated – The important thing is having a documented starting point from which to automate.

In closing, it would be silly of me to write this post without talking about my favorite new feature in SQL Server 2008 ...

Policy Based Management (to the rescue)

Imagine for a moment you’re the new DBA of a company with thousands of servers across development, test and production and you’re asked to ensure all servers adhere to best practices.

If you were handy with scripting languages such as powershell, that would certainly help, but without those skills, you’d be basically manually checking and fixing each server – A massive task, and by the time you finished, how could you be certain that none of the servers had reverted back to a poor configuration?

Policy based management is purpose built to address this problem.

Essentially what we can do is to start moving from “exception” based management to “intent” based management – in other words – “make this production server like all those other production servers". Best of all, when something happens that causes it to deviate from the intended state, we can either prevent the change (depending on the type of change made), or be alerted to it.

For those not familiar with this new feature, check out this interview with Dan Jones. It's the only way to fly.

Cheers,
Rod.