DBA’s Behaving Badly (1/10): Backups

... In this series of blog posts, I'll examine DBA best practices through the lens of the very opposite; worst practices ...

Which is worse; Not taking backups, not verifying their validity, or failing to practice various recovery scenarios? I'd argue that they're all as bad as each other, but let's start with the first one, not taking them at all.

No Backups

Let's not waste too much time here; If you're not taking backups, perhaps database administration is not for you. You might be better suited to a career in the finance sector.

Some less obvious variations on not taking backups include;

  • Not taking transaction log backups of databases in full recovery mode,
  • Backing up at the file system level on the assumption that this will result in a valid SQL Server backup,
  • Using SAN magic such as block replication as your primary (and only) backup strategy


No Verification

You're taking backups right? Cool. You can go back to sleep now...

I don't need to point out the countless stories of backup tapes being recalled for restoration only to discover that the backups were never completed successfully in the first place (despite the "success" message from the backup software). Until you actually restore the backups, how do you know they worked? And by "restore", I don't mean restoring over the top of the production database!

In summary, here are a number of things you can/should do to verify your backups actually worked;

  • Do you have a test environment? Why not automate the restoration of the production backup to the test server on a nightly basis? Apart from constant backup verification, there are numerous other advantages to this strategy including periodic test data refresh, the ability to offload DBCC checks to another enviornment, and an additional disaster recovery point,
  • If there is no test environment (hello?), or there are security/disk space concerns with moving production data out of production, then at the very least, you should use the RESTORE VERIFYONLY command to check the backup validity. Books Online contains a complete description of this command, but in summary, it will read and verify the backup file, doing everything a restore does without actually restoring the database,
  • Without transaction log backups, point in time restoration is simply not possible. Unless you're happy to loose all changes since the last full backup, transaction log backups are required, and as a result, the validty of each and every one is crucial. Arguably the best way of verifying this is via transaction log shipping. In addition to verifying log backup validy, this method enables numerous other benefits including maintaining a warm standby server for recovery and/or reporting purposes. Better still, the log shipping destination can be in a physically separate environment for protection from large scale geographical disaster


No Recovery Simulation

Ok, so you take backups and you verify them. Very good. But what about simulating disasters and practicing recovery? In a real disaster situation, you'll probably have people at your desk sweating on you to recover the database to avoid loss of data. By the time you need to restore a database in a real emergency, you'll want to know the restore process back to front. Like Hugh Jackman in the hacking scene in Swordfish, you should be ready to spring into action and restore databases at any moment, and under the most intense pressure.

Swordfish

The whole purpose of backups is for restoration. Despite this obvious fact, it's quite astonishing that most backup "strategies" are designed without any regard for the situations in which they may be required for restoration. Good backup strategies start with service level agreements for restoration time and acceptable data loss, and work backwards to derive the backup design. Classic mistakes here include;

  • Designing the backup strategy around minimizating the backup impact. This is fine, so long as the SLA for restoration time and data loss can be met (you do have an SLA right? - more on that in a later post),
  • Backing up to tape and shipping the tapes offsite. What happens in a recovery situation? How long will it take to retrieve the tapes from offsite? Offsite tape backups are fine, so long as the most recent backups are also on disk for local restoration,
  • Not anticipating, documenting and simulating various failure scenarios. We'll spend more time on this in the next post, but in summary, every DBA that may be required to perform a restore should know exactly where the backups are and how to restore them to the appropriate point in time. To ensure this happens, a variety of failure scenarios should be simulated in a test environment, ideally at a random and unanounced time, with each DBA taking it in turns to practice recovery. Their success or failure in this regard should ideally be part of their employment KPI's

In the next post, I'll examine a closely related category of worst practices; Disaster recovery planning (or lack thereof)