DBAs Behaving Badly (4/10): Storage Configuration
Performance tuning SQL Server applications involves finding and addressing performance bottlenecks. While there will always be a bottleneck somewhere, the goal is to reduce the bottlenecks until application performance meets or exceeds the usage requirements, typically defined in a service level agreement (SLA).
Although it’s undeniable that the largest performance gains usually come from good application design, inadequate hardware makes resolving performance problems much more difficult. Poorly designed storage systems account for arguably the largest percentage of hardware-based performance problems for SQL Server solutions, and fixing them is usually more complicated than a simple memory or CPU upgrade. It follows that a well-designed storage system removes the biggest hardware-based performance obstacle, and that storage design should therefore lead the way in sizing servers for use in SQL Server environments.
Let’s take a look at three common storage configuration problems, beginning with the capacity-centric design.
Capacity-centric design
When designing a storage system for SQL Server, one can take a capacity-centric or a performance-centric approach. To highlight the difference between these approaches, consider a requirement for a 200GB database. How many disks do you need? A storage system with a capacity-centric design would possibly have three 73GB disks, or perhaps more in a RAID configuration. In contrast, a system designed using the performance-centric approach would be vastly different, and wouldn’t even consider the space requirement until much later in the process.
Performance-centric designs start with consideration of the expected workload, and the IO per second (IOPS) capacity of the disks. For example, let’s assume that our application produces a peak load of 1200 reads and 400 writes per second, figures obtained during a profiling exercise in a test environment. Using these figures, we’d calculate the required number of disks as follows;
Required # Disks = (Reads/sec + (Writes/sec * RAID adjuster)) / Disk IOPS
For our example, assuming an average IOPS of 125 and a RAID 10 design, this equates to;
Required # Disks = (1200 + (400 * 2)) / 125
... giving us a total of 16 disks. You’ll note that we haven’t addressed storage capacity requirements yet. This is a deliberate decision to ensure the storage system is designed for throughput and performance as the highest priority. In this example, assuming we used 73GB disks, we’d have a total available capacity of 1.1TB. Usable space, after RAID 10 is implemented, would come down to around 500GB, more than enough for our 200GB database.
A common attribute of storage systems designed using a capacity-centric approach are throughput bottlenecks. For example, in the above scenario, while three disks would meet the capacity requirements of our 200GB database, the throughput would be severely constrained due to the low spindle (disk) count. In contrast, the performance-centric approach exceeds both the capacity and throughout requirements as a consequence of the design process.
In summary, spindle count is vitally important, and chapter 2 of my soon to be published book is dedicated to this topic, including the impact (good and bad) of storage area networks (SANs) in this regard.
Not validating the I/O chain
There are a vast number of components in a storage system’s I/O chain. The operating system, I/O drivers, virus scanners, storage controllers, read cache, write cache, switches, and various other components all pass data to and from SQL Server, and each one of these components has the potential to corrupt data along the way.
One of the really ugly aspects of being a DBA is dealing with physical data corruption. The only thing worse than finding corruption is then discovering the backups are also corrupt (or don’t exist!). Given the critical importance of data, it’s quite scary the number of DBAs who put a system into production without validating the numerous components in the I/O chain.
One of the inevitable outcomes of data corruption is the heated blame game that usually follows. The DBA blames the storage system, the storage administrator blames SQL Server and so on. Worst of all, this usually happens while production is down and people are screaming.
SQL Server has a number of specific I/O requirements, detailed here. The best way of validating them is using the SQLIOSIM tool before the system goes into production. Available as a free download, SQLIOSIM simulates the various I/O patterns used by SQL Server and ensures the data is written and read in a consistent manner. If any of the components in the I/O chain alters the data in any adverse way, SQLIOSIM will detect it, providing you the opportunity to correct the fault in a calm and prepared manner, before the system is implemented in a production capacity. In a similar manner, SQLIOSIM can be used to validate any modifications to the storage system (ideally in a identical test environment first), such as the upgrade of the firmware, drivers and so forth.
Unaligned partitions
A frequently overlooked configuration task is not aligning disk partitions. As shown in this whitepaper from the SQLCAT team, the latency impact of unaligned partitions can be as high as 30% or more. I’m not going to go into any more detail on creating aligned partitions, other than to say it’s a very simple and quick process which has long lasting and significant benefits; if you’re not familiar with the alignment problem, I urge you to read this whitepaper, which includes discussions around this graph...
Source: Disk Partition Alignment Best Practices for SQL Server, Microsoft Corporation
In the next post in this series, we’ll continue with the storage theme, but from a data/log/backup file configuration perspective.