Archive for the ‘Database Design’ Category

The SQL Doctor Is In

Wednesday, April 29th, 2009

Denis Gobo dropped me an email the other day to mention that I should have included a book (Pro SQL Server 2008 Relational Database Design and Implementation) by Louis Davidson et al., in my Technical Library. No question this is a must read for anyone interested in database design  and since Denis also offered a book review it made it real easy to get this added quickly.

I have known Louis for about five years now, and known of Louis for even longer. I still remember listening to an interview he did for SQLDownUnder a few years back (number twelve on this page). Listen for the part where he says one of top errors in database design is “people not listening to me”. When I heard that podcast while jogging I knew I could learn a lot from him.

Truthfully I want to have a shelf dedicated to database design, but the widget needs at least six books so I have some reading to do before I can comfortably recommend others. I need two or three more, so if you have a book to recommend just drop me an email.

How can I encrypt my database?

Monday, October 13th, 2008

This is a very general question that someone could certainly write a book on but I’ll go at answering it by providing a few options and explaining the pro’s and con’s of each solution.  The following list is by no means comprehensive.

The first question you need to ask yourself before going on this journey is where does the encryption need to occur and who are you protecting the data from.  These two questions will drive the amount of effort and complexity required in building a solution.

1)  In database encryption.  The database is encrypted in the database and is viewable to no-one unless they have access to the encryption key.  The data could be encrypted at the application level by first running the data through an encryption algorithm before inserting into the database or the data can be encrypted using SQL Server 2005’s column level encryption or SQL Server 2008’s transparent data encryption.  Using in database encryption is the most secure option in that it can be leveraged to protect sensitive data from anyone that gains access to the database engine itself (whether valid or invalid access) however it has significant performance and scalability caveats.  Encrypting at the application level basically ensures that SQL Server will not be able properly index the data and therefore query perofmance may be dreadful with large data sets.  Leveraging column level encryption or TDE also will introduce query processing overhead and adversely affects the performance of your queries but not to the same extent.   Another drawback of In Database encrption is that it renders backup compression (whether through SQL Server 2008 or a third party backup product or winzip) obsolete.  Encrypted data is not compressible, therefore the size of your encrypted database will roughly equal the size of your backup files regardless of the compression technology used.

2)  If you do not need to secure data inside of the database but want to secure data once it leaves the database, a good solution would be to leverage a third party backup solution such as LiteSpeed.  These solutions will compress and also encrypt backup data in memory so that it is secured as soon as it leaves the database.  This is an attractive option to companies that store backups on network devices and are worried about internal or external users gaining access to network resources and therefore having direct access to backup files.  If you think that native backup files are secured, try opening a small database backup file in notepad.  You will see that all of the data is stored in plain text.  If you have a copy of pubs lying around from the SQL 2000 days you can back it up, open the backup file in notepad, search for “Smith” and you will see Smith’s social security number right there.  This method of database encryption will not secure the data in the database, so anyone who gains database access has free reign on your data, but thats what DBA’s are for right???

3)  A third option, and the last that we will discuss here, is to leverage encryption at the tape level.  Most organizations take their disk based backups and eventually migrate them to tape, where the backups files are then encrypted and eventually moved to offsite storage.  This should be a bare minimum for any SQL Server as once a tape leaves offsite anything could happen to it.  There have been many major news stories recently discussing companies that have had backup tapes lost or stolen and without leveraging at a minimum the tape systems encryption, that data is available to anyone that may “find” a backup tape.  Tape level encryption does nothing to protect the database info inside of the database or the backup data inside of a companies network, but it does protect the data once it is transported offsite.  The nice advantage of this as well as the previous option is that they impose no overhead on standard database workload, they simply may add some overhead to your backup time depending on the level of encyption that you are performing.

Hopefully this brief rundown of database encryption options has given you enough info to get started decideing which implentation strategy is best for you, and as always if you have any further questions, feel free to submit them.

Multiple filegroups: yay or nay?

Monday, September 29th, 2008

Stephanie wrote in with two questions:

  1. Is there a gain in getting 2 LDF (log) files for one database?  My opinion is no, because log file access is sequential, so no performance gain can be obtained and this will complicate maintenance.
  2. Is there a gain in separating the data & indexes onto 2 specific filegroups?  My opinion is no, because the way SQL Server handles indexes (as opposed to Oracle) does not help if you split them from the data.

First things first, multiple log files – I agree that you don’t want to do two LDF files for a single database unless there’s special circumstances.  At one shop, we had a pre-grown full-size log file for each database (with autogrow turned off), but we kept an “emergency” log file for each database on an empty drive.  The “emergency” log files were 1mb, and set to autogrow.  If a database completely ran out of log space in its normal pre-grown log files, then it would start using that “emergency” log file and growing it.  That was better than stopping the database completely, and gave us enough time to kill transactions and roll things back.

That certainly wasn’t a best practice, and it didn’t buy us any performance, but we had to do it based on the unpredictable nature of those applications and the way they would sometimes load huge amounts of data in a single transaction.  In that case, multiple LDF files served a purpose.  (When our Microsoft contact found out what we were doing, they hit the roof and told us to just go buy enough disk to be done with it, and the business guys finally decided to shell out the dough.)

Adding a second file adds complexity, like you said, and the same holds true whether it’s a log file or a data file, which segues into your second question.  Adding a second data file for indexes adds more design overhead – suddenly the DBA has to pay attention with where they’re placing objects.

In the days of my youth (ahh, for the sweet days of 2004), I liked this idea for my 1tb data warehouse because I could rebuild indexes faster with less impact on the end users.  Now that I’m old (okay, maybe just older) and SQL Server 2005 Enterprise Edition lets us rebuild most indexes online, I don’t bother with that anymore.

Plus, when I started managing storage area networks, I had a different reason.  Storage is getting faster and faster every year, but DBAs aren’t getting any more hours in their day.  Keeping data and indexes on different filegroups requires careful attention during development.  All it takes is one knucklehead adding an index without paying attention, and bam, you’ve got indexes on the data filegroup.  That means you have to introduce a new task into your daily/weekly/monthly routine: running a query to identify which objects are on which filegroups.

What a pain.

I’ve struggled with finding and hiring good DBAs, and if I have a choice between making my database 5% faster or making my DBAs 5% faster, I choose that latter option.  I can get 5% more storage performance by adding a few more drive spindles, but it’s harder – and more expensive – to get 5% more good DBA time.

On the other hand, if I was in a shop where my DBAs had plenty of time (or worked for peanuts, whate, then I would take a sandbox system, set it up the same way as production, and do performance benchmarking to see how much of a difference a separate index filegroup made on my exact hardware, storage and application.

You’re asking me, though, so I bet you don’t have the time to do that research either.  In that case, keep it simple!

Can SELECT Statements Cause Blocking to Occur in SQL Server?

Friday, September 26th, 2008

The following question was posed to me during a call today and its one that I’ve heard quite a bit so I figured it warranted a blog post.

The answer is, absolutely!  SELECT statements acquire a shared lock on the tables being accessed.  This shared lock will not affect other SELECT statements hitting the same table, but if someone tries to modify data in the table (via an UPDATE statement for example), the UPDATE statement will be blocked.  When analyzing locks using sp_lock the (S) symbol indicates a shared lock is on an object.

This default behavior can be modified by using one of many query hints.  Using the NOLOCK hint on a SELECT statement will force SQL Server to read data from the table without creating a shared lock on it.  When using this hint, you run the risk of reading uncommited data from the database but in cases where reading data with 100% accuracy is not required, NOLOCK can dramatically reduce blocking and improve the performance of your SELECT queries.  Another option is to use the READ UNCOMMITED isolation level when running your transactions, which conceptually does the same thing.

One additional caveat about both methods.  If your SELECT statement is running, and it expects to read a page that has been deleted by a transaction that is currently executing, SQL Server may deadlock this transaction.  There is an interresting post on this topic that can be found here.

Error Message: Ad hoc update to system catalogs is not supported.

Wednesday, September 24th, 2008

I ran into an interresting situation today while working with a customer and thought it worthwhile to blog on the subject since my internet searching proved lengthy on the subject.  While trying to run a RECONFIGURE statement after making an sp_configure change I received the following message:

Msg 5808, Level 16, State 1, Line 2
Ad hoc update to system catalogs is not supported.

I’m running on SQL Server 2005 SP2+ so my first thought was… obviously, but I’m not trying to update system catalogs, I’m trying to make an sp_configure change.  After doing some digging I found that the culript was the sp_configure ‘allow updates’ parameter.  This configuration in SQL Server 2000 allowed or dissalowed direct system table updates.  In SQL Server 2005, this configuration item still exists but it is obsolete since direct access to system tables in always prohibited.  While the configuration item is obsolete, having it set to 1 in SQL Server 2005 requires you to run the RECONFIGURE statement using WITH OVERRIDE, otherwise you will get the message above.

If you ever see the message above when trying to run RECONFIGURE you will either need to run RECONFIGURE WITH OVERRIDE, or first run:

EXEC sp_configure ‘allow updates’, 0
RECONFIGURE

I find it confusing that an “obsolete” configuration parameter can have an effect on instance behavior, but oh well.  I’d love to hear comments if anyone has any insight into this or similar issues.

Thanks to Jasper Smith, who posted here and pointed me in the right direction.

How Can I Determine Which Tables and Indexes Are Not Being Used in My Database?

Wednesday, August 27th, 2008

Q: I have inherited a legacy applications of which I am certain there are many unneccesary tables and indexes.  I am slightly paranoid about ripping them out however as I am not 100% sure that they are not used.  Is there any way that I can determine which indexes and tables are safe to drop?

A: As long as your application is running on SQL Server 2005+, you have access to a Dynamic Management View (DMV) that displays exactlky this information.  Querying from sys.dm_db_index_usage_stats will tell you how many times a given index/table has been scanned, seeked, looked up, updated, and even gives you the last date when an operation happened.  Browsing through this DMV will allow you to determine with confidence any index/table that has never been touched (since SQL Server was last restarted).  There are many exampes of some elegant queries that people have written that can be found by searching the DMV name but its a fairly straight forward one to understand.

One got-chya, the data in sys.dm_db_index_usage_stats will show tables/indexes across all databases where sys.indexes only shows you data from the database you are in.  This creates some challenging hurdles if you are trying to join on those tables to lookup the index name.  It’s best to keep it simple and filter dm_db_index_usage_stats to only show you data from the database you are in.

A fairly simple example with explanation can be found here.

How do I query DB2 data from SQL Server?

Wednesday, August 20th, 2008

Q:  We use DB2 Connect to access DB2 from MS Access. Do we need this for SQL Server also or does SQL Server have a built-in interface or do we need something else?

A: No you do not need to use DB2 Connect in able to access DB2 data within SQL Server.  SQL Server supports linked servers which allow you to register a remote data source in SQL Server and execute queries against it.  Linked servers can be remote SQL Server instances or instances of Oracle, DB2, Sybase, MySQL, etc…  As long as there is a supported OLEDB driver for the data source you can add it as a linked server.  There is a blog post you can read here:

http://blogs.msdn.com/dotnetinterop/archive/2006/01/20/defining-a-db2-as-a-linked-server.aspx

that walks you through the process of setting up a linked server to DB2 and you can also look up Linked Servers in books online for more information.

Clustered index impacts on writes vs reads

Thursday, May 29th, 2008

After today’s webcast, “What a Cluster! Clustered Indexes” we got an email from David Holt at Kaye Scholer LLP. He writes:

I just attended the webcast “What A Cluster! Clustered Indexes”; thanks for the info!

Clustered indexes obviously impact writes as well as reads, and there can be a conflict between the needs of the two; i.e., if a lot of different people are writing to the same table then you don’t want them to block each other, yet you want simultaneous queries (selects) on that table to be efficient. That seems to argue that the clustered index should make writes as efficient as possible, even if it’s not the most effective clustered index for reads. Would you agree?

Uh oh – you know it’s dangerous when someone from a law firm asks if you would agree, and it’s even more dangerous when that answer is no. I’ll take my life into my own hands and answer anyway.

When designing an index strategy for a table, it helps to know the read/write mix and any time ranges for each. I can illustrate with two examples of opposite ends of the spectrum.

Let’s say our company has a web filtering database that logs all employee web activity to a database table, including the user info, date/time, and the URL they were surfing. Every now and then, they’ll run reports to see what kinds of web sites the users are hitting, and sometimes they’ll audit the web history of a particular individual.

Example #1: The Web History Table

The web activity table would be extremely write-intensive, since we have a lot of company employees who sit around surfing the web all day long. (Like you right now, come to think of it – get back to work!) A write performance reduction as small as 10% might have a significant impact on our ability to keep up with our surf-happy employees. Plus, we store the data on a SATA RAID5 array with slow writes to begin with.In a case like this, we could make an argument for a clustered key using an identity field or using a datestamp field. That would keep write speeds quick at the cost of making reports slower. However, at report time, these users don’t really care how long the report takes to run – they just want to see it sooner or later.

Example #2: The IP Subnet Table

Our web filtering application would store a list of IP address subnets – network TCP/IP address ranges that identify workstations on our network. We would not do very frequent inserts into this table, because our network isn’t expanding like crazy.

When we query this table, the queries use a lot of range queries because they want to group users together by location. The IP address subnets are in similar ranges. The fields on this table include a lot of things that we want to report on, like geographic location and company department.

This table could have a completely different primary key design because it’d have different design objectives: faster queries instead of faster inserts.

Data warehouses are another great example of this kind of design: nightly loads happen during one window (nighttime), and end users do queries during a different window (daytime). Depending on our loads, we might design our primary keys differently. If we have plenty of time at night to load our files, and our users want faster responses on their daytime queries, then our primary keys wouldn’t be designed for fast inserts.

There’s yet another consideration in this design: the number of other indexes on the table. If a table has ten wide covering indexes, then the arrangement of the primary key may not make as much of a difference as you’d think, because there’s so much overhead in maintaining the indexes.

Bottom Line: Measure Every Change

This is why I drove home the Measure, Change, Measure, Revert point in the presentation: the only way to find out for sure is to get a complete picture of your query loads, design a replayable test (like a series of queries) and keep measuring the results.  If possible, replicate the production system as closely as possible in development, because the storage systems will affect your results.

Clustered Indexes & SQL 2005’s Performance Dashboard

Thursday, May 29th, 2008

In today’s installment of the Quest Pain of the Week webcasts, Jason Hall and I will be discussing a reader’s question about finding missing clustered indexes, their impact on performance, and how to be proactive about monitoring performance to find those types of design issues.

MSSQLTips.com has a great post called “Finding primary keys and missing primary keys in SQL Server.”  That post includes two snippets of code, one for each of the popular SQL versions:

Finding missing primary keys in SQL 2005:

SELECT c.name, b.name
FROM sys.tables b
INNER JOIN sys.schemas c ON b.schema_id = c.schema_id
WHERE b.type = 'U'
AND NOT EXISTS
(SELECT a.name
FROM sys.key_constraints a
WHERE a.parent_object_id = b.OBJECT_ID
AND a.schema_id = c.schema_id
AND a.type = 'PK' )

Finding missing primary keys in SQL 2000:

SELECT c.name, a.name
FROM sysobjects a
INNER JOIN sysusers c ON a.uid = c.uid
WHERE xtype = 'U'
AND NOT EXISTS
(SELECT b.name
FROM sysindexes b
WHERE a.id = b.id
AND (b.status & 2048)<>0)

You can read their post for similar queries on finding tables WITH primary keys too.

Also in today’s webcast, I’ll be showing a few screenshots from Microsoft’s SQL Server Performance Dashboard Reports.  They’re a free set of SSRS reports that are integrated into SQL Server Management Studio.  The reports I’ll be showing in the webcast are somewhat limited in that they only show data that’s currently in the plan cache.  If your server is under memory pressure, it may not retain this data for too long, and you would have to revisit these reports pretty often to get a good picture of what’s going on under the hood.

For more information about the dashboard and how to use it, check out Kevin Kline’s Tool Time article “SQL Server 2005 Dashboard Reports” in SQL Server Magazine’s August 2007 issue.

Implementing a Hash Partition on SQL Server 2005

Monday, April 21st, 2008

One of the best things about becoming a Microsoft MVP is meeting other MVPs.  I bring this up because last week was the annual MVP Summit in Seattle, WA.  I was really looking forward to meeting Steve Kass.  Steve Kass is one of the smartest SQL Server MVPs I’ve encountered, especially when it comes to SQL questions.  A while back, I noticed that Steve made an interesting recommendation for a hash function that you could use for partitioning that I thought was worth noting.  A hash function would be very useful if you wanted to implement your own variation of a range partition using a hash function rather than the standard sort of range partitioning where colA values of A-H go to partition 1, values of I-P got to partition 2, and so forth.

 

Steve notes that you could use the following for hashing something small in size:

   CAST(    SUBSTRING(      HASHBYTES(’SHA1′,         CAST(my_col AS NVARCHAR(appropriate_size))),8,1) AS tinyint) 

This is just an off the cuff recommendation from Steve and might need some fine tuning, for example, the CAST might throw off persistence.  However, it’s a good start.

 

Thanks, Steve, for sharing this and thanks, readers, for sharing any improvements you might develop out in the field.


Fatal error: Call to undefined function SEO_pager() in /home/ssp/webapps/htdocs/blog/wp-content/themes/newblogcity/archive.php on line 47