Archive for the ‘Database Design’ Category

Database Design Process

Monday, January 18th, 2010

Buck Woody recently asked a question; how do you design a database. He outlined the process he followed and asked four questions about how each of us do our work:

  1. What process do you follow?
  2. How important are the business requirements?
  3. What tool do you use to create the design, do you need it to diagram, do you even care about diagrams?
  4. What’s your biggest pain-point about designing?

Funny enough, I haven’t done a full on database design in over a year. My company just finished about 6 years of very hard-core engineering work, designing and building or redesigning and building, the majority of our data input and collection systems. Then, I was doing lots of design work. Now, we’re either largely in maintenance mode for most of those systems, or the few new major projects we’re working on are using CRM, no database design, or hoping that database design will never, ever matter and using nHibernate to build the database on the fly, based on the object model (this, by the way, is still in super-double-secret probation development mode. I haven’t seen what they’re producing). All that means we’re doing very little design work now. That will change.

Process

The process I follow isn’t radically different from Buck Woody’s. I get the business requirements, which are hopefully written on something more substantial than a cocktail napkin, and with more detail than will fit on one, and I read them. Or I meet with the business people and ask lots and lots of questions, which I’ll probably do even if I have written requirements. Sometimes, especially at my current company, I’ll get a full logical diagram from another team. I’ll still basically do the same design work, even if I have a logical model, but I’ll use it as a reference point to double-check my understanding of the requirements. From there:

  1.  Identify the nouns. Figure out what kinds of objects, or things, or widgets that we’re talking about needing to store.
  2. Figure out the relationships between the widgets. What’s a parent and what’s a child and which of the widgets is related to many other widgets or to one other widget, etc.
  3. Lay out the attributes for the widget, meaning the columns in the table. Big points here include getting the data type correctly identified and figuring out which of the attributes are required and which are not. Further, which of the attributes come from pick lists, which means look-up tables.
  4. Identify the natural key. I’m with Buck, most of the time I use an alternate, artificial key (yeah, frequently an identity column), but I want to know the natural key so that I can put a unique constraint on the table, in addition to the primary key. This is a vital, but often missed step, in terms of the business processes being modeled.
  5. Figure out where I’m going to put the clustered index. Usually, but not always, this will be the primary key, but I do it as a fundamental part of the design. That means, as a fundamental part of the design, I think about the most common access path for the data. That’s going to be where the cluster needs to be.

How Important Are the Business Requirements

Buck, I respect you and I like you and I hate to do this, but you’re kidding with this question, right? No requirements, no database. The requirements are all. The most important reason you want the requirements written down is because that means that business at least thought them through, all the way, one time. It’s your best bet that you’re going to deliver something that slightly resembles what the business wants and needs. I live and breathe by the requirements. When someone from the business corrects my design, “Oh, that widget is related this thingie, not that watchamacallit,” I get them to change the requirements to reflect that new information. This way, when someone wonders why I did what I did, they’ll always be able to see the requirements the way I saw them.

Tools

I use Embarcadero’s ERStudio. I know there are other ER tools on the market, but I fell in love with this one on the day I needed to change the data type on a column that was in about 30 different tables and I did it in a couple of minutes using their scripting tool. I’ve been using it ever since, and we’re talking well more than ten years now. It’s just great. I only use it for the initial design and for documentation after the design. Once the design is done, the first time, and a physical database is constructed, I don’t work from the ER diagram to do builds and deployments, I work from source control. Do I have to do it like this? No, but I really enjoy the power of an ER tool while I’m doing the design because it lets you do a lot of changes, quickly and easily without having to rebuild a database over & over.

Biggest Pain Point

The largest pain point has to be changing requirements. Change happens. I embrace it. I’ve worked on agile projects and I like the general approach and mind set. And yet, changing databases is hard. It’s not so bad when you’re in the strict, isolated, ER diagram only stage, but as soon as you’ve built the database, even one time, change gets difficult. It’s not so bad if you work closely with the design, test & development teams and get their buy-in, early, that test data must be tossed & rebuilt with each fundamental design change. But that’s a hard agreement to get and so you end up spending a lot of time trying to retain the test data AND fundamentally redesign the data structure. This is not an enjoyable combination.

That’s about it. I would say the one thing I try to do, and it’s not easy, is be open to doing silly stuff. I try, and I don’t always succeed, to let the business or developers or logical modelling team make silly choices after I carefully tell them why they’re silly and the likely outcome. I do this because fighting them on every single silly decision is a losing proposition. Plus, it saves you for the fights against the stupid, as opposed to silly, things that come along occasionally.


Database Design Process

Monday, January 18th, 2010

Buck Woody recently asked a question; how do you design a database. He outlined the process he followed and asked four questions about how each of us do our work:

  1. What process do you follow?
  2. How important are the business requirements?
  3. What tool do you use to create the design, do you need it to diagram, do you even care about diagrams?
  4. What’s your biggest pain-point about designing?

Funny enough, I haven’t done a full on database design in over a year. My company just finished about 6 years of very hard-core engineering work, designing and building or redesigning and building, the majority of our data input and collection systems. Then, I was doing lots of design work. Now, we’re either largely in maintenance mode for most of those systems, or the few new major projects we’re working on are using CRM, no database design, or hoping that database design will never, ever matter and using nHibernate to build the database on the fly, based on the object model (this, by the way, is still in super-double-secret probation development mode. I haven’t seen what they’re producing). All that means we’re doing very little design work now. That will change.

Process

The process I follow isn’t radically different from Buck Woody’s. I get the business requirements, which are hopefully written on something more substantial than a cocktail napkin, and with more detail than will fit on one, and I read them. Or I meet with the business people and ask lots and lots of questions, which I’ll probably do even if I have written requirements. Sometimes, especially at my current company, I’ll get a full logical diagram from another team. I’ll still basically do the same design work, even if I have a logical model, but I’ll use it as a reference point to double-check my understanding of the requirements. From there:

  1.  Identify the nouns. Figure out what kinds of objects, or things, or widgets that we’re talking about needing to store.
  2. Figure out the relationships between the widgets. What’s a parent and what’s a child and which of the widgets is related to many other widgets or to one other widget, etc.
  3. Lay out the attributes for the widget, meaning the columns in the table. Big points here include getting the data type correctly identified and figuring out which of the attributes are required and which are not. Further, which of the attributes come from pick lists, which means look-up tables.
  4. Identify the natural key. I’m with Buck, most of the time I use an alternate, artificial key (yeah, frequently an identity column), but I want to know the natural key so that I can put a unique constraint on the table, in addition to the primary key. This is a vital, but often missed step, in terms of the business processes being modeled.
  5. Figure out where I’m going to put the clustered index. Usually, but not always, this will be the primary key, but I do it as a fundamental part of the design. That means, as a fundamental part of the design, I think about the most common access path for the data. That’s going to be where the cluster needs to be.

How Important Are the Business Requirements

Buck, I respect you and I like you and I hate to do this, but you’re kidding with this question, right? No requirements, no database. The requirements are all. The most important reason you want the requirements written down is because that means that business at least thought them through, all the way, one time. It’s your best bet that you’re going to deliver something that slightly resembles what the business wants and needs. I live and breathe by the requirements. When someone from the business corrects my design, “Oh, that widget is related this thingie, not that watchamacallit,” I get them to change the requirements to reflect that new information. This way, when someone wonders why I did what I did, they’ll always be able to see the requirements the way I saw them.

Tools

I use Embarcadero’s ERStudio. I know there are other ER tools on the market, but I fell in love with this one on the day I needed to change the data type on a column that was in about 30 different tables and I did it in a couple of minutes using their scripting tool. I’ve been using it ever since, and we’re talking well more than ten years now. It’s just great. I only use it for the initial design and for documentation after the design. Once the design is done, the first time, and a physical database is constructed, I don’t work from the ER diagram to do builds and deployments, I work from source control. Do I have to do it like this? No, but I really enjoy the power of an ER tool while I’m doing the design because it lets you do a lot of changes, quickly and easily without having to rebuild a database over & over.

Biggest Pain Point

The largest pain point has to be changing requirements. Change happens. I embrace it. I’ve worked on agile projects and I like the general approach and mind set. And yet, changing databases is hard. It’s not so bad when you’re in the strict, isolated, ER diagram only stage, but as soon as you’ve built the database, even one time, change gets difficult. It’s not so bad if you work closely with the design, test & development teams and get their buy-in, early, that test data must be tossed & rebuilt with each fundamental design change. But that’s a hard agreement to get and so you end up spending a lot of time trying to retain the test data AND fundamentally redesign the data structure. This is not an enjoyable combination.

That’s about it. I would say the one thing I try to do, and it’s not easy, is be open to doing silly stuff. I try, and I don’t always succeed, to let the business or developers or logical modelling team make silly choices after I carefully tell them why they’re silly and the likely outcome. I do this because fighting them on every single silly decision is a losing proposition. Plus, it saves you for the fights against the stupid, as opposed to silly, things that come along occasionally.

Database Design Process

Monday, January 18th, 2010

Buck Woody recently asked a question; how do you design a database. He outlined the process he followed and asked four questions about how each of us do our work:

  1. What process do you follow?
  2. How important are the business requirements?
  3. What tool do you use to create the design, do you need it to diagram, do you even care about diagrams?
  4. What’s your biggest pain-point about designing?

Funny enough, I haven’t done a full on database design in over a year. My company just finished about 6 years of very hard-core engineering work, designing and building or redesigning and building, the majority of our data input and collection systems. Then, I was doing lots of design work. Now, we’re either largely in maintenance mode for most of those systems, or the few new major projects we’re working on are using CRM, no database design, or hoping that database design will never, ever matter and using nHibernate to build the database on the fly, based on the object model (this, by the way, is still in super-double-secret probation development mode. I haven’t seen what they’re producing). All that means we’re doing very little design work now. That will change.

Process

The process I follow isn’t radically different from Buck Woody’s. I get the business requirements, which are hopefully written on something more substantial than a cocktail napkin, and with more detail than will fit on one, and I read them. Or I meet with the business people and ask lots and lots of questions, which I’ll probably do even if I have written requirements. Sometimes, especially at my current company, I’ll get a full logical diagram from another team. I’ll still basically do the same design work, even if I have a logical model, but I’ll use it as a reference point to double-check my understanding of the requirements. From there:

  1.  Identify the nouns. Figure out what kinds of objects, or things, or widgets that we’re talking about needing to store.
  2. Figure out the relationships between the widgets. What’s a parent and what’s a child and which of the widgets is related to many other widgets or to one other widget, etc.
  3. Lay out the attributes for the widget, meaning the columns in the table. Big points here include getting the data type correctly identified and figuring out which of the attributes are required and which are not. Further, which of the attributes come from pick lists, which means look-up tables.
  4. Identify the natural key. I’m with Buck, most of the time I use an alternate, artificial key (yeah, frequently an identity column), but I want to know the natural key so that I can put a unique constraint on the table, in addition to the primary key. This is a vital, but often missed step, in terms of the business processes being modeled.
  5. Figure out where I’m going to put the clustered index. Usually, but not always, this will be the primary key, but I do it as a fundamental part of the design. That means, as a fundamental part of the design, I think about the most common access path for the data. That’s going to be where the cluster needs to be.

How Important Are the Business Requirements

Buck, I respect you and I like you and I hate to do this, but you’re kidding with this question, right? No requirements, no database. The requirements are all. The most important reason you want the requirements written down is because that means that business at least thought them through, all the way, one time. It’s your best bet that you’re going to deliver something that slightly resembles what the business wants and needs. I live and breathe by the requirements. When someone from the business corrects my design, “Oh, that widget is related this thingie, not that watchamacallit,” I get them to change the requirements to reflect that new information. This way, when someone wonders why I did what I did, they’ll always be able to see the requirements the way I saw them.

Tools

I use Embarcadero’s ERStudio. I know there are other ER tools on the market, but I fell in love with this one on the day I needed to change the data type on a column that was in about 30 different tables and I did it in a couple of minutes using their scripting tool. I’ve been using it ever since, and we’re talking well more than ten years now. It’s just great. I only use it for the initial design and for documentation after the design. Once the design is done, the first time, and a physical database is constructed, I don’t work from the ER diagram to do builds and deployments, I work from source control. Do I have to do it like this? No, but I really enjoy the power of an ER tool while I’m doing the design because it lets you do a lot of changes, quickly and easily without having to rebuild a database over & over.

Biggest Pain Point

The largest pain point has to be changing requirements. Change happens. I embrace it. I’ve worked on agile projects and I like the general approach and mind set. And yet, changing databases is hard. It’s not so bad when you’re in the strict, isolated, ER diagram only stage, but as soon as you’ve built the database, even one time, change gets difficult. It’s not so bad if you work closely with the design, test & development teams and get their buy-in, early, that test data must be tossed & rebuilt with each fundamental design change. But that’s a hard agreement to get and so you end up spending a lot of time trying to retain the test data AND fundamentally redesign the data structure. This is not an enjoyable combination.

That’s about it. I would say the one thing I try to do, and it’s not easy, is be open to doing silly stuff. I try, and I don’t always succeed, to let the business or developers or logical modelling team make silly choices after I carefully tell them why they’re silly and the likely outcome. I do this because fighting them on every single silly decision is a losing proposition. Plus, it saves you for the fights against the stupid, as opposed to silly, things that come along occasionally.

Bitemporal Data

Tuesday, August 4th, 2009

Any IT pro with more than a year or two of experience will have faced the challenges of version control for an application, but what if you have to implement version control for data?

The most common way to tackle this problem is implementing something called “bitemporal data”. Under this method, each row in a table includes the current valid time and the transaction time. Since two distinct time values are stored, we get the term bitemporal.

A great place to start is Adam Machanic’s excellent article at http://www.simple-talk.com/sql/t-sql-programming/a-primer-on-managing-data-bitemporally/. In addition, I encourage you to check out Adam’s book “Expert SQL Server 2005 Development” at http://www.apress.com/book/view/9781590597293.

In addition, Dejan Sarka pointed out that you can get a very comprehensive information on supporting temporal data from a book entitled “Developing Time-Oriented Database Applications in SQL” by Richard T. Snodgrass. Mr. Snodgrass was so kind to publish the book in PDF format and make it available for free download at http://www.cs.arizona.edu/~rts/publications.html. (Note that the book is quite old now (year 1999), so the T-SQL code does not include any SQL Server 2005 or 2008 enhancements. But the basic SQL is still solid.

Cheers!

-Kevin

kekline @ twitter

P.S. Check out my new site – http://kevinekline.com

SQL Server VLDB Partitioning Tutorial

Tuesday, July 14th, 2009

Tables with hundreds of millions of rows or billions of rows need a way to manage that data easier.  SQL Server 2005 introduced partitioning, and it’s been improved ever since.  Learn about how to implement it in this ten-minute training video.

Get the Flash Player to see the wordTube Media Player.

For more about SQL Server table partitioning and other VLDB tips, check out our Very Large Databases wiki article.

Subscribing or Downloading the Podcast

If you have an MP3 player or a portable video player and you want to download our videos automatically, you can subscribe to our podcast feeds here:

You can also download this video to watch it later:

The SQL Doctor Is In

Wednesday, April 29th, 2009

Denis Gobo dropped me an email the other day to mention that I should have included a book (Pro SQL Server 2008 Relational Database Design and Implementation) by Louis Davidson et al., in my Technical Library. No question this is a must read for anyone interested in database design  and since Denis also offered a book review it made it real easy to get this added quickly.

I have known Louis for about five years now, and known of Louis for even longer. I still remember listening to an interview he did for SQLDownUnder a few years back (number twelve on this page). Listen for the part where he says one of top errors in database design is “people not listening to me”. When I heard that podcast while jogging I knew I could learn a lot from him.

Truthfully I want to have a shelf dedicated to database design, but the widget needs at least six books so I have some reading to do before I can comfortably recommend others. I need two or three more, so if you have a book to recommend just drop me an email.

How can I encrypt my database?

Monday, October 13th, 2008

This is a very general question that someone could certainly write a book on but I’ll go at answering it by providing a few options and explaining the pro’s and con’s of each solution.  The following list is by no means comprehensive.

The first question you need to ask yourself before going on this journey is where does the encryption need to occur and who are you protecting the data from.  These two questions will drive the amount of effort and complexity required in building a solution.

1)  In database encryption.  The database is encrypted in the database and is viewable to no-one unless they have access to the encryption key.  The data could be encrypted at the application level by first running the data through an encryption algorithm before inserting into the database or the data can be encrypted using SQL Server 2005’s column level encryption or SQL Server 2008’s transparent data encryption.  Using in database encryption is the most secure option in that it can be leveraged to protect sensitive data from anyone that gains access to the database engine itself (whether valid or invalid access) however it has significant performance and scalability caveats.  Encrypting at the application level basically ensures that SQL Server will not be able properly index the data and therefore query perofmance may be dreadful with large data sets.  Leveraging column level encryption or TDE also will introduce query processing overhead and adversely affects the performance of your queries but not to the same extent.   Another drawback of In Database encrption is that it renders backup compression (whether through SQL Server 2008 or a third party backup product or winzip) obsolete.  Encrypted data is not compressible, therefore the size of your encrypted database will roughly equal the size of your backup files regardless of the compression technology used.

2)  If you do not need to secure data inside of the database but want to secure data once it leaves the database, a good solution would be to leverage a third party backup solution such as LiteSpeed.  These solutions will compress and also encrypt backup data in memory so that it is secured as soon as it leaves the database.  This is an attractive option to companies that store backups on network devices and are worried about internal or external users gaining access to network resources and therefore having direct access to backup files.  If you think that native backup files are secured, try opening a small database backup file in notepad.  You will see that all of the data is stored in plain text.  If you have a copy of pubs lying around from the SQL 2000 days you can back it up, open the backup file in notepad, search for “Smith” and you will see Smith’s social security number right there.  This method of database encryption will not secure the data in the database, so anyone who gains database access has free reign on your data, but thats what DBA’s are for right???

3)  A third option, and the last that we will discuss here, is to leverage encryption at the tape level.  Most organizations take their disk based backups and eventually migrate them to tape, where the backups files are then encrypted and eventually moved to offsite storage.  This should be a bare minimum for any SQL Server as once a tape leaves offsite anything could happen to it.  There have been many major news stories recently discussing companies that have had backup tapes lost or stolen and without leveraging at a minimum the tape systems encryption, that data is available to anyone that may “find” a backup tape.  Tape level encryption does nothing to protect the database info inside of the database or the backup data inside of a companies network, but it does protect the data once it is transported offsite.  The nice advantage of this as well as the previous option is that they impose no overhead on standard database workload, they simply may add some overhead to your backup time depending on the level of encyption that you are performing.

Hopefully this brief rundown of database encryption options has given you enough info to get started decideing which implentation strategy is best for you, and as always if you have any further questions, feel free to submit them.

Multiple filegroups: yay or nay?

Monday, September 29th, 2008

Stephanie wrote in with two questions:

  1. Is there a gain in getting 2 LDF (log) files for one database?  My opinion is no, because log file access is sequential, so no performance gain can be obtained and this will complicate maintenance.
  2. Is there a gain in separating the data & indexes onto 2 specific filegroups?  My opinion is no, because the way SQL Server handles indexes (as opposed to Oracle) does not help if you split them from the data.

First things first, multiple log files – I agree that you don’t want to do two LDF files for a single database unless there’s special circumstances.  At one shop, we had a pre-grown full-size log file for each database (with autogrow turned off), but we kept an “emergency” log file for each database on an empty drive.  The “emergency” log files were 1mb, and set to autogrow.  If a database completely ran out of log space in its normal pre-grown log files, then it would start using that “emergency” log file and growing it.  That was better than stopping the database completely, and gave us enough time to kill transactions and roll things back.

That certainly wasn’t a best practice, and it didn’t buy us any performance, but we had to do it based on the unpredictable nature of those applications and the way they would sometimes load huge amounts of data in a single transaction.  In that case, multiple LDF files served a purpose.  (When our Microsoft contact found out what we were doing, they hit the roof and told us to just go buy enough disk to be done with it, and the business guys finally decided to shell out the dough.)

Adding a second file adds complexity, like you said, and the same holds true whether it’s a log file or a data file, which segues into your second question.  Adding a second data file for indexes adds more design overhead – suddenly the DBA has to pay attention with where they’re placing objects.

In the days of my youth (ahh, for the sweet days of 2004), I liked this idea for my 1tb data warehouse because I could rebuild indexes faster with less impact on the end users.  Now that I’m old (okay, maybe just older) and SQL Server 2005 Enterprise Edition lets us rebuild most indexes online, I don’t bother with that anymore.

Plus, when I started managing storage area networks, I had a different reason.  Storage is getting faster and faster every year, but DBAs aren’t getting any more hours in their day.  Keeping data and indexes on different filegroups requires careful attention during development.  All it takes is one knucklehead adding an index without paying attention, and bam, you’ve got indexes on the data filegroup.  That means you have to introduce a new task into your daily/weekly/monthly routine: running a query to identify which objects are on which filegroups.

What a pain.

I’ve struggled with finding and hiring good DBAs, and if I have a choice between making my database 5% faster or making my DBAs 5% faster, I choose that latter option.  I can get 5% more storage performance by adding a few more drive spindles, but it’s harder – and more expensive – to get 5% more good DBA time.

On the other hand, if I was in a shop where my DBAs had plenty of time (or worked for peanuts, whate, then I would take a sandbox system, set it up the same way as production, and do performance benchmarking to see how much of a difference a separate index filegroup made on my exact hardware, storage and application.

You’re asking me, though, so I bet you don’t have the time to do that research either.  In that case, keep it simple!

Can SELECT Statements Cause Blocking to Occur in SQL Server?

Friday, September 26th, 2008

The following question was posed to me during a call today and its one that I’ve heard quite a bit so I figured it warranted a blog post.

The answer is, absolutely!  SELECT statements acquire a shared lock on the tables being accessed.  This shared lock will not affect other SELECT statements hitting the same table, but if someone tries to modify data in the table (via an UPDATE statement for example), the UPDATE statement will be blocked.  When analyzing locks using sp_lock the (S) symbol indicates a shared lock is on an object.

This default behavior can be modified by using one of many query hints.  Using the NOLOCK hint on a SELECT statement will force SQL Server to read data from the table without creating a shared lock on it.  When using this hint, you run the risk of reading uncommited data from the database but in cases where reading data with 100% accuracy is not required, NOLOCK can dramatically reduce blocking and improve the performance of your SELECT queries.  Another option is to use the READ UNCOMMITED isolation level when running your transactions, which conceptually does the same thing.

One additional caveat about both methods.  If your SELECT statement is running, and it expects to read a page that has been deleted by a transaction that is currently executing, SQL Server may deadlock this transaction.  There is an interresting post on this topic that can be found here.

Error Message: Ad hoc update to system catalogs is not supported.

Wednesday, September 24th, 2008

I ran into an interresting situation today while working with a customer and thought it worthwhile to blog on the subject since my internet searching proved lengthy on the subject.  While trying to run a RECONFIGURE statement after making an sp_configure change I received the following message:

Msg 5808, Level 16, State 1, Line 2
Ad hoc update to system catalogs is not supported.

I’m running on SQL Server 2005 SP2+ so my first thought was… obviously, but I’m not trying to update system catalogs, I’m trying to make an sp_configure change.  After doing some digging I found that the culript was the sp_configure ‘allow updates’ parameter.  This configuration in SQL Server 2000 allowed or dissalowed direct system table updates.  In SQL Server 2005, this configuration item still exists but it is obsolete since direct access to system tables in always prohibited.  While the configuration item is obsolete, having it set to 1 in SQL Server 2005 requires you to run the RECONFIGURE statement using WITH OVERRIDE, otherwise you will get the message above.

If you ever see the message above when trying to run RECONFIGURE you will either need to run RECONFIGURE WITH OVERRIDE, or first run:

EXEC sp_configure ‘allow updates’, 0
RECONFIGURE

I find it confusing that an “obsolete” configuration parameter can have an effect on instance behavior, but oh well.  I’d love to hear comments if anyone has any insight into this or similar issues.

Thanks to Jasper Smith, who posted here and pointed me in the right direction.