Filtered Stats to Counter Data Skew Issues

There are a handful of things that can make SQL Server choose a “bad plan”. They usually revolve around variable sized result sets. Some problematic scenarios are parameter sniffing, data skew, local variables, table variables, UDF’s or complex calculations in the where clause. All of these and more are detailed in my favorite and recently updated white paper.

We are going to be talking about data skew in this post. Picture a table where you have 1 billion  rows where IsShipped = 1 but 50k rows where IsShipped=NULL. Now, let’s say you have a query that returns orders that are not shipped(NULL). During compilation, the optimizer asks the stats how many rows are IsShipped=NULL. If the stats sampling touched, few, if any of the pages, where IsShipped = NULL, then SQL Server will guess that only 1 row will be returned. This may surpass the key lookup threshold(Part 1, Part 2) and a scan would not be optimal but better than a key lookup. Get it? We will look at an example soon if not.

In the past, your options have been limited. Update stats with fullscan was usually the best option unless the table was just too large and volatile. If it was, you probably had to use index hints or OPTIMIZE FOR the lesser of the two evils. You have covering indexes and is probably a good solution for the example above unless the result set has to be very wide.

Filtered statistics are a significant improvement and can be used nicely when dealing with this problem. They are smaller so cheaper to maintain. Their scope can be very narrow so they are more accurate.

There are a couple of caveats. I wrote about the need to use full scan on very selective stats here and Kimberly Tripp wrote about the fact that they will not auto update until the entire table crosses the threshold.

Let’s look at an example.

CREATE TABLE filtered_stats        
(        
c1 INT IDENTITY(1,1) PRIMARY KEY NONCLUSTERED,        
c2 datetime DEFAULT GETDATE(),        
c3 CHAR(500),        
c4 INT        
)        
--lets add 1 million rows with c4=1        
--lets add 1 row with c4 = 2         
--Lets add a million more with c4 =3         
--Finally, lets add 50 more with c4 =2 so the estimates are skewed.         
DECLARE @ctr INT = 1000000        
WHILE @ctr > 0        
BEGIN        
   INSERT INTO filtered_stats(c3, c4)        
   VALUES        
   ('blah', 1)        
   SELECT @ctr = @ctr-1        
END        
GO        
INSERT INTO filtered_stats(c3, c4)        
   VALUES        
   ('blah', 2)        
DECLARE @ctr INT = 1000000        
WHILE @ctr > 0        
BEGIN        
   INSERT INTO filtered_stats(c3, c4)        
   VALUES        
   ('blah', 3)        
   SELECT @ctr = @ctr-1        
END        
GO        
DECLARE @ctr INT = 50        
WHILE @ctr > 0        
BEGIN        
   INSERT INTO filtered_stats(c3, c4)        
   VALUES        
   ('blah', 2)        
   SELECT @ctr = @ctr-1        
END        
--Let's add a clustered index        
CREATE CLUSTERED INDEX ix1 ON filtered_stats(c2)        
--This index will be used for the key lookups        
CREATE INDEX ix2 ON filtered_stats(c4)  
--The estimated rows returned are 1 but the actual rows are 51. 
--This is a bad estimate due to uneven data distribution.       
SELECT c1, c2        
FROM filtered_stats        
WHERE c4=2 

 

underestimate

--Lets create narrow and more accurate filtered stats
--You still have to do full scan but you are touching just a fraction of the data – NOTE: Full scan is needed based on my observations here.
CREATE STATISTICS fstats ON filtered_stats(c4) WHERE c4 = 2 WITH fullscan

--This should estimate 51 no matter how many times you update stats since they are more accurate
SELECT c1, c2
FROM filtered_stats
WHERE c4=2

 

image

This is not bullet proof but I say it is an improvement. The downside include the fact that filtered stats are manually created. Manual stats updates will need to be done with a full scan. You may also say that a covering index(or filtered covering index) would kill this. That is true but there may be an actual need to return all columns of a table.

A quick wrap up: Put it in the tool box. It will be like that $50 wrench that you use once a year but it saves you 4 hours every time you touch it.