Author Archive

Five things SQL Server should ban forever

Wednesday, May 12th, 2010

Here's my list of what needs to go.

Ok, everyone seems to be creating a list of five things that SQL Server should get rid of, so I'll do one of those as well.  (I already did a "five things that SQL Server needs", but apparently I jumped the gun on that one. Oh well).  Actually I'm fairly happy with everything in the product right now, and really don't have anything that's hanging around driving me crazy.  Having said that, sure here's a few that I wouldn't miss:


Model Database
- Does anyone really use this thing?   The ability to make changes to it so that when you create a new database it is "modeled" after a template, so to speak,  is a good idea I guess, but I never seem to use it.

Database Diagrams - This facility is weak, so getting rid of it is no hair off of my <backside>.  Make it really slick or just trash it.

Backup Devices - I can't remember the last time I actually created a device for a backup.  Not needed.

OLE Automation - What I thought was once cool, back in my wilder and more daring days, is a garbage. This stuff needs to go. Can't believe that it's still around.  What's worse - I can't believe that it's not deprecated! I'll always appreciate that piece of ingenious code by the late, great Ken Henderson on programming the Microsoft Word API to spell-check before inserts into a table, but I'd never use something like that in a production setting.

Auto Shrink - Probably the worst 'feature' eva!  I think they leave it in just so that they can warn beginning DBA- types to go shut it off, and fill up chapters in books and articles.  It's gotta go!!!

 

Lee

 

-----------------------

"He's not a man - a machine. A Terminator. A Cyberdyne Systems Model 101"

terminator010

 

face

Five things SQL Server should add

Tuesday, May 11th, 2010

Here's my list...enjoy!

Super-stealth DBA/SSIS expert Jamie Thomson came up with a great list entitled "Five things SSIS should drop", and Aaron Bertrand offered the blog "5 things SQL Server should drop"; I thought I'd throw in my .02 cents and give my own list...Five things SQL Server should add.  Here they are and in no particular order:

  1. Object Restores from a backup - How many times have you wanted to just restore a proc, a table, or some other some other object in SQL Server from a backup?  I have.  Who cares if a table is "transactionally consistent"!  Just give me the table as-is and let me fix it.  We had table restores in SQL Server 6.5 but they removed them, remember? Bring 'em back!!!
  2. More minimally-logged operations - With the advent of the very popular document and key-value pair databases roaming the landscape (MongoDB, Cassandra, Hadoop, et. al.), we need to add more minimally or better yet completely non-logged operations at some point in SQL Server to compete with these guys. The new INSERT WITH (TABLOCK) is great - so what's next?  Again, just let me decide if I want to send the data without logging, and let me be responsible to see that the rows got there.  I already check for them anyway.  Some operations just don't need to be logged.  For example, do you think uploading a comment on facebook needs to be logged?  Uh no.  Just retry if it fails.  Give me the option to be "reasonably sure" rather than 100% positive at my discretion.
  3. Linked Servers with optimized distributed transactions - If we're going to scale out, let's scale out the right way. If you have a multi-server architecture that uses the MSDTC, you automatically have a non-optimized query. Congratulations.  Give me a query optimizer that can look at both sides of the fence and make heads or tails of query optimization.  Also, give me high-performance linked servers. MSDTC code hasn't been changed in 10-15 years;  Let's chuck it and redo the whole process.  (Either that or just get rid of it entirely).
  4. "Shared-something" or "Shared Everything" Architecture - Wouldn't it be cool to have one database file across multiple servers? Wouldn't it be cool to have a pool of RAM, disk, CPU that could be dynamically moved around a farm of servers?  Wouldn't it be nice to have a cluster that does more than just "high availability"?  Give me the processing power of RAIW. (Redundant Array of Inexpensive Workstations).  It won't work?  Uh, don't tell the boys from Terradata - they have it and it's great.  And, maybe we will as well someday soon ;)
  5. Add Ctl-B to SSMS - This was the greatest feature ever for those of us who used it in SQL 7 and 2000 Enterprise Manager.  Who took it out and why?  Huh? CTL + B was the shortcut that grabbed the query results window; when you wanted to move it, it snagged the bar between the results pain and the query window allowed you to slide it up or down with a mouse. Bring it back ASAP!!!

 

Thanks for reading!

Lee

 

-------------------------

Buford T. Justice: Well, thank you, Mr. Bandit. And as the pursuer, may I say you're the ***dam*dest pursuee I've ever pursued. Now that the mutual bullsh*t is over, WHERE ARE YOU, YOU SOMB*TCH?

chestram2

 

face

Table Partitioning and LOCK_ESCALATION in SQL Server 2008

Tuesday, May 11th, 2010

INTRODUCTION

The folks at SQL Cat (SQL Customer Advisory Team) post some really great stuff, and had a blog a couple of months back [1] on partitions and locking.  I haven’t played much with partitions yet in SQL 2005 or 2008 -  haven’t really had a need to up to this point, and haven’t randomly geeked with the technology since I’ve been busy geeking with other stuff.   But I knew that I would eventually come around to them, and now is that time - the client where I am located is wanting to use the technology in a consolidation effort in their data warehouse.  For their implementation, they want to deploy partitioning to segment country codes, and specifically, they want to make sure that DML on one country will not affect another country.  For example,  they want to know if inserts or deletes that are occurring for China affect the people who are querying data from Indonesia.

The goal of my post is to expand what SQLCat did by doing the following:

  1. Change the INSERT to a DELETE statement (theoretically, changing DML type won’t matter)
  2. Increase the percentage of rows to be affected to those of all of the rows in the table
  3. Investigate what SQL Server Profiler captures during this operation
  4. Test a concurrent set of DML actions to the partitions
  5. Test it using a heap table

BACKGROUND, AND AN OPINION

Maybe I haven’t wanted to jump into the feature due to prior experiences.  I started working with SQL Server 6.5, so I am familiar with and have used all of the various flavors that Microsoft has offered to scale through the years, and frankly I was never impressed with any of them.  Federated servers, horizontal partitioning, distributed partitioned views, vertical partitioning, anything using the MSDTC, constraints to direct rows, etc.  I’ve never liked, all of which I believed were a small step short of a hack. Personal opinion here, so take it FWIW.  I am very skeptical when using “scale” and “SQL Server” in the same sentence; I am, however, very impressed thus far with partitions and table partitioning. Let’s gen up a scenario and do this thing.


SQL CAT’S CODE

We’ll set up the tables and indexes using the following, somewhat similar to their script but slightly modified:

USE tempdb
GO
BEGIN TRY
    DROP TABLE dbo.Data
END TRY

BEGIN CATCH
END CATCH
GO

CREATE PARTITION FUNCTION pf_year(INT) 
AS RANGE RIGHT FOR VALUES (2007,2008,2009,2010)
GO
CREATE PARTITION SCHEME ps_year 
AS PARTITION pf_year ALL TO ([PRIMARY])
GO
CREATE TABLE Data 
(    
    [year] INT NOT NULL
    , UpdateValue FLOAT NOT NULL 
    , PayLoad CHAR(200) NOT NULL
) ON ps_year([year])

GO
-- drop index data.cix
CREATE CLUSTERED INDEX CIX ON Data(Year) ON ps_year(Year)
GO

Notice that I have four years in my partition function, which will be really five logical partitions in SQL Server.  We can verify this by running the following query, as this shows our partition ids for our object_id for our “Data” table.  This is very important, because we will look into SQL Profiler to make sure that DML for one partition doesn’t escalate locks to one of the other partitions; if it did, we may experience accessibility/concurrency problems for those other users.  In the screenshot below we can see the partition_id column, and this is what we’ll search for to verify either way.

SELECT * FROM sys.partitions
WHERE OBJECT_ID = OBJECT_ID('Data')


image

Figure 1 – Table with our partitions set up


YES, WE WILL NEED SOME DATA

Let’s populate the table now.  For this test I am deleting data.  What I am going to do is kill two birds with one stone: I am going to check for both locking at the partition level, and am also going to skew the table values just to verify that it has affect on SQL Server by inserting quite a bit more data in the partition that I am going to delete. (Remember that SQL Server will try to escalate up to a table lock if the number of rows being affected is percentage-wise is high compared to the number of rows in the table. SQL Cat used 10,000 rows for all partitions).  Run the following and you will see [rows] in the above query populated.

--/* populate the table */
TRUNCATE TABLE Data
GO
DECLARE @y INT = 2007
WHILE (@y < 2011)
BEGIN
    INSERT Data
    SELECT @y, 0, REPLICATE('X', 200) FROM 
        (SELECT TOP (CASE WHEN @y = 2008 then 1000000 else 10000 end) o1.id
        FROM sys.syscolumns o1
        CROSS JOIN sys.syscolumns o2
        CROSS JOIN sys.syscolumns o3
        ) AS BIG

    SET @y +=1
END

 

image

Figure 2 – Screenshot while table is being populated

image

Figure 3 – Screenshot while table is being populated

 

This is cool – you can see each partition filling up from the above statement.  Because I can use expressions in TOP, I insert more records for year 2008 than the others to get the data skewed via the CASE statement; again, my thinking here is that we will verify that partitions don’t bother each other as they did in their test, but also make sure that deleting a large chunk in one partition will also not affect the entire table.

I am going to use a read committed isolation level since that is what they use at the client (the SQL Server default) and leverage the new feature in SQL Server 2008 to set the lock escalation to AUTO at the table level.  These are worth changing up, by the way, to see the various affects on locking when setting the LOCK_ESCALATION for a given table.

--SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
SET TRANSACTION ISOLATION LEVEL READ COMMITTED
--SET TRANSACTION ISOLATION LEVEL REPEATABLE READ
--SET TRANSACTION ISOLATION LEVEL SERIALIZABLE

ALTER TABLE Data SET (LOCK_ESCALATION = AUTO)
--ALTER TABLE Data SET (LOCK_ESCALATION = DISABLE)
GO

 

ARE WE THERE YET?

Alright, it’s time to run the script and look at this thing.  As mentioned, SQL Cat did not fire up SQL Server Profiler, so here we will so that we can 1) see lock escalations and 2) view the types of locks.  They also mentioned that you can add the Lock:Escalation event, but I am also going to add Lock:Acquired and Lock:Release as well, since these will give us additional information to get a better picture of the process.  Here’s the code:

BEGIN TRAN
    DELETE Data
    WHERE YEAR=2008

    SELECT 
          lck.request_session_id AS session_id
        , resource_type
        , CASE resource_type 
            WHEN 'OBJECT' THEN    
                OBJECT_NAME(lck.resource_associated_entity_id)
            WHEN 'KEY' THEN OBJECT_NAME(p.object_id)
            WHEN 'PAGE' THEN OBJECT_NAME(p.object_id)
            WHEN 'RID' THEN OBJECT_NAME(p.object_id)
            WHEN 'HOBT' THEN OBJECT_NAME(p.object_id)
            ELSE    
                CAST(resource_associated_entity_id AS NVARCHAR(MAX))
            END 
          AS object_name
        , lck.request_mode
        , i.name AS index_name
    FROM sys.dm_tran_locks lck WITH (NOLOCK)
    LEFT JOIN sys.partitions p WITH (NOLOCK) ON p.hobt_id = lck.resource_associated_entity_id 
    LEFT JOIN sys.indexes i WITH (NOLOCK) ON p.object_id = i.object_id AND 
         p.index_id = i.index_id 
    LEFT JOIN sys.databases db WITH (NOLOCK) ON lck.resource_database_id = db.database_id 
    WHERE lck.request_mode <> 'Sch-S' 
    AND lck.request_session_id = @@SPID
    AND lck.resource_type <> 'DATABASE'
    
COMMIT TRAN


First I’m going to run this with LOCK_ESCALATION = AUTO. Notice in screenshots and that I have something similar to SQLCat; in the profiler shot I see that there is a HOBT lock type at the end. 

image

Figure 4. Profiler with LOCK_ESCALATION = AUTO

 

image

Figure 5. Profiler with LOCK_ESCALATION = DISABLE


The difference here is that HOBT is not seen in Figure 5, but it is in Figure 4, so I’m reasonably sure that we did get the HOBT granularity (Heap or B-Tree) for the first run.  Good deal.


INTERESTING FIND

While the above output was expected, the below two screenshots from the output of the queries for my two tests are somewhat strange and unexpected – they’re the same.  With AUTO and DISABLE, I get both HOBT granularity. After running this a dozen times or more, it appears that with tables with one very large partition over the others, you get the HOBT locking and not row-level locks even with DISABLE.  I retested this over and over as well, starting back with 10K rows for each partition, and then upping the 2008 partition at various increments. Only when I got to about 500,000 rows did it exhibit this behavior, and you could see that the page locks went away and were replaced by the output in Figure 6.

Something to test further for sure; maybe I’ll ping the SQLCat team to comment on this.  What I am finding is that ALTER TABLE seems to be working inconsistently, because every time that I drop the table it seems to work OK.  I’ll follow up here.

image

Figure 6. With LOCK_ESCALATION = AUTO

 

image

Figure 7. With LOCK_ESCALATION = DISABLE

 

TESTING A CONCURRENT INSERT WITH DELETE

Here’s a test that they did not do – I am going to run the above DELETE, but at the same time run a large INSERT into another partition. Set up the first query, and then run this one at the same time in another window.

BEGIN TRAN
    
    DECLARE @Counter int = 0
    BEGINNING:

    INSERT Data SELECT 2007,0,' '
    SELECT @Counter +=@@ROWCOUNT
    IF @Counter < 50000
        GOTO BEGINNING    
    
    SELECT 
         p.partition_id 
        ,lck.request_session_id AS session_id
        , resource_type
        , CASE resource_type 
            WHEN 'OBJECT' THEN    
                OBJECT_NAME(lck.resource_associated_entity_id)
            WHEN 'KEY' THEN OBJECT_NAME(p.object_id)
            WHEN 'PAGE' THEN OBJECT_NAME(p.object_id)
            WHEN 'RID' THEN OBJECT_NAME(p.object_id)
            WHEN 'HOBT' THEN OBJECT_NAME(p.object_id)
            ELSE    
                CAST(resource_associated_entity_id AS NVARCHAR(MAX))
            END 
          AS object_name
        , lck.request_mode
        , i.name AS index_name
    FROM sys.dm_tran_locks lck WITH (NOLOCK)
    LEFT JOIN sys.partitions p WITH (NOLOCK) ON p.hobt_id = lck.resource_associated_entity_id 
    LEFT JOIN sys.indexes i WITH (NOLOCK) ON p.object_id = i.object_id AND          
          p.index_id = i.index_id 
    LEFT JOIN sys.databases db WITH (NOLOCK) ON lck.resource_database_id = db.database_id 
    WHERE lck.request_mode <> 'Sch-S' 
    AND lck.request_session_id = @@SPID
    AND lck.resource_type <> 'DATABASE'
    
COMMIT TRAN
   


No problems here – we were able to delete a big chunk out of one partition while inserting into another, and both partitions display HOBT locking granularity. Very nice.

image

Figure 8.  Concurrent INSERT and UPDATE


YET ANOTHER INTERESTING TEST

Let’s try a heap with the above test.  Drop and recreate the table without the clustered index, and run with 10K rows for each partition. Here’s what I come up with:

image

Figure 9.  Heap – it doesn’t work.

 

CONCLUSION

From my test, partitioning using ALTER TABLE and setting the LOCK_ESCALATION looks to be good news for concurrency in a large table.  I found no problems or collisions from doing various DML to the partitions at the same time.  One thing that I could not explain was why I got HOBT granularity locks from time-to-time with a large amount of data in one partition;  I noticed this when I did not DROP the target table.  This is worth looking into a bit more.

Thanks for reading,

Lee Everest, M.S.

 

----------------------------

“Redo those buttons. Dress that belt buckle. Straighten that cap. And ***damnit tuck up those pajamas!”

smile5

 

Refs: [1]  Kejser, 2008. Enabling Partition Level Locking in SQL Server. Retrieved on 5/10/10 from
http://sqlcat.com/msdnmirror/archive/2010/03/03/enabling-partition-level-locking-in-sql-server-2008.aspx

facebook: face

Table Partitioning and LOCK_ESCALATION in SQL Server 2008

Tuesday, May 11th, 2010

INTRODUCTION

The folks at SQL Cat (SQL Customer Advisory Team) post some really great stuff, and had a blog a couple of months back [1] on partitions and locking.  I haven’t played much with partitions yet in SQL 2005 or 2008 -  haven’t really had a need to up to this point, and haven’t randomly geeked with the technology since I’ve been busy geeking with other stuff.   But I knew that I would eventually come around to them, and now is that time - the client where I am located is wanting to use the technology in a consolidation effort in their data warehouse.  For their implementation, they want to deploy partitioning to segment country codes, and specifically, they want to make sure that DML on one country will not affect another country.  For example,  they want to know if inserts or deletes that are occurring for China affect the people who are querying data from Indonesia.

The goal of my post is to expand what SQLCat did by doing the following:

  1. Change the INSERT to a DELETE statement (theoretically, changing DML type won’t matter)
  2. Increase the percentage of rows to be affected to those of all of the rows in the table
  3. Investigate what SQL Server Profiler captures during this operation
  4. Test a concurrent set of DML actions to the partitions
  5. Test it using a heap table

BACKGROUND, AND AN OPINION

Maybe I haven’t wanted to jump into the feature due to prior experiences.  I started working with SQL Server 6.5, so I am familiar with and have used all of the various flavors that Microsoft has offered to scale through the years, and frankly I was never impressed with any of them.  Federated servers, horizontal partitioning, distributed partitioned views, vertical partitioning, anything using the MSDTC, constraints to direct rows, etc.  I’ve never liked, all of which I believed were a small step short of a hack. Personal opinion here, so take it FWIW.  I am very skeptical when using “scale” and “SQL Server” in the same sentence; I am, however, very impressed thus far with partitions and table partitioning. Let’s gen up a scenario and do this thing.


SQL CAT’S CODE

We’ll set up the tables and indexes using the following, somewhat similar to their script but slightly modified:

USE tempdb
GO
BEGIN TRY
    DROP TABLE dbo.Data
END TRY

BEGIN CATCH
END CATCH
GO

CREATE PARTITION FUNCTION pf_year(INT) 
AS RANGE RIGHT FOR VALUES (2007,2008,2009,2010)
GO
CREATE PARTITION SCHEME ps_year 
AS PARTITION pf_year ALL TO ([PRIMARY])
GO
CREATE TABLE Data 
(    
    [year] INT NOT NULL
    , UpdateValue FLOAT NOT NULL 
    , PayLoad CHAR(200) NOT NULL
) ON ps_year([year])

GO
-- drop index data.cix
CREATE CLUSTERED INDEX CIX ON Data(Year) ON ps_year(Year)
GO

Notice that I have four years in my partition function, which will be really five logical partitions in SQL Server.  We can verify this by running the following query, as this shows our partition ids for our object_id for our “Data” table.  This is very important, because we will look into SQL Profiler to make sure that DML for one partition doesn’t escalate locks to one of the other partitions; if it did, we may experience accessibility/concurrency problems for those other users.  In the screenshot below we can see the partition_id column, and this is what we’ll search for to verify either way.

SELECT * FROM sys.partitions
WHERE OBJECT_ID = OBJECT_ID('Data')


image

Figure 1 – Table with our partitions set up


YES, WE WILL NEED SOME DATA

Let’s populate the table now.  For this test I am deleting data.  What I am going to do is kill two birds with one stone: I am going to check for both locking at the partition level, and am also going to skew the table values just to verify that it has affect on SQL Server by inserting quite a bit more data in the partition that I am going to delete. (Remember that SQL Server will try to escalate up to a table lock if the number of rows being affected is percentage-wise is high compared to the number of rows in the table. SQL Cat used 10,000 rows for all partitions).  Run the following and you will see [rows] in the above query populated.

--/* populate the table */
TRUNCATE TABLE Data
GO
DECLARE @y INT = 2007
WHILE (@y < 2011)
BEGIN
    INSERT Data
    SELECT @y, 0, REPLICATE('X', 200) FROM 
        (SELECT TOP (CASE WHEN @y = 2008 then 1000000 else 10000 end) o1.id
        FROM sys.syscolumns o1
        CROSS JOIN sys.syscolumns o2
        CROSS JOIN sys.syscolumns o3
        ) AS BIG

    SET @y +=1
END

 

image

Figure 2 – Screenshot while table is being populated

image

Figure 3 – Screenshot while table is being populated

 

This is cool – you can see each partition filling up from the above statement.  Because I can use expressions in TOP, I insert more records for year 2008 than the others to get the data skewed via the CASE statement; again, my thinking here is that we will verify that partitions don’t bother each other as they did in their test, but also make sure that deleting a large chunk in one partition will also not affect the entire table.

I am going to use a read committed isolation level since that is what they use at the client (the SQL Server default) and leverage the new feature in SQL Server 2008 to set the lock escalation to AUTO at the table level.  These are worth changing up, by the way, to see the various affects on locking when setting the LOCK_ESCALATION for a given table.

--SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
SET TRANSACTION ISOLATION LEVEL READ COMMITTED
--SET TRANSACTION ISOLATION LEVEL REPEATABLE READ
--SET TRANSACTION ISOLATION LEVEL SERIALIZABLE

ALTER TABLE Data SET (LOCK_ESCALATION = AUTO)
--ALTER TABLE Data SET (LOCK_ESCALATION = DISABLE)
GO

 

ARE WE THERE YET?

Alright, it’s time to run the script and look at this thing.  As mentioned, SQL Cat did not fire up SQL Server Profiler, so here we will so that we can 1) see lock escalations and 2) view the types of locks.  They also mentioned that you can add the Lock:Escalation event, but I am also going to add Lock:Acquired and Lock:Release as well, since these will give us additional information to get a better picture of the process.  Here’s the code:

BEGIN TRAN
    DELETE Data
    WHERE YEAR=2008

    SELECT 
          lck.request_session_id AS session_id
        , resource_type
        , CASE resource_type 
            WHEN 'OBJECT' THEN    
                OBJECT_NAME(lck.resource_associated_entity_id)
            WHEN 'KEY' THEN OBJECT_NAME(p.object_id)
            WHEN 'PAGE' THEN OBJECT_NAME(p.object_id)
            WHEN 'RID' THEN OBJECT_NAME(p.object_id)
            WHEN 'HOBT' THEN OBJECT_NAME(p.object_id)
            ELSE    
                CAST(resource_associated_entity_id AS NVARCHAR(MAX))
            END 
          AS object_name
        , lck.request_mode
        , i.name AS index_name
    FROM sys.dm_tran_locks lck WITH (NOLOCK)
    LEFT JOIN sys.partitions p WITH (NOLOCK) ON p.hobt_id = lck.resource_associated_entity_id 
    LEFT JOIN sys.indexes i WITH (NOLOCK) ON p.object_id = i.object_id AND 
         p.index_id = i.index_id 
    LEFT JOIN sys.databases db WITH (NOLOCK) ON lck.resource_database_id = db.database_id 
    WHERE lck.request_mode <> 'Sch-S' 
    AND lck.request_session_id = @@SPID
    AND lck.resource_type <> 'DATABASE'
    
COMMIT TRAN


First I’m going to run this with LOCK_ESCALATION = AUTO. Notice in screenshots and that I have something similar to SQLCat; in the profiler shot I see that there is a HOBT lock type at the end. 

image

Figure 4. Profiler with LOCK_ESCALATION = AUTO

 

image

Figure 5. Profiler with LOCK_ESCALATION = DISABLE


The difference here is that HOBT is not seen in Figure 5, but it is in Figure 4, so I’m reasonably sure that we did get the HOBT granularity (Heap or B-Tree) for the first run.  Good deal.


INTERESTING FIND

While the above output was expected, the below two screenshots from the output of the queries for my two tests are somewhat strange and unexpected – they’re the same.  With AUTO and DISABLE, I get both HOBT granularity. After running this a dozen times or more, it appears that with tables with one very large partition over the others, you get the HOBT locking and not row-level locks even with DISABLE.  I retested this over and over as well, starting back with 10K rows for each partition, and then upping the 2008 partition at various increments. Only when I got to about 500,000 rows did it exhibit this behavior, and you could see that the page locks went away and were replaced by the output in Figure 6.

Something to test further for sure; maybe I’ll ping the SQLCat team to comment on this.  What I am finding is that ALTER TABLE seems to be working inconsistently, because every time that I drop the table it seems to work OK.  I’ll follow up here.

image

Figure 6. With LOCK_ESCALATION = AUTO

 

image

Figure 7. With LOCK_ESCALATION = DISABLE

 

TESTING A CONCURRENT INSERT WITH DELETE

Here’s a test that they did not do – I am going to run the above DELETE, but at the same time run a large INSERT into another partition. Set up the first query, and then run this one at the same time in another window.

BEGIN TRAN
    
    DECLARE @Counter int = 0
    BEGINNING:

    INSERT Data SELECT 2007,0,' '
    SELECT @Counter +=@@ROWCOUNT
    IF @Counter < 50000
        GOTO BEGINNING    
    
    SELECT 
         p.partition_id 
        ,lck.request_session_id AS session_id
        , resource_type
        , CASE resource_type 
            WHEN 'OBJECT' THEN    
                OBJECT_NAME(lck.resource_associated_entity_id)
            WHEN 'KEY' THEN OBJECT_NAME(p.object_id)
            WHEN 'PAGE' THEN OBJECT_NAME(p.object_id)
            WHEN 'RID' THEN OBJECT_NAME(p.object_id)
            WHEN 'HOBT' THEN OBJECT_NAME(p.object_id)
            ELSE    
                CAST(resource_associated_entity_id AS NVARCHAR(MAX))
            END 
          AS object_name
        , lck.request_mode
        , i.name AS index_name
    FROM sys.dm_tran_locks lck WITH (NOLOCK)
    LEFT JOIN sys.partitions p WITH (NOLOCK) ON p.hobt_id = lck.resource_associated_entity_id 
    LEFT JOIN sys.indexes i WITH (NOLOCK) ON p.object_id = i.object_id AND          
          p.index_id = i.index_id 
    LEFT JOIN sys.databases db WITH (NOLOCK) ON lck.resource_database_id = db.database_id 
    WHERE lck.request_mode <> 'Sch-S' 
    AND lck.request_session_id = @@SPID
    AND lck.resource_type <> 'DATABASE'
    
COMMIT TRAN
   


No problems here – we were able to delete a big chunk out of one partition while inserting into another, and both partitions display HOBT locking granularity. Very nice.

image

Figure 8.  Concurrent INSERT and UPDATE


YET ANOTHER INTERESTING TEST

Let’s try a heap with the above test.  Drop and recreate the table without the clustered index, and run with 10K rows for each partition. Here’s what I come up with:

image

Figure 9.  Heap – it doesn’t work.

 

CONCLUSION

From my test, partitioning using ALTER TABLE and setting the LOCK_ESCALATION looks to be good news for concurrency in a large table.  I found no problems or collisions from doing various DML to the partitions at the same time.  One thing that I could not explain was why I got HOBT granularity locks from time-to-time with a large amount of data in one partition;  I noticed this when I did not DROP the target table.  This is worth looking into a bit more.

Thanks for reading,

Lee Everest, M.S.

 

----------------------------

“Redo those buttons. Dress that belt buckle. Straighten that cap. And ***damnit tuck up those pajamas!”

smile5

 

Refs: [1]  Kejser, 2008. Enabling Partition Level Locking in SQL Server. Retrieved on 5/10/10 from
http://sqlcat.com/msdnmirror/archive/2010/03/03/enabling-partition-level-locking-in-sql-server-2008.aspx

facebook: face

Table Partitioning and LOCK_ESCALATION in SQL Server 2008

Tuesday, May 11th, 2010

INTRODUCTION

The folks at SQL Cat (SQL Customer Advisory Team) post some really great stuff, and had a blog a couple of months back [1] on partitions and locking.  I haven’t played much with partitions yet in SQL 2005 or 2008 -  haven’t really had a need to up to this point, and haven’t randomly geeked with the technology since I’ve been busy geeking with other stuff.   But I knew that I would eventually come around to them, and now is that time - the client where I am located is wanting to use the technology in a consolidation effort in their data warehouse.  For their implementation, they want to deploy partitioning to segment country codes, and specifically, they want to make sure that DML on one country will not affect another country.  For example,  they want to know if inserts or deletes that are occurring for China affect the people who are querying data from Indonesia.

The goal of my post is to expand what SQLCat did by doing the following:

  1. Change the INSERT to a DELETE statement (theoretically, changing DML type won’t matter)
  2. Increase the percentage of rows to be affected to those of all of the rows in the table
  3. Investigate what SQL Server Profiler captures during this operation
  4. Test a concurrent set of DML actions to the partitions
  5. Test it using a heap table

BACKGROUND, AND AN OPINION

Maybe I haven’t wanted to jump into the feature due to prior experiences.  I started working with SQL Server 6.5, so I am familiar with and have used all of the various flavors that Microsoft has offered to scale through the years, and frankly I was never impressed with any of them.  Federated servers, horizontal partitioning, distributed partitioned views, vertical partitioning, anything using the MSDTC, constraints to direct rows, etc.  I’ve never liked, all of which I believed were a small step short of a hack. Personal opinion here, so take it FWIW.  I am very skeptical when using “scale” and “SQL Server” in the same sentence; I am, however, very impressed thus far with partitions and table partitioning. Let’s gen up a scenario and do this thing.


SQL CAT’S CODE

We’ll set up the tables and indexes using the following, somewhat similar to their script but slightly modified:

USE tempdb
GO
BEGIN TRY
    DROP TABLE dbo.Data
END TRY

BEGIN CATCH
END CATCH
GO

CREATE PARTITION FUNCTION pf_year(INT) 
AS RANGE RIGHT FOR VALUES (2007,2008,2009,2010)
GO
CREATE PARTITION SCHEME ps_year 
AS PARTITION pf_year ALL TO ([PRIMARY])
GO
CREATE TABLE Data 
(    
    [year] INT NOT NULL
    , UpdateValue FLOAT NOT NULL 
    , PayLoad CHAR(200) NOT NULL
) ON ps_year([year])

GO
-- drop index data.cix
CREATE CLUSTERED INDEX CIX ON Data(Year) ON ps_year(Year)
GO

Notice that I have four years in my partition function, which will be really five logical partitions in SQL Server.  We can verify this by running the following query, as this shows our partition ids for our object_id for our “Data” table.  This is very important, because we will look into SQL Profiler to make sure that DML for one partition doesn’t escalate locks to one of the other partitions; if it did, we may experience accessibility/concurrency problems for those other users.  In the screenshot below we can see the partition_id column, and this is what we’ll search for to verify either way.

SELECT * FROM sys.partitions
WHERE OBJECT_ID = OBJECT_ID('Data')


image

Figure 1 – Table with our partitions set up


YES, WE WILL NEED SOME DATA

Let’s populate the table now.  For this test I am deleting data.  What I am going to do is kill two birds with one stone: I am going to check for both locking at the partition level, and am also going to skew the table values just to verify that it has affect on SQL Server by inserting quite a bit more data in the partition that I am going to delete. (Remember that SQL Server will try to escalate up to a table lock if the number of rows being affected is percentage-wise is high compared to the number of rows in the table. SQL Cat used 10,000 rows for all partitions).  Run the following and you will see [rows] in the above query populated.

--/* populate the table */
TRUNCATE TABLE Data
GO
DECLARE @y INT = 2007
WHILE (@y < 2011)
BEGIN
    INSERT Data
    SELECT @y, 0, REPLICATE('X', 200) FROM 
        (SELECT TOP (CASE WHEN @y = 2008 then 1000000 else 10000 end) o1.id
        FROM sys.syscolumns o1
        CROSS JOIN sys.syscolumns o2
        CROSS JOIN sys.syscolumns o3
        ) AS BIG

    SET @y +=1
END

 

image

Figure 2 – Screenshot while table is being populated

image

Figure 3 – Screenshot while table is being populated

 

This is cool – you can see each partition filling up from the above statement.  Because I can use expressions in TOP, I insert more records for year 2008 than the others to get the data skewed via the CASE statement; again, my thinking here is that we will verify that partitions don’t bother each other as they did in their test, but also make sure that deleting a large chunk in one partition will also not affect the entire table.

I am going to use a read committed isolation level since that is what they use at the client (the SQL Server default) and leverage the new feature in SQL Server 2008 to set the lock escalation to AUTO at the table level.  These are worth changing up, by the way, to see the various affects on locking when setting the LOCK_ESCALATION for a given table.

--SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
SET TRANSACTION ISOLATION LEVEL READ COMMITTED
--SET TRANSACTION ISOLATION LEVEL REPEATABLE READ
--SET TRANSACTION ISOLATION LEVEL SERIALIZABLE

ALTER TABLE Data SET (LOCK_ESCALATION = AUTO)
--ALTER TABLE Data SET (LOCK_ESCALATION = DISABLE)
GO

 

ARE WE THERE YET?

Alright, it’s time to run the script and look at this thing.  As mentioned, SQL Cat did not fire up SQL Server Profiler, so here we will so that we can 1) see lock escalations and 2) view the types of locks.  They also mentioned that you can add the Lock:Escalation event, but I am also going to add Lock:Acquired and Lock:Release as well, since these will give us additional information to get a better picture of the process.  Here’s the code:

BEGIN TRAN
    DELETE Data
    WHERE YEAR=2008

    SELECT 
          lck.request_session_id AS session_id
        , resource_type
        , CASE resource_type 
            WHEN 'OBJECT' THEN    
                OBJECT_NAME(lck.resource_associated_entity_id)
            WHEN 'KEY' THEN OBJECT_NAME(p.object_id)
            WHEN 'PAGE' THEN OBJECT_NAME(p.object_id)
            WHEN 'RID' THEN OBJECT_NAME(p.object_id)
            WHEN 'HOBT' THEN OBJECT_NAME(p.object_id)
            ELSE    
                CAST(resource_associated_entity_id AS NVARCHAR(MAX))
            END 
          AS object_name
        , lck.request_mode
        , i.name AS index_name
    FROM sys.dm_tran_locks lck WITH (NOLOCK)
    LEFT JOIN sys.partitions p WITH (NOLOCK) ON p.hobt_id = lck.resource_associated_entity_id 
    LEFT JOIN sys.indexes i WITH (NOLOCK) ON p.object_id = i.object_id AND 
         p.index_id = i.index_id 
    LEFT JOIN sys.databases db WITH (NOLOCK) ON lck.resource_database_id = db.database_id 
    WHERE lck.request_mode <> 'Sch-S' 
    AND lck.request_session_id = @@SPID
    AND lck.resource_type <> 'DATABASE'
    
COMMIT TRAN


First I’m going to run this with LOCK_ESCALATION = AUTO. Notice in screenshots and that I have something similar to SQLCat; in the profiler shot I see that there is a HOBT lock type at the end. 

image

Figure 4. Profiler with LOCK_ESCALATION = AUTO

 

image

Figure 5. Profiler with LOCK_ESCALATION = DISABLE


The difference here is that HOBT is not seen in Figure 5, but it is in Figure 4, so I’m reasonably sure that we did get the HOBT granularity (Heap or B-Tree) for the first run.  Good deal.


INTERESTING FIND

While the above output was expected, the below two screenshots from the output of the queries for my two tests are somewhat strange and unexpected – they’re the same.  With AUTO and DISABLE, I get both HOBT granularity. After running this a dozen times or more, it appears that with tables with one very large partition over the others, you get the HOBT locking and not row-level locks even with DISABLE.  I retested this over and over as well, starting back with 10K rows for each partition, and then upping the 2008 partition at various increments. Only when I got to about 500,000 rows did it exhibit this behavior, and you could see that the page locks went away and were replaced by the output in Figure 6.

Something to test further for sure; maybe I’ll ping the SQLCat team to comment on this.  What I am finding is that ALTER TABLE seems to be working inconsistently, because every time that I drop the table it seems to work OK.  I’ll follow up here.

image

Figure 6. With LOCK_ESCALATION = AUTO

 

image

Figure 7. With LOCK_ESCALATION = DISABLE

 

TESTING A CONCURRENT INSERT WITH DELETE

Here’s a test that they did not do – I am going to run the above DELETE, but at the same time run a large INSERT into another partition. Set up the first query, and then run this one at the same time in another window.

BEGIN TRAN
    
    DECLARE @Counter int = 0
    BEGINNING:

    INSERT Data SELECT 2007,0,' '
    SELECT @Counter +=@@ROWCOUNT
    IF @Counter < 50000
        GOTO BEGINNING    
    
    SELECT 
         p.partition_id 
        ,lck.request_session_id AS session_id
        , resource_type
        , CASE resource_type 
            WHEN 'OBJECT' THEN    
                OBJECT_NAME(lck.resource_associated_entity_id)
            WHEN 'KEY' THEN OBJECT_NAME(p.object_id)
            WHEN 'PAGE' THEN OBJECT_NAME(p.object_id)
            WHEN 'RID' THEN OBJECT_NAME(p.object_id)
            WHEN 'HOBT' THEN OBJECT_NAME(p.object_id)
            ELSE    
                CAST(resource_associated_entity_id AS NVARCHAR(MAX))
            END 
          AS object_name
        , lck.request_mode
        , i.name AS index_name
    FROM sys.dm_tran_locks lck WITH (NOLOCK)
    LEFT JOIN sys.partitions p WITH (NOLOCK) ON p.hobt_id = lck.resource_associated_entity_id 
    LEFT JOIN sys.indexes i WITH (NOLOCK) ON p.object_id = i.object_id AND          
          p.index_id = i.index_id 
    LEFT JOIN sys.databases db WITH (NOLOCK) ON lck.resource_database_id = db.database_id 
    WHERE lck.request_mode <> 'Sch-S' 
    AND lck.request_session_id = @@SPID
    AND lck.resource_type <> 'DATABASE'
    
COMMIT TRAN
   


No problems here – we were able to delete a big chunk out of one partition while inserting into another, and both partitions display HOBT locking granularity. Very nice.

image

Figure 8.  Concurrent INSERT and UPDATE


YET ANOTHER INTERESTING TEST

Let’s try a heap with the above test.  Drop and recreate the table without the clustered index, and run with 10K rows for each partition. Here’s what I come up with:

image

Figure 9.  Heap – it doesn’t work.

 

CONCLUSION

From my test, partitioning using ALTER TABLE and setting the LOCK_ESCALATION looks to be good news for concurrency in a large table.  I found no problems or collisions from doing various DML to the partitions at the same time.  One thing that I could not explain was why I got HOBT granularity locks from time-to-time with a large amount of data in one partition;  I noticed this when I did not DROP the target table.  This is worth looking into a bit more.

Thanks for reading,

Lee Everest, M.S.

 

----------------------------

“Redo those buttons. Dress that belt buckle. Straighten that cap. And ***damnit tuck up those pajamas!”

smile5

 

Refs: [1]  Kejser, 2008. Enabling Partition Level Locking in SQL Server. Retrieved on 5/10/10 from
http://sqlcat.com/msdnmirror/archive/2010/03/03/enabling-partition-level-locking-in-sql-server-2008.aspx

facebook: face

Popular posts on this site from Google! Analytics

Monday, May 10th, 2010

SSIS-related posts getting the most traffic by far.

I was looking at Google Analytics the other day on my blog and wanted to know what topics and pages generate the most traffic;  I sort of had a good idea but wanted to verify by checking over the past year or so. Here’s the list of the top five pages on this site (not including off-topic and default.aspx page) for distinct hits.  Kind of interesting I guess.

  1. Use variables in the SSIS data flow tab
     
  2. Reason Token Based Server Access Violation Failed Error

  3. sp_executesql string size in SQL Server 2005, 2008 

  4. Create a custom assembly in SSIS 

  5. SSIS Pivot, a simple and easy explanation

It turns out that three of the top five posts (60% for those who are math-challenged) that gain the most page views are SSIS-related posts.  I think it’s cool to know that so many people are using SSIS and are looking for assistance and examples to help them in their shops.  When I need some help, a blog post with a generic and straightforward example with screen shots is imperative, and I have tried to make my examples easy to read without a lot of clutter as well.  I'm looking forward to doing more SSIS posts in the future alongside some of our great SSIS practitioners and experts, such as SQL Sever MVP Jamie Thomson and SQL Server MVP Tim Mitchell;  it's guys like these who give us all great ideas and motivation to go explore SSIS further.

I hope that I have helped someone out along the way and, as always, I appreciate your visiting my blog and for the great feedback that you send.  This blog is my scratchpad for jotting stuff down, but truth be told I do this for you guys, not myself.

Thank you,

Lee

------------------------------------

 


face

MongoDB vs. SQL Server – INSERT comparison

Sunday, April 4th, 2010

Geeking with MongoDB and .Net.

Time for another off-topic post?  Well, maybe not completely I suppose.  I had a .Net expert contact me this week and ask me if I had done any performance comparisons with a file or document-based database - often termed "NOSQL" databases - to SQL Server.   Geeking with all of this other stuff on my plate, I promptly responded no but would check into it.  Knowing very little about these systems, it wasn't terribly difficult to guess that for some DML operations one of these databases (Cassandra, MongoDB, others) would probably beat SQL Server hands-down based on what they were.  Let's check it out nonetheless.

Using the Sam Corder (samus) driver I ran the following simple test in .Net, inserting 500K rows into MongoDB, and then running the TSQL that follows, inserting 500K rows. 

public static void Main(string[] args)
        {
            using (var mongo = new Mongo())
            {
                mongo.Connect();
                var db = mongo.GetDatabase("Data");

                var DataTest = db.GetCollection("DataTest");
             
                for (int i = 1; i < 500001; i++)
                {
                    var Data = new Document();
                
                    Data["GUID"] = new Guid();
                    Data["value"] = i;
                    DataTest.Insert(Data);
                }
        }        

DROP TABLE MongoCompare
GO
CREATE TABLE MongoCompare
    (guid uniqueidentifier
    ,value int
    )
GO

DECLARE @id int = 1
WHILE (@id < 500001)
BEGIN
    INSERT INTO MongoCompare VALUES (newid(), @id)
    SET @id+=1
END
GO

 

Using SQL Server 2008 Development Edition, the TSQL script took about 1.5 minutes, and the MongoDB took roughly 30 seconds (.5 minutes) to complete, which I sort of expected. We know that implicit transactions fire for each row in an RDBMS, and MongoDB, a non-relational document based system, has no ACID properties to keep things "transactionally-consistent" so to speak, although you could create this in the business logic layer of a rich or web client for sure, which would undoubtedly make the comparison a little closer I would say.  I should also mention that Perfmon CPU utilization went nutso during the MongoDB insert.

It's funny and amusing to read all of the comments around this technology these days, by the way; competition from either MySQL, Oracle, DB2, or even these NOSQL databases, such as Hadoop, HBase, Cassandra, Hypertable, etc. will probably make our product even better in some form or fashion.

Need to do some more stuff on this in the future. Who knows - some department or team somewhere down the road may have one of these databases that I'll have to source and merge with data from SQL Server.

Lee

 

------------------------

easter1

 

 

 

face

Interesting TSQL query plan for NOT IN vs. NOT EXISTS

Friday, April 2nd, 2010

Ran across this one the other day.

I was searching for something the other day on Google!, can’t remember what right now, and happened upon an old post [1] (old for the internet – 1998) entitled The EXISTS Flaw.  This referenced a C.J. Date from September, 1989 and his postulate on a flaw with the EXISTS syntax of the structured query language.  I’ve been using SQL and TSQL now for about 15 years and I never heard of it…glad I learn something new every day, right?.  So I look at this posting and right off determined that the author 1) isn’t comparing apples-to-apples in his two queries, and 2) is actually witnessing how each operator – IN and EXISTS – is supposed to work, by definition.  That is beside the point, because this particular blog posting is not to engage anyone in a debate (I might save that one for another occasion) but to instead look into the behavior of why NOT IN works as it does.

While checking this out I discovered something interesting when you use NOT IN that I wanted to share. Run the following to create the table and insert the rows which he has done.

DROP TABLE sp
go
CREATE TABLE sp
(sno char(2)
,pno char(2)
,qty int
)
GO

INSERT sp VALUES ('S1', 'P1', NULL)
INSERT sp VALUES ('S2', 'P1', 200)
INSERT sp VALUES ('S3', 'P1', 1000) 
 

The question here is: “Find supplier numbers for suppliers who are known to supply part P1, but not in a quantity of 1,000”. 

SELECT DISTINCT spx.sno
FROM sp spx
WHERE spx.pno = 'P1'
AND 1000 NOT IN
    ( SELECT spy.qty
    FROM   sp spy
    WHERE spy.sno = spx.sno
    AND      spy.pno = 'P1');
 
SELECT DISTINCT spx.sno
FROM sp spx
WHERE spx.pno = 'P1'
AND NOT EXISTS
    ( SELECT spy.qty
    FROM sp spy
    WHERE spy.sno = spx.sno
    AND spy.pno = 'P1'
    AND spy.qty = 1000);

 

image

Now, look at the result set.  You will notice that a S2 returns for the first query, but that suppliers S1 and S2 return in the second query. What’s up with this?  The author states that EXISTS does not return the correct result set.  Let us ask now, does NULL not equal 1000?  Without question yes, NULL does not equal 1000, and in fact NULL equals nothing, not even NULL.  So in this light you could make a case that in fact the first query is incorrect because our question wants to know who carries part P1 in a quantity other than 1000, and NULL is not a quantity equal to 1000.

What I find interesting here, however, is to understand why the first query returns only S2 and not why the second returns both S1 and S2, and we do this with none other than a query execution plan.  Let’s look at them both:

 

image

Figure 1. Execution Plan (est) for query 1

image

Figure 2. Execution Plan (est) for query 2

For Figure 2, notice how for the predicates we find that qty=1000 and pno=’P1’ as expected. Now look at the same plan in Figure 1 and check out the differences – NOT IN also includes pno=’P1’ , but it’s interesting that it includes the compound predicate AND (qty IS NULL OR qty = 1000).  I am quite certain here that in my statement I have nothing that tells NOT IN to disregard NULLs.  This I cannot explain.  Why does NOT IN handle this as such?  Why is NOT IN doing me a favor and checking for (and throwing out of the recordset) the NULL? 

Additionally, it would seem that the IN and NOT IN work differently.  If you remember, IN actually flattens out a TSQL statement to a series of ORs (Figure 3), but a NULL won’t be handled as it appears to be under the NOT IN scenario because nothing can equal NULL. If you were to add “, NULL” to the SELECT list you would not get sno equal to 1.

 

image

Figure 3.  IN flattens out to OR



SELECT * FROM sp
WHERE qty IN (200, 1000)

 

Notice when comparing Figure 1 to Figure 3 the vast difference in behavior for an IN and NOT IN. Wouldn’t one expect that IN would also include the following, since NOT IN adds this caveat?  Makes sense to me…

OR [Test].[dbo].[sp].[qty] IS NOT NULL

 

Try some of these experiments with IN and NOT IN, and NOT EXISTS and see what you can come up with.  I find that geeking with these peculiarities is a lot of fun.

Thanks for reading!

Lee

 

------------------------

“I haven't felt this awful since we saw that Ronald Reagan film.”

easter2010

 

face

[1]  FFE Software (1998). The Exists Flaw. Retrieved 4/2/10 from http://firstsql.com/iexist2.htm

Demystifying the SSIS Script Component, Part 2

Monday, March 29th, 2010

A tutorial on the Script Component, continued.

Introduction

The Script Component allows you to 1) connect to source data, 2) transform data in the pipeline on a row-by-row basis, or 3) build a custom data flow destination.  In part one we looked at two examples for using it as a source.  This time let’s look at the second option – transformation – and explore how this functionality works. Remember that when you drag this task to the data flow window in SSIS (again, I am using SQL Server Business Intelligence Development Studio, or BIDS) you will be prompted to choose how you want the component to function. I have two more examples here for your consideration, both of which are straightforward and easy to understand. 

I appreciate the emails and comments on the first part, by the way, on this series.  Unfortunately, this one is a bit more long-winded – apologies in advance – but hopefully you will be able to follow along and glean some useful information here as well.

Who am I? Why am I here?

The goal of using the Script Component (or Script Task for that matter) is to allow the ETL developer a way to accomplish a task when a delivered task won’t work.  This is a very important alternative that we have at our disposal; several of the popular ETL tools within the industry have dozens, even hundreds, of canned or ‘baked-in’ tasks.  Unfortunately our product isn’t that mature yet, but we do have this outlet available to us, and until such time that we get the full bevy of ETL tasks I find this option worthwhile. Personal opinion? Yes, so take it for what it’s worth.

When I think of a transformation using the Script Component (let’s call it SC from here on out) the first thing that pops into my mind is Regex. Verifying formats of names, addresses, phone numbers, zip codes, IP addresses,
et. al. is an excellent way to use the SC, and there are many examples on Google! demonstrating this.  Here, let’s try something different. I am going to use a text file as a source and then do some stuff in c#, the first using a simple method, and the second using the Extension Methods, a feature in .Net 3.5/4.0 that is very cool.  Again, both are  easy for you to follow along and for the most part uneventful.  If you will, get used to wiring-up the SC first, and from there you may begin to go into some more complicated transformations.  Let’s get started. 

Example 1 – A simple Script Component Transformation

First, save a text file and include some data similar to what I have in Figure 1. Open up BIDS, add a new project, add a Data Flow Task to the Control Flow tab, switch to the Data Flow tab, add a Flat File Source and a connection to the text file, connect to it, and then add a Script Component.  For the text file, leave the column names as they are - we’ll change them later, but I went ahead and created the data-types now. It’s personal preference, so do it how you normally edit transformations and/or what order you are comfortable with.

Double-click on the SC and you’ll get the prompt to choose the behavior on the SC. Go ahead and make sure that it’s on “Transformation” – Figure 2 - and click OK. Open up the SC, and navigate to the “Input Columns” choice on the left nav. There you can click all three columns (Column 0, 1, and 2) to add them so that they are included in the source. At this point make sure that you have all of them selected. See Figure 3.

image 

Figure 1. Example 1 data in a text file

 

image

Figure 2. The Transformation option in the SC

 

image

Figure 3. Including the input columns in the Script Component

OK, here’s where the fun part starts.  Navigate to “Inputs and Outputs” on the left nav.  I would begin here by renaming the input and output buffers to simple names as I have (INPUT and OUTPUT), and add columns to the output portion has shown. (Quick point: you can’t name the inputs and outputs the same or you’ll get an error later, so notice that I left my input columns the default but renamed the outputs). For the output, I created four columns: outKey, outValue1, outValue2, and outGreatest. Make sure to change the data types as well. I have outKey as string, or string [DT_STR], and all others unsigned integer (four-byte signed integer [DT_I4]). Note in Figure 4 that I start by passing three columns in, but end up with four in my output buffer.

image

Figure 4. Adding the inputs and outputs to the Input and Output buffers

 

The anticipation is almost unbearable!

In the Script Transformation Editor, choose “Script”, and then click on Edit Script so that you get to the Visual Studio Tools for Applications (VSTA) window. First off and before we do anything else, because we’ll be accessing features in .Net 3.5 in example 2, go ahead and go to Project >> Properties, and change the Target Framework to .Net Framework 3.5 (Figure 5).  Unless there is some pressing reason, always do this so you can get the latest .Net libraries.

image 

Figure 5.  Change to the .Net 3.5 framework

 

Now paste the following code:

public override void INPUT_ProcessInputRow(INPUTBuffer Row)
   {
       Row.outKey = Row.Column0;
       Row.outValue1 = Row.Column1;
       Row.outValue2 = Row.Column2;
       Row.outGreatest = Greatest(Row.Column1, Row.Column2);
   }

   static public int Greatest(int val1, int val2)
   {
       return (val1 > val2 ? val1 : val2);
   }

Let’s study this for a moment.  In a nutshell, a row has already been added for us and the get and set accessors as well in the BufferWrapper.cs and ComponentWrapper.cs files; all that is needed in our code is to carry out the chosen transformation and map input columns to output columns.  Sounds somewhat difficult? No, not really. In my opinion this is more straightforward than the source SC.  My Row.outputcolumn was named as such so that you can see how I get the values from the source text file (Row.Column 0 through Column2) to the output buffer columns.  If you can see this mapping, then you’re in good shape.

Our transformation here is a simple comparison method that returns the greatest of two integers, somewhat cheesy, yeah, but again we’re not going for a difficult transformation here.  Essentially, we are passing through the three columns that we started with, but we add a new one called outGreatest, a value that we get for each row after we take the two numeric values and send them to our ternary function called Greatest().

Add an OLE destination, and then rename the table to SC_Transform_1.  You will have several columns available, but go ahead and highlight and delete the original input rows, as I have in Figure 6.  Don’t worry – SSIS won’t give you that warning that you have columns that need to be removed for better performance!  Also note how I have named mine.  Finally, in Figure 7, I have connected them in the mappings to my out columns.  Finish and close up, and then run the package. Your result set should match what is in Figure 8.

image

Figure 6. Create a destination, but lose the extra columns

 

image

Figure 7.  Mapping the inputs to outputs

 

image

Figure 8.  Final output for Example 1

 

Example 2 – Another Script Component Transformation

The steps for this one are about the same as Example 1.  First up, create a new package, paste the two connections that you have from the first example, and then paste the data flow task from the control flow of the first package to the control flow of the second package. You can paste connections one at a time, but when you copy the task on the control flow, all of the underlying data flow tasks come along as well. Switch to the Data Flow tab of package 2 and notice the red x’s in these. We’ll fix these in a minute.  Rename the text file source as I have in the below Figure 9.

image

Figure 9.  Copying the contents of our first package

 

Create a new source file and add data as it looks in Figure 10. Mine has three rows, and seven columns, including one “key” and six columns of integer data, all comma-separated. Connect to your text file in Flat File connection manager that you just copied over from Example 1, and point it to the file below.

image

Figure 10.  Data for our second package

Connect the new text file to the script component, point to it, and then open up the SC. This time, forget about passing through all of the columns and lets just choose to bring in the key, the Min, Max, and Avg columns. Notice the double type for the average output. (See Figure 11).

image

Figure 11.  Creating the output columns and assigning their data types

 

Open the SC Edit Script button and add Using System.Linq; to your code; if not you won’t be able to use the Extension Methods that I mentioned previously. After you have added this, add the following code.  You’ll notice that I’ve wrapped it so it doesn’t go off my page here - c# like c or c++ doesn’t care if you have CRs in your code:

public override void INPUT_ProcessInputRow(INPUTBuffer Row)
{
    Row.outKey = Row.Column0;
    Row.outMin = (new int[] { Row.Column1, Row.Column2, Row.Column3, Row.Column4,
                Row.Column5, Row.Column6 }).Min();
    Row.outMax = (new int[] { Row.Column1, Row.Column2, Row.Column3, Row.Column4, 
|               Row.Column5, Row.Column6 }).Max();
    Row.outAvg = (new double[] { Row.Column1, Row.Column2, Row.Column3, Row.Column4, 
                Row.Column5, Row.Column6 }).Average();        
} 

Create a new destination table similar to the first example, and remove the extra columns from the output buffer.

image

Figure 12. Destination table in SQL Server for Example 2

Run your package and verify the results.

image

Figure 13. Results for Example 2

 

Conclusion

The script component as a transformation is simple to use, yet very powerful. Follow these examples to get up and running, and then add your own super-stealthy transformations to your components. I chose these particular methods because both demonstrated the ease which you can add to your data flows.  In the first one, we used a greater than ternary function to grab the greatest of two numbers. In the second, one which I particularly liked, we deployed the new Extension Methods in .Net. 3.5.  This operation gave us values (min, max, avg)  across a row of data, not a straightforward operation in TSQL I might add.

Continue experimenting with the Script Component, and look for the concluding Part 3 on this blog series soon. If you have any questions or something that you would like to show me or share, feel free to ping me.

Thanks for reading,

Lee

 

----------------------------

Samir: “No, not again. I... why does it say paper jam when there is no paper jam? I swear to ***, one of these days, I just kick this piece of s*** out the window.”

surfing

 

 

 face

Lee

Use RAISERROR over PRINT for returning messages quickly

Thursday, March 25th, 2010

Get messages while the batch is running.

Coming back to data warehousing in July 2009 has been a lot of fun for me. I haven’t done this stuff since I left Transamerica Life in 2005 I think it was, and realize that I enjoy working with facts, dimensions, etc. designing and building data marts and warehouses, and adding some value to the business. One thing that I haven’t missed is the long processing times; many routines or batches or processes, however you may refer to them, take hours or even days to complete. The quick turnaround in OLTP is definitely one thing I like about fast-paced transaction oriented database systems.

For long-running processes written in TSQL, I see a lot of folks adding PRINT statements to their work. Why they do this I have no idea – because PRINT is definitely not the way for really any type of TSQL coding that I can think of. I’d rather use RAISERROR. Even Books Online lists these reasons for using RAISERROR over PRINT:

RAISERROR can also be used to return messages. RAISERROR has these advantages over PRINT:

  • RAISERROR supports substituting arguments into an error message string using a mechanism modeled on the printf function of the C language standard library.
  • RAISERROR can specify a unique error number, a severity, and a state code in addition to the text message.
  • RAISERROR can be used to return user-defined messages created using the sp_addmessage system stored procedure.

Unfortunately, they left out the most important reason – RAISERROR has the ability to ‘print’ the message while the batch is still running, without having to wait for the entire batch to complete. I looked over code from a 5-6 year SQL guy this week, and was surprised that he had all of these PRINT statements in his code. What if the process takes 36 hours to complete? You can’t know where a process is during the middle of it with PRINT. In this case you would need to have a logging process to see which procedure/script the process was executing. I always thought that this was Database 201, second-year stuff, maybe not.  Let’s review it to be sure that we have it down:

Here’s an example:

BEGIN
    WAITFOR DELAY '00:00:03.00'

    PRINT 'DELAY 1 HAS ENDED'

    WAITFOR DELAY '00:00:03.00'

    PRINT 'DELAY 2 HAS ENDED'
END

 

Note that you will not see the message until the entire script has completed. (If you need to go into text mode in SSMS by doing a CTL + T).  Here is the output:

DELAY 1 HAS ENDED
DELAY 2 HAS ENDED

 

Let’s run RAISERROR now. Note that I use the WITH NOWAIT option, so that it returns the result to the client…without waiting. You can see that, in the picture below, at 3 seconds after execution the first RAISERROR fires as the script keeps running, and at the end of the second raiserror the ‘DELAY 2’ message will run.

BEGIN
    WAITFOR DELAY '00:00:03.00'

    RAISERROR ('DELAY 1 HAS ENDED', 10,1) WITH NOWAIT

    WAITFOR DELAY '00:00:03.00'

    RAISERROR ('DELAY 2 HAS ENDED', 10,1) WITH NOWAIT

END

 

image


Keep this in mind the next time you start adding PRINT statements to your code. You’re probably better served to use RAISERROR. Maybe I’ll blog another one on the printf style messaging for RAISERROR, familiar if you had c or c++ in school. There are some neat things that you can do using TRY CATCH in this instance as well.

Thanks for reading,

Lee

 

------------------------

Tom Smykowski: Well-well look. I already told you: I deal with the * damn customers so the engineers don't have to. I have people skills; I am good at dealing with people. Can't you understand that? What the hell is wrong with you people?

stirthepot

 

 

 face