SQL Server Data Mining in the Cloud Review
The SQL Server 2008 Data Mining team is showing off some slick new technology at the KDD 2008 show in Las Vegas. You didn’t get your trip to Vegas approved, you say? Well, it pays to read Jamie’s Junk, also known as “The official blog of the SQL Server Analysis Services Development Manager(tm).” He wrote a blog entry called KDD 2008 and Incredibly Awesome SQL 2008 Data Mining Demos, and I spent some time working with it to find out whether he was blowing things out of proportion.
Install the Free Excel Data Mining Plugin
At http://www.sqlserverdatamining.com/cloud/ you can play with a browser-based SQL Server Analysis Server. Upload a CSV file or use one of their samples, and analyze away. You’ll get a quick feel of some amazing technology, but for me, the real winner is their Excel 2007 Table Analysis Tools for the Cloud plugin, and I’ll show you how to use it. (Go download that and install it fast before they figure out how popular it’s going to get.)
You get all of the power of a SQL Server Analysis Services server under your desk, but without any of the licensing, training or setup costs associated with SSAS. It’s data mining for the rest of us.
Cleaning Up the Source Data
Let’s start with a simple file that I have to analyze all the time: results from Performance Monitor. We’ve written a lot here about how to analyze Performance Monitor results to see where your SQL Server bottlenecks are, but one of the tough steps is to figure out when your server is undergoing peak loads, and what’s causing those loads.
At right, check out the raw results of Perfmon: a plain CSV file. We’re going to clean it up a little, but I want to show you the “before” to understand what we need to clean up. We’re not going to do any heavy lifting, just a couple of tweaks.
We want the headers to be readable by human beings, because we’re going to use these as labels. In cell A1, change the label to “Date”, and in the rest of the cells in row 1, do a search & replace to strip out the \\MYSERVERNAME\ label. (If you have multiple servers in your Perfmon data, leave the server names in, of course.)
Format the date column, A, to show the date and time instead of the number format. That’s the datestamp when the performance data was captured. To enhance our analysis, we want to add a couple of date calculations. SQL Server performance often depends on the day of the week and the time of day – weekends might have lower performance needs than weekdays, and we might have daily jobs at a certain time that stress the system. Add two new columns for Day of Week (using the WeekDay function) and Hour of Day (using the Hour function).
Finally, we need to tell Excel that this huge amount of numbers is actually a table. Click on any cell that has data (not an empty cell off to the side), click on the Home tab in the Office Ribbon at the top of the screen, and click on the Format as Table button in the ribbon. Click on any color style that pops up – I’m partial to Table Style Medium 2 myself. Excel will highlight the entire data range to confirm, and make sure the “My table has headers” box is checked. Click OK.
Now would be a good time to save our work – save the file as an Excel file, not a CSV file.
Connecting To SQL Server Data Mining in the Cloud
Assuming the “Table Analysis Tools for the Cloud” add-in has been installed successfully, the ribbon will change. Click on the “Analyze (in the Cloud)” tab, and SQL Server Analysis Services options will appear.
Time to make the magic happen.
Let’s find out what kinds of load we have on this server by clicking on the Detect Categories button in the ribbon. Accept the defaults, and click Run, and a warning will pop up:
“Table Analysis Tools for the Cloud needs to connect to the Internet to process your request. Your spreadsheet data will be uploaded for analysis using a secure protocol, then destroyed as the analysis completes. The operation cannot complete without your permission. To learn more about the privacy of your data read out Privacy Statement”
Time to take a quick break and talk privacy. This Excel add-in is going to connect to Microsoft’s servers, upload your data, run some analysis, and then send your data back to you via the internet.
Your data. Someone else’s servers. The internet. Stop to think about that for a second, especially if you’re:
- A public company or subject to SOX compliance
- Dealing with health care data
- Dealing with financial data
- Analyzing your passwords spreadsheet
- Slicing and dicing your little black book
This service may be secure in the future, but right now we’re talking about a demo, not an officially sanctioned Microsoft product with a EULA. Be smart – only use this for dummy demo data for now.
Breaking Your Perfmon Counter Results Into Categories
Click Run, and the magic happens: the data is uploaded and analyzed. It may take a few minutes depending on the size of your data source. (I have no idea what kind of server horsepower is on the other end.)
The result is a list of categories. In the screenshot at right, I have 10 categories of load. About 60% of the performance monitoring times fell into the first two categories. Let’s go through the categories individually to see what they have in common. The category details are on the bottom half of the spreadsheet, and you can switch from one category to the next by clicking on the filter icons in the Category header of the spreadsheet.
My interpretation of the results are:
- Category 1 – Low Activity with <8 Connections – during these times, we have less than 8 SQL connections, a lot of available memory, and low disk activity. What’s interesting to me is that there’s not a consistent Hour of Day finding for this one.
- Category 2 – Low Activity 8-13 Connections – during these times, we have a few active users, and we’re seeing some very minimal storage activity, but we’re still in great shape on memory.
- Category 3 – Low Activity with C Drive Load – for some reason, about 11% of the time our server has low SQL Server activity, but the operating system drive is seeing some activity. Maybe another application is running on our SQL Server, or maybe someone is remote desktopping into the system to do things.
- Category 4 – Sundays 2pm-6pm – some kind of job must be running during this time frame, because SQL Server connections stay low but I/O goes up significantly.
- Category 5 – High Connections, Heavy Load – about 5% of the time, this server has a high number of connections, heavy TempDB drive access, and high CPU loads.
Below category 5, each category represents 3% or less of the server’s time, so those wouldn’t concern me right away.
Bottom line: this server has very peaky demand cycles, only lasting for very short bursts and it doesn’t appear to be focused on a single time of the workday. My priorities as a DBA would be:
- Find out what’s happening on Sundays from 2pm-6pm, and ask the application owners if they want help to make that code to run faster
- Look at the types of queries that are running in Category 5, the 5% of the time when the server is under heavy load. My guess is that this is a new server with a new application, and not a lot of users are hitting it yet. When the app catches on and gains users, this server won’t be able to handle the load due to the individual queries being tough on TempDB. We need to find out why those queries are hammering TempDB, and either optimize them or speed up the TempDB drive.
Using SQL Server data mining in the cloud, I was able to figure this out in a matter of minutes – instead of manually slicing and dicing the numbers in Excel or pushing this data into my own SQL Server Analysis Server. This would save me time and money, making me more productive and a better DBA.
But wait – there’s more. That’s just one button in the ribbon.
Analyze Key Influencers of Performance Bottlenecks
Let’s find out what else seems to be happening in the system at the same time our TempDB usage is high. Back on the source data tab of the spreadsheet, click the Analyze Key Influencers button in the Analyze (in the Cloud) ribbon tab. Choose “Physical Disk 2_T Disk Time” as the data to analyze, because in this system, TempDB is isolated on the T drive.
Click on the “Choose columns to be used for analysis” link and uncheck the other counters for Physical Disk 2_T” because we know they’ll be high at the same time Disk Time is high. Those aren’t influencers. Then click Run.
When the results come back, scroll over to the far right. You’ll notice several sets of colored bar graph results. Scroll down, because we want to find what influences TempDB activity to be high – indicated by the “Favors” column being a high number. We want to find what favors high TempDB activity.
In our particular example, we can see that Category 8 is the highest TempDB load. Innnnteresting – earlier, I’d discarded categories over 5 because they represented a low amount of time, but it looks like we’ll have to go back and revisit category 8. Category 8 was Friday mornings from midnight to 4am, so we might have something going on then.
Other factors include high activity on our log drive (L), medium CPU load (11-34%), and so on. I bet TempDB’s logs are also on the L drive, shared with the rest of the databases, and I would have to verify that before further investigation. If TempDB’s logs are on that L drive, then it would make sense that they’d have high activity when the TempDB data is being written.
In Two Words: Incredibly Awesome
I’ve barely scratched the surface of what this thing can do, and already I can see that Jamie was right – this thing is incredibly awesome. This demo just showed Perfmon data – something interesting to me personally, but not to most of humanity. Imagine what it can do in the hands of an accounting analyst or a sales manager with actual financial data, people who could raise sales or cut costs. Very impressive.
I want to loudly congratulate the SQL Server Data Mining team for building something so cool and really pushing the edge of cloud analytics. I get genuinely excited about where cloud-based databases are going, and I’m thrilled that Microsoft might bring this to market. The faster we can analyze data with less setup time and less fixed infrastructure costs, the more powerful the DBA will become.