Data Mining the StackOverflow Database
StackOverflow released a public dump of their database this morning. Jeff Atwood and the guys believe that if you, the community, are putting the work into this huge body of knowledge, then you should be able to have rights to use it.
This is a great dataset to show off one of my favorite toys from the Microsoft SQL Server Data Mining team. In this fifteen minute video, I’ll walk you through data mining the StackOverflow user list to find out more about the users and see what makes the rockstar high-reputation users different from the worker bees like me.
Microsoft’s Free Data Mining Tools
For today’s demo, I’m using SQL Server Analysis Services installed on my desktop. Relax – it’s really easy. Literally just install SQL Server 2005 or 2008 Developer Edition, check the box for Analysis Services, and use the defaults. You don’t have to know what you’re doing in order to get it up and running, and it just runs in the background as a service. After you’re done playing around, you can stop the service and set it to manual to prevent it from sapping your system resources. Go into Control Panel, Administrative Tools, double-click on the SQL Server Analysis Services service, and change the startup type to Manual.
Depending on your version of SQL Server, you’ll need one of these free plugins from Microsoft:
- Microsoft SQL Server 2005 Data Mining Add-Ins for Office 2007
- Microsoft SQL Server 2008 Data Mining Add-Ins for Office 2007
- And you can download the StackOverflow users spreadsheet shown in the video
If you want to avoid the whole SQL Server Analysis Services thing altogether, you can also use Microsoft’s free SQL Server Data Mining in the Cloud plugin. Be aware that it’s a technical preview, not a fully supported & released product. Their cloud servers can (and do) go down. Also know that your data is going into the cloud, which has its own ramifications as I’ve discussed in my previous cloud data mining tutorial.
What’s Coming Next: SQL Server 2008 R2 with BI in Excel
In the next version of SQL Server, Microsoft will deliver business intelligence to end users through Excel. At the Professional Association for SQL Server Summit last November, Donald Farmer demoed slicing and dicing of huge spreadsheets with real-time analytics that previously would have required some pretty hefty hardware.
Excel 2007 has a million-row limit, but the forthcoming version will not. Some of the StackOverflow export tables like Votes have more than a million rows, so we can’t yet data mine those using Excel as a front end, but we can play with the Users table today.
Subscribing or Downloading the Podcast
If you have an MP3 player or a portable video player and you want to download our videos automatically, you can subscribe to our podcast feeds here:
- MP4 (Apple) Video Feed
- WMV (Microsoft) Video Feed
- MP3 Audio-Only Feed
- Zune One-Click Subscribe for Video
- Zune One-Click Subscribe for MP3
You can also download this video to watch it later: