Data Mining the StackOverflow Database

StackOverflow released a public dump of their database this morning. Jeff Atwood and the guys believe that if you, the community, are putting the work into this huge body of knowledge, then you should be able to have rights to use it.

This is a great dataset to show off one of my favorite toys from the Microsoft SQL Server Data Mining team.  In this fifteen minute video, I’ll walk you through data mining the StackOverflow user list to find out more about the users and see what makes the rockstar high-reputation users different from the worker bees like me.

Get the Flash Player to see the wordTube Media Player.

Microsoft’s Free Data Mining Tools

For today’s demo, I’m using SQL Server Analysis Services installed on my desktop.  Relax – it’s really easy.  Literally just install SQL Server 2005 or 2008 Developer Edition, check the box for Analysis Services, and use the defaults.  You don’t have to know what you’re doing in order to get it up and running, and it just runs in the background as a service.  After you’re done playing around, you can stop the service and set it to manual to prevent it from sapping your system resources.  Go into Control Panel, Administrative Tools, double-click on the SQL Server Analysis Services service, and change the startup type to Manual.

Depending on your version of SQL Server, you’ll need one of these free plugins from Microsoft:

If you want to avoid the whole SQL Server Analysis Services thing altogether, you can also use Microsoft’s free SQL Server Data Mining in the Cloud plugin.  Be aware that it’s a technical preview, not a fully supported & released product.  Their cloud servers can (and do) go down.  Also know that your data is going into the cloud, which has its own ramifications as I’ve discussed in my previous cloud data mining tutorial.

What’s Coming Next: SQL Server 2008 R2 with BI in Excel

In the next version of SQL Server, Microsoft will deliver business intelligence to end users through Excel. At the Professional Association for SQL Server Summit last November, Donald Farmer demoed slicing and dicing of huge spreadsheets with real-time analytics that previously would have required some pretty hefty hardware.

Excel 2007 has a million-row limit, but the forthcoming version will not. Some of the StackOverflow export tables like Votes have more than a million rows, so we can’t yet data mine those using Excel as a front end, but we can play with the Users table today.

Subscribing or Downloading the Podcast

If you have an MP3 player or a portable video player and you want to download our videos automatically, you can subscribe to our podcast feeds here:

You can also download this video to watch it later:

Tags: , ,

3 Responses to “Data Mining the StackOverflow Database”

  1. Francois Says:

    Try datamining it with Qlikview

  2. Brent Ozar Says:

    I’ve got a better idea – since it’s open source data and I’ve never heard of Qlikview, why don’t YOU data mine it with Qlikview and tell us your experiences? You can show us how it works and show the power of it.

  3. Podcast #57 - Blog - Stack Overflow Says:

    [...] We’re excited to see what the community can do with this data; Brent Ozar put together a data mining video to get people [...]

Leave a Reply