Suppose you’re invited to join a fantasy football league, but you know absolutely nothing about any of the players (maybe that’s why you were invited). How do you know which quarterback to draft? Do you watch a video of every game from last season? You certainly could, but that probably wouldn’t be the most efficient method. No, you’d look at their stats to see which one is the best available.
SQL Server uses statistics based on your data to determine which execution plan is the most optimal for a query. And similar to your fantasy football draft, the sample size those stats are based on (one game or an entire season) and the age of those stats (last season or five years ago) can make a huge difference in how well a query runs (or how well your team does).
Let’s start with some vocabulary.
Cardinality – a measure of the uniqueness of a column or combination of columns (as in a composite index). It’s measured as / . So a high cardinality value means a highly unique column or index.
Density – number of duplicates in a column or combination of columns (as in a composite index). The density is calculated as 1/. So I high density value means a lot of duplicates, and therefore lower cardinality.
So statistics help the query optimizer generate the best execution plan based on the cardinality and density of a particular column or index. If the filter predicate of a query is based on a column with a high density value, the optimizer might choose to do a full scan. If an index has a high cardinality value, chances are it will perform a seek operation.
How are statistics created?
Any time you create an index in a database, SQL Server will automatically create statistics for the key values in that index. These stats are given the same name as your index. SQL Server also creates statistics automatically (provided autostats is turned on) if you query a column that doesn’t already have statistics on it. These auto-created statistics are given a name that starts with “_WA” (pop quiz: what does the WA stand for?). And finally, if you don’t want to create an index on a particular column or column combination, but you still think statistics on that would be beneficial, you can always create them yourself using the CREATE STATISTICS command.
How are statistics maintained?
Hopefully you’re already familiar with the UPDATE STATISTICS command. This is obviously the best way to help SQL Server maintain those statistics: proactively. But there’s another way: reactively. You have two options in your database properties related to the maintenance of statistics, “Auto Update Statistics” and “Auto Update Statistics Asynchronously”. With the former, if a query is executed and SQL Server sees that the statistics it should use to generate an execution plan are stale, SQL Server will first update those statistics and then generate the execution plan. Not only does this take time, which impacts the performance of that query, but the sample size that SQL Server uses to update the statistics might not be sufficient to provide a true picture of your data distribution. This might cause the query optimizer to choose a less optimal path. The latter option, “Auto Update Statistics Asynchronously” will trigger an auto-update stats process, but it won’t make the query wait for it to finish. This may seem like an improvement, in that the query isn’t impacted, but in this case the optimizer will base its execution plan on stale statistics. That’s not good either. Generally speaking, it’s recommended to have Auto Create Stats and Auto Update Stats on in your database.
So you’re probably wondering how SQL Server decides whether statistics need to be updated? Prior to SQL Server 2005, this was based on a row modification counter that SQL tracked internally. When the counter reached a certain point, the statistics were deemed stale. The biggest problem with this was that that counter applied to the entire table. So it didn’t matter if the modifications were all on ColumnA, SQL Server would invalidate all of the statistics for that table. Starting with SQL 2005, the process got a little better. Now a counter is maintained for each column, so SQL Server is a little more selective in which statistics it renders invalid.
Can I look at them?
Sure! You can use the DBCC SHOW_STATISTICS command or view them in Management Studio. In SSMS, right-click on whatever statistics you’re interested in and select Properties. Then go to Details. What you see should look something like this:
From this window we can see the density of the columns included in the index, the number of rows in the table and the number of rows sampled to generate the statistics. We then see a histogram which gives us a picture of how our data is distributed. Each step in the histogram is a bucket based on the values in that column. We can see how many rows have that exact value (EQ_ROWS) and how many rows fall between that value and the next (RANGE_ROWS). We can also see how many distinct values are in that bucket (DISTINCT_RANGE_ROWS).
So, for example, look at the rows I’ve highlighted in the picture above. Here we see a bucket where the high-value is 15.4479. We know that there are 201 rows with that exact value in our table. We also know that there are 21 records where TotalDue > 14.2987 and TotalDue < 15.4479. And all of those rows have the same value because DISTINCT_RANGE_ROWS = 1. Don’t believe me? Take a look.
select TotalDue, COUNT(*) from sales.SalesOrderHeader where TotalDue > 14.2987 and TotalDue < 15.4479 group by TotalDue
So there you have it, your crash course in SQL Server statistics. If you’d like to know more, here are some links to some very smart people:
Kimberly Tripp is an indexing guru and has some good information about auto stats.
Rob Carrol tells us all about stats in SQL Server 2005.
And of course there’s always the fine folks at MSDN.