Composite indexes – Does column order matter?


When it comes to creating composite indexes, a common bit of advice we’re given is to put the most selective column first.  And I’ve always just taken this at face value.  But recently I got to thinking, is this still true?  I’d never actually tested this theory myself.  Does this rule still apply?

So let’s take a look.  I started by creating a test table in AdventureWorks using existing data.  My goal was to create a table with one high-cardinality column and at least one very low-cardinality column.

USE AdventureWorks;
GO

SET NOCOUNT ON

CREATE TABLE myBigTable (
ID int IDENTITY(1,1),
SURNAME VARCHAR(40),
LOCATION CHAR(3))

INSERT INTO myBigTable (SURNAME, LOCATION)
SELECT LastName, StateProvinceCode FROM Person.Contact CROSS JOIN Person.StateProvince ;
GO

SELECT    COUNT(distinct ID) AS DistinctIDs,
 COUNT(DISTINCT surname) AS DistinctNames,
 COUNT(DISTINCT location) AS DistinctLocs
FROM myBigTable;
GO

Perfect.  ID is unique and location is very low cardinality.  I’ll use those 2 columns in my test indexes.

CREATE INDEX i_highcard ON myBigTable (ID, location);  -- high cardinality first
CREATE INDEX i_lowcard ON myBigTable (location, ID);    -- low cardinality first
GO

Before we go any further, let’s take a look at the physical size of the indexes, just to see if there’s any difference there.

select name, index_id from sys.indexes where object_id = OBJECT_ID('myBigTable')
SELECT * FROM sys.dm_db_index_physical_stats (DB_ID(), OBJECT_ID('myBigTable'), NULL , NULL, 'LIMITED');
GO

Judging by the fragment_count and page_count values, I’d say there’s no difference in size.  Now, on to query performance.  First we’ll query the table for a particular ID value.  I’ll use the INDEX hint to force the optimizer to use each index.

SET STATISTICS IO ON
SELECT * FROM myBigTable WITH (INDEX = i_highcard) WHERE ID = 77321
SELECT * FROM myBigTable WITH (INDEX = i_lowcard) WHERE ID = 77321

Check out those execution plans.  When we query based on the first column in the index, the optimizer is able to do an index seek.  But when we query based on the second column in the index, as in the second query, the optimizer doesn’t know where to enter the index, so it’s forced to do an index scan.  (In fact, the optimizer even suggests we create an index on ID.)  We can also see the impact of these different execution plans in the IO statistics.

From the first query:
Table ‘myBigTable’. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

And from the second query:

Table ‘myBigTable’. Scan count 5, logical reads 9563, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

That’s a lot more IO the database has to do.  And we all know, when it comes to IO, less is more.

Now, those results were pretty predictable.  But what about when both indexed columns are in the where clause?  Based on what I’ve been told, the index with the high-cardinality column first will be more efficient.  But I want to see that for myself.

SELECT * FROM myBigTable WITH (INDEX = i_highcard) WHERE ID = 77321 AND LOCATION = '04'
SELECT * FROM myBigTable WITH (INDEX = i_lowcard) WHERE ID = 77321 AND LOCATION = '04'

First, the IO stats:

Table ‘myBigTable’. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘myBigTable’. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Exactly the same.  What about the execution plans?

Also the same.  Even down to the cost.

That surprises me.  I really thought the i_highcard index would be more efficient.  So, I’m back to my original question, when it comes to composite indexes, does column order matter?  And I guess the answer is:  it depends.  SQL Server only maintains statistics on the first column in an index, so having a highly selective first column can be more efficient if you’re using that column in your WHERE clauses.  But if you’re using both columns as filter criteria, it doesn’t seem to matter.  This another reason is why it’s extremely important to know your application and how it accesses the data.

Also recommended:


Leave a Reply to Javier Villegas Cancel reply

Your email address will not be published.

6 thoughts on “Composite indexes – Does column order matter?

  • Bruce zikmund

    Yes – I agree that on composite indexes, the choice of ordering the most selective first doesn’t normally impact performance. This is contrary to what is often given as advice. The reason, I believe, is in the way a btree works. a btree guarantees a fixed amount of logical I/o (the tree depth) regardless of column order in the index. In fact, the least selective column might be the best choice because it will cluster data by that column. So if you have a column value that often is selected off a low cardinality value (e.g. where col = 0 on a col which has only a few values), putting that colum first clusters all of the col = 0 near each other. This in turn improves the chances of a block cache hit on the index access.

    Moral: don’t blindly believe everything you read – challenge and validate when something is questionable in your mind. Too many people have forgotten about the principles behind the scientific method. Thanks for taking the time to collect and document your empirical evidence. If everyone did this, there would be a lot leas misinformation running around. That is something I greatly admire about Tom Kyte and Jonathan Lewis in the oracle apace – true computer scientists.

    Bz

  • Erik Grob

    Hi Colleen,

    One thing that I’d suggest would be to perform the test on a Table and not a Heap – (1) We’ll get more predictable and repeatable results testing things out in a “lab environment” when we have a clustered index and (2) it’s good to enforce best practices in the postings we create, so that others don’t run into issues resulting in [the specific case] the use of HEAPS.

    Heaps are great for loading large amounts of data into an ETL staging table, but when performing due diligence or doing lab work, tables with a clustered index is something I would recommend.

    Lastly, I agree with Bruce. In fact, I just answered an interview question with – ‘It depends’, regarding column ordering and index usage, and I referenced the same reasoning behind my answer. SQL Server is constantly making decisions on the fastest route to take, and complexity and page allocation of branches in a b-tree index is something taken into consideration.

  • Raghav

    This was an interesting research. But, to make sure I exteneded the query by adding a range search and there it is. There is a big difference between these 2 approaches. Column order does matter !!!

    • Colleen M. Morrow

      Raghav – Good point, thanks for taking it a step further. If we extend the location predicate to a range, we still have an index seek in both cases, but the seek using the i_lowcard index will include a predicate in addition to the seek_predicate, making that index seek less optimal.