Querying a single table at a time will only get you so far. Eventually, you’re going to need to combine that data, to “join” it, to get more meaningful information. Joins put the relationships into relational databases. Today I want to talk about the various types of joins that SQL Server performs. I don’t mean LEFT OUTER JOINs, or RIGHT OUTER JOINS, or INNER JOINS. What I’m going to discuss are the join mechanisms that go on behind the scenes.
There are three basic types of joins: nested loop joins, merge joins, and hash joins.
Nested Loop Joins
The nested loop is an iterative join. In a nested loop you have 2 sets of input, an outer table and an inner table. SQL Server reads through each row in the outer table (the outer loop) and for every row in the outer table it searches for a matching row in the inner table (the inner loop, hence the term “nested loop”). Nested loop joins work best when the outer table is small and the inner table is larger and indexed, though SQL Server will create a temporary index on the inner table if it will help execution. This is called a temporary index nested loop join.
Let’s take a look at an explain plan to see the nested loop in action.
SET statistics profile on; USE AdventureWorks; GO SELECT p.ProductID, p.Name, pch.StartDate, pch.EndDate, pch.StandardCost FROM Production.Product p LEFT JOIN Production.ProductCostHistory pch ON p.ProductID = pch.ProductID WHERE p.ProductID < 100
Here we see the outer table, Product, circled in green and the inner circle, ProductCostHistory, circled in purple. SQL Server performs a clustered index seek on Product to get the rows that meet the where condition. For every record that meets that condition, it does another clustered index seek into the ProductCostHistory table to find matching records.
We can also see this in the output from SET STATISTICS PROFILE.
This makes it very easy to see at a glance that SQL Server performs one seek on Product and four seeks on ProductCostHistory, once for every row returned from Product.
In a merge join, SQL Server reads through and compares two input tables at the same time. The catch with merge joins, however, is that both of those inputs must be sorted. SQL Server will read a row from each input. If the rows match, the records are returned. If they don’t SQL Server discards the row with the lower value and pulls in another row from that input table. This goes on until all rows have been processed. A merge join is great with larger data sets that have clustered indexes on the JOIN values because the data is already sorted. However, SQL Server will sort the data on the fly if it’s the most efficient method for getting the data. This sorting, however, can be a costly operation.
We can force a sort operation in our first example by removing the WHERE condition.
SET statistics profile on; USE AdventureWorks; GO SELECT p.ProductID, p.Name, pch.StartDate, pch.EndDate, pch.StandardCost FROM Production.Product p LEFT JOIN Production.ProductCostHistory pch ON p.ProductID = pch.ProductID
The last, and most expensive method for joining tables is the hash join. These are your biggest joins, the heavyweights. The hash join takes place in two phases, the build phase and the probe phase. In the build phase, SQL Server reads through all the rows in the first input table and builds a hash table on the columns used in the join predicate. In the probe phase, the second table (the probe input) is scanned and for each row the hash value for that row is computed. SQL Server scans the appropriate bucket in the hash table and if a match is found, the row is returned. If not, the row is discarded.
If the entire hash table can fit into the available memory, it’s called an in-memory hash join. If it can’t fit into memory, you have what’s called a grace hash join. In a grace hash join, both tables are scanned and hashed and partitioned into multiple files. Because the inputs are hashed before the partitioning takes place, matching values will always be placed into the same partitions. Once this process is complete, a normal hash join process is performed on each partition. So basically you’re taking a large dataset and breaking it down into multiple, smaller datasets and joining them.
Here we force a hash join by dropping the clustered index on the ProductCostHistory table. Without that index, that input is no longer pre-sorted, so SQL Server decides it’s more efficient to do a hash join than to perform the sort operation required for a merge join.
SET statistics profile on; USE AdventureWorks; GO IF EXISTS (SELECT * FROM sys.indexes WHERE object_id = OBJECT_ID(N'[Production].[ProductCostHistory]') AND name = N'PK_ProductCostHistory_ProductID_StartDate') ALTER TABLE [Production].[ProductCostHistory] DROP CONSTRAINT [PK_ProductCostHistory_ProductID_StartDate] GO SELECT p.ProductID, p.Name, pch.StartDate, pch.EndDate, pch.StandardCost FROM Production.Product p LEFT JOIN Production.ProductCostHistory pch ON p.ProductID = pch.ProductID
For more information on the various types of joins and their inner workings, check out Craig Freedman’s blog. He did an entire series on logical and physical joins and how SQL Server implements them. Really good stuff.