ABSTRACT
This article shows that adding a TOP clause to the SQL SELECT query may
greatly increase its execution time. It also shows very effective solution to
fix this problem.
INTRODUCTION
Lately I was conducting performance tests for a new project. I built sample
database, filled it with data and began testing. The database schema is not
important here. All you should know is that I had about 3M of records in main
table [Articles] and about 11M of records in tables which I joined. I used this
naive query at first:
SELECT
[Articles].Id
,CountryCategories.Name
,CityCategories.Name
FROM [Articles]
INNER JOIN CategoryCountry2Articles
ON [Articles].Id = CategoryCountry2Articles.IdArticle
INNER JOIN CountryCategories
ON CountryCategories.Id = CategoryCountry2Articles.IdCountry
INNER JOIN CategoryCity2Articles
ON [Articles].Id = CategoryCity2Articles.IdArticle
INNER JOIN CityCategories
ON CityCategories.Id = CategoryCity2Articles.IdCity
WHERE CountryCategories.Name = 'country1'
AND
CityCategories.Name = 'city4'
This query took 00:04 (seconds)
to execute. It returned far too much records, so I thought it would be a good
idea to use TOP N clause to limit them. After all, limiting the number of
records to be returned should decrease execution time. And certainly it should
not increase it. Query optimizer however is odd enough to refute my
assumptions. By adding a TOP clause to aforementioned query:
SELECT TOP(5)
[Articles].Id,
(…)
query execution time jumped to 01:14! How is that possible?
EXPLANATION
Execution plans for both queries may help to explain it. For base query:
http://imageshack.us/a/img195/5361/sqltop1.png
and for query with TOP clause:
http://imageshack.us/a/img521/9523/sqltop2.png
The main difference between those two execution plans is that in first plan
optimizer uses Hash Matches, while in second it uses Nested Loops. Both symbols
generally describe the same operation (join). However Hash Matches are
efficient with large sets, while Nested Loops only with small sets. This is why
the latter query performs so badly. It is the query optimizer that makes wrong
decision.
But let’s assume that for some reason you do not want to abandon the
idea of using TOP clause here. There are
two ways to induce query optimizer to take right decision.
SOLUTIONS
First solution is to use explicit hash joins in your query:
SELECT TOP 5
[Articles].Id
,CountryCategories.Name
,CityCategories.Name
FROM [Articles]
INNER HASH
JOIN CategoryCountry2Articles
ON [Articles].Id = CategoryCountry2Articles.IdArticle
INNER HASH
JOIN CountryCategories
ON CountryCategories.Id = CategoryCountry2Articles.IdCountry
INNER HASH
JOIN CategoryCity2Articles
ON [Articles].Id = CategoryCity2Articles.IdArticle
INNER HASH
JOIN CityCategories
ON CityCategories.Id = CategoryCity2Articles.IdCity
WHERE CountryCategories.Name = 'country1'
AND
CityCategories.Name = 'city4'
Execution time: 00:07. Execution
plan:
http://imageshack.us/a/img812/2150/sqltop3.png
As you can see, this time Hash Matches are used.
Second and most effective solution is to use SQL variable:
DECLARE @topCount INT
SET @topCount = 5
SELECT TOP (@topCount)
(...)
Execution time: 00:00.2 (0,2
sec).
Quick look at the execution plan:
http://imageshack.us/a/img802/6742/sqltop4.png
It shows that query optimizer did not only choose to use Hash Matches, but
it also shows high degree of parallelism in the query. This is why it is so
incredibly efficient.
I am not able to explain why using variable pushes query optimizer towards
those decisions. But in case you are struggling with the same problem as I did,
try this solution. I hope it helps.