LOOP JOIN and other fun query optimizations

by: Per Bjurström

Consider this SQL Server query that is part of EPiServer CMS 5:

SELECT 
tblPageLanguage.fkPageID,
tblLanguageBranch.LanguageID as LanguageBranch
FROM 
tblPage
JOIN 
tblPageLanguage ON tblPage.pkID=tblPageLanguage.fkPageID
LEFT JOIN
tblLanguageBranch ON 
tblLanguageBranch.pkID=tblPageLanguage.fkLanguageBranchID WHERE tblPageLanguage.ExternalURL=@LoweredUrl AND tblPage.Deleted=0 AND tblPageLanguage.PendingPublish=0 AND tblPageLanguage.StartPublish <= GetDate() AND (
tblPageLanguage.StopPublish IS NULL OR
tblPageLanguage.StopPublish >= GetDate()
) ORDER BY tblPageLanguage.Changed DESC

It queries the database to find pages with a specific simple URL, for example "/products". We have a non-unique index on tblPageLanguage.ExternalURL and we know that the resultset on this predicate will be very small because URL's has to be unique per site/language, but not universally unique. So the query should run very fast, right? But, when running it I get response times around 1 second when tracing in Query Profiler.

This site has 4 million pages so it's a very large site with generated content in a lab environment, not your average demo site, so don't go crazy that this will affect your site. The hardware is a HP 2xQuadCore server with 4GB RAM and runs IIS and SQL Server on the same box (so its not an optimal configuration). And it's 32-bit.

Looking at the query plan in Query Analyzer it actually does two Clustered Index Scans, which is more or less the same as Table Scans which involves scanning through ALL rows in the two tables. The first scan is for finding all pages in tblPage where Deleted=0 which of course will be the majority of all pages and the second scan to find pages with all the specified predicates on tblPageLanguage. Then these two costly scans ends up in a Hash Match to join the two result sets into one, you can see by the thickness on the top arrow below that there is a lot of data to merge. This image is a smaller part of the actual query plan:

 

queryplan

 

So, our index is actually not used and we get two costly scans of all rows in the biggest tables we got (well, almost). SQL Server has a lot of statistics on tables and columns to be able to select the best query plan at all times, and the statistics in this case are up to date for the columns in this query. I am guessing the reason for this behaviour is that SQL Server has no way of knowing that the ExternalURL predicate will return a very small resultset so it opts for scanning the table and applying all the predicates as it goes, statistics are statistics.

Well, one alternative could be to use a sub query to make sure the ExternalURL get evaluated separately in the FROM clause, like this:

SELECT .. 
FROM 
(
  SELECT .. FROM tblPageLanguage 
  WHERE ExternalURL=@LoweredExternalURL
) AS Pages
INNER JOIN tblPage ON tblPage.pkID=Pages.fkPageID
....

But, SQL Server is smart and expands sub queries into the outer query so this actually has no effect on the query plan. Normally a good behaviour.

Another alternative would be to split this query in two parts using a table variable to hold the data, but that would require the need of a temporary table and that seems like too much overhead for nothing.

Next alternative is using performance hints (which is a strange name because it's more than hints). Take a look at this query:

SELECT 
tblPageLanguage.fkPageID,
tblLanguageBranch.LanguageID as LanguageBranch
FROM 
(
   SELECT fkPageID,fkLanguageBranchID
   FROM tblPageLanguage
   WHERE tblPageLanguage.ExternalURL=@LoweredUrl
) AS Pages
INNER LOOP JOIN 
tblPage ON tblPage.pkID = Pages.fkPageID
INNER LOOP JOIN
tblPageLanguage ON tblPageLanguage.fkPageID=Pages.fkPageID 
AND tblPageLanguage.fkLanguageBranchID=Pages.fkLanguageBranchID INNER LOOP JOIN tblLanguageBranch ON tblLanguageBranch.pkID=Pages.fkLanguageBranchID WHERE tblPage.Deleted=0 AND tblPageLanguage.PendingPublish=0 AND tblPageLanguage.StartPublish<=GetDate() AND ( tblPageLanguage.StopPublish IS NULL OR tblPageLanguage.StopPublish>=GetDate() ) ORDER BY tblPageLanguage.Changed DESC

It forces SQL Server to use LOOP joins on the Pages subquery, we know in advance that this will be the best option because the Pages subquery will return a very small resultset (most often 0 or 1 rows and in some rare cases a few more). We also separate the predicates on tblPageLanguage into two different tables, one is the main ExternalURL predicate in the sub query and the others are moved to a loop join back to the original tblPageLanguage (that will hit the primary key fkPageID, fkPageLanguageBranchID so the cost is minimal).

This query takes 1ms to run, compare that to the original 1s.

PS. I tried to tweak the TSQL indenting to fit this site, it's not the original indenting style we use DS.

 

26 March 2008


Comments

Post a comment    
User verification Image for user verification  
EPiTrace logger