Question

For the DB gurus out there, I was wondering if there is any functional/performance difference between Joining to the results a SELECT statement and Joining to a previously filled table variable. I'm working in SQL Server 2008 R2.

Example (TSQL):

-- Create a test table
DROP TABLE [dbo].[TestTable]
CREATE TABLE [dbo].[TestTable](
    [id] [int] NOT NULL,
    [value] [varchar](max) NULL
) ON [PRIMARY]
-- Populate the test table with a few rows
INSERT INTO [dbo].[TestTable]
SELECT 1123, 'test1'

INSERT INTO [dbo].[TestTable]
SELECT 2234, 'test2'

INSERT INTO [dbo].[TestTable]
SELECT 3345, 'test3'

-- Create a reference table
DROP TABLE [dbo].[TestRefTable]
CREATE TABLE [dbo].[TestRefTable](
    [id] [int] NOT NULL,
    [refvalue] [varchar](max) NULL
) ON [PRIMARY]
-- Populate the reference table with a few rows
INSERT INTO [dbo].[TestRefTable]
SELECT 1123, 'ref1'

INSERT INTO [dbo].[TestRefTable]
SELECT 2234, 'ref2'

-- Scenario 1: Insert matching results into it's own table variable, then Join

-- Create a table variable 
DECLARE @subset TABLE ([id] INT NOT NULL, [refvalue] VARCHAR(MAX))

INSERT INTO @subset
SELECT * FROM [dbo].[TestRefTable] 
WHERE [dbo].[TestRefTable].[id] = 1123

SELECT  t.*, s.*
FROM [dbo].[TestTable] t
JOIN @subset s
ON t.id = s.id 

-- Scenario 2: Join directly to SELECT results
SELECT t.*, s.*
FROM [dbo].TestTable t
JOIN (SELECT * FROM [dbo].[TestRefTable] WHERE id = 1123) s
ON t.id = s.id 

In the "real" world, the tables and table variable are pre-defined. What I'm looking at is being able to have the matched reference rows available for further operations, but I'm concerned that the extra steps will slow the query down. Are there technical reasons as to why one would be faster than the other? What sort of performance difference may be seen between the two approaches? I realize it is difficult (if not impossible) to give a definitive answer, just looking for some advice for this scenario.

Was it helpful?

Solution

The database engine has an optimizer to figure out the best way to execute a query. There is more under the hood than you probably imagine. For instance, when SQL Server is doing a join, it has a choice of at least four join algorithms:

  • Nested Loop
  • Index Lookup
  • Merge Join
  • Hash Join

(not to mention the multi-threaded versions of these.)

It is not important that you understand how each of these works. You just need to understand two things: different algorithms are best under different circumstances and SQL Server does its best to choose the best algorithm.

The choice of join algorithm is only one thing the optimizer does. It also has to figure out the ordering of the joins, the best way to aggregate results, whether a sort is needed for an order by, how to access the data (via indexes or directly), and much more.

When you break the query apart, you are making an assumption about optimization. In your case, you are making the assumption that the first best thing is to do a select on a particular table. You might be right. If so, your result with multiple queries should be about as fast as using a single query. Well, maybe not. When in a single query, SQL Server does not have to buffer all the results at once; it can stream results from one place to another. It may also be able to take advantage of parallelism in a way that splitting the query prevents.

In general, the SQL Server optimizer is pretty good, so you are best letting the optimizer do the query all in one go. There are definitely exceptions, where the optimizer may not choose the best execution path. Sometimes fixing this is as easy as being sure that statistics are up-to-date on tables. Other times, you can add optimizer hints. And other times you can restructure the query, as you have done.

For instance, one place where loading data into a local table is useful is when the table comes from a different server. The optimizer may not have full information about the size of the table to make the best decisions.

In other words, keep the query as one statement. If you need to improve it, then focus on optimization after it works. You generally won't have to spend much time on optimization, because the engine is pretty good at it.

OTHER TIPS

This would give the same result?

SELECT t.*, s.*
FROM dbo.TestTable AS t
JOIN dbo.TestRefTable AS s ON t.id = s.id AND s.id = 1123

Basically, this is a cross join of all records from TestTable and TestRefTable with id = 1123.

Joining to table variables will also result in bad cardinality estimates by the optimizer. Table variables are always assumed by the optimizer to contain only a single row. The more rows it actually has the worse that estimate becomes. This causes the optimizer to assume the wrong number of rows for the table itself, but in other places, for operators that might then join to that result, it can result in wrong estimations of the number executions for that operation.

Personally I think Table parameters should be used for getting data into and out of the server conveniently using client apps (C# .Net apps make good use of them), or for passing data between Stored Procs, but should not be used too much within the proc itself. The importance of getting rid of them within the Proc code itself increases with the expected number of rows to be carried by the parameter.

Sub Selects will perform better, or immediately copying into a temp table will work well. There is overhead for copying into the temp table, but again, the more rows you have the more worth it that overhead becomes because the estimates by the optimizer get worse and worse.

In general a derived table in the query is probably going to be faster than joining to a table variable because it can make use of indexes and they are not available in table variables. However, temp tables can also have indexes creted and that might solve the potential performance difference.

Also if the number of table variable records is expected to be small, then indexes won't make a great deal of difference anyway and so there would be little or no differnce.

As alawys you need to test on your own system as number of records and table design and index design havea great deal to do with what works best.

I'd expect the direct Table join to be faster than the Table to TableVariable, and use less resources.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top