SQL Server Linked Server performance: Why are remote queries so expensive?

https://dba.stackexchange.com/questions/10873

16-10-2019
|

Question

I have two database servers, connected via Linked Servers. Both are SQL Server 2008R2 databases, and the linked server connection is made via a regular "SQL Server" link, using the current login's security context. The linked servers are both in the same datacentre, so the connection shouldn't be an issue.

I use the following query to check which values of the column identifier are available remotely, but not locally.

SELECT 
    identifier 
FROM LinkedServer.RemoteDb.schema.[TableName]

EXCEPT

SELECT DISTINCT
    identifier 
FROM LocalDb.schema.[TableName]

On both tables are non-clustered indexes on the column identifier. Locally are around 2.6M rows, remotely only 54. Yet, when looking at the query plan, 70% of the execution time is devoted to "executing remote query". Also, when studying the complete query plan, the number of estimated local rows is 1 instead of 2695380 (which is the number of estimated rows when selecting only the query coming after EXCEPT). Execution plan When executing this query, it takes a long time indeed.

It makes me wonder: Why is this? Is the estimation "just" way off, or are remote queries on linked servers really that expensive?

Solution

The plan you have at the moment looks like the most optimal plan to me.

I don't agree with the assertion in the other answers that it is sending the 2.6M rows to the remote server.

The plan looks to me as though for each of the 54 rows returned from the remote query it is performing an index seek into your local table to determine whether it is matched or not. This is pretty much the optimal plan.

Replacing with a hash join or merge join would be counterproductive given the size of table and adding an intermediate #temp table just adds an additional step that doesn't seem to give you any advantage.

OTHER TIPS

Connecting to a remote resource is expensive. Period.

One of the most expensive operations in any programming environment is network IO (though disk IO tends to dwarf it).

This extends to remote linked servers. The server calling the remote linked server needs to first establish a connection, then a query needs to be executed on the remote server, results returned and the connection closed. This all takes time over the network.

You should also structure your query in such a way that you transfer the minimum data across the wire. Don't expect the DB to optimize for you.

If I were to write this query, I would select the remote data into a table variable (or into a temp table) and then use this in conjunction with the local table. This ensures that only data that needs to be transferred will.

The query you are running can easily be sending 2.6M rows to the remote server in order to process the EXCEPT clause.

I am not an expert but if you are using Union, Except, or Intersect, you don't have to use "Distinct". Depending on the values from LocalDb.schema.[TableName], the query performance can be improved.

SELECT 
    identifier 
FROM LinkedServer.RemoteDb.schema.[TableName]

EXCEPT

SELECT 
    identifier 
FROM LocalDb.schema.[TableName]

Oded is right, the performance problem is caused by sending the 2.6M rows to your remote server.

To fix this issue you can force the remote data (54 rows) being send to you by using a temp or an in memory table.

Using a temporary table

SELECT  identifier 
INTO    #TableName
FROM    LinkedServer.RemoteDb.schema.[TableName]

SELECT  identifier
FROM    #TableName
EXCEPT
SELECT  DISTINCT identifier 
FROM    LocalDb.schema.[TableName] 

DROP    #TableName

I think you are better off replicating the remote table to the server you are querying from and then running all your SQL locally.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange