Pregunta

I have two SQL Servers on two different machines somewhere in the cloud (maybe Azure).

One is Microsoft SQL Server 2012 (SP3-CU10) (KB4025925) - 11.0.6607.3 (X64) Jul 8 2017 16:43:40 Copyright (c) Microsoft Corporation Standard Edition (64-bit) on Windows NT 6.3 (Build 9600: ) (Hypervisor)

On this server there is a link to the second server.

The second server (aae-sqldw-02) is Microsoft SQL Server 2016 (SP1-CU15-GDR) (KB4505221) - 13.0.4604.0 (X64) Jun 15 2019 07:56:34 Copyright (c) Microsoft Corporation Enterprise Edition: Core-based Licensing (64-bit) on Windows Server 2016 Datacenter 10.0 (Build 14393: ) (Hypervisor)

On the first server we are running a "simple" query:

TRUNCATE TABLE [dbo].[LocalTable]

INSERT INTO [dbo].[LocalTable]
    ([DatabaseName]
    ,[SalesContractNumber]
    ,... 60 columns
    )
SELECT
    convert(varchar(128), DatabaseName) collate Latin1_General_CI_AS
    ,convert(varchar(60), SalesContractNumber) collate Latin1_General_CI_AS
    ,... 60 columns
FROM [aae-sqldw-02].[Fin_DWH].[dbo].[RemoteView]
WHERE DatabaseName = 'somename'

This query sometimes fails with an error:

Cannot fetch a row from OLE DB provider "SQLNCLI11" for linked server "aae-sqldw-02".

or with this error:

Cannot fetch the rowset from OLE DB provider "SQLNCLI11" for linked server "aae-sqldw-02". .

I know that the second server is under a very heavy load most of the day. It literally maxes out its disk IO (255MB/sec). The brute-force solution is to simply move it to a more expensive plan with more IO. This change needs a lot of bureaucracy and will take a long time. Besides, there is no guarantee that the next tier will be enough.

server load

Is there anything I can do with the given resources right now?

When a query completes successfully, it can take between 1-3 hours. The query returns about 3M rows, about 4GB of data, so not too much.

When a query fails with Cannot fetch a row, last few times it failed after 9294 seconds (2.5 hours), 12326 seconds (3.5 hours).

When a query fails with Cannot fetch the rowset, it failed after 606 seconds, 611 seconds.

So, 600 seconds suggest some default 10min timeout (for connection?) In those cases when connection succeeded, it started fetching the data, but failed in the process. Maybe the linked server could not send the next row fast enough and some other timeout kicked in.

When a query succeeded, it took 3841 seconds last time.

Here are the settings for the linked server:

EXEC master.dbo.sp_addlinkedserver @server = N'aae-sqldw-02', @srvproduct=N'SQL Server'
EXEC master.dbo.sp_addlinkedsrvlogin @rmtsrvname=N'aae-sqldw-02',@useself=N'True',@locallogin=NULL,@rmtuser=NULL,@rmtpassword=NULL
GO

EXEC master.dbo.sp_serveroption @server=N'aae-sqldw-02', @optname=N'collation compatible', @optvalue=N'false'
EXEC master.dbo.sp_serveroption @server=N'aae-sqldw-02', @optname=N'data access', @optvalue=N'true'
EXEC master.dbo.sp_serveroption @server=N'aae-sqldw-02', @optname=N'dist', @optvalue=N'false'
EXEC master.dbo.sp_serveroption @server=N'aae-sqldw-02', @optname=N'pub', @optvalue=N'false'
EXEC master.dbo.sp_serveroption @server=N'aae-sqldw-02', @optname=N'rpc', @optvalue=N'true'
EXEC master.dbo.sp_serveroption @server=N'aae-sqldw-02', @optname=N'rpc out', @optvalue=N'true'
EXEC master.dbo.sp_serveroption @server=N'aae-sqldw-02', @optname=N'sub', @optvalue=N'false'
EXEC master.dbo.sp_serveroption @server=N'aae-sqldw-02', @optname=N'connect timeout', @optvalue=N'0'
EXEC master.dbo.sp_serveroption @server=N'aae-sqldw-02', @optname=N'collation name', @optvalue=null
EXEC master.dbo.sp_serveroption @server=N'aae-sqldw-02', @optname=N'lazy schema validation', @optvalue=N'false'
EXEC master.dbo.sp_serveroption @server=N'aae-sqldw-02', @optname=N'query timeout', @optvalue=N'0'
EXEC master.dbo.sp_serveroption @server=N'aae-sqldw-02', @optname=N'use remote collation', @optvalue=N'true'
EXEC master.dbo.sp_serveroption @server=N'aae-sqldw-02', @optname=N'remote proc transaction promotion', @optvalue=N'true'

What do you think, would it make any difference if I explicitly set the query timeout option to something like 5 hours? Can it make things worse?

Obviously, the proper fix would be to look at what is going on at the server and optimize the queries to reduce the overall load, but is there anything I can do at a higher server / database level, so that the query completes, even if it takes a really long time?

We need to run this query once a week and right now we have to retry it several times until it completes successfully.

¿Fue útil?

Solución

Per our discussion in the comments, I think this Microsoft doc might be what you're looking for, except I have a feeling this is only applicable to an on-prem instance and you won't be able to adjust this in Azure.

I've also found a tangentially related StackOverflow post where the accepted answer was to increase the DTUs for a highly intensive IO workload, even if you just temporarily scaled up to run that query and scaled back down when it was done. (Again scaling up to an NVMe based tier might payoff momentously here.)

Unfortunately the errors you're receiving don't have a lot of concrete information out there, and have quite varying causes. The only one that I found that might be relevant to your case is this StackExchange post where the issue was due to deadlocks happening across the linked server. Perhaps you're running into the same issue? (Which theoretically could be a timing thing resulting from your server crawling for other concurrently running queries when the IO is maxed out.)

Outside of that, the only other option I think you have is tuning the query itself to improve performance. Even being throttled at ~250 MB/s of IO here, 4 GB of data should be processed in about 16 seconds + whatever bottleneck might be added for the network latency. But 1+ hours definitely seems off, even for 3 million rows (deadlocks seeming more suspect perhaps). For the error you're getting which isn't directly a timeout error itself, I would consider talking to an Azure rep to see if there's more to your underlying issue.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a dba.stackexchange
scroll top