How can SQL Server email an error message when Replication publisher fails to connect to a subscriber system

StackOverflow https://stackoverflow.com/questions/1824939

Question

We have several remote locations where we have set up SQL Server 2005 replication. Sometimes the publisher fails to replicate due to various reasons like

1) network problems,

2) improper shut downs of subscriber,

3) change in domain passwords,

4) change in SQL passwords,

5) failure to switch on the subscriber system.

Is there any way we can make SQL server send the admin an email when this happens so he can check ?

Thanks, Chak.

Was it helpful?

Solution

The way I typically handle this is by modifying the Sql Agent job(s) that are responsible for starting/running the replication agents (depending on your replication topology, you'll have a variety of them in potentially different places). Simply add a job step to the appropriate agent job(s) (i.e. log reader agent, distribution agent, merge agent, queue agent, etc.) after the "run agent" step that gets executed if/when that step completes/fails (depending on whether or not you are using a continuous schedule).

For example, if I have a transactional uni-directional push publication setup, the distribution agent will be running at the distributor. If I connect to the distributor and find the Sql Agent job responsible for running the distribution agent for this publication, I can modify the job and add a step to send an email to a particular group if the "run agent" step fails/completes. If I am using a continuous replication schedule, I will simply add the step to email if the "run agent" step finishes (as I want to be notified if the agent stops for any reason). If I am using a non-continuous schedule, I may instead have the email step run only on failure of the "run agent" step. You can even configure this "email" step to send an email, pause for a bit, then try restarting the agent automatically (by simply configuring the step to "go to step 1" on success").

Here's a screen shot that depicts what the job steps look like for a distribution agent configured as I outline above:

distribution agent configured with notify, pause, restart step

You'll notice in the pic above that I've added a step called "Notify, pause, retry" which will be executed anytime the agent stops (success or failure - this is intentional as I am using a continuous replication schedule and simply want to know whenever the distribution agent isn't running for whatever reason). This step basically sends an email to a specific group, waits for a minute or two, then starts the agent up again. You can add code to do whatever you like including logging, restarting only a certain number of times in a certain time slice, etc. It's easily scripted and repeatable for any number of agents, publications, etc. (I have scripts to ensure any new replication agent in any type of topology includes this type of configuration - then it's simply a matter of adding them to a release tool or schedule the execution of, depending on how you deploy in your environment).

OTHER TIPS

As far as detecting agent problems, you want to know when the logreader and distributor are stopped. Mine is also continuous replication like chadhoc, but I find it easier to use an Alert to tell me if the agents are stopped.

USE [msdb] 
GO 
EXEC msdb.dbo.sp_add_alert  
    @name=N'Distribution agent stopped',   
    @message_id=0,  
    @severity=0,  
    @enabled=1,   
    @delay_between_responses=2160, 
    @include_event_description_in=1,   
    @category_name=N'[Uncategorized]',  
    @performance_condition=N'MSSQL$MYDATABASE:Replication Agents|Running|Distribution|=|0', 
    @job_id=N'00000000-0000-0000-0000-000000000000' 
GO 
EXEC msdb.dbo.sp_update_notification 
    @alert_name=N'Distribution agent stopped',   
    @operator_name=N'Amit',  
    @notification_method = 1 

Validation/sync errors are not so simple to detect. You can set up a nightly job to run sp_publication_validation and setup another alert on "Validation Failed".

I dont know much about replication specifically, but sp_readerrorlog is a very useful stored procedure allowing you to access the DB logs from within the database instance. If required it may allow you to respond more appropriately based on the specific error messages rather than just SUCCESS/FAIL branches from a agent job. You can of course send an email directly from a stored proc too, customizing the recipients based on who can best respond to the error (or the time of day - day/night shift coordinators for example).

Also perhaps it would be more appropriate to send an email if the job succeeds rather than fails, considering your potential failures include network disconnectivity? You might want to set up an exchange rule on your end to monitor this inbox and fire off an error notification to your admin if it doesnt receive an expected success message....humans are very good at filtering out constant stimulus and the lack of a success message could easily be missed. Exchange on the other hand is always (usually) vigilant.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top