Slony-I replication stopped working

https://dba.stackexchange.com/questions/156719

04-10-2020
|

Question

I inherited a 5 node postgres cluster running postgres 8.4 and slony 1.2.21. We are in the process of migrating the application to all new code and have not wanted to do very little maintenance on the cluster. Yesterday we decided to take down two nodes out of the cluster which were not being used. We used the slonik script to DROP NODE for the two nodes in the cluster. This seemed to work correctly and we shut down the nodes today. However I noticed this morning that our master database where the collect writes is not replicating the changes to the rest of the servers. I have tried everything I can think of but it appears but nothing seems to work.

When I run a query to collect the status I see that events are not being acknowleged since yesterday. The st_last_received has not changed at all.

 st_origin | st_received | st_last_event |      st_last_event_ts      | st_last_received |    st_last_received_ts     | st_last_received_event_ts  | st_lag_num_events |              now              
-----------+-------------+---------------+----------------------------+------------------+----------------------------+----------------------------+-------------------+-------------------------------
        25 |          24 |      26196903 | 2016-11-29 17:39:06.859051 |         26187885 | 2016-11-29 12:51:45.396619 | 2016-11-28 11:11:48.909855 |              9018 | 2016-11-29 17:39:07.247598-05
        25 |          27 |      26196903 | 2016-11-29 17:39:06.859051 |         26187885 | 2016-11-28 11:11:49.203193 | 2016-11-28 11:11:48.909855 |              9018 | 2016-11-29 17:39:07.247598-05
        25 |          26 |      26196903 | 2016-11-29 17:39:06.859051 |         26187885 | 2016-11-28 11:11:50.253235 | 2016-11-28 11:11:48.909855 |              9018 | 2016-11-29 17:39:07.247598-05

I first restarted the slony daemons on all the nodes and have subsequently done this multiple times. I have set debugging to level 4 for debug logging and have combed through them without finding a single issue.

I have looked through all the .sl_ tables for anything that might tell me why it is not working.

Our configuration is as follows for the important replication set.

select * from _ads.sl_set;
 set_id | set_origin | set_locked |   set_comment   
--------+------------+------------+-----------------
      1 |         25 |            |  mgt tables


select * from _ads.sl_subscribe ;
 sub_set | sub_provider | sub_receiver | sub_forward | sub_active 
---------+--------------+--------------+-------------+------------
       1 |           25 |           26 | t           | t
       1 |           25 |           27 | t           | t
       2 |           25 |           27 | t           | t
       1 |           25 |           24 | t           | t


select * from _ads.sl_listen ;
 li_origin | li_provider | li_receiver 
-----------+-------------+-------------
        24 |          24 |          25
        26 |          26 |          25
        27 |          27 |          25
        27 |          25 |          26
        26 |          25 |          27
        27 |          25 |          24
        24 |          25 |          27
        24 |          25 |          26
        26 |          25 |          24
        26 |          24 |          25
        27 |          26 |          25
        26 |          27 |          25
        24 |          26 |          25
        27 |          24 |          25
        24 |          27 |          25
        25 |          25 |          24
        25 |          25 |          26
        25 |          25 |          27

Any advice assistance, or an idea on where to look would be greatly appreciated. I am in full on panic mode now.

Solution

OK, this was a nightmare but we figured but we resolved the problem. We were able to remove the servers using the slonik DROP NODE command. For a while the slony services was segfaulting repeatedly. We had to DROP NODE on all the slaves and then add and subscribe them again. This caused the tables to be dropped and then it ran a copy set to restore the tables. It is working again. We were very hesitant to run DROP NODE on these due to this being a destructive operation. However the replication servers are running as expected again.

I am so looking forward to turning off this service permanently.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange