MySQL PXC node failing to receive state

https://dba.stackexchange.com/questions/219841

12-01-2021
|

Question

I have three nodes that I want to setup into a Percona XtraDB Cluster (PXC). I have bootstrapped the first node and joined the second node, but cannot somehow join the third node. All configuration is the same as I just did copy and paste:

[mysqld]
# Galera
wsrep_cluster_address = gcomm://10.1.5.100,10.1.5.101,10.1.5.102
wsrep_cluster_name = db-test
wsrep_provider = /usr/lib/libgalera_smm.so
wsrep_provider=/usr/lib64/galera3/libgalera_smm.so
wsrep_provider_options = "gcache.size=256M"
wsrep_slave_threads = 16 # 2~3 times with CPU
wsrep_sst_auth = "sstuser:sstPwd#123"
wsrep_sst_method = xtrabackup-v2

I am running the nodes on CentOS 7.x. Below is the status of the two PXC nodes already up and running:

| wsrep_ist_receive_seqno_end      | 0                                       |
| wsrep_incoming_addresses         | 10.1.5.100:3306,10.1.5.101:3306 |
| wsrep_cluster_weight             | 2                                       |
| wsrep_desync_count               | 0                                       |
| wsrep_evs_delayed                |                                         |
| wsrep_evs_evict_list             |                                         |
| wsrep_evs_repl_latency           | 0/0/0/0/0                               |
| wsrep_evs_state                  | OPERATIONAL                             |
| wsrep_gcomm_uuid                 | 8d59ca0f-cd35-11e8-863c-d79869fa6d80    |
| wsrep_cluster_conf_id            | 4                                       |
| wsrep_cluster_size               | 2                                       |
| wsrep_cluster_state_uuid         | ac97f711-cad5-11e8-8f39-be9d0594cdb9    |
| wsrep_cluster_status             | Primary                                 |
| wsrep_connected                  | ON                                      |
| wsrep_local_bf_aborts            | 0                                       |
| wsrep_local_index                | 0                                       |
| wsrep_provider_name              | Galera                                  |
| wsrep_provider_vendor            | Codership Oy <info@codership.com>       |
| wsrep_provider_version           | 3.31(rf216443)                          |
| wsrep_ready                      | ON                                      |
+----------------------------------+-----------------------------------------+
71 rows in set (0.01 sec)

Below is the error from the error log of the third node failing to join:

backup-v2|10.1.5.102:4444/xtrabackup_sst//1
2018-10-11T09:20:03.278884-00:00 2 [Note] WSREP: Auto Increment Offset/Increment re-align with cluster membership change (Offset: 1 -> 2) (Increment: 1 -> 3)
2018-10-11T09:20:03.278997-00:00 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2018-10-11T09:20:03.279155-00:00 2 [Note] WSREP: Assign initial position for certification: 69, protocol version: 4
2018-10-11T09:20:03.279626-00:00 0 [Note] WSREP: Service thread queue flushed.
2018-10-11T09:20:03.280052-00:00 2 [Note] WSREP: Check if state gap can be serviced using IST
2018-10-11T09:20:03.280145-00:00 2 [Note] WSREP: Local state seqno is undefined (-1)
2018-10-11T09:20:03.280445-00:00 2 [Note] WSREP: State gap can't be serviced using IST. Switching to SST
2018-10-11T09:20:03.280510-00:00 2 [Note] WSREP: Failed to prepare for incremental state transfer: Local state seqno is undefined: 1 (Operation not permitted)
         at galera/src/replicator_str.cpp:prepare_for_IST():549. IST will be unavailable.
2018-10-11T09:20:03.287673-00:00 0 [Note] WSREP: Member 1.0 (db-test-3.pd.local) requested state transfer from '*any*'. Selected 0.0 (db-test-2.pd.local)(SYNCED) as donor.
2018-10-11T09:20:03.287850-00:00 0 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 69)
2018-10-11T09:20:03.288073-00:00 2 [Note] WSREP: Requesting state transfer: success, donor: 0
2018-10-11T09:20:03.288225-00:00 2 [Note] WSREP: GCache history reset: ac97f711-cad5-11e8-8f39-be9d0594cdb9:0 -> ac97f711-cad5-11e8-8f39-be9d0594cdb9:69
2018-10-11T09:20:38.988120-00:00 0 [Warning] WSREP: 0.0 (db-test-2.pd.local): State transfer to 1.0 (db-test-3.pd.local) failed: -32 (Broken pipe)
2018-10-11T09:20:38.988274-00:00 0 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():766: Will never receive state. Need to abort.
2018-10-11T09:20:38.988366-00:00 0 [Note] WSREP: gcomm: terminating thread
2018-10-11T09:20:38.988493-00:00 0 [Note] WSREP: gcomm: joining thread
2018-10-11T09:20:38.988942-00:00 0 [Note] WSREP: gcomm: closing backend
2018-10-11T09:20:38.995070-00:00 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,8d59ca0f,3)
memb {
        d3167260,0
        }
joined {
        }
left {
        }
partitioned {
        8d59ca0f,0
        e3def063,0
        }
)
2018-10-11T09:20:38.995334-00:00 0 [Note] WSREP: Current view of cluster as seen by this node
view ((empty))
2018-10-11T09:20:38.996612-00:00 0 [Note] WSREP: gcomm: closed
2018-10-11T09:20:38.996837-00:00 0 [Note] WSREP: /usr/sbin/mysqld: Terminated.
Terminated
        2018-10-11T09:20:47.767946+00:00 WSREP_SST: [ERROR] Removing /var/lib/mysql//xtrabackup_galera_info file due to signal
        2018-10-11T09:20:47.788109+00:00 WSREP_SST: [ERROR] Removing  file due to signal
        2018-10-11T09:20:47.808425+00:00 WSREP_SST: [ERROR] ******************* FATAL ERROR ********************** 
        2018-10-11T09:20:47.818240+00:00 WSREP_SST: [ERROR] Error while getting data from donor node:  exit codes: 143 143
        2018-10-11T09:20:47.828411+00:00 WSREP_SST: [ERROR] ****************************************************** 
        2018-10-11T09:20:47.840006+00:00 WSREP_SST: [ERROR] Cleanup after exit with status:32

And below is the error from the node that was chosen as the donor:

2018/10/11 09:20:38 socat[22418] E connect(5, AF=2 10.1.5.102:4444, 16): No route to host
        2018-10-11T09:20:38.805798+00:00 WSREP_SST: [ERROR] ******************* FATAL ERROR ********************** 
        2018-10-11T09:20:38.818683+00:00 WSREP_SST: [ERROR] Error while sending data to joiner node:  exit codes: 0 1
        2018-10-11T09:20:38.832059+00:00 WSREP_SST: [ERROR] ****************************************************** 
        2018-10-11T09:20:38.846813+00:00 WSREP_SST: [ERROR] Cleanup after exit with status:32
2018-10-11T09:20:38.985060-00:00 0 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.1.5.102:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --mysqld-version '5.7.23-23-57'  --binlog 'db-test-2-bin' --gtid 'ac97f711-cad5-11e8-8f39-be9d0594cdb9:69' : 32 (Broken pipe)
2018-10-11T09:20:38.985552-00:00 0 [ERROR] WSREP: Command did not run: wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.1.5.102:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --mysqld-version '5.7.23-23-57'  --binlog 'db-test-2-bin' --gtid 'ac97f711-cad5-11e8-8f39-be9d0594cdb9:69' 
2018-10-11T09:20:38.990613-00:00 0 [Warning] WSREP: 0.0 (db-test-2.pd.local): State transfer to 1.0 (db-test-3.pd.local) failed: -32 (Broken pipe)
2018-10-11T09:20:38.990815-00:00 0 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 69)
2018-10-11T09:20:38.997784-00:00 0 [Note] WSREP: declaring e3def063 at tcp://10.1.5.100:4567 stable
2018-10-11T09:20:38.997807-00:00 0 [Note] WSREP: Member 0.0 (db-test-2.pd.local) synced with group.
2018-10-11T09:20:38.998230-00:00 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 69)
2018-10-11T09:20:38.998277-00:00 0 [Note] WSREP: forgetting d3167260 (tcp://10.1.5.102:4567)
2018-10-11T09:20:38.998806-00:00 13 [Note] WSREP: Synchronized with group, ready for connections
2018-10-11T09:20:38.999112-00:00 13 [Note] WSREP: Setting wsrep_ready to true
2018-10-11T09:20:38.999198-00:00 13 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2018-10-11T09:20:39.003491-00:00 0 [Note] WSREP: Node 8d59ca0f state primary
2018-10-11T09:20:39.005025-00:00 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(PRIM,8d59ca0f,4)
memb {
        8d59ca0f,0
        e3def063,0
        }
joined {
        }
left {
        }
partitioned {
        d3167260,0
        }
)
2018-10-11T09:20:39.005270-00:00 0 [Note] WSREP: Save the discovered primary-component to disk
2018-10-11T09:20:39.009691-00:00 0 [Note] WSREP: forgetting d3167260 (tcp://10.1.5.102:4567)
2018-10-11T09:20:39.010097-00:00 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
2018-10-11T09:20:39.011037-00:00 0 [Note] WSREP: STATE_EXCHANGE: sent state UUID: eb0b1f21-cd36-11e8-8ac8-c60fb82759c9
2018-10-11T09:20:39.019171-00:00 0 [Note] WSREP: STATE EXCHANGE: sent state msg: eb0b1f21-cd36-11e8-8ac8-c60fb82759c9
2018-10-11T09:20:39.021665-00:00 0 [Note] WSREP: STATE EXCHANGE: got state msg: eb0b1f21-cd36-11e8-8ac8-c60fb82759c9 from 0 (db-test-2.pd.local)
2018-10-11T09:20:39.021786-00:00 0 [Note] WSREP: STATE EXCHANGE: got state msg: eb0b1f21-cd36-11e8-8ac8-c60fb82759c9 from 1 (db-test-1.pd.local)
2018-10-11T09:20:39.021861-00:00 0 [Note] WSREP: Quorum results:
        version    = 4,
        component  = PRIMARY,
        conf_id    = 3,
        members    = 2/2 (primary/total),
        act_id     = 69,
        last_appl. = 0,
        protocols  = 0/9/3 (gcs/repl/appl),
        group UUID = ac97f711-cad5-11e8-8f39-be9d0594cdb9
2018-10-11T09:20:39.021999-00:00 0 [Note] WSREP: Flow-control interval: [141, 141]
2018-10-11T09:20:39.022058-00:00 0 [Note] WSREP: Trying to continue unpaused monitor
2018-10-11T09:20:39.022774-00:00 17 [Note] WSREP: REPL Protocols: 9 (4, 2)
2018-10-11T09:20:39.023163-00:00 17 [Note] WSREP: New cluster view: global state: ac97f711-cad5-11e8-8f39-be9d0594cdb9:69, view# 4: Primary, number of nodes: 2, my index: 0, protocol version 3
2018-10-11T09:20:39.023209-00:00 17 [Note] WSREP: Setting wsrep_ready to true
2018-10-11T09:20:39.023256-00:00 17 [Note] WSREP: Auto Increment Offset/Increment re-align with cluster membership change (Offset: 1 -> 1) (Increment: 3 -> 2)
2018-10-11T09:20:39.023373-00:00 17 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2018-10-11T09:20:39.023540-00:00 17 [Note] WSREP: Assign initial position for certification: 69, protocol version: 4
2018-10-11T09:20:39.023832-00:00 0 [Note] WSREP: Service thread queue flushed.
2018-10-11T09:20:44.480289-00:00 0 [Note] WSREP:  cleaning up d3167260 (tcp://10.1.5.102:4567)

When I bootstrap the third not to be its own cluster, it runs just fine. But when I try to stop the first two nodes in the other cluster and attempt to have them join the new cluster, they fail to join. I can ping and telnet the first two clusters nodes from the third node and vice versa. I even tried stopping all nodes and bootstrapped the cluster from scratch, and that did not help.

What is really going on here?

Solution

First of all, thanks for providing enough debug information, not everybody does that.

Your SST (data copy) is failing. Apparently, netcat is failing with "no route to host" error- that tells you that the new host is unreachable from the donor you paste. This is not really a cluster configuration issue, but an os/network one -your port may be closed, firewall up, or other network issue. Try to ping the other host from the donor or run a test netcat on the 4444 port to debug the breakage. Once the host is reachable, your sst should succeed and the node join the cluster. Usually it is some silly mistake like the firewall being up on one of the used port, wrong datadir permissions, wrong user, etc.

You can try changing the sst method to a different one to help debuging ( it only uses the mysql port, so it is simpler), if it is a test setup.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange