How to restart MariaDB Galera cluster?

https://dba.stackexchange.com/questions/151941

04-10-2020
|

Question

After all node have been crashed I try to recover the cluster but without success. I have only 2 nodes.

As documentation says I set a parameter on one of the node:

set global wsrep_provider_options="pc.bootstrap=true";

And then try to start first node:

systemctl start mariadb

After that I get an error:

Oct 11 16:11:12 proxy1 sh[2367]: 2016-10-11 16:11:12 140291677038720 [Note] /usr/sbin/mysqld (mysqld 10.1.18-MariaDB) starting as process 2402 ...
Oct 11 16:11:15 proxy1 sh[2367]: WSREP: Recovered position b6c1dc93-8fa7-11e6-933e-e64cd44e3be0:141
Oct 11 16:11:15 proxy1 mysqld[2434]: 2016-10-11 16:11:15 140047023368320 [Note] /usr/sbin/mysqld (mysqld 10.1.18-MariaDB) starting as process 2434 ...
Oct 11 16:11:15 proxy1 mysqld[2434]: 2016-10-11 16:11:15 140047023368320 [Note] WSREP: Read nil XID from storage engines, skipping position init
Oct 11 16:11:15 proxy1 mysqld[2434]: 2016-10-11 16:11:15 140047023368320 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
Oct 11 16:11:15 proxy1 mysqld[2434]: 2016-10-11 16:11:15 140047023368320 [Note] WSREP: wsrep_load(): Galera 25.3.18(r3632) by Codership Oy <info@codership.com> loaded successfully.
Oct 11 16:11:15 proxy1 mysqld[2434]: 2016-10-11 16:11:15 140047023368320 [Note] WSREP: CRC-32C: using hardware acceleration.
Oct 11 16:11:15 proxy1 mysqld[2434]: 2016-10-11 16:11:15 140047023368320 [Note] WSREP: Found saved state: b6c1dc93-8fa7-11e6-933e-e64cd44e3be0:-1
Oct 11 16:11:15 proxy1 mysqld[2434]: 2016-10-11 16:11:15 140047023368320 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 192.168.0.41; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false;
Oct 11 16:11:15 proxy1 mysqld[2434]: 2016-10-11 16:11:15 140046790919936 [Note] WSREP: Service thread queue flushed.
Oct 11 16:11:15 proxy1 mysqld[2434]: 2016-10-11 16:11:15 140047023368320 [Note] WSREP: Assign initial position for certification: 141, protocol version: -1
Oct 11 16:11:15 proxy1 mysqld[2434]: 2016-10-11 16:11:15 140047023368320 [Note] WSREP: wsrep_sst_grab()
Oct 11 16:11:15 proxy1 mysqld[2434]: 2016-10-11 16:11:15 140047023368320 [Note] WSREP: Start replication
Oct 11 16:11:15 proxy1 mysqld[2434]: 2016-10-11 16:11:15 140047023368320 [Note] WSREP: Setting initial position to b6c1dc93-8fa7-11e6-933e-e64cd44e3be0:141
Oct 11 16:11:15 proxy1 mysqld[2434]: 2016-10-11 16:11:15 140047023368320 [Note] WSREP: protonet asio version 0
Oct 11 16:11:15 proxy1 mysqld[2434]: 2016-10-11 16:11:15 140047023368320 [Note] WSREP: Using CRC-32C for message checksums.
Oct 11 16:11:15 proxy1 mysqld[2434]: 2016-10-11 16:11:15 140047023368320 [Note] WSREP: backend: asio
Oct 11 16:11:15 proxy1 mysqld[2434]: 2016-10-11 16:11:15 140047023368320 [Note] WSREP: gcomm thread scheduling priority set to other:0
Oct 11 16:11:15 proxy1 mysqld[2434]: 2016-10-11 16:11:15 140047023368320 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
Oct 11 16:11:15 proxy1 mysqld[2434]: 2016-10-11 16:11:15 140047023368320 [Note] WSREP: restore pc from disk failed
Oct 11 16:11:15 proxy1 mysqld[2434]: 2016-10-11 16:11:15 140047023368320 [Note] WSREP: GMCast version 0
Oct 11 16:11:15 proxy1 mysqld[2434]: 2016-10-11 16:11:15 140047023368320 [Note] WSREP: (30a7b2e6, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
Oct 11 16:11:15 proxy1 mysqld[2434]: 2016-10-11 16:11:15 140047023368320 [Note] WSREP: (30a7b2e6, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
Oct 11 16:11:15 proxy1 mysqld[2434]: 2016-10-11 16:11:15 140047023368320 [Note] WSREP: EVS version 0
Oct 11 16:11:15 proxy1 mysqld[2434]: 2016-10-11 16:11:15 140047023368320 [Note] WSREP: gcomm: connecting to group 'test_cluster', peer '192.168.0.41:,192.168.0.42:'
Oct 11 16:11:15 proxy1 mysqld[2434]: 2016-10-11 16:11:15 140047023368320 [Note] WSREP: (30a7b2e6, 'tcp://0.0.0.0:4567') connection established to 30a7b2e6 tcp://192.168.0.41:4567
Oct 11 16:11:15 proxy1 mysqld[2434]: 2016-10-11 16:11:15 140047023368320 [Warning] WSREP: (30a7b2e6, 'tcp://0.0.0.0:4567') address 'tcp://192.168.0.41:4567' points to own listening address, blacklisting
Oct 11 16:11:15 proxy1 mysqld[2434]: 2016-10-11 16:11:15 140047023368320 [Note] WSREP: (30a7b2e6, 'tcp://0.0.0.0:4567') connection established to 30a7b2e6 tcp://192.168.0.41:4567
Oct 11 16:11:15 proxy1 mysqld[2434]: 2016-10-11 16:11:15 140047023368320 [Note] WSREP: (30a7b2e6, 'tcp://0.0.0.0:4567') connection established to 1ef15511 tcp://192.168.0.42:4567
Oct 11 16:11:15 proxy1 mysqld[2434]: 2016-10-11 16:11:15 140047023368320 [Note] WSREP: (30a7b2e6, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
Oct 11 16:11:15 proxy1 mysqld[2434]: 2016-10-11 16:11:15 140047023368320 [Note] WSREP: declaring 1ef15511 at tcp://192.168.0.42:4567 stable
Oct 11 16:11:15 proxy1 mysqld[2434]: 2016-10-11 16:11:15 140047023368320 [Warning] WSREP: no nodes coming from prim view, prim not possible
Oct 11 16:11:15 proxy1 mysqld[2434]: 2016-10-11 16:11:15 140047023368320 [Note] WSREP: view(view_id(NON_PRIM,1ef15511,2) memb {
Oct 11 16:11:15 proxy1 mysqld[2434]: 1ef15511,0
Oct 11 16:11:15 proxy1 mysqld[2434]: 30a7b2e6,0
Oct 11 16:11:15 proxy1 mysqld[2434]: } joined {
Oct 11 16:11:15 proxy1 mysqld[2434]: } left {
Oct 11 16:11:15 proxy1 mysqld[2434]: } partitioned {
Oct 11 16:11:15 proxy1 mysqld[2434]: })
Oct 11 16:11:18 proxy1 mysqld[2434]: 2016-10-11 16:11:18 140047023368320 [Note] WSREP: (30a7b2e6, 'tcp://0.0.0.0:4567') turning message relay requesting off
Oct 11 16:11:19 proxy1 mysqld[2434]: 2016-10-11 16:11:19 140047023368320 [Note] WSREP: (30a7b2e6, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.0.42:4567
Oct 11 16:11:20 proxy1 mysqld[2434]: 2016-10-11 16:11:20 140047023368320 [Note] WSREP: forgetting 1ef15511 (tcp://192.168.0.42:4567)
Oct 11 16:11:20 proxy1 mysqld[2434]: 2016-10-11 16:11:20 140047023368320 [Note] WSREP: (30a7b2e6, 'tcp://0.0.0.0:4567') turning message relay requesting off
Oct 11 16:11:20 proxy1 mysqld[2434]: 2016-10-11 16:11:20 140047023368320 [Warning] WSREP: no nodes coming from prim view, prim not possible
Oct 11 16:11:20 proxy1 mysqld[2434]: 2016-10-11 16:11:20 140047023368320 [Note] WSREP: view(view_id(NON_PRIM,30a7b2e6,3) memb {
Oct 11 16:11:20 proxy1 mysqld[2434]: 30a7b2e6,0
Oct 11 16:11:20 proxy1 mysqld[2434]: } joined {
Oct 11 16:11:20 proxy1 mysqld[2434]: } left {
Oct 11 16:11:20 proxy1 mysqld[2434]: } partitioned {
Oct 11 16:11:20 proxy1 mysqld[2434]: 1ef15511,0
Oct 11 16:11:20 proxy1 mysqld[2434]: })
Oct 11 16:11:25 proxy1 mysqld[2434]: 2016-10-11 16:11:25 140047023368320 [Note] WSREP:  cleaning up 1ef15511 (tcp://192.168.0.42:4567)
Oct 11 16:11:46 proxy1 mysqld[2434]: 2016-10-11 16:11:46 140047023368320 [Note] WSREP: view((empty))
Oct 11 16:11:46 proxy1 mysqld[2434]: 2016-10-11 16:11:46 140047023368320 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
Oct 11 16:11:46 proxy1 mysqld[2434]: at gcomm/src/pc.cpp:connect():162
Oct 11 16:11:46 proxy1 mysqld[2434]: 2016-10-11 16:11:46 140047023368320 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
Oct 11 16:11:46 proxy1 mysqld[2434]: 2016-10-11 16:11:46 140047023368320 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1380: Failed to open channel 'test_cluster' at 'gcomm://192.168.0.41,192.168.0.42': -110 (Connection timed out)
Oct 11 16:11:46 proxy1 mysqld[2434]: 2016-10-11 16:11:46 140047023368320 [ERROR] WSREP: gcs connect failed: Connection timed out
Oct 11 16:11:46 proxy1 mysqld[2434]: 2016-10-11 16:11:46 140047023368320 [ERROR] WSREP: wsrep::connect(gcomm://192.168.0.41,192.168.0.42) failed: 7
Oct 11 16:11:46 proxy1 mysqld[2434]: 2016-10-11 16:11:46 140047023368320 [ERROR] Aborting
Oct 11 16:11:47 proxy1 systemd[1]: mariadb.service: main process exited, code=exited, status=1/FAILURE
Oct 11 16:11:47 proxy1 systemd[1]: Failed to start MariaDB database server.
-- Subject: Unit mariadb.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit mariadb.service has failed.
-- 
-- The result is failed.
Oct 11 16:11:47 proxy1 systemd[1]: Unit mariadb.service entered failed state.
Oct 11 16:11:47 proxy1 systemd[1]: mariadb.service failed.
Oct 11 16:11:47 proxy1 polkitd[570]: Unregistered Authentication Agent for unix-process:2360:148848 (system bus name :1.15, object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_US.UTF-8) (disconnected from bus)

How to recover a cluster?

Solution

MariaDB Galera Cluster:
Solution 1:
1) I've changed safe_to_bootstrap parameter to 1 on one of the node in the file /var/lib/mysql/grastate.dat:

safe_to_bootstrap: 1

2) After that I killed all mysql processes:

killall -KILL mysql mysqld_safe mysqld mysql-systemd

3) And started a new cluster:

galera_new_cluster

4) All other nodes I reconnected to the new one:

systemctl restart mariadb

P.S. to install killall on CentOS use psmisc:

sudo yum install psmisc

Solution 2:
Another way to restart a MariaDB Galera Cluster is to use --wsrep-new-cluster parameter.

1) Kill all mysql processes:

killall -KILL mysql mysqld_safe mysqld mysql-systemd

2) On the most up to date node start a new cluster:

/etc/init.d/mysql start --wsrep-new-cluster

3) Now other nodes can be connected:

service mysql start --wsrep_cluster_address="gcomm://192.168.0.101,192.168.0.102,192.168.0.103" \
--wsrep_cluster_name="my_cluster"

Percona XtraDB Cluster:
Solution 1:
In case you can connect to the most up to date node then you can setup the node to bootstrap with the next SQL:

SET GLOBAL wsrep_provider_options='pc.bootstrap=true';

Solution 2:
In case if all your nodes are dead and can not be started, you can stop the old one cluster and run a new one. You must stop all the cluster nodes because they have an information about old nodes in the old cluster.

1) Kill all the mysql processes on all nodes:

killall -KILL mysql mysqld_safe mysqld mysql-systemd

2) Start a new cluster on the most up to date node:

systemctl start mysql@bootstrap.service

3) Start other nodes:

systemctl start mysql

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange