Aurora MySQL 5.7 randomly fails

https://dba.stackexchange.com/questions/213030

07-01-2021
|

Question

This is the 5th time. It happens once a week (Tuesday or Wednesday within 03:00-07:00 UTC+0). On the console, it shows available but inaccessible. We try to wait if the instance will recover itself, after ~30 min nothing happens. So I reboot it manually, then it came online again after rebooting (~5 min).

It would be helpful to know what actually went wrong. This is only a dev server with few users and records.

Engine: Aurora MySQL 5.7.12
DB instance class: db.t2.small
Backup time: 16:00-16:30 UTC+0
Maintenance time: sun:17:00-sun:17:30 UTC+0

Below is the only list of available logs after rebooting the instance.

error/mysql-error-running.log.2018-07-24.03 Tue Jul 24 11:14:06 GMT+800 2018    11.8 kB
error/mysql-error-running.log.2018-07-24.04 Tue Jul 24 11:30:00 GMT+800 2018    285.5 kB
error/mysql-error-running.log.2018-07-24.05 Tue Jul 24 12:30:00 GMT+800 2018    31.1 kB
error/mysql-error-running.log.2018-07-24.06 Tue Jul 24 13:30:00 GMT+800 2018    31.8 kB
error/mysql-error-running.log.2018-07-24.07 Tue Jul 24 14:30:00 GMT+800 2018    32.9 kB
error/mysql-error-running.log.2018-07-24.08 Tue Jul 24 15:30:00 GMT+800 2018    29 kB
error/mysql-error-running.log.2018-07-24.09 Tue Jul 24 16:30:00 GMT+800 2018    32.1 kB
error/mysql-error-running.log.2018-07-24.10 Tue Jul 24 17:30:00 GMT+800 2018    27.5 kB
error/mysql-error-running.log.2018-07-24.11 Tue Jul 24 18:30:00 GMT+800 2018    31.7 kB
error/mysql-error-running.log.2018-07-24.12 Tue Jul 24 19:30:00 GMT+800 2018    27.1 kB
error/mysql-error-running.log.2018-07-24.13 Tue Jul 24 20:30:00 GMT+800 2018    22.4 kB
error/mysql-error-running.log.2018-07-24.14 Tue Jul 24 21:30:00 GMT+800 2018    22.8 kB
error/mysql-error-running.log.2018-07-24.15 Tue Jul 24 22:30:00 GMT+800 2018    24.7 kB
error/mysql-error-running.log.2018-07-24.16 Tue Jul 24 23:30:00 GMT+800 2018    24.7 kB
error/mysql-error.log   Wed Jul 25 00:34:45 GMT+800 2018    2.6 kB
external/mysql-external.log Wed Jul 25 00:30:00 GMT+800 2018    7.6 kB

external/mysql-external.log

/rdsdbbin/oscar/bin/mysqld, Version: 5.7.12 (MySQL Community Server (GPL)). started with:
Tcp port: 3306 Unix socket: /tmp/mysql.sock
Time,ServerHost,User,UserHost,Command,Payload
/rdsdbbin/oscar/bin/mysqld, Version: 5.7.12 (MySQL Community Server (GPL)). started with:
Tcp port: 3306 Unix socket: /tmp/mysql.sock
Time,ServerHost,User,UserHost,Command,Payload
/rdsdbbin/oscar/bin/mysqld, Version: 5.7.12 (MySQL Community Server (GPL)). started with:
Tcp port: 3306 Unix socket: /tmp/mysql.sock
Time,ServerHost,User,UserHost,Command,Payload
----------------------- END OF LOG ----------------------

error/mysql-error-running.log.2018-07-24.03 shows: https://pastebin.com/ywmXLR5g.

error/mysql-error-running.log.2018-07-24.04 shows: https://pastebin.com/g1dkR6rj.

error/mysql-error-running.log.2018-07-24.18 shows: https://pastebin.com/g0aAXfaT.

All other logs shows nothing(see photo).

Event Logs

July 24, 2018 at 11:14:14 AM UTC+8  DB instance restarted
July 24, 2018 at 11:13:31 AM UTC+8  Error restarting mysql: Engine bootstrap failed with no mysqld process running...
July 24, 2018 at 11:12:01 AM UTC+8  Recovery of the DB instance is complete.
July 24, 2018 at 11:04:26 AM UTC+8  Recovery of the DB instance has started. Recovery time will vary with the amount of data to be recovered.

CPU Utilization (07-24-2018)

CPU Utilization (07-11-2018 to 07-24-2018)

Solution

Special thanks to @WilsonHauck. After 4 weeks of monitoring, Manually upgrading Aurora to the latest version solves the issue.

There have been several bugfixes addressing unexpected restarts on 2.01.1. https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/AuroraMySQL.Updates.20Updates.html

To manually upgrade your Aurora:

Go to RDS - AWS Console
Navigate to Clusters
Select your cluster
Click Actions >> Upgrade now

OTHER TIPS

Had this problem with Aurora 5.7 as well over the weekend, at migration point to boot! AWS support said to disable "Performance Insights" as there's a "software defect" that internal teams are "actively" working on. No restarts so far.

As far as performance is concerned, compared to our instance-based clustered perconas, RDS Aurora MySQL is significantly slower: about 10% slower (this is based on replicated data and benchmarking similar long running reports) regardless of RDS server type or size (we tried bigger instance types just to be sure, same result).

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange