EBS Raid-0: 9 out of 20 EBS volumes “impaired”. Now what? [closed]
-
18-06-2021 - |
Pergunta
I have an ec2 instance (10GB network, AMI: EC2 CentOS 5.5 GPU HVM AMI (Driver 260.19.29) (ami-42a2532b)) with 20 EBS volumes attached in raid-0. As a result of last night's outage of AWS, 9 of these volumes have been marked "impaired, possible data inconsistency" and I/O disabled. The instance is now stopped. Volumes are awaiting "Enable I/O".
Additionally, the small EBS volume that is not part of the raid array and that has the root partition was also impaired.
AWS recommends to enable I/O of the volumes that were impaired and then run fsck on them, but of course that doesn't apply to EBS volumes used in a raid array.
What would be the safest way to proceed in order to try to recover that array? I understand I might lose it all and that's why we have contingency plans (just much more work & time to recover), but I'd rather put all the chances on my side and try to recover/repair the array. So what does seem the safest sequence of actions?
Thanks.
Solução
Wanted to give an update and close this question. Essentially everything went fine and I didn't have any data corruption. FSCK ran clean, and the parallel DB that is using this array started just fine and all is good.
Here are some commands that helped gather some data as I gingerly walked in the mine field:
mdadm --detail /dev/md0 >md0_detail
Get an overview of the raid array.mdadm --examine /dev/sd[fghijklmnopqrstuvwxy] > examine_sd
Examine each component of the raid array.grep -i checksum examine_sd
Verify that all checksums are correct.mount -o noatime /dev/md0 /data
Since the low-level tests looked good, tried to mount the raid device.
Notes:
- The actual fs used by the device is ext4 (journalled)
- It went fine and going through it seemed everything was where it should be.
Further actions:
umount /data
Unmount the raid array before performing fsck.fsck /dev/md0
It all came out clean, no problem whatsoever.mount -o noatime /dev/md0 /data
Finally, mount the device for good.