Replacing failed disks in a MD RAID in Ubuntu Server

Hello everyone! Been a moment since my last blog update. This is a special one that I have been wanting to write, but wanted to wait until I actually had to do it so I can show real world examples, and boy, is this one for the record books.

So, my secondary KVM server has a 5 disk hot swappable chassis that I bought on NewEgg about 7 years ago that allows you to install 5 SATA disks and these disks are connected to the mother board from the chassis into the 5 SATA ports. This allows me to hot swap the hard drives if they ever fail, and well, two of them did about a month ago. The system is setup as a RAID-5. So all of the disks are members of the RAID and then the 5th disk is a Hot Spare. Well, Disk 4 and 5 failed together. Basically, disk 4 failed, and while 5 was becoming the 4th disk, it failed. Luckily the Array was still good, but now I need to replace the failed disks.

I bought 2 new 2TB disks from NewEgg and installed them in the array. Unfortunately, the system does not automatically detect new drives installed or removed, so I had to run the following commands to get the disks recognized by the system.

sudo -i
echo "0 0 0" >/sys/class/scsi_host/host0/scan
echo "0 0 0" >/sys/class/scsi_host/host1/scan
echo "0 0 0" >/sys/class/scsi_host/host2/scan
echo "0 0 0" >/sys/class/scsi_host/host3/scan

I then listed the /dev/ directory to make sure that /dev/sdd and /dev/sde were no longer being seen as they have been removed. I also checked the raid configuration to make sure that they were not listed any longer:

mdadm -D /dev/md0
mdadm -D /dev/md1

Both arrays no longer listed the failed disks, so I’m ready to physically add the new disks.

I installed the new disks. Now I need to re-scan the bus for Linux to see the disks:

echo "0 0 0" >/sys/class/scsi_host/host0/scan
echo "0 0 0" >/sys/class/scsi_host/host1/scan
echo "0 0 0" >/sys/class/scsi_host/host2/scan
echo "0 0 0" >/sys/class/scsi_host/host3/scan

I then listed the /dev directory and I can now see the new disks, sdd and sde.

I then need to make sure that they have the correct format and partition layout to work with my existing array. For this I used the sfdisk command to copy a partition layout and then apply it to the new disks:

sfdisk -d /dev/sda > partitions.txt
sfdisk /dev/sdd < partitions.txt
sfdisk /dev/sde < partitions.txt

If I do another listing of the /dev directory I can see the new drives have the partitions. I’m now ready to add the disks back to the array:

mdadm --add /dev/md0 /dev/sdd2
mdadm --add /dev/md1 /dev/sdd3
mdadm --add-spare /dev/md0 /dev/sde2
mdadm --add-spare /dev/md1 /dev/sde3

I then check the status of the array to make sure it is rebuilding:

mdadm -D /dev/md0
mdadm -D /dev/md1

The system shown it was rebuilding the arrays and at the current rate it was going to take about a day.

The next day I go and check the status, and low and behold I found out that disk 5 (sde) had failed and was no longer reporting in. I got a bad disk shipped to me. So I contacted NewEgg and they sent me out a replacement as soon as I sent them the failed disk. Luckily it was the hot spare so it didn’t have any impact on the system removing it or adding it back, but I did run the following command to remove the spare from the array and then re-scanned the bus so that the disk was fully removed from the server:

sudo mdadm --remove /dev/md0 /dev/sde2
sudo mdadm --remove /dev/md1 /dev/sde3
sudo echo "0 0 0" >/sys/class/scsi_host/host0/scan
sudo echo "0 0 0" >/sys/class/scsi_host/host1/scan
sudo echo "0 0 0" >/sys/class/scsi_host/host2/scan
sudo echo "0 0 0" >/sys/class/scsi_host/host3/scan
sudo mdadm -D /dev/md0
sudo mdadm -D /dev/md1

The MDADM reported that there was no longer a spare available and the listing of the /dev directory no longer shown /dev/sde. A week later, I got my new spare from NewEgg and installed it and ran the following:

sudo -i
echo "0 0 0" >/sys/class/scsi_host/host0/scan
echo "0 0 0" >/sys/class/scsi_host/host1/scan
echo "0 0 0" >/sys/class/scsi_host/host2/scan
echo "0 0 0" >/sys/class/scsi_host/host3/scan
ls /dev
sfdisk /dev/sde < partitions.txt
ls /dev
mdadm --add-spare /dev/md0 /dev/sde2
mdadm --add-spare /dev/md1 /dev/sde3
mdadm -D /dev/md0
mdadm -D /dev/md1

This added the disk and then added it as a hot spare for the arrays. Since it’s a hot spare, it does not need to resync.

And there you have it, how to replace the disks in a MD RAID on Ubuntu.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.