Posts Tagged by High Availability

The RAID5 Write Hole

The latest edition of the venerable UNIX and Linux System Administration Handbook (Nemeth et al) has a good section discussing the “RAID5 Write Hole”:

Finally, RAID 5 is vulnerable to corruption in certain circumstances. Its incremental updating of parity data is more efficient than reading the entire stripe and recalculating the stripe’s parity based on the original data. On the other hand, it means that at no point is parity data ever validated or recalculated. If any block in a stripe should fall out of sync with the parity block, that fact will never become evident in normal use; reads of the data blocks will still return the correct data.

Only when a disk fails does the problem become apparent. The parity block will likely have been rewritten many times since the occurrence of the original desynchronization. Therefore, the reconstructed data block on the replacement disk will consist of essentially random data.

Further reading on the BAARF archive (Battle Against Any Raid 5), including why RAID10 and RAID3 should be chosen over RAID5. And then there’s ZFS and RAID-Z.

Troubleshooting Linux HA (High Availability)

When Linux HA (High Availability) is setup, each machine will have a physical address, and one machine should also have the virtual address. This can be checked via ip addr:

machine 1
2: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 00:09:3d:12:af:77 brd ff:ff:ff:ff:ff:ff
    inet 999.99.133.12/23 brd 211.29.133.255 scope global eth0

machine 2
2: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 00:09:3d:12:ba:ef brd ff:ff:ff:ff:ff:ff
    inet 999.99.133.13/23 brd 211.29.133.255 scope global eth0
    inet 999.99.133.19/23 brd 211.29.133.255 scope global secondary eth0:1

If this isn’t the case, do a hb_takeover on the appropriate machine (depending on the status of the underlying application). Eg /usr/lib64/heartbeat/hb_takeover