[ale] LVM2 thinpools, thinLV, gluster, RAID, and a big warning

Jim Kinney jim.kinney at gmail.com
Wed Aug 16 17:33:24 EDT 2017


I just lost 62TB of data in a drive array.

"But Jim!?", you ask knowing Jim isn't usually totally stupid enough to
paint himself into a corner and set up a condition to cause a data loss
of any size, much less one of 62TB, "How did this happen?".

Step 1. Use a known good, respected RAID card from LSI (2108, SAS2, 8-
lane external, RAID6 support - have 2 others in production)
Step 2. Use a mix of known decent hard drives for their size and class
(Toshiba, HGST in 4TB SAS2) - 28 total.
Step 3. RAID the pile into a RAID6 array inside a redundant SAS2 JBOD
box
Step 4. Use LVM2 to then split the 95TB space into multiple 10TB
thinpools and each thinpool with a single thin logical volume of 10TB. 
Step 5. Build another one of these system to match
Step 6. Use Gluster to create nearly 100TB of redundant storage split
between the two machines.
Step 7. Use the storage cluster for a bit over a year until a single
drive dies early.
Step 8. Replace the failed drive with a new one
Step 9. Watch in horror as the RAID subsystem locks up and the thinpool
metadata gets scrambled
Step 10. Diagnose the failure:

Step 0 {
Weeks of study into the complete workings of LVM, RAID and Gluster
prior to the hardware purchase. The RAID card was already in use in 2
other systems with great success. Dead drives were easily popped out,
new ones slammed in, auto-recovery commences and a day or two latter
all is fine.

Gluster is a pretty useful way to provide failover ability in a storage
cluster. Not too hard to setup. Has a few performance gotcha's but
overall, plenty of capability for the need.

LVM2 is a pretty well known and understood entity even with the
addition of thinpool (recommended by gluster as a way to easily extend
space to bricks "on the fly"). Thin pool stores physical extents for
logical volumes and allocates them only when written. Thin logical
volumes are sparse and be virtual sized larger than actual space. Thin
pool can be expanded without touching                        thin
volumes. Yeah sounds like a great match.
}

Didn't read the section in the docs about deleting thin pool, thin
volumes as I was _creating_ them. Would have seen the blurb about "
vgcfgbackup does not back up thin pool metadata."

Hmm. That is not good. Apparently there is only a SINGLE copy of
thinpool metadata and there's no way to back it up. WTF?!?!?

LVM2 does have a sequence of backup and archival of metadata but it's
not useful for thinpool. It looks like it might be possible to do some
funky-ugly like swap thinpool metatada with/into a new LV, take a
snapshot of the new LV, then replay the snapshot back to the original
thinpool but no one actually has any real "yeah, this works" process
mapped out anywhere.

There are a few tools to do some repair (lvchange --repair VG/thinpool)
and they have greatly encouraging words like "If the repair does not
work, the thin pool LV and its thin LVs are lost.". Yep. That sounds
more like reality.

Apparently, LSI Megaraid needs a firmware update to play nice with the
kernels in CentOS 7. Hmm. That was updated when the card went in. So
the new firmware causes behavior different than the documented drive
replacement methodology. Nice.

The only bright spot in this is the other half of the storage cluster
mirror is doing just fine. But now I get to format 90+TB and copy back
over 60+TB and then tell gluster to take over and resync everything. At
least I have a 40Gbps ethernet connection between the two machines. (5
days of file copies with no one using the remaining mirror - not going
to happen).

My new mantra: Always have enough decent Scotch on hand to handle any
occasion.

The warning: Always break things before you depend on them. It's easier
to fix stuff that you already know the insides of when you have a total
failure. Read ALL the docs. When it's infrastructure level stuff, get
real experience breaking things every way possible and then fixing them
before you get stuck having to do it live. Backups/duplicates are a
crutch that are essential to have. 
-- 
James P. Kinney III

Every time you stop a school, you will have to build a jail. What you
gain at one end you lose at the other. It's like feeding a dog on his
own tail. It won't fatten the dog.
- Speech 11/23/1900 Mark Twain

http://heretothereideas.blogspot.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ale.org/pipermail/ale/attachments/20170816/a2543b97/attachment.html>


More information about the Ale mailing list