Explain about RAID5 please

willie · January 2017

I keep hearing about the danger of having a second drive fail during a RAID5 rebuild after the first drive fails. For a big multi-drive array, that problem is obvious with a little bit of math. For a smaller array, if you want to keep the server accepting traffic through the rebuild, that will slow the rebuild a lot and stressfully pound the heck out of the remaining drives, which doesn't sound good either.

But I've heard not to do it even with a small array and even if you're willing to take the server offline for rebuild. Is that really so bad? I'm imagining the common 4x4TB or 4x6TB servers (Hetzner SX-61 etc). To do a rebuild you basically have to make one sequential read pass through each of the surviving drives. Not much seeking and not that many hours of read operations. In fact the same reading operation on each drive that a raid-1 or raid-10 would require. Is it that bad? Should 4-drive servers be set up some other way, or even avoided? Thanks.

HyperFilter_Official · January 2017

If you want to use RAID-5, I'd recommend you to use RAID-6 instead, so you would have "double redundancy" if compared as RAID-5 as in this case, two drives can fail, before everything dies.

In my opinion, the best setup is a RAID 10, as it ensures write/read speed and redundancy, while giving you the double space, a bit less than RAID-5, but still less problematic, and less slow during rebuilds as well.

And yes, RAID-6 rebuilding is slower than RAID-5, but it is more reliable/redundant ah, and yes, in RAID-6 you'll have 2x read speed, while in RAID-5 3x read speed, but still, reliability matters more in this scenario.

But hey ! This is just me.

willie · January 2017

On a 4 drive system, raid 6 and raid 10 eat half your disk space, so you end up with 1.5x the cost per TB of raid 5. That's significant! Might be better to use raid-6 with 8 or so drives.

msg7086 · January 2017

It's not about another drive failure. It's about another sector failure. That's total different thing.

If you have an unreadable sector on a RAID-5 array, it will simply be marked as a bad sector, and the controller will write back the correct data calculated from other drive, into a spare sector. Thus in such an array, even everything works just fine, it doesn't mean it never found such a problem. It's just fixable, and actually got fixed silently.

The problem appears if you have a full drive failure and start rebuilding the array. Now if an unreadable sector is found, at that particular position, no data can be recovered for that sector. And you lose your whole array as the result.

A second drive failure may not be common, however an unreadable sector does very likely happen. With larger drives, it happens even more often.

That being said, it will still protect your data to some extent, as I said above, from unreadable sectors.

msg7086 · January 2017

So the question is more like, how important do you think your data is?

All useless crap -- RAID-0

Somewhat I'd like to keep for long -- RAID-5

Super important -- ZFS RAID-Z3 + geo-remote 3x backups

willie · January 2017

Why do you lose the whole array from an unreadable sector, instead of just losing that sector? You have millions of files; having one of them partially clobbered doesn't sound like the end of the world. I've heard that the majority of data written to computer disks these days is never read back (stuff like CCTV footage and logs). A sector failure in a video archive means a dropped frame or something like that, even if you do read it back.

Is there some reason the raid system can't just keep rebuilding after skipping the bad sector?

willie · January 2017

msg7086 said:

So the question is more like, how important do you think your data is?

Yes, that's about right. But I keep hearing raid-5 is "run away!" once the amount of data is large enough, even if the # of drives is small. The issue of bad sectors is relevant but there's some part of the picture that I'm missing.

pbgben · January 2017

@willie said:
I keep hearing about the danger of having a second drive fail during a RAID5 rebuild after the first drive fails. For a big multi-drive array, that problem is obvious with a little bit of math. For a smaller array, if you want to keep the server accepting traffic through the rebuild, that will slow the rebuild a lot and stressfully pound the heck out of the remaining drives, which doesn't sound good either.

But I've heard not to do it even with a small array and even if you're willing to take the server offline for rebuild. Is that really so bad? I'm imagining the common 4x4TB or 4x6TB servers (Hetzner SX-61 etc). To do a rebuild you basically have to make one sequential read pass through each of the surviving drives. Not much seeking and not that many hours of read operations. In fact the same reading operation on each drive that a raid-1 or raid-10 would require. Is it that bad? Should 4-drive servers be set up some other way, or even avoided? Thanks.

And then you have to worry about the raid controller going nutz and corrupting the config = no more data.

willie · January 2017

pbgben said: And then you have to worry about the raid controller going nutz and corrupting the config = no more data.

Probably software raid in the case of these cheap 4 drive systems.

rm_ · January 2017

msg7086 said:

The problem appears if you have a full drive failure and start rebuilding the array. Now if an unreadable sector is found, at that particular position, no data can be recovered for that sector. And you lose your whole array as the result.

You can still recover it: by manually rewriting the unreadable sector(s) after which they usually become readable again, and restarting/resuming the rebuild. However that's some data corruption (even if minor) and can be quite messy and involved process. And I'm not sure if on hardware RAID controllers you can even access the disks directly to try doing that, likely not. You could try taking the affected drive out and doing that on another computer, however as you understand that's not an option on a rented server in a DC.

To minimize the risk of unreadable sectors during a rebuild, it's advised to do regular "patrol reads" (basically reading through all disks), in mdadm software RAID terms that's scheduling the "check" action (in fact in Debian that's auto-scheduled for you monthly). That way if an undreadable sector appears, it can get rewritten from other drives while array is still fully up.

willie · January 2017

Thanks. Should an unreadable sector rewritten like that show up in SMART? That would mean time to ask for a replacement drive, I think (unrecoverable ECC error or whatever it's called).

Is data corruption something that actually happens on modern drives? That is you get actual wrong data, rather than a read failure. I'd hoped that drive ECC could detect errors large enough to be extremely unlikely, even if the correctable amount is smaller.

mailcheap · January 2017

Like @rm_ said, the automatic mdadm check is important; make sure its configured properly and it runs (in Debian, /etc/cron.d/mdadm). Another thing to do is active disk monitoring with SMART. Do a short test daily and check for Non-medium error count (above 0: replace) and error counter log (4-5 errors corrected is okay).

SMART showing soon to be completely failing disk

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0      285         0       285     564165      36486.352           0
write:         0     2319         0      2319    1645783        548.091           0
verify:        0        0         0         0     400050          0.000           0

Non-medium error count:       37

SMART showing healthy disk

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        4         0         4     140735      36484.116           0
write:         0        0         0         0     377682        540.429           0
verify:        0        0         0         0       4867          0.000           0

Non-medium error count:        0

Best Regards,

Pavin.

willie · January 2017

I don't see "non-medium error count" on Toshiba or Seagate Constellation disks (the two kinds that are in my servers) but I do see quite a lot of ECC and other numbers of possible concern on the Seagates:

$ sudo smartctl -x /dev/sda
Model Family:     Seagate Constellation ES.3
Device Model:     ST1000NM0033-9ZM173
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR--+  078   063   044    -    82754743
  7 Seek_Error_Rate         POSR--   093   060   030    -    2000290303
195 Hardware_ECC_Recovered  -O-RC-   053   015   000    -    82754743
196 Reallocated_Event_Count -O--CK   000   000   000    -    32125


$ sudo smartctl -x /dev/sdb
  1 Raw_Read_Error_Rate     POSR--+  083   063   044    -    234778050
  7 Seek_Error_Rate         POSR--   093   060   030    -    1982716315
195 Hardware_ECC_Recovered  -O-RC-   054   014   000    -    234778050
196 Reallocated_Event_Count -O--CK   000   000   000    -    15869

The two drives both have around 20K power-on hours and all the numbers above are fairly similar between the drives, so maybe it's ok, I never know how to tell.

All drives report passing self-tests. Hetzner (Toshiba) ran long self-tests before delivering the server but Online (Seagate) only ran short ones. I'm running the long ones now. I hadn't thought of running a daily short test instead of just monitoring the stats: sounds like a good idea, if it doesn't stress the drives?

Thanks!

msg7086 · January 2017

@willie said:
Thanks. Should an unreadable sector rewritten like that show up in SMART? That would mean time to ask for a replacement drive, I think (unrecoverable ECC error or whatever it's called).

Is data corruption something that actually happens on modern drives? That is you get actual wrong data, rather than a read failure. I'd hoped that drive ECC could detect errors large enough to be extremely unlikely, even if the correctable amount is smaller.

Sometimes yes. No.
Yes. Yes.

URE (unrecoverable read error) happens very often. Some say that you can hit 1 URE every 12.5TB data read. If you take that as a reason for a replacement drive, your provider may not be happy. Usually the disk will try to overwrite the sector itself and see if it works -- if it does work, nothing will happen; if not, you'll see the reallocated sector count increased.

We have some RAID-5 arrays in our lab dc, and we do observe that some drives carry a positive reallocated sector count, despite that a re-scan of those bad sectors indicates no problem; and these arrays never failed.

Data corruption does happen, and it is called bit-rot. In most cases it should be detected and fixed by ECC, however if multiple bit-rot happened, it's possible that you'd get a bit-rot on your actual data -- that's why sometimes people recommend filesystems with checksum, like ZFS. Yes it's damn so extremely unlikely, but it does happen.

nullnothere · January 2017

@willie - since this thread is now getting into SMART territory, here are a couple of useful (previous) threads where there is some relevant information around the SMART values etc.

https://www.lowendtalk.com/discussion/93554/trying-to-understand-smartctl

https://lowendtalk.com/discussion/100362/should-these-i-o-errors-concern-me

I'm not sure on attributes 195/196 (I don't see them on my Seagate's) but 197 and 198 are the ones to watch (someone please corroborate!).

Also, I think 195 is "benign" - internal drive ECC correction stats but 196 should be concerning - I'm a bit surprised at such a high reallocation count (this says event not sector - not sure what the difference is or how to interpret) and I would be a bit wary.

Also, an offline and a short test frequently shouldn't really hurt things - but beware that they will (typically) NOT detect the kinds of drive failures that you want to know about (like reallocated sectors). I'd say a once in a month long test is a good idea (unless you're happier to do it more often like weekly during off hours for your server).

My $.02.

mailcheap · January 2017

@willie said:
I don't see "non-medium error count" on Toshiba or Seagate Constellation disks (the two kinds that are in my servers) but I do see quite a lot of ECC and other numbers of possible concern on the Seagates:
$ sudo smartctl -x /dev/sda
Model Family:     Seagate Constellation ES.3
Device Model:     ST1000NM0033-9ZM173
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR--+  078   063   044    -    82754743
  7 Seek_Error_Rate         POSR--   093   060   030    -    2000290303
195 Hardware_ECC_Recovered  -O-RC-   053   015   000    -    82754743
196 Reallocated_Event_Count -O--CK   000   000   000    -    32125


$ sudo smartctl -x /dev/sdb
  1 Raw_Read_Error_Rate     POSR--+  083   063   044    -    234778050
  7 Seek_Error_Rate         POSR--   093   060   030    -    1982716315
195 Hardware_ECC_Recovered  -O-RC-   054   014   000    -    234778050
196 Reallocated_Event_Count -O--CK   000   000   000    -    15869
The two drives both have around 20K power-on hours and all the numbers above are fairly similar between the drives, so maybe it's ok, I never know how to tell.

All drives report passing self-tests. Hetzner (Toshiba) ran long self-tests before delivering the server but Online (Seagate) only ran short ones. I'm running the long ones now. I hadn't thought of running a daily short test instead of just monitoring the stats: sounds like a good idea, if it doesn't stress the drives?

Thanks!

Errors corrected by ECC is a good stat if there's no non-medium error count. A few corrected read errors are okay, but write errors are almost certainly bad.

Mostly I keep these smart stats -run daily- as a secondary information source for when mdadm reports any problems. Configure smart & mdadm with email alerts (w/ mdadm full check every month). Monthly smart long tests aren't really necessary when mdadm monthly check is configured.

Pavin.

willie · January 2017

Thanks. I see my Hetzner server is set up to run a checkarray weekly but I'll add some more checks (daily SMART short, and mdadm scans. The thing that scared me about the Seagate ECC attribute #195 was the "value" field of 053. That field is usually normalized to 100 when the drive is completely healthy, and decreases as errors are found. Otoh the Toshiba drives don't report attribute 195 at all.

I'm likely to cancel the server with the Seagate drives at the end of February anyway so maybe I'll just live with the issue and be sure to back up anything important from it. It's an online.net promo server and those would seem to be super scarce and wonderful, but in reality there's comparably priced servers at Hetzner.

dediserve · January 2017

Just run RAID 50 past 4 or 5 drives. (Two RAID 5 arrays in RAID 0)

mailcheap · January 2017

@willie said:
Thanks. I see my Hetzner server is set up to run a checkarray weekly but I'll add some more checks (daily SMART short, and mdadm scans. The thing that scared me about the Seagate ECC attribute #195 was the "value" field of 053. That field is usually normalized to 100 when the drive is completely healthy, and decreases as errors are found. Otoh the Toshiba drives don't report attribute 195 at all.

I'm likely to cancel the server with the Seagate drives at the end of February anyway so maybe I'll just live with the issue and be sure to back up anything important from it. It's an online.net promo server and those would seem to be super scarce and wonderful, but in reality there's comparably priced servers at Hetzner.

Didn't know Online used Seagates; have a Kimsufi box w/ HGST. But then again, had 2 new HGSTs that failed -didn't completely- on me last year. The smart data above is from one of those.

Just to add: mdadm always takes precedence over smart in a raid config and it'll be the one that always reports problems first since its managing read/writes in realtime (keeps count of bad sectors). In the above smart data of the soon to be failing disk, mdadm was the first to remove the disk due to too many write errors whereas smart tests continued to pass. If Sundays are low load and RAID has parity, then feel free to run that mdadm checkarray cron every Sunday instead of just first Sunday. This way even if a disk fails, you can be sure the rebuild won't encounter any bad sectors (and it keeps count of the bad sectors that had to rewritten; this in turn gives mdadm more data on when to kick out a potentially bad drive from the array - sooner the better!).

Best Regards,

Pavin.

bsdguy · January 2017

Haha, one of the classical issues where many just repeat whatever happens to be the current dogma and where quite few actually have some non-trivial knowledge. Then add some religious sauce to the dish, things like "raid controllers are evil, use linux sw raid!".

I'll start with raid 10, which is very often misunderstood, for example as "it's fast yet safe, hehe". Nope. As for safety it's basically a pimped up raid 1 which is more on the "a reasonable basis" than on the "safe" side.

raid 10 can be considered the answer to one specific problem and that's seriously large data bases.
Looking at how things work one finds that pretty much all modern OSs have quite good and half-way smart caching. And one finds that pretty much everything is read and written serially or quasi serially (due to smart OSs). But for one big exception: Large DBs. By large I mean "larger than the any reasonable cache and even largely than you RAM". Looking at seriously large production DBs one will quite often find that even the index part is larger than OS (or controller) cache and quite often even larger than available RAM.
That's what raid 10 is for. Random access and fast, please. Plus reasonably safe.
As for safety one would prefer raid 6 but being a large DB raid 10 is the only sensible option. Reason: both 1 and 0 are basically just mirror, albeit for different purpose; no raid magic there. Some people have used raid 50 or 60 (or even "500" or "600") and achieved nice performance results but when a disk goes belly up they are fucked.
As for safety you might find quite some large DBs writing command shadows to raid 6 to add increased safety.

Raid 5 can be seen as a more efficient form of raid 1. The cost ratio is more attractive. Rather than losing 50% for safety you can drive that down to much smaller costs. But you pay elsewhere, namely at the failure end. Replacing a failed disk is far more expensive in both time and resources in raid 5 than in raid 1. Plus you get provider dependent, which can be nasty as it's not simply "I had an lsi controller and need another lsi controller"; it's often worse and being to do with the exact controller family or firmware revision or an xx08 from dell not working with a disk from an xx08 from an ibm server. And, many don't know that, that' also true for some sw raids albeit by far less painful than with hw controllers.
Looking at all that I'd advise against raid 5 except maybe for home use.

raid 6 is similar but adds another "safety disk" to the mix (everyone knows that) but also another algorithm (Gallois field based) which is a point to understand. Reason: raid 5 is xor based and hence every OS can do that easily for almost free (unless the load is very high); it basically comes down to pushing all bytes through a single clock cycle OP.
With raid 6 one should prefer a hardware controller because the gallois field computations don't come for free and to make things worse they like to trash the cache.

That said, while raid 5 might be used as a cheaper raid 1, raid 6 is about quite serious resilience. If need that you pay your price; you get a hw controller pair (you definitely want to keep a twin spare) and you'll wait - and I mean hours and hours - when a disk goes belly up.
Unfortunately that's something many production environments can't afford. They need full performance and always. That's where you enter raid 60 (or, depending on your needs raid 50). Given that you need and can have (-> chassis) a rather large amount of disks, say 18, raid 60 will often be the best solution albeit the most expensive one. Also be sure to have a hot spare.
Arriving here, I can also explain the need for a good (BBU'd!) cache: It gives the controller some breathing room which significantly enhances performance on rebuilds under load.

All said, that whole raid issue is a rather complex zoo where you need to know you use profile, your OS, and raid to come to the right choice.

For the average semi professional user or for small offices I still argue that a well-kept Raid 5 (or 51 for companies) is the best solution for most needs. It's cheap and maintaining it properly keeps it running. Note: with maintaining I also mean to keep an eye on smart, e.g. by including a smart query in all backup cycles, and to pro-actively replace disks early as soon as they start to become less reliable.
Note, however, that this also depends on decoupling issues, particularly by doing backups.

Finnally a word regarding Z. To put it short and simple, I personally don't like it. It's super fat and somewhere a price has to be paid. As usual the price for surface comfort is payed in the layers beneath the surface. While Z might be a great thing for cloud providers, it's unattractive and expensive for almost all other needs, except for some half-cooked "expert" amateurs.

All said, I haven't seen many drive arrays (or disks) fail during some decades, given that they had been reasonably maintained. And looking at prices and price differences I'd also suggest to stay away (in most scenarios) from expensive enterprise drives; it seem more promising to me to buy more cheap (but still reasonable quality) drives and to replace them a little earlier.

willie · January 2017

Thanks bsdguy, I'll try to digest that but it seems to make sense. In a nonstop production environment you obviously need online db replication and entire spare servers (geo separated even), so RAID level becomes less important. So I'm only asking from the perspective of a cheapskate LET packrat hoarding a pile of data for personal use.

That is, I don't mind extended downtime for rebuild if a drive goes out and I don't mind the performance hit of mdadm or maybe ZFS, but I don't want to take excessive chances of losing data. (Smaller chances are ok, since the tiny fraction that's really important should be multi-replicated etc).

The story I keep hearing is that RAID 5 will trash a whole array if any single sector becomes unreconstructable, and that's the part I don't understand. Obviously I don't want any sectors corrupted, but that's nowhere near as bad as losing a whole array.

Do you know if it's possible to set up Ceph so that the redundancy level is below 1:1? That is, as a distributed raid-6? Maybe that's a worthwhile approach given the low end server configs generally available on LET.

Right now both of my servers have 2 drives and raid-1, fwiw. I'm inquiring about raid-5 because of the 3-drive and 4-drive servers out there that give much more usable space with raid-5 than with raid-1. With >4 drives raid-6 starts looking good, so it's mostly the 3 or 4 drive systems where there's a real question.

Msg7086 and others, thanks also for the informative posts. It's disturbing to hear URE and particularly bit rot are so frequent. I know some tape drives use very long ECC patterns to prevent URE, but I guess that's not possible with a disk.

I'll re-read all the posts in the thread in the next few days when I'm less sleepy, but it's already been very helpful. In particular I'll put regular drive checks onto a cron task as suggested, and so on. Much appreciated.

bsdguy · January 2017

@willie said:
The story I keep hearing is that RAID 5 will trash a whole array if any single sector becomes unreconstructable, and that's the part I don't understand. Obviously I don't want any sectors corrupted, but that's nowhere near as bad as losing a whole array.

Depending on the specific circumstances (e.g. controller) that may (rarely) be the case but as a general statement it's bullshit.

So, yes, raid 5 is a good solution for the situation you describe (dedi with 3+ equal sized disks) or general storage (i.e. non db).

As for ceph that's a whole different playing field. Summarizing it somewhat brutally one might consider it something like raid-over-machines. Unless you have seriously major amounts of data, plenty of resources, and seriously high resilience demands I'd suggest to stay away.

willie · January 2017

RAID-over-machines sounds like exactly the right thing when using cheap servers with 2 drives! Three hetzner auction servers with 2x 3tb each gives 6 drives for under $100 a month, so if Ceph can turn those into a raid-6 with 12tb usable, that seems worth looking into. I don't have enough data to want to pursue that just yet, but it's well below the amounts that some other LET users are already maintaining. I won't worry about it for now though. I had just thought of it as ultra-resilient (all data triple replicated etc) rather than RAID-like until recently.

bsdguy · January 2017

Think again looking at available bandwidth.
And no, ceph can't do magic. The law of the weakest link still holds true.

That said, I do indeed know of people who used ceph to strengthen up a bunch of lousy (as in unreliable storage) machines but that was on company lans with plenty bandwidth.

If it's about resilient storage I'd rather go with 1 server with multiple drives in a raid 5 or 6 and use that a frequently used backup server for a farm of cheap servers and vps.

willie · January 2017

Yeah, Hetzner SX-61 (4x 6tb for 70 euro/m + 1mo setup) is probably the logical step up from those auction servers, especially if raid 5 is ok, or if some of the files (or even parity stripes) can be stashed in a cold archive somewhere. Hopefully I won't reach that level of data hoarding any time soon.

bsdguy · January 2017

@willie

If my posts have been helpful for you, great. I will, however, stop that now as certain super-cool guys here (like @WSS) molest me for exactly that. May the cool 1-liners provide good advice.

willie · January 2017

OK, thanks. This should keep me busy for a while. WSS seems cool with it either way.

WSS · January 2017

@willie said:
OK, thanks. This should keep me busy for a while. WSS seems cool with it either way.

Just don't get too helpful or I guess you'll get molested. I can pencil you in on Thursday,.

Howdy, Stranger!

Categories

In this Discussion

Explain about RAID5 please

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Explain about RAID5 please

Comments