RAID rebuild times

Kassem · June 2021

According to this https://blog.shi.com/next-generation-infrastructure/best-raid-configuration-no-raid-configuration/

HDD RAID rebuild times can go up to 6 months!

So was wondering what are real world times for RAID rebuilds you encountered? Are you still using RAID for your HDDs? or just for SSDs now?

If not using RAID for HDDs, what are you using?

lentro · June 2021

Interesting article, yes I do use RAID. I'm a tiny host so I've only ever dealt with such a situation once almost three years ago, but I rebuilt it idle for two days instead of under load....

Realistically, I think RAID 10 for performance + 3-2-1 rule for backups is the way to go for things like image/web hosting. For backup storage, RAID 5/6 is still useful. I personally use ZFS given its built in checksums.

sparek · June 2021

Certainly can't speak for everyone. But for us:

1) We are no longer provisioning new servers with spinning disks. We do still have quite a few servers with spinning disks, but SSD prices together with the vast improvement of performance - that's where we've gone.

2) We avoid the parity bit of RAID5 (and RAID6) and stick solely to RAID1 or RAID10. I think the remaining spinning disk servers we have are all RAID10 - for the performance boost. And we top our servers out at 2TB. For us, it just doesn't make a lot of sense to go over 2TB, because you start running into other bottlenecks if you put that many accounts on a single server, which leads to the too many eggs in one basket scenario. For us, having many baskets with fewer eggs is a better setup.

Our SSD servers are RAID1 to guard against a single drive failure.

I haven't exactly dealt very much with RAID5 (or RAID6) but I would imagine the parity bit computation would be rather significant. And for us, it's just better to have more servers that one huge setup.

But again... that's us. Doesn't mean RAID5 or single (or low number of) servers with huge setups isn't a viable path for others.

TimboJones · June 2021

@sparek said:
Certainly can't speak for everyone. But for us:

1) We are no longer provisioning new servers with spinning disks. We do still have quite a few servers with spinning disks, but SSD prices together with the vast improvement of performance - that's where we've gone.

2) We avoid the parity bit of RAID5 (and RAID6) and stick solely to RAID1 or RAID10. I think the remaining spinning disk servers we have are all RAID10 - for the performance boost. And we top our servers out at 2TB. For us, it just doesn't make a lot of sense to go over 2TB, because you start running into other bottlenecks if you put that many accounts on a single server, which leads to the too many eggs in one basket scenario. For us, having many baskets with fewer eggs is a better setup.

Our SSD servers are RAID1 to guard against a single drive failure.

I haven't exactly dealt very much with RAID5 (or RAID6) but I would imagine the parity bit computation would be rather significant. And for us, it's just better to have more servers that one huge setup.

But again... that's us. Doesn't mean RAID5 or single (or low number of) servers with huge setups isn't a viable path for others.

And the rebuild times with your SSD's?

raindog308 · June 2021

@cociu

jsg · June 2021

I miss some relevant information, in particular the type and model of the Raid controller as well as its amount and type of cache memory.
But I agree with their base line that is (my slightly different version), stay away from Raid 5 and 6 with spindles over 1 to 1.5 TB and with SSDs over 4 to 6 TB. I also suggest to keep the array size reasonable (4 +1 or 2 is much better than 12 + 1 or 2) because while just 1 disk needs to be resilvered (usually) all disk are involved for checksum calculation and writing.

But there are some buts, like e.g. the Raid controller and model playing a very important role as well as the role of the array. If for instance you've put a DB on a Raid 5 or 6 that was a bad decision in the first place while say some file storage (mostly read) is far less critical.

As for erasure coding I'm a big fan of it for large volumes (say north of 50 to 100 TB) but usually advise against it because most forms approach Raid 1 in terms of space used and those that don't are complicated and not "plug and play".

sparek · June 2021

@TimboJones said:
And the rebuild times with your SSD's?

Can't say as if I've ever actually timed it. Definitely nothing scientific. But less than a day for 1TB, under load. 6 hours maybe? Could be even less than that. We haven't had a lot of SSD failures, but 20 years of spinning disks versus 3 years of SSD disks means we've had many more spinning disk failures than SSD failures. The SSD failing and rebuilding hasn't been of any consequence, so the timing doesn't really stand out.

Spinning disks? I agree are much longer, it's probably at least 24 hours for 1TB. And it's a pain. You only have so much disk I/O to play with and the vast majority of that is being spent grabbing data from the mirrored drive. So any other disk activity slows to a crawl.

SSD disks are just a lot faster than spinning disks. But if you need 10, 20, 40TB in one server, then that's really, really expensive for SSD.

If you have a need for that much disk space, I guess what are you going to do?

I might question whether a RAID10 setup might be better, avoid the parity bit of RAID5. But you may be wanting to minimize the number of disks you have to have in the system to achieve that much disk space.

Kassem · June 2021

@lentro said: Interesting article, yes I do use RAID. I'm a tiny host so I've only ever dealt with such a situation once almost three years ago, but I rebuilt it idle for two days instead of under load....

What was the size? You left it idle for two days that means you had a backup server? or it just wasn't critical to access this data for two days?

@sparek said: 2) We avoid the parity bit of RAID5 (and RAID6) and stick solely to RAID1 or RAID10. I think the remaining spinning disk servers we have are all RAID10 - for the performance boost. And we top our servers out at 2TB. For us, it just doesn't make a lot of sense to go over 2TB, because you start running into other bottlenecks if you put that many accounts on a single server, which leads to the too many eggs in one basket scenario. For us, having many baskets with fewer eggs is a better setup.

This would work for many small sites but doesn't work when you need tons of data that you want to be sure you won't lose it in case 1 or 2 disks (HDDs) fails.

I think Ceph is the viable answer for mass HDD storage, replicas on independent hardware, if one goes out not a big deal. It costs more though. Any providers here using Ceph in production?

raindog308 · June 2021

@sparek said: Spinning disks? I agree are much longer, it's probably at least 24 hours for 1TB

Assume you mean under heavy load.

I replaced some 6TB WD Blacks with WD Golds earlier this year (as the Blacks were >5 years of age and SMART was getting iffy). I didn't keep records but using software RAID-1 (i5-4690 CPU @ 3.50GHz, Debian 9, 6Gbps consumer mobo Sata) the resync time was nothing like that...well less than a day for 6TB. They were not entirely idle but not heavily used either.

TimboJones · June 2021

@raindog308 said:

@sparek said: Spinning disks? I agree are much longer, it's probably at least 24 hours for 1TB

Assume you mean under heavy load.

I replaced some 6TB WD Blacks with WD Golds earlier this year (as the Blacks were >5 years of age and SMART was getting iffy). I didn't keep records but using software RAID-1 (i5-4690 CPU @ 3.50GHz, Debian 9, 6Gbps consumer mobo Sata) the resync time was nothing like that...well less than a day for 6TB. They were not entirely idle but not heavily used either.

Yeah, I think 24 hours for 10 TB might be more in the ballpark. Gets confusing when talking RAID 10 and raid capacity or drive capacity vs RAID 1.

Rebuild rates can be changed but often defaults to 30%, iirc for LSI cards.

jsg · June 2021

@raindog308 said:

@sparek said: Spinning disks? I agree are much longer, it's probably at least 24 hours for 1TB

Assume you mean under heavy load.

I replaced some 6TB WD Blacks with WD Golds earlier this year (as the Blacks were >5 years of age and SMART was getting iffy). I didn't keep records but using software RAID-1 (i5-4690 CPU @ 3.50GHz, Debian 9, 6Gbps consumer mobo Sata) the resync time was nothing like that...well less than a day for 6TB. They were not entirely idle but not heavily used either.

Warning: Raid 1 rebuilds are very different from Raid 5 or 6 rebuilds and much, much faster.

Jarry · June 2021

@jsg said:

Warning: Raid 1 rebuilds are very different from Raid 5 or 6 rebuilds and much, much faster.

Not sure if one can talk about "rebuilding" raid1 array, as it is basically just copying one disc to another. Well, a little more than that, but no parity is calculated, which is cpu-intensive task requiring multiple read/write ops...

Howdy, Stranger!

Categories

In this Discussion

RAID rebuild times

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

RAID rebuild times

Comments