Is Raid 5 acceptable again now with SSDs?

sureiam · September 2019

Let's get this out of the way first. Raid 5 isn't acceptable in modern day high capacity spindle drives. It's not, don't argue that it is because it's all documented that it's not. Their high failure rate in rebuilds along with slow drive access and horrendously slow rebuild times make it clearly not acceptable. Raid 6 is perhaps a little better since you get a second drive redundancy but still plagued by the slow and risky rebuild with sub par write speeds..

With that said raid 5 using SSDs? Personally I haven't done it but SSDs are stupid cheap right now and the thought is tempting. I'm sure a number of providers here have tinkered with it or run their whole "ship" off of it. What has your testing shown?

What's a typical rebuild time vs spindle as well as read/write speeds gains.

Thanks!

GTHost · September 2019

RAID-1 is much better than RAID-5 with SSD's

Actavus · September 2019

RAID5 kills SSDs faster, due to the amount of writes RAID5 performs during parity operations. Generally you'll want an SLC SSD to mitigate against it.

ViridWeb · September 2019

Go with RAID 1
And if you 4 disk then go for RAID 10

sureiam · September 2019

Of course raid 1 is better than raid 5 and 4x drive raid 10 is better than 5. But what about 5-8 drive arrays? Does anyone have real world experience with larger ssd arrays in raid 5?

AnthonySmith · September 2019

I cannot think of a single use case for raid 5 on SSD's

When you consider chassis with 3 drive bays essentially do not exist, its 2, 4 then more, the only possible reason I can imagine is that someone has spent all their pocket money, they can only afford 3 drives and need a bit more storage than raid 1 provides and is not brave enough for raid 0.

Raid 5 for spinners is fine if the use case is high storage within a limited number of drive bays e.g. 4 when the cost of going up a U + additional power and significant chassis cost increase does not outweigh the benefits, especially in a market whereby everyone want 1TB for $3

So I am not arguing with you I am just saying your analysis of spinner use in raid 5 is wrong, although I also now appreciate you probably just did not want to discuss that at all as I type this, sorry.

tl;dr the gains and loss will be the same as spinners in terms of percentages but if your going for a 4 bay server just buy bigger drives and use raid 10, raid 5 on ssd's is only for those that cannot afford 4 but need more storage than raid 1 provides.

Amitz · September 2019

https://www.lowendtalk.com/discussion/160169/its-my-birthday-crazy-offers-lifetimes-directadmin-hosting-in-germany-and-los-angeles

Ask Mike how it works out for him...

Neoon · September 2019

@Amitz said:
https://www.lowendtalk.com/discussion/160169/its-my-birthday-crazy-offers-lifetimes-directadmin-hosting-in-germany-and-los-angeles

Ask Mike how it works out for him...

Well, he rents the machines, he does not pay for replacement if he tires them down faster. Gives him in result more storage.

I suspect Raid 5 with decent SSD with that storage bonus is not bad for Shared Hosting.

jsg · September 2019

Sorry to spoil the mass of the church of "everyone knows".

Raid 6 is not Raid 5 plus another XOR. It is Raid 5 plus a considerably more compute intense algorithm (galois), so Raid 6 will rebuild slower than Raid 5. And btw. even for the simple Raid 5 XOR the processors usually used on (not ultra cheap) Raid cards do have hardware support.

The difference between spindles and SSDs mainly comes down to price, size and the nature of SSDs (and NVMes) which however can be addressed properly. Now to price ... Looking for a reasonably good quality drive (2.5 mio hrs MTBF, etc) one finds that the options boil down to paying about 10 times the price of a decent spindler. Example: 12 TB spindle about 400€, 12 TB SSD about 4000€.

Probably the biggest error I see being made again and again is to play that game without context. What are you after? A large general storage array? An array for your database? or ...?
For one of the classical use cases, large general (non specific) storage of say 100 TB one needed a very very well filled purse to go the SSD route.

Other relevant questions are how serious your need is to have your storage array online at full speed, how you do your backup, etc. Raid 5 + hot spare can be an excellent and reliable solution and Raid 6 is not per se better, nor is Raid 10 always the best answer.

So, my answer to OP is: Raid 5 with SSD was never not acceptable but it was rarely the best solution.

Falzo · September 2019

# smartctl -a -d megaraid,8 /dev/sda

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.4.35-2-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Intel 320 Series SSDs
Device Model:     INTEL SSDSA2CW300G3
User Capacity:    300.069.052.416 bytes [300 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device

SMART Attributes Data Structure revision number: 5
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  3 Spin_Up_Time            0x0020   100   100   000    Old_age   Offline      -       0
  4 Start_Stop_Count        0x0030   100   100   000    Old_age   Offline      -       0
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       51465
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       47
170 Reserve_Block_Count     0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
183 SATA_Downshift_Count    0x0030   100   100   000    Old_age   Offline      -       5
184 End-to-End_Error        0x0032   100   100   090    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
192 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       46
199 CRC_Error_Count         0x0030   100   100   000    Old_age   Offline      -       1
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       3936036
226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       12291
227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       48
228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       3087915
232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   089   089   000    Old_age   Always       -       0
241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       3936036
242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       3769003

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Offline             Completed without error       00%     51465         -

one of three SSDs in a dedi, raid5 on a lsi megaraid, moderate workload, only hosting.

~120TB written in about 6 years :-D

back in the day it was cheap for nearly 600GB hw-cached raided SSD ;-)

xyz · September 2019

Actavus said: RAID5 kills SSDs faster, due to the amount of writes RAID5 performs during parity operations.

Huh? RAID5 has lower write overhead than RAID1. Worst case is a 3 disk array, where the overhead is 50%, and this goes down with larger arrays.

teamacc · September 2019

@jsg said:
Raid 6 is not Raid 5 plus another XOR. It is Raid 5 plus a considerably more compute intense algorithm (galois), so Raid 6 will rebuild slower than Raid 5.

Although technically correct, I am unsure if this matters in this day and age, with cpus being as advanced as they are. On NVMe I can see this making a difference (might become cpu-bottlenecked), but on spinning rust it's most likely IO limited. I have no clue about "normal" sata ssds.

Zerpy · September 2019

@Actavus said:
RAID5 kills SSDs faster, due to the amount of writes RAID5 performs during parity operations. Generally you'll want an SLC SSD to mitigate against it.

You'll be fine with raid 5 and some proper SSDs. But sure if you're after consumer hardware, then I agree - raid 5 might not be the best choice.

Any decent enterprise-grade drive has enough endurance to run in a raid 5 setup, for 5-6 years with even rather decent workloads on them.

@teamacc said:
Although technically correct, I am unsure if this matters in this day and age, with cpus being as advanced as they are. On NVMe I can see this making a difference (might become cpu-bottlenecked), but on spinning rust it's most likely IO limited. I have no clue about "normal" sata ssds.

In X company we did 20 960GB (or 1920GB in some cases) SSDs in raid 6, we'd still throttle on IO when a drive would rebuild.. Which.. I guess makes sense :')

With that said - I can't remember when I last saw a raid 5 environment.

sureiam · September 2019

I'm thinking about either 3,5, or 6+ consumer 256gb or 512gb SSD drives in a raid 5 with a hot spare that backs up to a spindle drive nightly. Run a simple software linux raid (I find modern CPUs have no problem with raid 5). Couple that with a cheap 10gbps pci card for a pretty decent and quick nas box.

Since SSDs are dirt cheap right now and incredibly easy to just tack inside a case it seems actually viable these days. This wouldn't be for a production or business use. I've found that high quality KVMs can be had pretty cheaply these days and object storage solutions fill the gap nicely with the need for big storage. Running a dedicated server is all but unnecessary unless you have a specific security need.

However I'm finding that my local nas solution is lacking in the I/O speeds I need locally to satisfy my changing use cases. SSDs are a good fit but I want to be able to keep tossing more in and keeping growing as needed.

Falzo · September 2019

use zfs.

I have about five only slightly used 250GB ssds laying around right now, interested? 😁
though shipping can be costly depending on your location and may be a dealbreaker...

jsg · September 2019

@teamacc said:
Although technically correct, I am unsure if this matters in this day and age, with cpus being as advanced as they are. On NVMe I can see this making a difference (might become cpu-bottlenecked), but on spinning rust it's most likely IO limited. I have no clue about "normal" sata ssds.

The processor is just 1 problem, the other one is memory. The processor is a problem because the Galois Field GF(2^8) calculation is significantly more compute intense than the simple XOR operation. And the memory iis a problem because the block sizes are considerably larger than the chache lines. One classical approach to solve that is to use ASICS which typically are but "bent" standard processors (typ. Arm, sometimes PowerPc) with a different cache control and structure, e.g. in the form of less lines which are larger.
A modern X86 can do those calculations too of course but that's a waste of both computing and electrical power, the latter of which also creates BBU power backup problems.

@Zerpy @sureiam

Let me introduce the other enemy URE and BER, about both of which there is lots of talk, lots of misunderstanding, and little tangible, let alone reliable information. Typical numbers that are thrown around is 10^14 for consumer spindles, 10^15 for enterprise spindles, 10^15 for consumer SSD, and 10^16 for enterprise SSDs.

But there are ugly buts, one important one of which is the fact that those numbers are (a) highly likely wrong and misunderstood, and (b) statistical values.
So no an enterprise spindle is not likely to have a URE every about 125 TB. It's actual URE is more likely to be good for 1250 TB - but it may also happen after just 1.25 TB.
Another issue is the story about Raid 5 completely failing during a rebuild due to URE. That story is highly likely wrong and based on a few worst case experiences (keep in mind, it's based on statistics).

Plus, of course drive manufacturers have come up with their own solutions which unfortunately are usually proprietary. The important take away is that one can considerably enhance protection against the much feared "Raid 5 rebuild doesn't work beyond 12.5 TB" story.. The sad part is that there also is a legal reality in which enterprises are much more likely to sue a drive manufacturer, and to pull it through, than the average Joe Consumer is. The result is that enterprise grade drives are in fact really more reliable than consumer drives. In the former the given URE is highly likely a worst case (read: the flat end of a bell curve) while in the latter it's highly likely a positive outlook (read: the 85% center of a bell curve).
Translation into reality: Actual UREs of enterprise spindles are highly likely more like 10^17 and the total loss of a whole Raid 5 array is extremely unlikely (under normal load and with a proper controller).

I've seen excited stories about SSDs being about 100 times more reliable than spindles. That may or may not be the case but it's largely theory because (enterprise grade) SSDs (to not even talk about NVMes) also are 10+ times more expensive than spindles - and keep in mind what we were talking about in the first place: large storage (You'll probably not run your local 1 TB drive in a Raid 5 or 6).

But we are told, there is a saviour, Raid 6. Well, sorry no, not really. Raid 6 doesn't protect you from UREs and BER problems plus Raid 6 comes with "embedded Raid 5" as one of its mechanisms.

I personally almost always run Raid 5 - plus - I do backups. And that, backups, is the real protection against Raid failure plus it's considerably cheaper than throwing disks at URE (which Raid 6 does).

As for ZFS I'm looking at that with the eyes of security developer: ZFS goes against KISS in a big way. Throwing additional layers and complexity at problems that are about reliability and availability is a big no-no.

sureiam · September 2019

@jsg said:
I personally almost always run Raid 5 - plus - I do backups. And that, backups, is the real protection against Raid failure plus it's considerably cheaper than throwing disks at URE (which Raid 6 does).

As for ZFS I'm looking at that with the eyes of security developer: ZFS goes against KISS in a big way. Throwing additional layers and complexity at problems

I've done raid 6 across 4 smaller spindle drives where data storage was more important than speed. As raid 10 gives 1 drive failure raid 6 gave 6. In hindsight Raid 10 would have been better but it's also lasted 5 years now off of used Hitachi drives soooo who's to say what was better.

ZFS confuses and frustrates me. I feel like with that kind of effort I would consider minio s3 before ZFS.

reliablevps_us · September 2019

+1 for @jsg comment.
RAID should never be your backup plan, just get an HDD for backups.

AlyssaD · September 2019

I wouldn't use raid 5....

Watched a raid 5 nuke itself during rebuild after another drive had failed.

Weblogics · September 2019

@sureiam said:

ZFS confuses and frustrates me. I feel like with that kind of effort I would consider minio s3 before ZFS.

Minio is great, especially when using distributed mode. I currently have Minio in distributed mode across 4 servers and 4 drives. Couple that with the Minio mc client which has the ability to "watch" a folder for changes and additions. I'll take Minio's redundancy for backup purposes, any day over raid 5 or ZFS.

xyz · September 2019

jsg said: the other one is memory. The processor is a problem because the Galois Field GF(2^8) calculation is significantly more compute intense than the simple XOR operation. And the memory iis a problem because the block sizes are considerably larger than the chache lines

GF(2^8) computation doesn't require much memory at all. For a 4 disk RAID6, only multiply by 2 is needed, which is only slightly slower than XOR.
I don't get your cache line point. Cache lines are only 64 bytes wide. Any I/O block size should be significantly larger, regardless of RAID or not, otherwise you're doing something horribly wrong. GF(2^8) is byte granular, so cannot possibly cross cache lines or otherwise be affected by it.

Intel's Ice Lake and Temont processors include the GFNI instruction set, which should make GF(2^8) roughly as fast as XOR.

MechanicWeb · September 2019

sureiam said: I've done raid 6 across 4 smaller spindle drives where data storage was more important than speed. As raid 10 gives 1 drive failure raid 6 gave 6.

This is what people often forget while considering a RAID setup for their server.

You need to know what are you going to do with the server.

jsg · September 2019

@xyz said:
GF(2^8) computation doesn't require much memory at all. For a 4 disk RAID6, only multiply by 2 is needed, which is only slightly slower than XOR.
I don't get your cache line point. Cache lines are only 64 bytes wide. Any I/O block size should be significantly larger, regardless of RAID or not, otherwise you're doing something horribly wrong. GF(2^8) is byte granular, so cannot possibly cross cache lines or otherwise be affected by it.

GF(2^8) calculations are multiply and mod (addition and subtraction are basically xor, btw).
A 4 disk Raid 6 array is quite unrealistic. Usually Raid 6 arrays are considerably larger (typ. 8+ disks) and then GF(2^8) is computing expensive.

As for cache lines you simply mistook it. Yes, the calculation itself is 8 bit granular, but the amount of data to be dealt with is usually n x sector_size and those data come over the bus (PCIE) and are temp stored in memory. But the calculation doesn't happen in memory, so an optimized engine does (a) compute along the current cache line, and (b) load new lines and store processed lines. For that one wants less granularity and longer lines. The trick is basically to have computing time and load-store time nicely balanced (and fast, of course).

Intel's Ice Lake and Temont processors include the GFNI instruction set, which should make GF(2^8) roughly as fast as XOR.

Yay, so let's use those power hungry rather expensive intel processors as disk controllers. Brillant. Being at it let's also use 40 ton trucks when we need some bread from the bakery.

sureiam · September 2019

@MechanicWeb said:

sureiam said: I've done raid 6 across 4 smaller spindle drives where data storage was more important than speed. As raid 10 gives 1 drive failure raid 6 gave 6.

This is what people often forget while considering a RAID setup for their server.

You need to know what are you going to do with the server.

Lol whoops meant 2 drive failure. But ya the system has been extremely reliable. Taking it offline soon because it's been running for 5 years and upgrades are required but VPS's have gotten so affordable that it's just not worth having my own dedicated server anymore. > @Weblogics said:

@sureiam said:

ZFS confuses and frustrates me. I feel like with that kind of effort I would consider minio s3 before ZFS.

Minio is great, especially when using distributed mode. I currently have Minio in distributed mode across 4 servers and 4 drives. Couple that with the Minio mc client which has the ability to "watch" a folder for changes and additions. I'll take Minio's redundancy for backup purposes, any day over raid 5 or ZFS.

Right I've heard great things and the ability to keep growing with more servers (or mini PCs in my mind) is really tempting. I haven't actually set one up yet though just been everything I've read makes it seem more viable. Also every backup solution and worthwhile application works with S3 these days so that's a huge plus

MechanicWeb · September 2019

sureiam said: Lol whoops meant 2 drive failure. But ya the system has been extremely reliable.

We have run inhouse servers on RAID 5 for 10+ years. We were lucky, as many are. The server was needed for a small database driven software for a small company. RAID 5 is perfect in this scenario - low cost, enough fault tolerance and I/O throughput.

For a shared hosting server with NVMe, SW RAID 1 is good, for HDD, RAID 1 is no longer an option, I would prefer RAID 10. Different scenario, different RAID. You get the idea.

xyz · September 2019

jsg said: the amount of data to be dealt with is usually n x sector_size and those data come over the bus (PCIE) and are temp stored in memory

This is no different from software RAID5.

jsg said: But the calculation doesn't happen in memory, so an optimized engine does (a) compute along the current cache line, and (b) load new lines and store processed lines. For that one wants less granularity and longer lines.

CPUs can pull from memory way faster than disks can DMA to it.

I think you may be confused. Cache line size is not important, but an optimized GF(2^8) does have to care about cache blocking (aka loop tiling). This is mostly because the computation is often fast enough to saturate the L2 cache bandwidth. The cache blocking size is completely selectable by the algorithm, so as long as the sector size is larger than it (which it typically will be), it's not an issue.

jsg said: Yay, so let's use those power hungry rather expensive intel processors as disk controllers. Brillant. Being at it let's also use 40 ton trucks when we need some bread from the bakery.

Ignoring the unnecessary snark, you'll find that most servers today already have an Intel CPU in them. Also, in all practical cases, all data has to pass through the CPU at one point anyway, so it's not like it's making unnecessary round trips.
As for hardware controllers, I suspect many controllers have rather poor implementations of RAID6. As such, I wouldn't be surprised if a hardware RAID6 implementation is much more of a bottleneck than a software implementation would be.

jsg · September 2019

@xyz said:
This is no different from software RAID5.

So? Raid Controller do both.

CPUs can pull from memory way faster than disks can DMA to it.

That's BS. In fact, some modern processors use a variant of PCIe even for inter-core communication. Moreover your assumptions are questionable. Not every disk read translates to a real disk read; it may well be a disk cache read. Plus and more importantly, Raid controller hardly ever (wastefully) use X86. Usually they use Arm or PowerPC based MCUs for various reasons, one of which is the difference between $5 vs $100.

I think you may be confused.

Well, frankly it seems that actually you are out fo your depth and talking from a pure X86 perspective. The reality of MCUs however is quite different.

Cache line size is not important, but an optimized GF(2^8) does have to care about cache blocking (aka loop tiling). This is mostly because the computation is often fast enough to saturate the L2 cache bandwidth. The cache blocking size is completely selectable by the algorithm, so as long as the sector size is larger than it (which it typically will be), it's not an issue.

You miss the point again. The problem is this: there is, say 4KB bytes of data which (a) need to be striped (which is virtually cost free) and (b) pushed through 2 algorithms. How fast that can be done is depending on diverse factors, one important of which is to keep L1 load/store in balance with computing. If you compute faster than you can read from/write back to memory, you are wasting. If you compute slower you are wasting. So you want a good balance. Btw, many MCUs have just 1 level of cache or even none. Considering that memory access is about 100 times slower than cache access, keeping the balance is a significant part of the whole mechanism.

jsg said: Yay, so let's use those power hungry rather expensive intel processors as disk controllers. Brillant. Being at it let's also use 40 ton trucks when we need some bread from the bakery.

Ignoring the unnecessary snark, you'll find that most servers today already have an Intel CPU in them. Also, in all practical cases, all data has to pass through the CPU at one point anyway, so it's not like it's making unnecessary round trips.

No. Look up DMA and the PCIe bus.

As for hardware controllers, I suspect many controllers have rather poor implementations of RAID6. As such, I wouldn't be surprised if a hardware RAID6 implementation is much more of a bottleneck than a software implementation would be.

Yes and no. I agree with your suspicion that Raid controllers might have suboptimal implementations or even errors. But no re the software implementation. Because Raid controllers do have software implementations. It's just that they have processors or MCUs that are optimized for the job by e.g. cache design or hardware optimized for e.g. mul-mod.

xyz · September 2019

jsg said: Raid controllers might have suboptimal implementations or even errors

I wasn't aware that people actually still use hardware RAID controllers with SSDs considering the myriad of issues they have, like these. My apologies.

Adam1 · September 2019

xyz said: Huh? RAID5 has lower write overhead than RAID1

isnt the issue that raid5 requires an erase operation before writing?

FHR · September 2019

@xyz said:

jsg said: Raid controllers might have suboptimal implementations or even errors

I wasn't aware that people actually still use hardware RAID controllers with SSDs considering the myriad of issues they have, like these. My apologies.

It's the other way around, we use hardware RAID exclusively precisely for the myriad of issues software RAID implementations have. Not to mention a cached hardware RAID will provide much much higher performance than any software solution will ever have.

sureiam · September 2019

@FHR said:

@xyz said:

jsg said: Raid controllers might have suboptimal implementations or even errors

I wasn't aware that people actually still use hardware RAID controllers with SSDs considering the myriad of issues they have, like these. My apologies.

It's the other way around, we use hardware RAID exclusively precisely for the myriad of issues software RAID implementations have. Not to mention a cached hardware RAID will provide much much higher performance than any software solution will ever have.

I believe the cached hardware raids with batt backups are very reliable. With that said, I've heard about an equal number of Raid controllers failing vs drives even in a raid 1. But if you are running anything other than a reliable Linux based platform Raid controllers are the way to go.

Linux Software Raid though IMO is extremely reliable and a great option.

Howdy, Stranger!

Categories

In this Discussion

Is Raid 5 acceptable again now with SSDs?

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Is Raid 5 acceptable again now with SSDs?

Comments