ZFS write IOPS

nowthisisfun · November 2020

Hi guys, I am running out of space on my 2TB Nvme SSD server and want to upgrade it to 4TB. 4TB SSD storage is getting out of my budget. I use it as a file server for a service with ~10k concurrent users so the IOPS get around 1k at times (7k during backup, but slow backup is not a problem). Will using ZFS on a server with like 16GB RAM, 512 GB SSD and 4TB HDD be able to achieve the same IOPS (around 1k?). This would be much cheaper as this config is available on Hetzner auctions. Thanks

dfroe · November 2020

It depends. Of course you cannot expect similiar IOPS from a HDD array compared to SSD.

If you are using a couple of GB from the SSD as ZIL and the remaining space as L2ARC you may see similiar results in certain use cases.

However for read performance I doubt that all your regularly accessed data will fit into 500 GB of L2ARC. And obviously reading data from HDD will be much slower than from SSD.

Regarding write performance it depends whether you are optimizing for async or sync write operations. Async writes will always go into RAM first without involving ZIL. But for large write operations your sustainable rate will be limited by HDD throughput. For Sync operations a ZIL on SSD will help but remember to mirror it - and of course only use ECC RAM especially with ZFS if your data is important to you.

Designing storage systems can be quite complex and depends on a lot of factors. But if your question is, if a system with 4 TB HDD + 512 GB SSD can typically reach performance of a 2 TB NVMe system, then the answer will be: Most likely not.

Falzo · November 2020

I agree with @dfroe but want to add that besides all these options your achievable iops also heavily depend on the blocksize, which also means on the filesize in the end.

technically you will have an IO limit for your SSD and a bandwidth limit of course. if you have a large blocksize you oviously won't need as much IO to reach your bandwidth limit and with tons of small files / small blocksize you might not even manage to get to the bandwidth limit before you run out of IOps...

so it also depends on your specific workload in terms of number and size of files.

PS: practically speaking I would guess you won't be satisfied with zfs.

SplitIce · November 2020

Moving from a 2TB NVMe to a 4TB HDD is a considerable IOPS decrease.

Typical IOPS for spinning rust is in the range of ~50/sec so you arent going to be acheiving 1-7k regardless of how you use your 512GB drive (write cache, ZIL, etc) unless you perhaps only have a very small active dataset.

Although your backup job willl mean you don't

rcxb · November 2020

@SplitIce said:
Typical IOPS for spinning rust is in the range of ~50/sec

That seem extremely low. Here's what a quick search turned up:

The HGST Ultrastar He6 averages 204 IOPS at QD256, while the 7K4000 delivers 215 IOPS.

Source: https://www.tweaktown.com/reviews/6211/hgst-ultrastar-he6-6tb-helium-enterprise-hdd-review/index.html

Levi · November 2020

@rcxb said: That seem extremely low.

Take a note about HDD cache and overall system RAM. Large HDDs tend to have massive cache + RAID cards with tons of caching.

SplitIce · November 2020

@rcxb measured some drives to get that number (got 45-55 on average). Of course fast enterprise drives will do more it's the scale that matters (factor of 100!)

nowthisisfun · November 2020

@dfroe @Falzo Thanks for the reply! The average file size is around 3 MB and the max (99 percentile) file size is 130 M. The write volume per day is around 50 Gb max, so plenty of time to flush the disk (practically no users from 12 PM to 6 AM). Also, frequently accessed data will be around 100 GB max, so I am optimistic that I can pull this off. I am thinking of trying lvmcache, will update here once that is done.

Falzo · November 2020

@nowthisisfun I actually use ssd cached zfs on my home server, and though this is a quite old 120G ssd it caches a 4x4TB striped mirrored zpool. just ran fio via yabs on it for you:

fio Disk Speed Tests (Mixed R/W 50/50):
---------------------------------
Block Size | 4k            (IOPS) | 64k           (IOPS)
  ------   | ---            ----  | ----           ---- 
Read       | 7.58 MB/s     (1.8k) | 118.95 MB/s   (1.8k)
Write      | 7.62 MB/s     (1.9k) | 119.57 MB/s   (1.8k)
Total      | 15.21 MB/s    (3.8k) | 238.52 MB/s   (3.7k)
           |                      |                     
Block Size | 512k          (IOPS) | 1m            (IOPS)
  ------   | ---            ----  | ----           ---- 
Read       | 218.53 MB/s    (426) | 217.69 MB/s    (212)
Write      | 230.14 MB/s    (449) | 232.19 MB/s    (226)
Total      | 448.68 MB/s    (875) | 449.88 MB/s    (438)

so as you can see this more or less is all cached, for large blocksizes the bandwidth clearly limits the available IOps but for smaller blocksizes it manages to hit more than 1k.

keep in ming this is just on a single testfile that easily fits in the cache (probably even ARC)... also the underlying quasi raid-10 might reach 200-250 iops standalone as well.

workloads of your use case are much more stressful and constant so they will be more problematic, you probably only find out by thorough testing then.

I would be very interested in your findings with LVM cache...

simlev · January 2021

I bought a new (spinning platter) HDD for storage, and am quite confused because I'm getting dramatically better benchmark results with zfs. Maybe I'm not performing the right test (I'm showing yabs here for reference), or I'm not interpreting it correctly, but with these numbers I don't see the benefit of adding a ssd cache or log. I must be missing something, can anyone enlighten me?
zfs:

HDD zfs fio Disk Speed Tests (Mixed R/W 50/50):
    ---------------------------------
        Block Size | 4k            (IOPS) | 64k           (IOPS)
          ------   | ---            ----  | ----           ----
        Read       | 11.81 MB/s    (2.9k) | 165.51 MB/s   (2.5k)
        Write      | 11.81 MB/s    (2.9k) | 166.38 MB/s   (2.5k)
        Total      | 23.63 MB/s    (5.9k) | 331.90 MB/s   (5.1k)
                   |                      |
        Block Size | 512k          (IOPS) | 1m            (IOPS)
          ------   | ---            ----  | ----           ----
        Read       | 280.03 MB/s    (546) | 286.98 MB/s    (280)
        Write      | 294.91 MB/s    (576) | 306.09 MB/s    (298)
        Total      | 574.95 MB/s   (1.1k) | 593.08 MB/s    (578)

xfs:

HDD xfs fio Disk Speed Tests (Mixed R/W 50/50):
    ---------------------------------
        Block Size | 4k            (IOPS) | 64k           (IOPS)
          ------   | ---            ----  | ----           ----
        Read       | 898.00 KB/s    (224) | 11.93 MB/s     (186)
        Write      | 935.00 KB/s    (233) | 12.46 MB/s     (194)
        Total      | 1.83 MB/s      (457) | 24.39 MB/s     (380)
                   |                      |
        Block Size | 512k          (IOPS) | 1m            (IOPS)
          ------   | ---            ----  | ----           ----
        Read       | 54.11 MB/s     (105) | 69.31 MB/s      (67)
        Write      | 57.05 MB/s     (111) | 73.93 MB/s      (72)
        Total      | 111.17 MB/s    (216) | 143.24 MB/s    (139)

ext4:

HDD ext4 fio Disk Speed Tests (Mixed R/W 50/50):
---------------------------------
Block Size | 4k            (IOPS) | 64k           (IOPS)
  ------   | ---            ----  | ----           ----
Read       | 845.00 KB/s    (211) | 11.15 MB/s     (174)
Write      | 884.00 KB/s    (221) | 11.72 MB/s     (183)
Total      | 1.72 MB/s      (432) | 22.87 MB/s     (357)
           |                      |
Block Size | 512k          (IOPS) | 1m            (IOPS)
  ------   | ---            ----  | ----           ----
Read       | 50.53 MB/s      (98) | 63.97 MB/s      (62)
Write      | 53.37 MB/s     (104) | 68.78 MB/s      (67)
Total      | 103.91 MB/s    (202) | 132.75 MB/s    (129)

btrfs:

HDD btrfs fio Disk Speed Tests (Mixed R/W 50/50):
---------------------------------
Block Size | 4k            (IOPS) | 64k           (IOPS)
  ------   | ---            ----  | ----           ----
Read       | 2.26 MB/s      (565) | 13.88 MB/s     (216)
Write      | 2.28 MB/s      (570) | 14.50 MB/s     (226)
Total      | 4.54 MB/s     (1.1k) | 28.39 MB/s     (442)
           |                      |
Block Size | 512k          (IOPS) | 1m            (IOPS)
  ------   | ---            ----  | ----           ----
Read       | 19.43 MB/s      (37) | 24.70 MB/s      (24)
Write      | 20.83 MB/s      (40) | 26.90 MB/s      (26)
Total      | 40.26 MB/s      (77) | 51.60 MB/s      (50)

Test setup:

Debian system with zfs on root and a couple of other zfs pools. I've created a few 1,2TB partitions on the new HDD, made one a zfs pool and formatted the rest with different filesystems. Mounted each at a time in /mnt/test and ran cd /mnt/test/; yabs.sh -ig.

What bothers me is that those benchmark results are close to those for the zfs root partition, that sits on a decent nvme. I had to check that the test files are actually created in the mounted directory and that the new HDD's activity LED is on for the duration of the test.

NVMe zfs fio Disk Speed Tests (Mixed R/W 50/50):
---------------------------------
Block Size | 4k            (IOPS) | 64k           (IOPS)
  ------   | ---            ----  | ----           ----
Read       | 8.18 MB/s     (2.0k) | 227.25 MB/s   (3.5k)
Write      | 8.21 MB/s     (2.0k) | 228.45 MB/s   (3.5k)
Total      | 16.39 MB/s    (4.0k) | 455.70 MB/s   (7.1k)
           |                      |
Block Size | 512k          (IOPS) | 1m            (IOPS)
  ------   | ---            ----  | ----           ----
Read       | 787.54 MB/s   (1.5k) | 140.10 MB/s    (136)
Write      | 829.38 MB/s   (1.6k) | 149.43 MB/s    (145)
Total      | 1.61 GB/s     (3.1k) | 289.54 MB/s    (281)

Falzo · January 2021

@simlev as discussed above, the answer to your confusion is simply 'caching'.

depending on your systems RAM, zfs takes a large portion of it and uses that for it's own caching algorythm, which works pretty well, as you can see in the bench result.

while linux itself uses free ram for caching file access the important difference for the benchmark is, that fio is bypassing the regular systems filecache but not zfs ARC (because it can't).

so obviously you are comparing rather uncached results of ext/xfs/btrfs vs the cached results of zfs.

in real world scenarios both numbers might not be ideal, because your workloads even on ext/xfs/etc. will benefit from the regular filecache and therefore probably perform a bit better, than what you measured. heavily depends on the workload still, lot of small files vs few big ones and so on.

also keep in mind that the performance of zfs ARC might seem superior it can also lead to problems with ram usage as the system itself can't reclaim the used memory back as easy as it can from it's own filecaching.

if the last test is supposed to show NVMe performance, than I agree that seems rather weird/wrong. or it just tells you, that zfs on NVMe is maybe not the best choice anyway 🤷‍♂️

I'd go with pure ext4 for the nvme part, but leave a small partition if you can afford that and use it as L2ARC for your zfs array. that should allow to also restrict the regular ARC size a bit more, to not run into memory problem (unless you can afford the RAM anyway)

Erisa · January 2021

Edit to note that I wrote this before the above reply existed, read that too

@simlev
As one possibility, ZFS operates an in-memory cache (ARC) which could explain whats causing these speeds.
On the writes front when the writes are delivered as "asynchronous" (The application saying "I dont care when this data gets here just write it thanks") then they'll buffer in memory before flushing to disk - causing applications to think they completed really fast.
For reads, anything recently written, recently read or commonly read is places in the ARC and reads from there at near-instant speeds.
It may be worth trying things closer to actual workloads or testing for a longer time than YABS does.

Fwiw to disable cache (for testing) you can do zfs set primarycache=none pool and zfs set secondarycache=none pool (Though setting =metadata is more realistic for not testing. If you dont want cache eg for large infrequent things that would waste space there)
And to disable the buffering of asynchronous writes you can do zfs set sync=always pool (There are legitimate uses for this, though crash consistency isnt one because those wont be async)
Reversing these is a case of primarycache=all, secondarycache=all and sync=standard

It's also (less likely) possible ZFS decided in its infinite wisdom to use your root pool or any other SSD pool as its log (not L2ARC read cache), I don't know how to override the default log or even where its stored (The SLOG you add is an alternate location than the default) but maybe you could use a tool to check disk I/O during the test? (I open glances in another tab because I am not clever anything to learn anything more complex)

Disclaimer I am not a professional at any of this I just use ZFS a lot for personal things and picked up a couple of pieces of information from observing, experimenting and googling.

marvel · January 2021

This entire topic makes no sense at all. Running a ZFS with two spinning disks and then start talking about SLOG and L2ARC.

Just rent something with a 2x4 TB SSD in RAID1 and be done with it. Do you really want to do all this to save a couple of bucks? Please don't.

ZFS requires a lot more knowledge, especially when you try to run it on Linux. Listen to Linus, don't use ZFS of Linux!

nullnothere · January 2021

@marvel said: 2x4 TB SSD in RAID1 and be done with it. Do you really want to do all this to save a couple of bucks?

Now now... it's not really "a couple of bucks" more to go from OPs config to the 2 x 4TB SSD. Please be fair.

@marvel said: ZFS requires a lot more knowledge

Agreed - as do many things to do right, especially when you are trying to push beyond some typical boundaries.

@marvel said: Listen to Linus, don't use ZFS of Linux!

His argument was primarily against the licensing (and then some more but let's not digress).

Now that we're at 2.0+, there's really not that much of a difference to using it properly on Linux as well.

Irrespective of the OS (and some innards including the FPU restrictions), ZFS does require some understanding to use well because it isn't just a file system (when you compare it to ext4 etc.). So it does take some thoughtful tuning to get good mileage but it is really worth it when you consider the protections it offers.

Separately, going back to OPs post, without really being able to qualify the kind of access patterns (reads vs writes and how much of this is "hot" vs "cold" data) it is really tricky to pin point a good solution to the problem of IOPS.

Also, 10k concurrent users and 16GB of RAM seems (IMHO) very poor performance wise if there is likely to be a large spread in the data access (i.e. non-cacheable). At some level, irrespective of the FS, the limitation here is going to be the disk and there's no magical way to surpass the hard IOPS limitation there. With some tuning for a specific kind of access pattern, the performance can be improved to a point.

I really don't know what else to add for now - all the good folks have already added all there is above and well before me. I just rest my case.

simlev · January 2021

Just a quick follow-up after @Falzo's and @Erisa's informative answers. Cache it is; disabling it with zfs set primarycache=none pool and zfs set secondarycache=none pool yields worse results than other filesystems:

HDD zfs (no cache) fio Disk Speed Tests (Mixed R/W 50/50):
---------------------------------
Block Size | 4k            (IOPS) | 64k           (IOPS)
  ------   | ---            ----  | ----           ----
Read       | 250.00 KB/s     (62) | 4.04 MB/s       (63)
Write      | 269.00 KB/s     (67) | 4.32 MB/s       (67)
Total      | 519.00 KB/s    (129) | 8.37 MB/s      (130)
           |                      |
Block Size | 512k          (IOPS) | 1m            (IOPS)
  ------   | ---            ----  | ----           ----
Read       | 42.04 MB/s      (82) | 57.29 MB/s      (55)
Write      | 43.88 MB/s      (85) | 61.26 MB/s      (59)
Total      | 85.92 MB/s     (167) | 118.55 MB/s    (114)

After resetting primarycache and secondarycache, I thought I'd try zfs set sync=always pool as well; for 1.5 hours it was stuck at Generating fio test file..., but then completed.

HDD zfs (no async) fio Disk Speed Tests (Mixed R/W 50/50):
---------------------------------
Block Size | 4k            (IOPS) | 64k           (IOPS)
  ------   | ---            ----  | ----           ----
Read       | 536.00 KB/s    (134) | 5.18 MB/s       (81)
Write      | 567.00 KB/s    (141) | 5.43 MB/s       (84)
Total      | 1.10 MB/s      (275) | 10.62 MB/s     (165)
           |                      |
Block Size | 512k          (IOPS) | 1m            (IOPS)
  ------   | ---            ----  | ----           ----
Read       | 27.74 MB/s      (54) | 51.44 MB/s      (50)
Write      | 29.77 MB/s      (58) | 54.62 MB/s      (53)
Total      | 57.52 MB/s     (112) | 106.06 MB/s    (103)

I hadn't noticed before, but 3GB more RAM are in use during the test, and released immediately afterwards. The current limit (c,c_max) is half the system RAM.
I wish I hadn't put up all of the NVMe into rpool, given that only later I found out that pools cannot be shrinked. It seems the only option would be (booting from USB) to take a snapshot, store it somewhere else, recreate the original pool and reimport the snapshot.

Howdy, Stranger!

Categories

In this Discussion

ZFS write IOPS

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

ZFS write IOPS

Comments