GlusterFS with erasure coding over WAN

kisiel · September 2018

Hi everyone,

I am thinking ten cheap storage VPS at different providers bound together into one storage cluster using erasure coding say 10-5.
Ultimately moving to 15-10 with 1TB each and effectively able to store 10TB data with only 5TB overhead and ability to withstand five VPS going down at any time plus ease of heterogeneous growth of the cluster.

I know already that performance is going to be bad but can anyone guess how bad?
I am testing Tahoe-lafs in 10-5 setup and I get 800kBytes/s to 1.2MByte/s writes to the cluster.
I haven't tested reads yet nor actually started searching for any bottlenecks, but as a rule of thumb, is it going to be better or worse?

dragon2611 · September 2018

I'd be more inclined to look at LizardFS although the master might be a SPOF unless you setup some kind of auto failover. (You can have shadowmasters but they don't auto failover unless you script it/some support in 3.13RC1)

jlay · September 2018

Gluster is super bandwidth and latency sensitive, I don't think things will improve much. For WAN/internet they intend for you to use geo-replica volumes.

I've always gotten the best performance by setting GlusterFS on its own subnet/interface, as it'll gladly eat all the bandwidth you can give it. Things will probably worsen as you add nodes as a result.

For my storage pools I've just been getting a few low end dedicated servers that have private networks. Similar amount of storage (if not more), better performance, and just as likely to keep running (full-DC network/power and multi-server outages aren't that common).

kisiel · September 2018

@dragon2611 said:
I'd be more inclined to look at LizardFS although the master might be a SPOF unless you setup some kind of auto failover. (You can have shadowmasters but they don't auto failover unless you script it/some support in 3.13RC1)

Do you have more experience with LizardFS? It looks like it requires some serious hardware for master server. I wonder if in my use case (backup of overgrown photo library) it would accept way less

dragon2611 · September 2018

@kisiel said:

@dragon2611 said:
I'd be more inclined to look at LizardFS although the master might be a SPOF unless you setup some kind of auto failover. (You can have shadowmasters but they don't auto failover unless you script it/some support in 3.13RC1)

Do you have more experience with LizardFS? It looks like it requires some serious hardware for master server. I wonder if in my use case (backup of overgrown photo library) it would accept way less

Mines running 4 vcpu/4gb ram but it's also idle most of the time.
i'm also not doing much with it other than basic replication.

One of my chunkservers is a synology nas also (managed to cross compile it using spksrc).

Then again anything I care about is stored/backed up elsewhere as well (i.e it's not my only copy)

AlyssaD · September 2018

GlusterFSs over Wan is not a suggested idea.

I highly suggest you try and setup GlusterFS on a few Digital Ocean or Vultr vms to see how it works. In my testing in the past I was only getting about 2MBs write speed to the volume. Making it utterly worthless.

jlay · September 2018

@AlyssaD said:
GlusterFSs over Wan is not a suggested idea.

I highly suggest you try and setup GlusterFS on a few Digital Ocean or Vultr vms to see how it works. In my testing in the past I was only getting about 2MBs write speed to the volume. Making it utterly worthless.

That's way lower than what it should be! I used it with a few DO servers and attached volumes, and hit ~200 MB/s last I tested. The type of volume you make (dispersed, replica, stripe, etc) and a few other things play a big part (eg: using the private network, XFS/EXT4 options, etc).

FHR · September 2018

Use GlusterFS only within a single datacenter. You'll thank me later.

jsg · September 2018

I think that whole approach is mistaken. To do it at all and keep the pain level relatively low you'd need to go with good quality providers but then you might as well buy 3 dedis for the same money.

The point you seem to grossly misunderstand is that (a) a DFS should be close to the hardware and (b) the a VPS is the opposite of that. Plus VPSs rarely have the resources (processor, memory) needed.

If you for whatever weird reason absolutely want to stick with your many VPSs approach -and- have at least some geo-diversity you should look at providers with multiple (preferably own) data centers and their own network (like e.g. OVH).

But again, the VPS route is a bad one anyway because not only does it (usually) have too few resources but it also does not give you (or your DFS) any sensible amount of control over them.

Why don't you turn the problem upside down and rent 2 or 3 (well connected) servers, each with 6 or so fat drives in a ZFS pool and Glusterfs on top and then rent out high quality storage VPS with 3+ TB good quality geo diverse storage and make money instead of throwing it at a VPS based toy Glusterfs?

kisiel · September 2018

@jsg
The idea is to store offsite somewhat rapidly growing photo library (for the sake of the last line of defence backup), so I don't have massive budgets to drop into it.
My baseline is, say Wasabi - $5/TB a month. If Wasabi goes out of business, I need it replicated somewhere. So that's $10/TB tops in total, preferably more like $7.
Tahoe-lafs on top of cheap VPS meets most of the criteria, except the speed.

I am somewhat aware of other options, but each one of them struggles, mostly on price.

It looks like I want something that does not exist

aaraya1516 · September 2018

kisiel said: I haven't tested reads yet nor actually started searching for any bottlenecks, but as a rule of thumb, is it going to be better or worse?

I ran these benchmarks on my setup. I wasn't sure how to benchmark the read, but I'll rerun with something else if needed.

sudo dd if=/dev/zero of=sb-io-test bs=64k count=16k conv=fdatasync
16384+0 records in
16384+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 97.2173 s, 11.0 MB/s

sudo dd if=sb-io-test of=/dev/null bs=64k count=16k conv=fdatasync
dd: fsync failed for '/dev/null': Invalid argument
16384+0 records in
16384+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 66.4884 s, 16.1 MB/s

Setup:
2 Dedicated machines (Colo really, in same DC). Proxmox 5 ZFS mirrored stripe (4 x 1TB), with NVMe on PCIe for caching. (this is the only thing that fits, after ZFS, on my old machines. It's not "Best Practice" but it works)
1 KVM GFS Brick Host on each (Full disk LUKS encrypted)
1 KVM GFS Client hosted on one of the dedicated machines. The benchmark is from this VM.

Edit: You can probably get this performance from top-level providers, 6 of the 10 drives are consumer grade non-nas. Also, forgot to add the bricks are on LUKS encrypted VMs.

jsg · September 2018

@aaraya1516

With GFS nodes based on ZFS with NVMe cache? Let that be a lesson that one doesn't use VPSs for that kind of a job.
That said GFS isn't exactly a speed demon in many situations (e.g. lots of small files) plus it's very sensitive about configuration.

Just out of curiosity: What's the network speed between those two dedis?

jlay · September 2018

@aaraya1516 said:

kisiel said: I haven't tested reads yet nor actually started searching for any bottlenecks, but as a rule of thumb, is it going to be better or worse?

I ran these benchmarks on my setup. I wasn't sure how to benchmark the read, but I'll rerun with something else if needed.
sudo dd if=/dev/zero of=sb-io-test bs=64k count=16k conv=fdatasync
16384+0 records in
16384+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 97.2173 s, 11.0 MB/s
sudo dd if=sb-io-test of=/dev/null bs=64k count=16k conv=fdatasync
dd: fsync failed for '/dev/null': Invalid argument
16384+0 records in
16384+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 66.4884 s, 16.1 MB/s
Setup:
2 Dedicated machines (Colo really, in same DC). Proxmox 5 ZFS mirrored stripe (4 x 1TB), with NVMe on PCIe for caching. (this is the only thing that fits, after ZFS, on my old machines. It's not "Best Practice" but it works)
1 KVM GFS Brick Host on each (Full disk LUKS encrypted)
1 KVM GFS Client hosted on one of the dedicated machines. The benchmark is from this VM.

Edit: You can probably get this performance from top-level providers, 6 of the 10 drives are consumer grade non-nas. Also, forgot to add the bricks are on LUKS encrypted VMs.

I haven't tested with LUKS, but assuming those dedicated servers are on a LAN, they should perform much better. Perhaps it's the block size, how do larger blocks (eg: 1M or 50M) do?

For GlusterFS to perform well it needs low latency, lots of bandwidth, and a decent bit of CPU time - if you have all of that, you should be able to nearly max out your disk I/O (for example, ~200MB/s on most HDDs). At work I was testing with a few VMs and SSD-backed AoE (ATA over ethernet) bricks on a 10G network and hit about 700 to 800 MB/s after adjusting a bunch of volume parameters (eg: threads).

aaraya1516 · September 2018

jlay said: I haven't tested with LUKS, but assuming those dedicated servers are on a LAN, they should perform much better. Perhaps it's the block size, how do larger blocks (eg: 1M or 50M) do?

I haven't tuned it for performance, at one point I was doing geo-replication but that overseas VPS went bad. Last time I configured it was about a year ago. It runs my email backend, so I/O was never really pushed to the max. I'll try out increasing the block size. I do all of this on 100Mbps LAN, but through a TINC VPN tunnel.

jsg said: Just out of curiosity: What's the network speed between those two dedis?

I have it limited to 100Mbps. I'm also not sure what lesson you're referring to, my VMs run the GFS. The VMs have decent I/O for being on ZFS.

jsg · September 2018

@aaraya1516 said:
I have it limited to 100Mbps. I'm also not sure what lesson you're referring to, my VMs run the GFS. The VMs have decent I/O for being on ZFS.

Ok, if you consider 20 MB/s decent performance ...

aaraya1516 · September 2018

jsg said: Ok, if you consider 20 MB/s decent performance ...

20MB/s is what the GFS gets when mounted on a third VM. I get 857MB/s on ZFS.

jlay · September 2018

@aaraya1516 said:

jlay said: I haven't tested with LUKS, but assuming those dedicated servers are on a LAN, they should perform much better. Perhaps it's the block size, how do larger blocks (eg: 1M or 50M) do?

I haven't tuned it for performance, at one point I was doing geo-replication but that overseas VPS went bad. Last time I configured it was about a year ago. It runs my email backend, so I/O was never really pushed to the max. I'll try out increasing the block size. I do all of this on 100Mbps LAN, but through a TINC VPN tunnel.

jsg said: Just out of curiosity: What's the network speed between those two dedis?

I have it limited to 100Mbps. I'm also not sure what lesson you're referring to, my VMs run the GFS. The VMs have decent I/O for being on ZFS.

I was thinking more for the block size of your dd test, but given the other explanations, your performance makes sense now. Larger block sizes with dd tests usually perform better, to a degree. Give those bad boys some more bandwidth and you'll be off to the races!

In my experience, GlusterFS doesn't really care if it's in a VM. As long as you have the network bandwidth, CPU time, and disk I/O to back it, it runs just as well virtualized as baremetal for most intents and purposes. As long as you use virtio and have reasonable settings everywhere, it should do fine. Things like readahead, I/O scheduler, GlusterFS volume settings, and sysctl TCP params have more of an impact on performance than baremetal/virtualized.

I've very nearly hit 1GB/s write speeds on GlusterFS with three/five VMs and a dispersed volume on a fairly convoluted SSD-backed storage setup (ATA over ethernet on top of ZFS on separate storage nodes) connected at 20Gbit. bonnie++ tests showed perfectly acceptable latency.

I actually enjoy the idea of running GlusterFS in a VM, as it makes moving the 'node' around a lot easier. Can move the system as a unit, rather than deal with moving the brick and the data trickery that follows. With a VM you could just live-migrate to another physical host. Given all of that and how it can still perform well, seems like a win to me.

aaraya1516 · September 2018

@jlay said:
I actually enjoy the idea of running GlusterFS in a VM, as it makes moving the 'node' around a lot easier. Can move the system as a unit, rather than deal with moving the brick and the data trickery that follows. With a VM you could just live-migrate to another physical host. Given all of that and how it can still perform well, seems like a win to me.

This is exactly why I did it. I backup the VMs monthly, and if I have a catastrophic failure, I just restore the VMs to some other proxmox instance.

Your suggestions spurred me to get another replicated storage going and test things out. I found that the LUKS encrypted VMs are the ultimate bottleneck.

vish · September 2018

I read about how beegfs does parallel read/writes. Anyone ever use it?

jlay · September 2018

@aaraya1516 said:

@jlay said:
I actually enjoy the idea of running GlusterFS in a VM, as it makes moving the 'node' around a lot easier. Can move the system as a unit, rather than deal with moving the brick and the data trickery that follows. With a VM you could just live-migrate to another physical host. Given all of that and how it can still perform well, seems like a win to me.

This is exactly why I did it. I backup the VMs monthly, and if I have a catastrophic failure, I just restore the VMs to some other proxmox instance.

Your suggestions spurred me to get another replicated storage going and test things out. I found that the LUKS encrypted VMs are the ultimate bottleneck.

Nice, that probably helps speed to recovery quite a bit. Restore the VM somewhere else, let it heal what's changed, and bam - back in business!

Encryption hurting performance makes sense, I suppose! I don't know enough about how LUKS works behind the scenes to really dive in.

@vish said:
I read about how beegfs does parallel read/writes. Anyone ever use it?

I can't say I've used BeeGFS, but GlusterFS is reportedly able to do parallelism. I haven't researched this specifically, but my performance testing seems to back it. It scales linearly as you add bricks/nodes.

edit:
Some cursory research shows it does do parallelism - this is all due to their elastic hashing algorithmic model (not relying on metadata lookups, it's calculated):

http://moo.nac.uci.edu/~hjm/fs/An_Introduction_To_Gluster_ArchitectureV7_110708.pdf

(3.3, page 14)

jsg · September 2018

@jlay said:
Encryption hurting performance makes sense, I suppose! I don't know enough about how LUKS works behind the scenes to really dive in.

No, it doesn't. At least not when it's properly set up (e.g. initramfs containing aes_ni module) and your processor isn't much older than 10 years.

mfs · September 2018

I'm mildly curious about BeeGFS performance as well. I fiddled with Tahoe-LAFS some years ago and I found it slower than pretty much anything else; I haven't touched it again ever since even if I did like some of its quirks. I second all jlay's observations and I only suggest to give a thorough read at GlustersFS docs before doing anything. I've never leveraged on GlusterFS for transport and at-rest encryption (GlusterFS transparent disk encryption is conceptually different and to a certain degree better than encrypted bricks/disks, but it had its caveats) so I've had LUKS or plain dm-crypt covering the latter requirement; I've never found LUKS interacting in unwarranted ways with the rest

jlay said: http://moo.nac.uci.edu/~hjm/fs/An_Introduction_To_Gluster_ArchitectureV7_110708.pdf

An interesting read (with tests) is at http://moo.nac.uci.edu/~hjm/fhgfs_vs_gluster.html as well

jlay · September 2018

@jsg said:

@jlay said:
Encryption hurting performance makes sense, I suppose! I don't know enough about how LUKS works behind the scenes to really dive in.

No, it doesn't. At least not when it's properly set up (e.g. initramfs containing aes_ni module) and your processor isn't much older than 10 years.

I get that you can use CPU extensions to speed up encryption, I'm quite aware of that. I do a lot of optimization at work and this ties into it quite a bit on more layers than most imagine.

The thing is, I don't know how transparent LUKS is on the disk/filesystem layers (which can vary on implementation) in the sense of disk I/O (eg: blocks) or what impact it may have on the GlusterFS xlators. This could really mess with performance due to inherent inefficiencies with GlusterFS that just happen to occur due to encryption being present (eg: excessive healing or some kind of perpetual race conditions).

Until this is properly investigated and defined (such as where LUKS is implemented), I'd say it could absolutely impact performance with Gluster. For example, doing it on a PV could be transparent while doing it on the LV could be bad, because of how Gluster has self-healing and bitrot detection

@mfs said:
I'm mildly curious about BeeGFS performance as well. I fiddled with Tahoe-LAFS some years ago and I found it slower than pretty much anything else; I haven't touched it again ever since even if I did like some of its quirks. I second all jlay's observations and I only suggest to give a thorough read at GlustersFS docs before doing anything. I've never leveraged on GlusterFS for transport and at-rest encryption (GlusterFS transparent disk encryption is conceptually different and to a certain degree better than encrypted bricks/disks, but it had its caveats) so I've had LUKS or plain dm-crypt covering the latter requirement; I've never found LUKS interacting in unwarranted ways with the rest

jlay said: http://moo.nac.uci.edu/~hjm/fs/An_Introduction_To_Gluster_ArchitectureV7_110708.pdf

An interesting read (with tests) is at http://moo.nac.uci.edu/~hjm/fhgfs_vs_gluster.html as well

Awesome, I'll check that link out Thank you!

jsg · September 2018

As BeeGFS came up and as I'm involved in a project with a need that seemed to be addressed by a DFS I did some research and can say the following:

Most of them are linux only, some even with a clear distro preference. If you are a linux-only shop that's fine but in our case it wasn't. I'd strongly suggest to only consider a DFS that is minimum linux, FreeBSD, OpenSolaris server side and the same plus Mac and Windows client side.

Most are GPL infested. Again, for some that doesn't matter but in professional circles I see strongly increasing objections especially for v.3. Unfortunately that leaves only a few options.

(In our case) any java use was a no-go which also hampers BeeGFS (and a few others).

Another issue is that quite a few DFSs have been developed within and for the context of large scale/data intense scientific computing. Unfortunately that also means that quite some DFSs have a focus that may seem similiar but actually is quite different from what is typically needed in the hosting context.

For our context only (the little known) XtreemFS passed the filters but we didn't have enough confidence in it plus our needs (gladly) were quite specific so we decided on creating our own solution that is not a DFS but gives us all the features of a DFS that are important to us (which are centered around fault tolerant geo diverse storage and fast please).

For people who just need a "general" not too much fuss linux FT DFS and who don't care about license issues (~have no problem with GPL) XtreemFS might be worth a look. Secondary candidates to look at are GlusterFS and LizardFS. For larger and/or safety and performance sensitive projects I'd advise to think about their own solution.

jlay · September 2018

@mfs said:
An interesting read (with tests) is at http://moo.nac.uci.edu/~hjm/fhgfs_vs_gluster.html as well

I finally had a moment to give that a look, it is pretty interesting! It's fairly out of date (and they admittedly say so). They don't make any mention of things like volume parameters. In my experience that has made worlds of difference, especially in regards to latency.

Would be nice to see them do some volume parameter tuning and compare to GlusterD2 (re-write in Go) that has some improvements like lookup caching. That page seems to paint Gluster in a slightly worse light than I've seen recently haha

Howdy, Stranger!

Categories

In this Discussion

GlusterFS with erasure coding over WAN

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

GlusterFS with erasure coding over WAN

Comments