VPS Fallback cluster

FoxelVox · June 2018

Hi all,

So i have a few client projects coming up that require really high uptime and can't afford any data loss.

I was thinking about a setup where if the main server goes down for some reason, i could switch it to a different server. But that setup requires the data to always be accessible on BOTH the main and backup VM nodes, a backup every 24 hours of each VM or a SAN/NAS..

Which setup is the best to use in this case, or what is your thoughts on this?

Note: one of the company's is an enterprise, so i can't send data unencrypted over the interwebs, so preferably through an internal link.

Neoon · June 2018

SAN/NAS for HA? the fuck? if the SAN fails, you be fucked.

Your HA is gone, single point of falure.

The less components the setup has, the more reliable it is.

Get a few decent dedis.

FlamesRunner · June 2018

Heartbeat with periodic drive cloning? Normally, I'd suggest AWS, but...

FoxelVox · June 2018

@Neoon said:
SAN/NAS for HA? the fuck? if the SAN fails, you be fucked.

Your HA is gone, single point of falure.

The less components the setup has, the more reliable it is.

Get a few decent dedis.

Yup, Just realised it. I was thinking about rsyncing 2 network storage points but that would be Stupid.

My current plan is to get 4 servers: 2 master nodes where all the vpses are operating, one node which backups all vpses every 12 hours (They can still be online during this backup), and one fallback server which can import all the vps of a failed node within 15mins.

drserver · June 2018

what kind of application is in question ?

FoxelVox · June 2018

@drserver said:
what kind of application is in question ?

Large Webshop, Workspaces/hostedexchange and a DB server is one company.

The other one is a pretty large webdevelopment & design agency (+-660 webhosting clients which can't afford downtime.)

nik · June 2018

You should first check what the actual architecture is, than you could put all assets on object storage (so they are accessible and can be managed by n servers) and databases could be setup in multi master. This way you could use 2 load balancers (2 servers in 2 different locations), 2 app servers that are connected to 2 db servers and object storage.

With this architecture one datacenter could theoretically completely burn down while your projects are still up.

drserver · June 2018

FoxelVox said: Large Webshop, Workspaces/hostedexchange and a DB server is one company.

The other one is a pretty large webdevelopment & design agency (+-660 webhosting clients which can't afford downtime.)

This looks like private cloud project. Check OVH private cloud, they are fairly priced.

FoxelVox · June 2018

@nik said:
You should first check what the actual architecture is, than you could put all assets on object storage (so they are accessible and can be managed by n servers) and databases could be setup in multi master. This way you could use 2 load balancers (2 servers in 2 different locations), 2 app servers that are connected to 2 db servers and object storage.

With this architecture one datacenter could theoretically completely burn down while your projects are still up.

Thanks, yup i suggested that to them BUT the Webdesign agency wants a controlpanel for their clients, So i would need to develop a CP then for them with a cluster setup, Unless plesk or cpanel supports this?

drserver · June 2018

nik said: This way you could use 2 load balancers (2 servers in 2 different locations), 2 app servers that are connected to 2 db servers and object storage.

S3 was down in 2017 for 6 hours. so 2 object storage zones.

If you are referring to AWS here, this setup will give you only 99.9

to get that last nine you need to do 3 AZs.

nik · June 2018

@drserver said:

nik said: This way you could use 2 load balancers (2 servers in 2 different locations), 2 app servers that are connected to 2 db servers and object storage.

S3 was down in 2017 for 6 hours. so 2 object storage zones.

If you are referring to AWS here, this setup will give you only 99.9

to get that last nine you need to do 3 AZs.

I have never mentioned AWS nor S3 so I am not sure what your point is.

drserver · June 2018

nik said: I have never mentioned AWS nor S3 so I am not sure what your point is.

I took S3 as most popular object storage as example.

To be 100% up, you need minimum of 3 locations. and depends on what kind of service software you have most of them will not be re-elect maters if there are not at least 4 nodes.

My point is, that you cannot be 100% sure in any way.

willie · June 2018

FoxelVox said: Large Webshop, Workspaces/hostedexchange and a DB server is one company.

You need db replication: work on that first.

There is no such thing as 100% anything. Google SRE now treats reliability figures as percentage of requests that are allowed to fail. E.g. 99.99% reliability of some Google service means that out of a trillion requests coming into that service from around the world, they can fail 0.01%, not that the entire service can be simultaneously down 0.01% of the time (that would be unthinkable for something like gmail, which is spread across 1000s of machines in dozens or hundreds of data centers). In fact at any given moment there is probably someplace in the world where Google is having a localized outage, with any requests failing in those places counting against the 0.01% error budget.

With two geo-separated servers, replication, and good monitoring, you can probably get 99.99% by old-school methods, but it may be harder to do much better than that. Anywhere you go there will probably also be regional internet failures once in a while that have nothing to do with your server.

In fact what almost everyone in your segment does is use AWS, ELB etc. That has downtime, it's probably not the most reliable approach you can find, but it's the big brand and it's expensive and has lots of capacity, so if something goes wrong they think ok, they did the best they could.

The Google SRE book is interesting reading:

https://landing.google.com/sre/book/index.html

FoxelVox · June 2018

@willie said:

FoxelVox said: Large Webshop, Workspaces/hostedexchange and a DB server is one company.

You need db replication: work on that first.

There is no such thing as 100% anything. Google SRE now treats reliability figures as percentage of requests that are allowed to fail. E.g. 99.99% reliability of some Google service means that out of a trillion requests coming into that service from around the world, they can fail 0.01%, not that the entire service can be simultaneously down 0.01% of the time (that would be unthinkable for something like gmail, which is spread across 1000s of machines in dozens or hundreds of data centers). In fact at any given moment there is probably someplace in the world where Google is having a localized outage, with any requests failing in those places counting against the 0.01% error budget.

With two geo-separated servers, replication, and good monitoring, you can probably get 99.99% by old-school methods, but it may be harder to do much better than that. Anywhere you go there will probably also be regional internet failures once in a while that have nothing to do with your server.

In fact what almost everyone in your segment does is use AWS, ELB etc. That has downtime, it's probably not the most reliable approach you can find, but it's the big brand and it's expensive and has lots of capacity, so if something goes wrong they think ok, they did the best they could.

The Google SRE book is interesting reading:

https://landing.google.com/sre/book/index.html

They're aiming for a 99.992% uptime or higher.

DB replication is not the issue here, we're offloading all SQL requests through multiple servers with multiple locations around europe. It's mainly the site and workspaces that is a sensitive for downtime.

drserver · June 2018

FoxelVox said: They're aiming for a 99.992% uptime or higher.

DB replication is not the issue here, we're offloading all SQL requests through multiple servers with multiple locations around europe. It's mainly the site and workspaces that is a sensitive for downtime.

Yeah,

Go with AWS. Super robust network enormous capacity and some will say Cloud Industry standard.

Zerpy · June 2018

@FoxelVox said:
They're aiming for a 99.992% uptime or higher.

Good luck.
Hopefully, they'll pay you millions.

Like seriously - sure you might be able to maintain that uptime with luck over a year or more, but eventually there will be something that drops below that uptime, regardless of how much redundancy you've tried to put into it.

Stuff fails - businesses should understand that as well, and if they can't live with 42 minutes of downtime per year, it's not like there can be a whole lot of maintenance in that period anyway - they'll probably even screw up themselves causing more than 42 minutes of downtime in the course of 1 year anyway.

If they require 99.992% or higher uptime on let's say, hosted Workspaces - maybe they should stop using hosted Workspaces and use local machines for doing their work.

I'm even doubting if their own internet connection will have the same uptime :-D

Have you asked the client for their actual requirements, or is this just a "We want 99.992% uptime" because they can say it?

From my experience, even with huge enterprise customers, they want "100% uptime (or 99.99xxxx" but in reality, they're fine with more downtime in the end, since it turned out to not be so super critical as first stated.

Also, huge enterprises tend to understand that things fail.

willie · June 2018

I worked at an old school site with 99.99% SLA, which we met for several years. Yes we spent millions on it but it could be done for a lot less now. We had two colo racks in separate DC's and db replication and we did have an unexpected failover or two. Perhaps more usefully we were able to do software upgrades with no downtime, by intentional failover.

We also had a massive problem with one of the data centers, not causing a crash per se, but we had to completely get out of that data center on less than 48 hours notice (maybe less than 24, not sure). We could never have built a new hardware stack in another DC in that amount of time, and we couldn't incur hours or days of downtime by unplugging the original stack and moving it. Fortunately we were able to just switch over to our existing secondary stack, keeping the site up while we spent the next few days frantically moving the primary stack.

99.992% is doable by fairly conventional means (two servers, replication, anycast, and failover). 99.999% is probably a lot harder.

I do think that high availability web hosting is an under-served market and it would be great if there were more offerings in it. Any outages have elevated changes of happening at the times of the heaviest traffic, which is when you want them the least. If your webshop is down for a few hours on black friday, "I'm losing millions!!!!" is not that much of an exaggeration (you're at least losing thousands, even for a small shop). Once that happens to you, going from a $7/month LAMP VPS to a $100/month HA setup suddenly seems worth the cost.

raindog308 · June 2018

Neoon said: SAN/NAS for HA? the fuck? if the SAN fails, you be fucked.

Your HA is gone, single point of falure.

Dude, they're widely used.

But not at LET prices.

FoxelVox · June 2018

@Zerpy said:

@FoxelVox said:
They're aiming for a 99.992% uptime or higher.

Good luck.
Hopefully, they'll pay you millions.

Like seriously - sure you might be able to maintain that uptime with luck over a year or more, but eventually there will be something that drops below that uptime, regardless of how much redundancy you've tried to put into it.

Stuff fails - businesses should understand that as well, and if they can't live with 42 minutes of downtime per year, it's not like there can be a whole lot of maintenance in that period anyway - they'll probably even screw up themselves causing more than 42 minutes of downtime in the course of 1 year anyway.

If they require 99.992% or higher uptime on let's say, hosted Workspaces - maybe they should stop using hosted Workspaces and use local machines for doing their work.

I'm even doubting if their own internet connection will have the same uptime :-D

Have you asked the client for their actual requirements, or is this just a "We want 99.992% uptime" because they can say it?

From my experience, even with huge enterprise customers, they want "100% uptime (or 99.99xxxx" but in reality, they're fine with more downtime in the end, since it turned out to not be so super critical as first stated.

Also, huge enterprises tend to understand that things fail.

Lets just say it’s a 4 figures plan, and it’s not like it’s a requirement, but they want to aim for that high.

I mean, say if one node fails and the data from 12hr before is ready to go within an hour, means you wont have much downtime at all. 99.72% (1 day a year) must be manageable.

JamesF · June 2018

OVH public cloud offers 99.999% uptime and you can use cPanel. You could split the sites up across different DC’s and different projects?

You could look at google cloud / azure and aws.

Zerpy · June 2018

@experttechit said:
OVH public cloud offers 99.999% uptime

They really don't. Their page might say it, but the actual uptime the last 2 years have been less due to their pretty much full network outages in their DCs - 99.999% allows for a little over 5 minutes of downtime.

I'm an OVH fanboy, but we should also be realistic.. they have had their fair share of outages the last couple of years.

If I wanted to guarantee 99.99% uptime (or more), I'd never rely on a single hosting provider (even if multi-datacenter setup).

DanSummer · June 2018

@drserver I just sent you a PM, it is quite urgent.

drserver · June 2018

DanSummer said: @drserver I just sent you a PM, it is quite urgent.

ty. got it

black · June 2018

Database replication, then fail-over via DNS using something like https://github.com/blackdotsh/UptimeFlare . I use my own monitoring then trigger the fail-over but uptime robot works fine in most cases.

drserver · June 2018

AWS Aurora RDS as HA DB, S3 as object storage, ELB or NLB or ALB whatever you like, then nodes behind with EBS volumes.

Do 3AZs for worker nodes, and use 2 AWS regions for buckets and aurora.

CloudFlare load balancer to balance between regions or even clouds if you will do multi cloud.

As i said before, some services like for example Zookeeper needs minimum or 4 nodes to elect new master in case of issue.

That is how you can get up to 99.99%

wavecomas · June 2018

Elegant solution is to have 2 dedi or colo servers with VMware esx. You can replicate main site with 5Min RPO. If main site will be down you can switch immeadetley to secondary. Only minus that vmware cost some buk. 3000-4000USD for Essential plus package. But its worth it.

Howdy, Stranger!

Categories

In this Discussion

VPS Fallback cluster

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

VPS Fallback cluster

Comments