How to build an HA app/DB setup?

raindog308 · June 2016

I would like to build a highly available app/DB configuration using two nodes.

I have a database-driven web site (call it example.com). I would like a setup where my browser goes to RRDNS or something and the site keeps on working regardless if a node goes down. I don't want to build a separate DB tier.

Nodes: From two different reputable LowEndProviders in different DCs (Seattle and Las Vegas). Well, now you can guess who the providers are.

Hardware: one is 2GB, one is 4GB (or will be once @Francisco finishes satiating his legendary appetites among the fleshpots of Las Vegas and gets around to setting his new nodes up). Only budgeting 30GB of disk on each side as my 2GB node was a special and is has 40GB of total disk. I don't know how much disk the 4GB one has because @Francisco is too busy snorting cocaine off the stripper's hip bones instead of setting his new nodes up.

OS: Deb8, Cent7, and OpenBSD are my faves. LowEndDancers in hot tubs are @Francisco's.

So I need some advice:

Is RRDNS the best choice here? Is wasting his life amid the wanton excesses of Sin City @Francisco's? I believe all modern browsers understand RRDNS natively at this point. I'm OK with a brief outage if a node went down, which I anticipate to be rare. However, if a provider does some maintenance for a few hours, then I want the other node to recover when it comes back up. I could also use some kind of clustering failover but I think I'd need a VIP for that.
For the app files, binaries, and some import directories, I'm thinking of using glusterfs to keep them in sync. DRBD is another option but I've read that GlusterFS is better. I'm happy to err on the site of better recoverability if something goes wrong which is how I read GlusterFS vs. DRBD. There's also CephFS but I think that requires a minimum of 3 nodes.
BTW, since I'm building an HA clustered file system-ish thing, is there any way I could share that out to other clients? e.g., my own NAS in the sky and share it via NFS, etc.? So they'd mount someserver:/some/path local over NFS and it keeps working regardless if a node goes down? I don't recall NFS supporting RRDNS...
The DB, which is MySQL/Maria/Percona (if it's still free) or could be Postgres I guess. Percona XtraDB Cluster seems to get good press...? I only have two nodes, so I could also use MySQL's native multi-master. From everything I know about DBs, I want to do that replication at at the DB level instead of at the glusterfs/DRBD/etc. level so transactional integrity is preserved.

I could also use some kind of MySQL HA cloud service. Unfortunately these are really expensive (amazon RDS, etc.) and usually are just running on a single VM assigned to you anyway. There are more HA things like Amazon's SimpleDB, Azure Tables, etc. but I'd prefer not to lock myself into something proprietary. Google's Cloud SQL might work but it's more cost and since I need two nodes for the app anyway...

Any advice welcome!

hzr · June 2016

What are you running for the actual site itself? I like RethinkDB's HA (technical https://aphyr.com/posts/329-jepsen-rethinkdb-2-1-5) if it's less relational, or Postgres otherwise.

I use Ceph for file storage, but there's also https://github.com/coreos/torus which came out somewhat-recently.

Francisco · June 2016

You're setting yourself up for hardcore splitbrian.

If there's any sort of route issue between the two places you're going to have both instances claiming to be the master, there's a reason people generally use an odd number of nodes for quorum.

Francisco

raindog308 · June 2016

Francisco said: If there's any sort of route issue between the two places you're going to have both instances claiming to be the master, there's a reason people generally use an odd number of nodes for quorum.

Quorum of...glusterfs? Hmmm. I do have a storage box I could add a partition and make a third node.

rincewind · June 2016

Percona is a good option but, if you are flexible on your query language, also consider CockroachDB, ActorDB, RQLite.

For a distributed file-system, look at LizardFS - did well in a benchmark against Gluster and Ceph. Not sure if you need a distributed file-system, though. If it's just the app and some static files, you could use inotify to watch a directory for changes and trigger rsync. For instance, incron.

Another option is to store your files in the DB as a BLOB column in a table, and skip the need for a separate replicated layer.

+1 for odd number of nodes in any distributed setup.

agoldenberg · June 2016

For database I've always been a fan of Galera. Great multi master replication but requires at least 3 nodes.

PrincessOfCats · June 2016

MySQL DB Replication

I've tried a few for a while now, and I find that Percona XtraDB cluster is the easiest to setup and maintain.

A few things:

Use xtrabackup/xtrabackup-v2 to sync (xtrabackup must be installed from percona repositories)
Unless you want someone snooping on your sync traffic, use a VPN in between or use socat (built into Percona XtraDB Cluster). I personally find using a VPN (tinc?) is easier most of the time.
Minimum 3 nodes for no split-brain

Failover

Use a uptime monitoring service with support for webhooks or something similar and a DNS service that has an API.
For rage4, this is pretty easy, you just create a webhook so that the uptime monitoring service (I used uptimerobot) can call the webhook. On calling the webhook, the record is set to an offline state (removed temporarily).

This can probably be done for other DNS providers as well if they have an API.

TTL of the records must be set to low in order for it to be effective.

App Files

Doesnt really matter what you use, most will require you to have 3 nodes if you don't want split brain. Personally, I prefer gluster for ease of setup.

Note that some systems do not have encryption between the nodes by default. Unless you want someone snooping on your files, you should probably stick the traffic through a VPN. GlusterFS does not have encryption by default, though it does have support for it.

NAS-ey stuff

Glusterfs can be mounted using the glusterfs client and a few other ways.

For HA, i would reccomend mounting with the glusterfs client. See http://blog.gluster.org/category/high-availability/ for HA mounting

bookstack · June 2016

I am not sure what kind of application you are building, but a distributed system across the data center with low end servers may invite lots of unnecessary complexity. Partitioned network, high latency will be difficult to tackle.

If you really want to go to that route, you at least need 3 nodes to create a quorum. And most consensus protocol, like zookeeper and raft are quite chatty.

raindog308 · June 2016

I have gluster up and running. Slow as hell but considering I'm keeping three replicas (New York, Las Vegas, Seattle) spread out over the continent, I guess I should not be surprised.

# time for i in `seq -w 1 100` ; do cp -rp /var/log/messages ./copy-test-$i ; done

real    3m33.582s
user    0m0.070s
sys 0m0.257s
# time rm -f copy-test-*

real    0m27.918s
user    0m0.003s
sys 0m0.016s

Wouldn't mind improving speed a bit...I guess my options are:

(a) drop down to two replicas, maybe closer

(b) put everyone in the same DC and...gulp...trust one provider...

Maounique · June 2016

Considering such events will be very rare, i think a distributed FS is much more trouble than worth. The latency it introduces ALL the time will make up much more than an outage overall.
Put up a secondary site with a replicated DB and when the main site will come back up, manually sync and switch back. We need to avoid extensive downtime here, not to build an unattended setup which will work after our deaths completely automated, albeit very slow, even if we consider the extra complexity as generally trouble-free and not introducing countless SPOFs and risks of corruption even the best of us cant always think of and prepare in advance.

raindog308 · June 2016

@Maounique - yes, replication is done at the DB layer but there is non-DB stuff.

BTW, doing a two-node gluster setup (replicas) with both nodes at Vultr NJ:

real    0m2.345s
user    0m0.041s
sys 0m0.156s

That's without even getting into the private networking.

amhoab · June 2016

You'll rarely, if ever, see shared storage used as a sync mechanism for code across multiple nodes in a "real" production environment. "Real" as in something with a knowledgeable ops team for a site that gets considerable traffic.

There are many ways to do this, but a low-end way would be to use Jenkins or even your laptop to do a code build, and rsync the files over to your instances. Once you script this out, it's not so bad to do locally, if needed. I like the approach of creating a build artifact that includes all code and dependencies, pushing that to s3 or similar, and using a cm tool like chef or ansible to make sure the instance is setup correctly, and has the proper artifact.

I usually use load balancers, but rrdns should be fine for this. You just need to run health checks on your nodes to ensure that downed hosts aren't getting traffic.

Regarding your database, Galera (which Percona XtraDB Cluster uses), is great, but it requires quorum for writes, and isn't intended for distant neighbors.

I'd suggest doing a standard MySQL master/slave setup, and perhaps using MaxScale on your frontends to help failover your databases in case of failure.

Hope that helps.

raindog308 · June 2016

Thanks @amhoab

Alas, LET blocked my post but here it is:

http://pastebin.com/pZbepN3g

rincewind · June 2016

Gluster+Galera+Tinc+RabbitMQ? Looks too complicated.

Here's a potential alternative: Just use Dropbox and Inotify.

Distributed job queue

View Dropbox as a message queue, where messages are files.

Heartbeats: Create a folder under dropbox that each VPS writes/touches at regular intervals - one file per VPS where filename is hostname/IP. A leader can watch the folder to track which boxes are alive and use it for scheduling jobs
Job submit: Create another folder where jobs are submitted as JSON files. Filenames can be prefixed either by VPS hostname or by job type. All your boxes watch this folder and pick up jobs directed at them, deleting the file when done. Think in terms of either individual job queues per VPS, or a pub-sub pattern. The leader creates jobs by writing into the folder.
You could implement leader elections on top of this message queue, if you want to get fancy.

Dropbox would give a reliable backbone, and is the only distributed component - everything else runs locally. Dropbox could potentially rate-limit updates - so maybe heartbeats every hour and go for a coarse-grained HA solution.

Rsync is better for bulk data transfers. Dropbox only for command and control.

Database

Instead of replicating the entire DB, just focus on critical tables, write them into SQLite files and store them in Dropbox.

PS: Watchman is a sophisticated Inotify service. More recent than incron

jcaleb · June 2016

I did built in mastermaster of mysql 2 yrs ago on 2 leb from different states. I simulate one to be down and it will synch back. But leaving the db for 2-3 weeks,they arenot synch anymore

Jonchun · June 2016

@raindog308 said:
I have gluster up and running. Slow as hell but considering I'm keeping three replicas (New York, Las Vegas, Seattle) spread out over the continent, I guess I should not be surprised.

> # time for i in `seq -w 1 100` ; do cp -rp /var/log/messages ./copy-test-$i ; done
> 
> real  3m33.582s
> user  0m0.070s
> sys   0m0.257s
> # time rm -f copy-test-*
> 
> real  0m27.918s
> user  0m0.003s
> sys   0m0.016s
>

Wouldn't mind improving speed a bit...I guess my options are:

(a) drop down to two replicas, maybe closer

(b) put everyone in the same DC and...gulp...trust one provider...

What about picking a location with multiple big DC's? So speeds will be significantly better and you can handle an outage at one provider?

raindog308 · June 2016

I really need to properly write this up.

My Needs

I have ~15 VPSes and ~6 home boxes that do a variety of things. I want to manage their work all in one place. There are no life support units or missile defense batteries in play here - it's stuff like

nightly backups, versioning and tarsnap backups for some critical stuff like my password vault, etc.
schlepping seedbox stuff, converting, notifying, etc.
tons of minor life things like dumping my Google contacts and harmonizing them with our family addressbook database, checking if our standing lottery tickets hit the powerball, untarring and scanning my shared hosting backups intensively (on my own node) to see if my less-technical family members have been hacked again, etc.
system stuff like expiring old backups, ansible jobs to keep all systems in policy, etc.
information-gathering stuff like gathering stats from my TOR relays, reporting which nodes need updates on packages, etc. I also find it helpful to just metric a lot of things in life - e.g., I may never care what my Dropbox usage is every Monday and the usage of every folder inside, but if I do care, I have the data.
I am starting a web service to do some data processing of screenshots of a popular iOS game. In a month or two I'll post it in the subreddit and there will likely be traffic. Watching and managing that.

And of course I want to report on all of that - which jobs failed/succeeded, which ones ran longer than expected, any that started but never finished, etc. Ultimately, job dependencies and groups.

It's important to realize these are not 16-core enterprise VMs with 512GB of RAM that don't care about a gig for some big enterprise system. They're as small as 128MB LEBs.

Today

I have a system today, written in Python:

a directory full of scripts and libraries is kept in sync on all nodes via nightly rsync
crontabs on those nodes fire those scripts, and store the results and any messages in local sqlite files
a master node goes around to each node in the morning and gathers those sqlites, consolidates them, and emails me a morning report. It also audits backups vs. the inventory db (did everyone back up), checks that every database on every box backed up, etc.

It works fine but there are some things I'd like to fix:

if I want to change a host's execution time, or add a job, I have to go to that host and edit the crontab. This can be automated with ansible, but what you get into is having a DB of config variables and a script that spits out ansible rules that are then executed...which is kind of like having a central scheduling system :-)
there are things I want to manage fleet-wide like spreading out backups so the backup servers aren't overloaded.
inter-server dependencies are very hard. Not impossible, but it's every server handing off on its own, and there's no central coordination.

Requirements

central highly available, resilient brain that keeps track of what jobs are scheduled, who's running them, reruns them, etc.
the brain specify schedules in cron-compatible format (using the croniter package). Dependencies are a second-round enhancement once "replacing what I've got today" is in place. Each schedule describes what job to run where on either a specific node, or a group of nodes.
brain notifies node to run a job and the node does so in a relatively short window of time.
node makes results (succeeded/failed, any warnings/errors, any data output) available to brain
various reports are available but generating those is just another job

Brain Design

So the major problems are:

brain can't go down, which for all practical purposes means multiple nodes from different providers in different parts of the globe. I'd really rather not get into failover IP type things...it's 2016 and I want all nodes to be active. I also want the cluster to be "self-healing" is the sense that a node can drop out and come back and it just works.
there are different ways to have the brain cluster talk amongst itself. Since I'm a DBA by trade, my plan is to use an HA database they all have access to in order to do this. This is convenient because I need to record job results, etc. There are other ways - they could open a connection to each and talk amongst themselves, etc. but since I need an HA DB, I may as well use it.
The major coordination problem is "who is going to go through the schedules to fire off jobs because we don't want two people doing this". I'd also like to spread the work around since I'm paying these providers :-) The brains bid (random 1-1000000 number) for the next 5 minute block of time, and every 5 minutes, each checks to see who won the bid for this 5-minute block and takes the role. If a node drops out, it can't win the next bid and someone else will. Whoever has the role will notice if the previous failed to schedule something and will do so. Small risk that previous winner will still be running. Tied bids are ignored - if there's only 2 nodes and they tie (very small chance), no one will run for a couple minutes.
There are also MySQL GET_LOCK() based systems. Still thinking on this part. It's the fun part.

So how do we get that HA brain? I want to be able to buy brains anywhere. So the real problem is the HA MySQL database they'll talk to. Some options:

Pick a premium host and just trust that they'll be up, and host MySQL on that (brains can access via firewalled 3306, etc.). E.g, an Amazon or an Azure. Or a WiredTree or a KnownHost. The problem here is the cost, and really even these great providers do sometimes go down.
Pick a mid-market host and just trust that they'll be up. DO, Linode, Vultr. The world really won't end if it's down for a while, and I'm so neurotic about backups that I'm confidant I wouldn't lose any data, just time.
Construct a failover cluster. This can be done at places with a floating IP pretty easily. This would be just for the DB tier. Of course, now you're into at least $20 (two 1024MB VMs at Vultr or DO). Still a SPOF at the DC level.
Construct a Percona XtraDB cluster. I did this yesterday - Vultr in NJ, two BuyVM KVM nodes in Vegas. Works very well so far, but I admit I'm just playing now. Not sure what it'd be like if one node needed to be rebuilt, or was offline for a period of time, etc. I also setup Gluster but that probably isn't needed. The problem here is that Percona requires at least three nodes.
Use a cloud SQL service. Amazon RDS is actually not expensive - 3yr prepay for their micro with 3GB of storage and maybe 30GB of bandwidth is < $9/mo. If I didn't have to run MySQL on the brains, they could probably be 128MB LEBs :-) Google cloud SQL with their micro, same 3/30 is $13.77. It's hard to justify $20/mo when you can get $9/mo from Amazon...there's also Azure SQL but then I'd have to get into python talking to MSSQL and it's kind of hacky I suspect.
Use a cloud NoSQL service. Amazon SimpleDB is virtually free (in fact if I sign up for the free tier, it would be free for a year), but it doesn't promise strong consistency. DynamoDB does, but it is kind of a pain. Both of these would eat a lot of network bandwidth I think because my brains are going to be chatty coordinating amongst themselves. I could possibly do the coordination brain-to-brain and then only record results in the DB.
Use a LowEnd SQL service- e.g., BuyVM offloaded MySQL. But I think I want something with more of a guarantee (even though I'm a BuyVM fanboi).

I'm becoming more and more attractive to some kind of cloud database I don't have to worry about. Or just saying "we're going to assume this Vultr node over here will always be up".

Talking to Nodes

OK, so now how to tell the nodes to do work? Lots of options:

(a) Crudest: brain sftps a digitally signed message to the node. SFTP is nice because the node can chroot it. Node runs a script out of cron every X minutes and notices the incoming job order. Verifies the master's sig and then does the work, putting a result file in the same place. Brain has some idea that this job on that node usually takes X minutes, and starts checking back, eventually complaining if it hasn't heard back after X * {some number} minutes.

(b) Still crude: same thing except node sftps result to master. But that's an sshd_config/pubkey management headache.

(c) node runs a daemon and brain talks to it. Talk is either via queeus (RabbitMQ) or some other protocol I dream up. This would have to be encrypted. This is the fanciest option by far but also the most fun to write.

(d) @rincewind's Dropbox idea is intriguing. The pain with Dropbox is that you have to manually register each node...I'm really addicted to "reinstall in the provider's CP, set the root password there, then run this script that takes the node from distro to my config in one command". Also there are no guarantees with Dropbox in terms of time. Not sure if there "maximum number of clients" limits. Also, I'd have to write a job to regularly exclude all other directories...my recollection is if you add one, it's included everywhere by default.

(e) nodes run some kind of inotify-derived package that watches a directory and fires off a script once work arrives via SFTP.

(f) nodes could ask for work via RESTful API on a regular basis (RR DNS to the brains), and then report back. I kind of like this idea too but only because it'd be fun to write a RESTful API.

As you can see, this is a mix of actual need and goofy fun.

OK, that is my world. Thoughts?

@hzr @bookstack @goldenberg @ALinuxNinja @amhoab @JonChun come back and critique.

PrincessOfCats · June 2016

I would say go either cloud DB hosting (Azure/AWS/Rackspace/etc) or host it yourself with percona xtradb.

Lets take a look at the options...:

a) Pick a premium host

Probably not the best idea, putting your eggs in one basket. For all you know, a hacked solusvm or WHMCS, maybe even a DDOS not directed at you but your neighbors would give you trouble.

b) Pick a mid-market host

Same issue as above

c) Failover cluster

You now have various queries dropping out while the DB is switching over, why bother when you can make it a HA cluster with a bit more work?

d) Percona XtraDB Cluster

Three nodes minimum for proper functionality. An idea just occurred to me that if the memory usage is not too high, you can simply host it on the same server as the app server. Then, there will be no issue of the app not being able to reach the DB because the DB server is offline, as theoretically, the only time the local DB server would not be reachable is if the whole VPS is offline.

e) Cloud SQL

While most should be reliable, if the cloud fails, there is no backup DB server, nor can you do anything on your side. With Percona XtraDB, you can continue adding app servers with Percona even as the nodes continue dropping on the floor as long as you have nodes online.

f) Cloud NoSQL

No experience with NoSQL.

g) LowEndSQL

Same issue as a)

vimalware · June 2016

I haven't fully comprehended your large post.... but take a look at etcd for storing key-value type data in a single 'source of truth'. 3 low-ram KVMs perhaps?

It has quorum negotiation built-in, if I understand correctly.
http://thenewstack.io/about-etcd-the-distributed-key-value-store-used-for-kubernetes-googles-cluster-container-manager/

SplitIce · June 2016

@agoldenberg said:
For database I've always been a fan of Galera. Great multi master replication but requires at least 3 nodes.

Second that for RDBMS'es. Been using it in production for ~3 years. Great system.

yomero · June 2016

Galera == Percona XtraDb Cluster right? (At least the base is?)

raindog308 · June 2016

yomero said: Galera == Percona XtraDb Cluster right? (At least the base is?)

Yes.

I just noticed Google Cloud DataStore, which is free (or would be nearly so in this case given the small amounts of data). Eventual consistency but strong consistency can be forced.

It occurred to me another way to pick a master is to let the DBMS do it, if you assume the DBMS is always available. A stored procedure/DBMS job could fire every minute, monitoring heartbeats and picking a master as needed.

quicksilver03 · June 2016

I believe that Rundeck can be easily adapted to be your "brain". High availability is never trivial, but they seem to suggest the same strategies that you have identified already.

http://rundeck.org/docs/administration/scaling-rundeck.html

eva2000 · June 2016

@raindog308 with some of your tasks you might want to look into if it's possible to use Amazon Lambda which will do away with the need for many servers to just run your code/scripts related to logistics

I use Amazon Lambda in conjunction with Amazon SNS notifications for AWS Route53 dns health checks for my geo dns latency cluster of servers so that Lambda sends Route53 healthcheck alerts to my Centmin Mod Slack channel on desktop and mobile devices

raindog308 · June 2016

quicksilver03 said: I believe that Rundeck can be easily adapted to be your "brain". High availability is never trivial, but they seem to suggest the same strategies that you have identified already.

Yeah, I've been looking at Rundeck. Its HA strategy is still very immature. But the overall interface is so nice I may use it. Although the brains need to be pretty heavy (it's a java app), I believe it uses ssh to talk to nodes which is awesome.

eva2000 said: if it's possible to use Amazon Lambda

I was just looking at that yesterday :-) Along with SQS which is very handy.

I think the options at this point are:

Use some kind of pre-made brain system like RunDeck. Then the problem becomes assuring HA underneath, which at this point would be a VIP/failover type solution. Also the DB needs to be HA so either RDS or roll my own.
Write my own brain. One neat thing I found is etcd which has a nice python interface. It implements a globally consistent key/value store and has an "elect a leader" API call that implements the RAFT protocol underneath.
Rethink things into an event-oriented framework by using something like Amazon Lambda.

raindog308 · June 2016

Been working a lot on this and I think I have a final plan.

three 1GB VMs, spread geographically. At the moment my plan is:

(1) Vultr in NY (because it's the only DC that supports blocks storage atm)

(2) BuyVM in Las Vegas.

(3) DO in London...probably. If the transatlantic lag is too much I may go to Toronto or SF.

Percona XtraDB cluster in active-active-active
etcd as a way of electing one brain out of three to be the active schduler
my own job scheduling code
may or may not include gluster...it's not strictly necessary

Some reasons:

I decided not to do Amazon RDS because while it's free for a year on the free tier, eventually it's going to cost $10/mo. Also I'm that sort of "backup MySQL every 15 minutes" person and the network costs would eat me alive. I'm also skeptical on network lag...if I use Percona, all reads are local and RW is 1/3 local. With RDS, every read and write goes across the network to US-West.
I liked RunDeck but found it too immature. You can only have one active scheduler and while there is the sense of failover/takeover, I had a hard time finding getting docs or getting questions answered. Small community I think...nice product but needs a bit more development. Also, it's Java and does not run on 512MB without a ton of swap...ran fine on 1024MB but that was without really running it hard, so I think it's designed more for a big box.
etcd is awesome. Very easy to use. Once you have a cluster running, it's trivial to write something that either reads the cluster's own internal leader election, or use its compare-and-swap global key functionality to create a leader election without the hassle of implementing your own RAFT/Paxos/etc. In my testing it's been bulletproof - drop a node out, put it back in, kill the leader, etc. all works great.
Percona has been rock solid so far in my testing. So each node can be both RW and RO, which makes using something like RR DNS and running a RESTful gateway really easy. I really don't care for the garbage that is the MySQL world and I keep finding little gotchas (e.g., how will the event scheduler behave with multiple engines? These questions don't come up in real RDBMS engines) but there isn't an alternative...Pg doesn't really do clustering all that well. And while I'm an Oracle-certified DBA, it wants a lot more than 1GB of RAM :-)

yomero · June 2016

Looking forward to your progress on this

deadbeef · June 2016

@raindog308 said:

three 1GB VMs, spread geographically.

>

Percona XtraDB cluster in active-active-active

Latency will kill your writes.

yomero · June 2016

deadbeef said: Latency will kill your writes.

Probably that's not his priority?

vimalware · June 2016

You'll end up re-implementing Google Borg - > kubernetes

Next up: Mesos+/Marathon for job scheduling?

deadbeef · June 2016

@yomero said:

deadbeef said: Latency will kill your writes.

Probably that's not his priority?

I too build things just to look at them, but are you sure he won't actually want to use it?

Howdy, Stranger!

Categories

In this Discussion

How to build an HA app/DB setup?

Comments

My Needs

Today

Requirements

Brain Design

Talking to Nodes

Howdy, Stranger!

Quick Links

Categories

In this Discussion

How to build an HA app/DB setup?

Comments

My Needs

Today

Requirements

Brain Design

Talking to Nodes