Galera Extreme Geography Test (or, The Problem With Queues)

raindog308 · October 2016

Following up on this thread, I put Galera to the test a bit.

I'm specifically testing the job queue part of my application. Clients talk to one of three clustered nodes. Each node runs a web server (a golang app) that provides an API. Underneath, the web server is talking to a local MariaDB 10.1 DB, multi-mastered to the others via Galera.

I learned:

job queues are a "known hard" problem in database design.
bad schema design cannot be overcome by horsepower in all cases

For my testing, setup two sets of nodes. In the first, all three nodes were in one DC. In the second, they were spread around the globe.

Client Setup

I built a client that does the following:

picks a random brain (1 of 3)
performs one of three operations:
- a count on its queue
- a retrieve from its queue (I preseeded thousands of fake test jobs)
- makes a result and adds to a results queue
lather, rinse, repeat 100 times

Retrievals are sub-kilobyte JSON but results are often multi-kilobytes due to logging.

I ran 8 of these workers simultaneously on a VM in Vultr's Silicon Valley DC.

System Config/Performance

On the DB side, even the smallest 512MB nodes didn't go below ~200MB free RAM under load. And they're doing other stuff - mail, DNS, etc.

MySQL was definitely busy but not insane - maybe topping at 60-ish percent. I never saw my golang web server in the top of top - go figure, a language built by Google is really really good at highly concurrent web services. I'm using julienschmidt's httprouter.

For the DB servers, some notes:

tx_isolation is set to READ-COMMITTED
innodb_buffer_pool is tuned down to 32MB because these are small boxes. I also tested with 128M. It spend up the LAN scenario but didn't change the WAN scenario much.
for the WAN scenario, I set all the wsrep_provider_options to the max I could find in terms of tolerance and window. I essentially took the galera doc's WAN tuning recommendations. Honestly, I didn't see a lot of difference.

Scenario 1:

3 nodes, all at DO SFO2 on 512M VMs. A typical result:

start time          : Fri Oct 21 15:30:53 PDT 2016
end time            : Fri Oct 21 15:30:59 PDT 2016
start epoch         : 1477089053
end epoch           : 1477089059
elapsed             : 6 seconds

Total Errors: 16
Error 1317 Query execution was interrupted        : 3
Error 1213 Deadlock found when trying to get lock : 3
Error 1205 Lock wait timeout exceeded             : 0
Error 1028 Sort aborted: Query exec interrupted   : 10

That is, of the 800 executions, 784 succeeded and 16 failed with errors. For this test, I just let them fail...real world, the client's config has the list of DB nodes so it'll retry through each node.

I was a very disappointed because the nodes here are in the same DC talking over private network. There's no way to make it faster unless I went to a single node/failover kind of config. I was expecting that if there was some locking the results would slow down...and truth be told, if 800 operations took 60 seconds that would be fine for my needs. But it appears that contention was sufficient that galera needed to start rejecting commits...bummer...

I ran some tests wherein the client interjected a random 1...1000 millisecond pause after each operation to spread them out a bit, but it didn't make a difference.

So if it doesn't work in a single DC, no point in trying multiple DCs. But since I already have it built:

Scenario 2:

3 nodes: Frankfurt (DO 512M), Las Vegas (BuyVM 1GB), Tokyo (Vultr 768M).

Global, baby! Worst inter-node ping was on the order of 230-250ms.

Weirdly...this ran very similarly to Scenario 1. Maybe the problem isn't in the latency...

Intermission: Lessons Are Learned

So then I went and read a bit. This article was illuminating because it nailed every single problem I saw!

http://www.codeproject.com/Articles/110931/Building-High-Performance-Queue-in-Database-for-st

I changed my DDL to more closely match the second section of code (alas, MySQL lacks the SQL support to do what SQL Server can as he later shows).

This article also had a neat UPDATE trick...and also points out that you probably have a database-backed queue without knowing it:

https://blog.engineyard.com/2011/5-subtle-ways-youre-using-mysql-as-a-queue-and-why-itll-bite-you

So I reran Scenario 1:

start time          : Sat Oct 22 10:35:59 PDT 2016
end time            : Sat Oct 22 10:36:04 PDT 2016
start epoch         : 1477157759
end epoch           : 1477157764
elapsed             : 5 seconds

Total Errors: 0
Error 1317 Query execution was interrupted        : 0
Error 1213 Deadlock found when trying to get lock : 0
Error 1205 Lock wait timeout exceeded             : 0
Error 1028 Sort aborted: Query exec interrupted   : 0

Woot! Running it in a continuous loop, I was eventually able to spot a single pair of 1213 deadlocks but that was far beyond any kind of performance I'll need.

And then rerunning Scenario 2:

start time          : Sat Oct 22 10:52:45 PDT 2016
end time            : Sat Oct 22 10:53:23 PDT 2016
start epoch         : 1477158765
end epoch           : 1477158803
elapsed             : 38 seconds

Total Errors: 2
Error 1317 Query execution was interrupted        : 0
Error 1213 Deadlock found when trying to get lock : 2
Error 1205 Lock wait timeout exceeded             : 0
Error 1028 Sort aborted: Query exec interrupted   : 0

Another run's time was similar with:

Total Errors: 1
Error 1317 Query execution was interrupted        : 0
Error 1213 Deadlock found when trying to get lock : 1
Error 1205 Lock wait timeout exceeded             : 0
Error 1028 Sort aborted: Query exec interrupted   : 0

And in some runs there were no errors at all.

I'm OK with rare, documented errors, and a 1213 deadlock is how galera resolves conflict in its "optimistic locking" model. It's when it gets so bad it cascades into other errors (sort aborted, etc.) that it becomes difficult to determine what's happening.

Performance

Single DC, private network, SSD, etc.: 800 mixed read/write operations in 5 or 6 seconds.

3 different continents, 3 different providers, all over the public internet, pretty much the worst 3-node cluster you could build from a performance perspective: 800 mixed read/write operations in 38-40 seconds.

However, it's fairer to say that "8 clients can simultaneously execute 100 jobs in a row"...I didn't push further to find out if perhaps 24 clients could have done 2400 jobs in that time.

Note that's end-to-end time - marshaling the json, sending it, web overhead, db response, network, etc.

Now off to my next madcap adventure.

yomero · October 2016

Thank yo uvery much for sharing your findings.

A first question. What is DDL?

vpsGOD · October 2016

@raindog308 thanks for the report. I will keep an eye when it happen to work on.

@yomero DDL: data definition language

yomero · October 2016

vpsGOD said: @yomero DDL: data definition language

Found that, but didn't got what raindog means with it. Thanks.

raindog308 · October 2016

yomero said: Found that, but didn't got what raindog means with it. Thanks.

I meant I restructured my schema to look more like the one in article. One table to hold the bulky json, and meta/meta-archive table to hold status info.

nullnothere · October 2016

First, thanks for sharing your work and findings. It is a great opportunity to really learn about different ways of doing things (nicely).

I had a few thoughts:

a) Since you're working around "crashes" of the DB instances, does having the logs on a tmpfs (i.e. RAM) make sense? It should speed up things considerably. You do have spare RAM it seems.

b) I don't know your work loads and transaction boundaries but another thought is that if the job queues are transient, you can have them sitting in RAM only tables which should again give you orders of magnitude speed but if a crash happens well...

c) A completely different approach is of course to have only the queue part actually sit in a separate (memory or disk backed) queue (RabbitMQ etc.) instead of a DB (or a NoSQL type DB) from where you just have a simple consumer thread that "gives" you a work item from the queue. This being all memory backed will be really faster freeing up your DB for "real" work.

d) Do take a look at https://github.com/pingcap/tidb (I've not used it but it does look interesting and has elements of what you've used/liked like etcd, raft etc.).

Just my $0.02.

Thanks.

raindog308 · October 2016

Thanks @nullnothere. I agree that something like RabbitMQ is probably a better approach. I really had no idea that queues in DB were such a hard/interesting problem. RabbitMQ is actually very cool, though not WAN-oriented as far as HA.

I'll check out tidb. I've also run across rqlite and cockroachDB. rqlite is its own thing, while cockroachdb implements postgres-compatible SQL.

Stevie · October 2016

@raindog308
This was a interesting read, thanks for that.

Sadly I think mysql is too overkill for my type of site (and a waste of resources)
I like rqlite, its seems perfect but the only question I have is since it uses sqlite, does rqlite still have the limitation in the case of locked file (when writing)

seanho · October 2016

Thanks @raindog308, those linked articles were enjoyable reads, and it was fun to see it in action with your tests. The trick of normalizing out big blobs makes sense. Did you implement archiving of completed jobs via flag + background task?

Howdy, Stranger!

Categories

In this Discussion

Galera Extreme Geography Test (or, The Problem With Queues)

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Galera Extreme Geography Test (or, The Problem With Queues)

Comments