Server recommendation for very heavy CPU/disk task

fixxation · November 2021

Hi - I need a bit of help with a recommendation for a server.

I have a huge amount of data in tiny JSON files that I need to load into a MySQL database. There could be nearly 100 million files that I need to parse, each about 500-1000 bytes in size.

At the moment, I need to parse through the files sequentially, so it takes a long time, but in the months ahead I should be able to handle them in parallel and split them into chunks, and then I'll be able to fully utilise multiple cores on a server. I am doing a fair bit of caching with Redis to speed things up, but this is still a huge job overall. Currently using a CPX21 with Hetzner, unfortunately only one core really being used for the actual 'parse' process.

I'm not sure what would be best for this - a high spec Ryzen VPS with maybe 4-6 cores, 4GB+ RAM, and as fast as possible NVMe disk, of at least 100GB NVMe storage space -- or should I be considering a dedicated server, so I know that the I'll be getting full use of the resources all the time? VPS prices are more attractive, but would I have full speed of the NVMe disk all the time?

I'm trying to keep costs down, as this is a bit of a hobby project at the moment - but I think being Black Friday, now is a good time to get a hefty server that can handle the job better than my current 'load box'. Ideally I would want to snag a bargain in the days ahead and get a really high spec server for EUR 100 or less, that will do the job well for me over the year ahead.

Thoughts? Thanks in advance.

omelas · November 2021

have a dedicate mysql server that receive from others, and a bunch of threads connect to that to insert to database as remote user?

rick2610 · November 2021

I would recommend a dedicated server for consistent disk performance. Is the bottle neck disk or cpu?

1nf · November 2021

How long will it take? You could a massive bare metal box with an hourly provider for a few hours.

AXYZE · November 2021

Go with AX41-NVME or get preemptible VM instances from any major cloud provider if this doesnt need to run 24/7.

AX41-NVME has 'no setup free' promo now, without promo setup fee is about 30-40euro iirc.
https://www.hetzner.com/dedicated-rootserver/ax41-nvme/configurator

VPSes will throttle if you will use a lot of disk for longer periods of time, even VDS ones. 500-1000byte files are very demanding on drive. If you would get AX41 then you have 2x NVMe so you can split files into two disks and get A LOT better performance than shared VPS. I dont know if standard RAID is good idea for such small files.

Some tests even show that RAID0 will reduce performance for small file operations because of overhead.

cadddr · November 2021

If it's batchjob thing rent CCX-22/higher move data in, do stuff, get it out, delete instance. 100GB data shouldn't take more than 8hr including moving data.

ralf · November 2021

If you're just doing simple data transformation, I don't really understand why you need a new server for this...

Most of the time taken for this task will be reading the tiny files off disk, so the processing time won't much on long of the time to copy this data to another server.

If it was me, I'd write a perl script that takes as input a list of filenames from stdin, opens that as JSON, transforms it and output one or more INSERT statements to stdout.

You can then do:
find -type f basedir/ | ./transform.pl > huge.sql

Periodically put some commits in the output every 1000 records or so, or instead of outputting to stdout, you can write to files split every 1000 or so.

If there's any pattern to the naming, you could add an -iname to the find to restrict the output SQL to a subset of source files.

Using find this way means that you're paying the cost of reading the directory linearly, not having to cache the directory in RAM so that it can be sorted (e.g. if you used wildcards in the shell), and so the transfer speed from disk will be about as optimal as it can be. If you're only doing simple transformations, you might even outperform the speed of a network transfer to a remote server.

jsg · November 2021

@fixxation said:
Hi - I need a bit of help with a recommendation for a server.

I have a huge amount of data in tiny JSON files that I need to load into a MySQL database. There could be nearly 100 million files that I need to parse, each about 500-1000 bytes in size.

At the moment, I need to parse through the files sequentially, so it takes a long time, but in the months ahead I should be able to handle them in parallel and split them into chunks, and then I'll be able to fully utilise multiple cores on a server.

You can parse files in parallel just as well.

I am doing a fair bit of caching with Redis to speed things up, but this is still a huge job overall. Currently using a CPX21 with Hetzner, unfortunately only one core really being used for the actual 'parse' process.

That is one problem. Spread your work over multiple cores.

I'm not sure what would be best for this - a high spec Ryzen VPS with maybe 4-6 cores, 4GB+ RAM, and as fast as possible NVMe disk, of at least 100GB NVMe storage space -- or should I be considering a dedicated server, so I know that the I'll be getting full use of the resources all the time? VPS prices are more attractive, but would I have full speed of the NVMe disk all the time?

We are talking about 50 - 100 GB of data. Let's assume worst case (100 GB). Based on a recent benchmark I did with a decent SSD - not even NVMe - based VPS you'd need about 1 hour to read 100 GB of data in small files but the limited IOPS would slow it down to roughly 8 hrs.
Note that being smart about IOPS there's be two routes, (a) NVMe, but that would just plus-minus double IOPS, or (b) dedicated hardware with NVMe which could make things go a lot faster (the "killer" effect with any kind of VM (VDS or VPS) is that the drives' IO is shared with dozens of other VMs).

At the same time one should keep context in mind: it's 'a bit of a hobby project at the moment', so why would you care whether the whole process is done in 1 hour or in 2 days?

My recommendation therefore is this: Get yourself a good VDS with 2 vCores (or even better 4 which might be feasible due to BF) - but do NOT go for best price only! Rather try to get a bargain from a reputable provider who doesn't overcrowd their nodes. NVMe would be nice but frankly a good SSD (ideally Raid 10) will do fine too; it's not a big issue whether the process takes say 8 hrs or 3 days for 'a bit of a hobby project at the moment'. Go for 4 - 8 GB of memory (decent speed DDR4 is highly preferable). AMD Zen or Intel Xeon E5-26xx v4 (or better) isn't really that important, both will do fine enough.

I'm trying to keep costs down, as this is a bit of a hobby project at the moment - but I think being Black Friday, now is a good time to get a hefty server that can handle the job better than my current 'load box'. Ideally I would want to snag a bargain in the days ahead and get a really high spec server for EUR 100 or less, that will do the job well for me over the year ahead.

€100 is way too high limit for 'a bit of a hobby project at the moment' - and IMO it's not needed. Cut that number in half, that will get you a really decent, damn good enough(TM) VDS (with higher spec) or a halfway decent dedi (with lower specs). Both will be sufficient.

fixxation · November 2021

@AXYZE said:
Go with AX41-NVME or get preemptible VM instances from any major cloud provider if this doesnt need to run 24/7.

AX41-NVME has 'no setup free' promo now, without promo setup fee is about 30-40euro iirc.
https://www.hetzner.com/dedicated-rootserver/ax41-nvme/configurator

VPSes will throttle if you will use a lot of disk for longer periods of time, even VDS ones. 500-1000byte files are very demanding on drive. If you would get AX41 then you have 2x NVMe so you can split files into two disks and get A LOT better performance than shared VPS. I dont know if standard RAID is good idea for such small files.

Some tests even show that RAID0 will reduce performance for small file operations because of overhead.

Thanks @AXYZE - the AX41-NVMe looks good, it's a lot more RAM than I need, but the 1TB is great because the uncompressed data I need to load is roughly 700GB altogether.

I've been on to Hetzner support asking if they are throttling this instance in particular, but not getting any straight answers - they just keep mentioning how it's shared and performance can vary etc etc. Just trying to get a straight answer out of them so I can rule-out it being a bug in my code. I highly suspect some throttling is taking place though, because the program runs so fast at start, and then slows to a halt after a a certain amount of time. Just waiting for an answer from their technical team now - if they can confirm the throttling then the AX41-NVME looks the way to go- thanks for the help.

fixxation · November 2021

@ralf said:
If you're just doing simple data transformation, I don't really understand why you need a new server for this...

Most of the time taken for this task will be reading the tiny files off disk, so the processing time won't much on long of the time to copy this data to another server.

If it was me, I'd write a perl script that takes as input a list of filenames from stdin, opens that as JSON, transforms it and output one or more INSERT statements to stdout.

You can then do:
find -type f basedir/ | ./transform.pl > huge.sql

Periodically put some commits in the output every 1000 records or so, or instead of outputting to stdout, you can write to files split every 1000 or so.

If there's any pattern to the naming, you could add an -iname to the find to restrict the output SQL to a subset of source files.

Using find this way means that you're paying the cost of reading the directory linearly, not having to cache the directory in RAM so that it can be sorted (e.g. if you used wildcards in the shell), and so the transfer speed from disk will be about as optimal as it can be. If you're only doing simple transformations, you might even outperform the speed of a network transfer to a remote server.

Thanks @ralf - I probably should have mentioned that it's not a simple data transformation - the JSON files are in about 8 different formats - so I have to differentiate between the type of data in the file, and then once I know which file type it is, I can load it - and then each file will end up being normalised into a central model in the DB, which is split up across 10 tables or so. I wish it was a straight forward load from files to a single table or two but not in this case.

xetsys · November 2021

+1 Hetzner AX41 NVMe if you get it with Samsung disks

fixxation · November 2021

That's some great advice as well @jsg - many thanks for all the detail. Sorry, when I said EUR 100 I was thinking annually to try and get some bargain but I don't think that'll happen!

Once all my raw data is extracted, I'm looking at about 700GB, maybe even 800GB altogether, and for the time being, I have to step through the files in sequential order by time ascending - this is probably the thing that's slowing me down the most at the moment, but I have a plan to run it in parallel in the months ahead.

After reading your post, it just confirms that I think I really need dedicated here - I'll be running this routine several times a month, so I need consistent performance, and not to worry about throttling or resources being taken by other customers. I can do EUR 40 a month, knowing that I'll have the performance, and TBH, the specs of the AX41-NVME that the others have mentioned here (Ryzen 5 3600, 6 cores, 64GB DDR4, and 2 x 512GB NVMe seems like great value.

bsh · November 2021

For lowend machine, IMO you'd try this approach:
Step 1: Parse it into SQL statement flat file
Step 2: Use command line to import SQL statement flat file into SQLite3 or MySQL
Step 3 (optional if Step 2 choose MySQL): Export from SQLite3 to MySQL

jsg · November 2021

@fixxation said:
... for the time being, I have to step through the files in sequential order by time ascending - this is probably the thing that's slowing me down the most at the moment, but I have a plan to run it in parallel in the months ahead.

I don't get it, what keeps you from going multi-core?

After reading your post, it just confirms that I think I really need dedicated here - I'll be running this routine several times a month, so I need consistent performance, and not to worry about throttling or resources being taken by other customers. I can do EUR 40 a month, knowing that I'll have the performance, and TBH, the specs of the AX41-NVME that the others have mentioned here (Ryzen 5 3600, 6 cores, 64GB DDR4, and 2 x 512GB NVMe seems like great value.

Yes, a dedi has the huge advantage of offering the full IOPS of your drive.

But be warned: Not using Raid 1 on those 2 NVMes (which, based on what I hear, can be pretty worn out) is playing lottery. So I suggest to pay a bit more to get 2 x 1TB NVMes.

TerokNor · November 2021

One of my projects is based on linguistics research and I have been facing some similar situations with heavy parsing.

A few comments:

For heavy loads we use Hetzner AX-101 servers with NVMe
Filesystem does play role. Millions of small files -> btrfs, reiserfs and I recently I have seen some good results with ZFS
IIRC MySQL LOAD DATA INFILE works faster than INSERT INTO.. by orders of magnitude so I would suggest to convert json -> csv and then LOAD DATA
You may see some improvements with RAID0.
You will definitely see improvements with multiple threads + nvme. We parse millions of html files in matter of seconds.

Send me a PM if you need some help, I write parsers for fun.

ralf · November 2021

@fixxation said:
Thanks @ralf - I probably should have mentioned that it's not a simple data transformation - the JSON files are in about 8 different formats - so I have to differentiate between the type of data in the file, and then once I know which file type it is, I can load it

If they're all JSON (well, even if they're not), you can load it and then decide how to process it. There's no point reading it to decide what kind of file it is, then launching another program and reading it again. Try to do it all in the same process, so you're not constantly spawning new processes, as that will kill performance.

fixxation · November 2021

@jsg said:
I don't get it, what keeps you from going multi-core?

Ahh, it's the data - there are 'parent' files, which maybe 200-500 some odd 'child' files per parent will follow, and if I don't have the initial parent file, all the child files are unused and wasted. So there are a few dependencies at play here. Not unsolvable though, if I load all the parents for the entire dataset first, then I could rewind and load the children files and use all the available cores. I expect to see a huge performance boost then, just need to get time to modify the routine.

yoursunny · November 2021

You should really get your code in order before thinking about servers.

Go programming language will be efficient, because it can seamlessly scale to multiple cores, while not burden you with memory allocation like C++ does.
Scripting languages such as Python or JavaScript would be too slow.

fixxation · November 2021

@TerokNor said:
One of my projects is based on linguistics research and I have been facing some similar situations with heavy parsing.

A few comments:

For heavy loads we use Hetzner AX-101 servers with NVMe

Filesystem does play role. Millions of small files -> btrfs, reiserfs and I recently I have seen some good results with ZFS

IIRC MySQL LOAD DATA INFILE works faster than INSERT INTO.. by orders of magnitude so I would suggest to convert json -> csv and then LOAD DATA

You may see some improvements with RAID0.

You will definitely see improvements with multiple threads + nvme. We parse millions of html files in matter of seconds.

Send me a PM if you need some help, I write parsers for fun.

Thanks for that @TerokNor - those are some good tips. Interesting about the filesystem, I hadn't even thought about that. Funny you mention LOAD DATA INFILE - I was initially hoping to use this but there ended-up being so many dependencies, I've setting tons of foreign keys as I go so tricky to flatten everything down and do a pure csv load
That's really encouraging to hear about you parsing millions of files in such a short amount of time. When I was watching my load script slow to a turtle's pace the other day, knowing it would run for weeks (or even months) then I finally convinced myself that a VPS isn't enough for this. I'll send you a PM once I get my dedicated box up and running and hopefully have some performance improvements to report. Cheers.

jsg · November 2021

@yoursunny said:
You should really get your code in order before thinking about servers.

Go programming language will be efficient, because it can seamlessly scale to multiple cores, while not burden you with memory allocation like C++ does.
Scripting languages such as Python or JavaScript would be too slow.

No. The bottleneck seems to be IO. In such cases scripting languages are acceptable, especially for occasional or infrequent jobs.

TerokNor · November 2021

@yoursunny said: Go programming language will be efficient, because it can seamlessly scale to multiple cores, while not burden you with memory allocation like C++ does.

What do you mean by that? Go does not allocate memory?

TerokNor · November 2021

@fixxation said: I was initially hoping to use this but there ended-up being so many dependencies, I've setting tons of foreign keys as I go so tricky to flatten everything down and do a pure csv load

This may be a performance bottleneck. Depending on the nature of your data, it may be beneficially to denormalize, at least while loading.

fixxation · November 2021

@yoursunny - the code isn't perfect by any means, but it's a start - and for a hobby project that's better than nothing. It's nearly a year of full-time effort so now isn't the time to throw it all out and change programming languages.

I know where the bottlenecks are all the application level and pretty sure about the hardware level, I'm almost convinced I was having disk throttling taking place on the VPS. So dedicated should be an improvement for sure. When I get time to get around to altering it for multi-core, that'll be another big boost, at least the dedicated server will be ready for it.

Had a quick look around for other providers with dedicated Ryzen servers - seems that Hetzner still has the most competitive pricing in this space, even without a Black Friday sale. I can't find anyone that beats them on price.

TerokNor · November 2021

If you want to take a smaller step before going to a dedi and multiple cores, and you want to test the IO bottleneck, you can try a netcup root server. Their performance is quite impressive for that price.

TimboJones · November 2021

@TerokNor said:
4. You may see some improvements with RAID0.

I don't think so, this would hurt small stuff to improve big stuff.

codelock · November 2021

CHIA?

fluffernutter · November 2021

@fixxation said:
I've been on to Hetzner support asking if they are throttling this instance in particular, but not getting any straight answers - they just keep mentioning how it's shared and performance can vary etc etc. Just trying to get a straight answer out of them so I can rule-out it being a bug in my code. I highly suspect some throttling is taking place though, because the program runs so fast at start, and then slows to a halt after a a certain amount of time. Just waiting for an answer from their technical team now - if they can confirm the throttling then the AX41-NVME looks the way to go- thanks for the help.

Just to clarify, tbe AX41-NVMe is a dedicated server. There is no throttling in place, you can use 100% of the hardware, since it's entirely unshared.

fluffernutter · November 2021

Just reread your comment, my bad! Looks like you have the right idea

Daniel15 · November 2021

I've setting tons of foreign keys

I'd recommend not adding any indexes until after the import. A lot of small index updates will be significantly slower than just creating the index once the data is already there.

Also make sure you're using prepared statements rather than compiling the SQL query every time, and batch inserts (ie insert lots of rows in one query, rather than one separate INSERT per row) . Having said that, IIRC LOAD DATA INFILE is faster than anything an SQL query can do.

Maybe @raindog308 has some ideas - I think he's got a bunch of DBA experience.

Howdy, Stranger!

Categories

In this Discussion

Server recommendation for very heavy CPU/disk task

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Server recommendation for *very heavy* CPU/disk task

Comments

Server recommendation for very heavy CPU/disk task