Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Server recommendation for *very heavy* CPU/disk task
New on LowEndTalk? Please Register and read our Community Rules.

Server recommendation for *very heavy* CPU/disk task

Hi - I need a bit of help with a recommendation for a server.

I have a huge amount of data in tiny JSON files that I need to load into a MySQL database. There could be nearly 100 million files that I need to parse, each about 500-1000 bytes in size.

At the moment, I need to parse through the files sequentially, so it takes a long time, but in the months ahead I should be able to handle them in parallel and split them into chunks, and then I'll be able to fully utilise multiple cores on a server. I am doing a fair bit of caching with Redis to speed things up, but this is still a huge job overall. Currently using a CPX21 with Hetzner, unfortunately only one core really being used for the actual 'parse' process.

I'm not sure what would be best for this - a high spec Ryzen VPS with maybe 4-6 cores, 4GB+ RAM, and as fast as possible NVMe disk, of at least 100GB NVMe storage space -- or should I be considering a dedicated server, so I know that the I'll be getting full use of the resources all the time? VPS prices are more attractive, but would I have full speed of the NVMe disk all the time?

I'm trying to keep costs down, as this is a bit of a hobby project at the moment - but I think being Black Friday, now is a good time to get a hefty server that can handle the job better than my current 'load box'. Ideally I would want to snag a bargain in the days ahead and get a really high spec server for EUR 100 or less, that will do the job well for me over the year ahead.

Thoughts? Thanks in advance.

Comments

  • omelasomelas Member
    edited November 25

    have a dedicate mysql server that receive from others, and a bunch of threads connect to that to insert to database as remote user?

  • I would recommend a dedicated server for consistent disk performance. Is the bottle neck disk or cpu?

    Thanked by 1fixxation
  • 1nf1nf Member

    How long will it take? You could a massive bare metal box with an hourly provider for a few hours.

    Thanked by 1fixxation
  • AXYZEAXYZE Member
    edited November 25

    Go with AX41-NVME or get preemptible VM instances from any major cloud provider if this doesnt need to run 24/7.

    AX41-NVME has 'no setup free' promo now, without promo setup fee is about 30-40euro iirc.
    https://www.hetzner.com/dedicated-rootserver/ax41-nvme/configurator

    VPSes will throttle if you will use a lot of disk for longer periods of time, even VDS ones. 500-1000byte files are very demanding on drive. If you would get AX41 then you have 2x NVMe so you can split files into two disks and get A LOT better performance than shared VPS. I dont know if standard RAID is good idea for such small files.

    Some tests even show that RAID0 will reduce performance for small file operations because of overhead.

  • cadddrcadddr Member
    edited November 25

    If it's batchjob thing rent CCX-22/higher move data in, do stuff, get it out, delete instance. 100GB data shouldn't take more than 8hr including moving data.

  • ralfralf Member
    edited November 25

    If you're just doing simple data transformation, I don't really understand why you need a new server for this...

    Most of the time taken for this task will be reading the tiny files off disk, so the processing time won't much on long of the time to copy this data to another server.

    If it was me, I'd write a perl script that takes as input a list of filenames from stdin, opens that as JSON, transforms it and output one or more INSERT statements to stdout.

    You can then do:
    find -type f basedir/ | ./transform.pl > huge.sql

    Periodically put some commits in the output every 1000 records or so, or instead of outputting to stdout, you can write to files split every 1000 or so.

    If there's any pattern to the naming, you could add an -iname to the find to restrict the output SQL to a subset of source files.

    Using find this way means that you're paying the cost of reading the directory linearly, not having to cache the directory in RAM so that it can be sorted (e.g. if you used wildcards in the shell), and so the transfer speed from disk will be about as optimal as it can be. If you're only doing simple transformations, you might even outperform the speed of a network transfer to a remote server.

    Thanked by 1fixxation
  • jsgjsg Member, Resident Benchmarker
    edited November 25

    @fixxation said:
    Hi - I need a bit of help with a recommendation for a server.

    I have a huge amount of data in tiny JSON files that I need to load into a MySQL database. There could be nearly 100 million files that I need to parse, each about 500-1000 bytes in size.

    At the moment, I need to parse through the files sequentially, so it takes a long time, but in the months ahead I should be able to handle them in parallel and split them into chunks, and then I'll be able to fully utilise multiple cores on a server.

    You can parse files in parallel just as well.

    I am doing a fair bit of caching with Redis to speed things up, but this is still a huge job overall. Currently using a CPX21 with Hetzner, unfortunately only one core really being used for the actual 'parse' process.

    That is one problem. Spread your work over multiple cores.

    I'm not sure what would be best for this - a high spec Ryzen VPS with maybe 4-6 cores, 4GB+ RAM, and as fast as possible NVMe disk, of at least 100GB NVMe storage space -- or should I be considering a dedicated server, so I know that the I'll be getting full use of the resources all the time? VPS prices are more attractive, but would I have full speed of the NVMe disk all the time?

    We are talking about 50 - 100 GB of data. Let's assume worst case (100 GB). Based on a recent benchmark I did with a decent SSD - not even NVMe - based VPS you'd need about 1 hour to read 100 GB of data in small files but the limited IOPS would slow it down to roughly 8 hrs.
    Note that being smart about IOPS there's be two routes, (a) NVMe, but that would just plus-minus double IOPS, or (b) dedicated hardware with NVMe which could make things go a lot faster (the "killer" effect with any kind of VM (VDS or VPS) is that the drives' IO is shared with dozens of other VMs).

    At the same time one should keep context in mind: it's 'a bit of a hobby project at the moment', so why would you care whether the whole process is done in 1 hour or in 2 days?

    My recommendation therefore is this: Get yourself a good VDS with 2 vCores (or even better 4 which might be feasible due to BF) - but do NOT go for best price only! Rather try to get a bargain from a reputable provider who doesn't overcrowd their nodes. NVMe would be nice but frankly a good SSD (ideally Raid 10) will do fine too; it's not a big issue whether the process takes say 8 hrs or 3 days for 'a bit of a hobby project at the moment'. Go for 4 - 8 GB of memory (decent speed DDR4 is highly preferable). AMD Zen or Intel Xeon E5-26xx v4 (or better) isn't really that important, both will do fine enough.

    I'm trying to keep costs down, as this is a bit of a hobby project at the moment - but I think being Black Friday, now is a good time to get a hefty server that can handle the job better than my current 'load box'. Ideally I would want to snag a bargain in the days ahead and get a really high spec server for EUR 100 or less, that will do the job well for me over the year ahead.

    €100 is way too high limit for 'a bit of a hobby project at the moment' - and IMO it's not needed. Cut that number in half, that will get you a really decent, damn good enough(TM) VDS (with higher spec) or a halfway decent dedi (with lower specs). Both will be sufficient.

    Thanked by 2fixxation vickybus
  • @AXYZE said:
    Go with AX41-NVME or get preemptible VM instances from any major cloud provider if this doesnt need to run 24/7.

    AX41-NVME has 'no setup free' promo now, without promo setup fee is about 30-40euro iirc.
    https://www.hetzner.com/dedicated-rootserver/ax41-nvme/configurator

    VPSes will throttle if you will use a lot of disk for longer periods of time, even VDS ones. 500-1000byte files are very demanding on drive. If you would get AX41 then you have 2x NVMe so you can split files into two disks and get A LOT better performance than shared VPS. I dont know if standard RAID is good idea for such small files.

    Some tests even show that RAID0 will reduce performance for small file operations because of overhead.

    Thanks @AXYZE - the AX41-NVMe looks good, it's a lot more RAM than I need, but the 1TB is great because the uncompressed data I need to load is roughly 700GB altogether.

    I've been on to Hetzner support asking if they are throttling this instance in particular, but not getting any straight answers - they just keep mentioning how it's shared and performance can vary etc etc. Just trying to get a straight answer out of them so I can rule-out it being a bug in my code. I highly suspect some throttling is taking place though, because the program runs so fast at start, and then slows to a halt after a a certain amount of time. Just waiting for an answer from their technical team now - if they can confirm the throttling then the AX41-NVME looks the way to go- thanks for the help.

    Thanked by 1LiliLabs
  • @ralf said:
    If you're just doing simple data transformation, I don't really understand why you need a new server for this...

    Most of the time taken for this task will be reading the tiny files off disk, so the processing time won't much on long of the time to copy this data to another server.

    If it was me, I'd write a perl script that takes as input a list of filenames from stdin, opens that as JSON, transforms it and output one or more INSERT statements to stdout.

    You can then do:
    find -type f basedir/ | ./transform.pl > huge.sql

    Periodically put some commits in the output every 1000 records or so, or instead of outputting to stdout, you can write to files split every 1000 or so.

    If there's any pattern to the naming, you could add an -iname to the find to restrict the output SQL to a subset of source files.

    Using find this way means that you're paying the cost of reading the directory linearly, not having to cache the directory in RAM so that it can be sorted (e.g. if you used wildcards in the shell), and so the transfer speed from disk will be about as optimal as it can be. If you're only doing simple transformations, you might even outperform the speed of a network transfer to a remote server.

    Thanks @ralf - I probably should have mentioned that it's not a simple data transformation - the JSON files are in about 8 different formats - so I have to differentiate between the type of data in the file, and then once I know which file type it is, I can load it - and then each file will end up being normalised into a central model in the DB, which is split up across 10 tables or so. I wish it was a straight forward load from files to a single table or two but not in this case.

  • +1 Hetzner AX41 NVMe if you get it with Samsung disks

    Thanked by 1fixxation
  • That's some great advice as well @jsg - many thanks for all the detail. Sorry, when I said EUR 100 I was thinking annually to try and get some bargain but I don't think that'll happen!

    Once all my raw data is extracted, I'm looking at about 700GB, maybe even 800GB altogether, and for the time being, I have to step through the files in sequential order by time ascending - this is probably the thing that's slowing me down the most at the moment, but I have a plan to run it in parallel in the months ahead.

    After reading your post, it just confirms that I think I really need dedicated here - I'll be running this routine several times a month, so I need consistent performance, and not to worry about throttling or resources being taken by other customers. I can do EUR 40 a month, knowing that I'll have the performance, and TBH, the specs of the AX41-NVME that the others have mentioned here (Ryzen 5 3600, 6 cores, 64GB DDR4, and 2 x 512GB NVMe seems like great value.

  • bshbsh Member

    For lowend machine, IMO you'd try this approach:
    Step 1: Parse it into SQL statement flat file
    Step 2: Use command line to import SQL statement flat file into SQLite3 or MySQL
    Step 3 (optional if Step 2 choose MySQL): Export from SQLite3 to MySQL

  • jsgjsg Member, Resident Benchmarker

    @fixxation said:
    ... for the time being, I have to step through the files in sequential order by time ascending - this is probably the thing that's slowing me down the most at the moment, but I have a plan to run it in parallel in the months ahead.

    I don't get it, what keeps you from going multi-core?

    After reading your post, it just confirms that I think I really need dedicated here - I'll be running this routine several times a month, so I need consistent performance, and not to worry about throttling or resources being taken by other customers. I can do EUR 40 a month, knowing that I'll have the performance, and TBH, the specs of the AX41-NVME that the others have mentioned here (Ryzen 5 3600, 6 cores, 64GB DDR4, and 2 x 512GB NVMe seems like great value.

    Yes, a dedi has the huge advantage of offering the full IOPS of your drive.

    But be warned: Not using Raid 1 on those 2 NVMes (which, based on what I hear, can be pretty worn out) is playing lottery. So I suggest to pay a bit more to get 2 x 1TB NVMes.

    Thanked by 1fixxation
  • TerokNorTerokNor Member
    edited November 25

    One of my projects is based on linguistics research and I have been facing some similar situations with heavy parsing.

    A few comments:

    1. For heavy loads we use Hetzner AX-101 servers with NVMe
    2. Filesystem does play role. Millions of small files -> btrfs, reiserfs and I recently I have seen some good results with ZFS
    3. IIRC MySQL LOAD DATA INFILE works faster than INSERT INTO.. by orders of magnitude so I would suggest to convert json -> csv and then LOAD DATA
    4. You may see some improvements with RAID0.
    5. You will definitely see improvements with multiple threads + nvme. We parse millions of html files in matter of seconds.

    Send me a PM if you need some help, I write parsers for fun.

    Thanked by 2fixxation kp93
  • ralfralf Member

    @fixxation said:
    Thanks @ralf - I probably should have mentioned that it's not a simple data transformation - the JSON files are in about 8 different formats - so I have to differentiate between the type of data in the file, and then once I know which file type it is, I can load it

    If they're all JSON (well, even if they're not), you can load it and then decide how to process it. There's no point reading it to decide what kind of file it is, then launching another program and reading it again. Try to do it all in the same process, so you're not constantly spawning new processes, as that will kill performance.

    Thanked by 1fixxation
  • @jsg said:
    I don't get it, what keeps you from going multi-core?

    Ahh, it's the data - there are 'parent' files, which maybe 200-500 some odd 'child' files per parent will follow, and if I don't have the initial parent file, all the child files are unused and wasted. So there are a few dependencies at play here. Not unsolvable though, if I load all the parents for the entire dataset first, then I could rewind and load the children files and use all the available cores. I expect to see a huge performance boost then, just need to get time to modify the routine.

    Thanked by 1jsg
  • yoursunnyyoursunny Member, IPv6 Advocate

    You should really get your code in order before thinking about servers.

    Go programming language will be efficient, because it can seamlessly scale to multiple cores, while not burden you with memory allocation like C++ does.
    Scripting languages such as Python or JavaScript would be too slow.

  • @TerokNor said:
    One of my projects is based on linguistics research and I have been facing some similar situations with heavy parsing.

    A few comments:

    1. For heavy loads we use Hetzner AX-101 servers with NVMe
    2. Filesystem does play role. Millions of small files -> btrfs, reiserfs and I recently I have seen some good results with ZFS
    3. IIRC MySQL LOAD DATA INFILE works faster than INSERT INTO.. by orders of magnitude so I would suggest to convert json -> csv and then LOAD DATA
    4. You may see some improvements with RAID0.
    5. You will definitely see improvements with multiple threads + nvme. We parse millions of html files in matter of seconds.

    Send me a PM if you need some help, I write parsers for fun.

    Thanks for that @TerokNor - those are some good tips. Interesting about the filesystem, I hadn't even thought about that. Funny you mention LOAD DATA INFILE - I was initially hoping to use this but there ended-up being so many dependencies, I've setting tons of foreign keys as I go so tricky to flatten everything down and do a pure csv load :(
    That's really encouraging to hear about you parsing millions of files in such a short amount of time. When I was watching my load script slow to a turtle's pace the other day, knowing it would run for weeks (or even months) then I finally convinced myself that a VPS isn't enough for this. I'll send you a PM once I get my dedicated box up and running and hopefully have some performance improvements to report. Cheers.

  • jsgjsg Member, Resident Benchmarker

    @yoursunny said:
    You should really get your code in order before thinking about servers.

    Go programming language will be efficient, because it can seamlessly scale to multiple cores, while not burden you with memory allocation like C++ does.
    Scripting languages such as Python or JavaScript would be too slow.

    No. The bottleneck seems to be IO. In such cases scripting languages are acceptable, especially for occasional or infrequent jobs.

    Thanked by 1fixxation
  • @yoursunny said: Go programming language will be efficient, because it can seamlessly scale to multiple cores, while not burden you with memory allocation like C++ does.

    What do you mean by that? Go does not allocate memory?

  • @fixxation said: I was initially hoping to use this but there ended-up being so many dependencies, I've setting tons of foreign keys as I go so tricky to flatten everything down and do a pure csv load

    This may be a performance bottleneck. Depending on the nature of your data, it may be beneficially to denormalize, at least while loading.

    Thanked by 1fixxation
  • @yoursunny - the code isn't perfect by any means, but it's a start - and for a hobby project that's better than nothing. It's nearly a year of full-time effort so now isn't the time to throw it all out and change programming languages.

    I know where the bottlenecks are all the application level and pretty sure about the hardware level, I'm almost convinced I was having disk throttling taking place on the VPS. So dedicated should be an improvement for sure. When I get time to get around to altering it for multi-core, that'll be another big boost, at least the dedicated server will be ready for it.

    Had a quick look around for other providers with dedicated Ryzen servers - seems that Hetzner still has the most competitive pricing in this space, even without a Black Friday sale. I can't find anyone that beats them on price.

  • If you want to take a smaller step before going to a dedi and multiple cores, and you want to test the IO bottleneck, you can try a netcup root server. Their performance is quite impressive for that price.

    Thanked by 1fixxation
  • @TerokNor said:
    4. You may see some improvements with RAID0.

    I don't think so, this would hurt small stuff to improve big stuff.

  • CHIA?

  • @fixxation said:
    I've been on to Hetzner support asking if they are throttling this instance in particular, but not getting any straight answers - they just keep mentioning how it's shared and performance can vary etc etc. Just trying to get a straight answer out of them so I can rule-out it being a bug in my code. I highly suspect some throttling is taking place though, because the program runs so fast at start, and then slows to a halt after a a certain amount of time. Just waiting for an answer from their technical team now - if they can confirm the throttling then the AX41-NVME looks the way to go- thanks for the help.

    Just to clarify, tbe AX41-NVMe is a dedicated server. There is no throttling in place, you can use 100% of the hardware, since it's entirely unshared.

  • Just reread your comment, my bad! Looks like you have the right idea :smile:

    Thanked by 1fixxation
  • Daniel15Daniel15 Member
    edited November 26

    I've setting tons of foreign keys

    I'd recommend not adding any indexes until after the import. A lot of small index updates will be significantly slower than just creating the index once the data is already there.

    Also make sure you're using prepared statements rather than compiling the SQL query every time, and batch inserts (ie insert lots of rows in one query, rather than one separate INSERT per row) . Having said that, IIRC LOAD DATA INFILE is faster than anything an SQL query can do.

    Maybe @raindog308 has some ideas - I think he's got a bunch of DBA experience.

Sign In or Register to comment.