All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
Linux VPS for sorting large files
I'm looking for a VPS to sort 2 files once a week. Each file is around 20 GB compressed, with half a billion lines. I add some lines, then need to sort the files. It's basically one line of code piping like this: gunzip | sort | nl | sort | gzip.
I'm looking for a VPS that allows higher load once a week. I can limit it with cpuload if needed, I don't mind if it takes longer. I think a fast system would only be under load maybe an hour per week. A heavily loaded server would easily take a day. This task requires both CPU and quite a lot of temporary file writing. I don't mind if it takes a while, as long as it doesn't get me kicked off. If you have a server with less load on a certain day of the week, I can adjust to that day.
Update: the rest of the week, the server should be available for downloads. Just without any heavy load.
VZ Type: KVM or OpenVZ
Number of Cores: 1 or 2
RAM: depends: more RAM means less disk activity and faster sorting. I can work with 128 MB but more is better.
Disk Space: 150 GB
Disk Type: depends: SSD is faster, HDD causes more load. I can work with any.
Bandwidth: 1 TB per month should be enough
Port Speed: doesn't matter
DDoS Protection: no
Number of IPs: 1 IPv4 is enough. I can also work with just IPv6
Location: doesn't matter
Budget: crypto! Not too much though, this is only a side project that doesn't take much resources (on average)
Billing period: depends: I've been burned by several disappearing hosts in the past, other than that I prefer to pay per year.
Comments
@seriesn won’t disappoint.
https://nexusbytes.com/
NexusBytes is prem no doubt. Can handle any workload. Highly recommended.
But if you need it for just an hour each week you can grab a VDS at 6 cents an hour from RamNode. Fully dedicated so you can use 100% cpu. And yes accepts crypto.
Thanks for the mention fam!
I wouldn’t recommend our storage server for this, however, block storage 💯
Join the family
I quickly checked the site, and it looks out of budget for this task indeed.
Sorry, I think my description wasn't complete:
Update: the rest of the week, the server should be available for downloads. Just without any heavy load.
I checked RamNode: the VDS doesn't have dedicated I/O, so they may still not appreciate my activity.
RamNode support should be able to clarify on Disk IO. Although the other day, compiled Google Chrome from scratch on their VDS. Took 10 hours straight with 100% cpu and high disk IO without any issue
What's the budget?
Europe and USA location are cheaper than other countries.
Can't you eliminate the two sort commands since your input data are already sorted? You just uncompress them, add the weekly new data when suitable, and then compress the data. Something like: gunzip | [add new data with something like awk] | gzip. You won't need much CPU this way.
Is this a $10/y looking thing or $10/m thing?
Are the two tasks related? I'm thinking you can have one budget VPS for downloads and another one you spin up on demand for sorting.
If the files that are to be download are the same compressed files for sorting, you can also just use this structure, but spend the additional few minutes to transfer the files.
The gzip is the factor that makes this task hard to simplify, is there any flexibility in the choice of compression, or the implementation (eg spamming flush markers)? If you could use transparent filesystem compression (eg btrfs) it becomes simpler still
lzop is a very CPU-friendly compression utility/algorithm.
If most of the lines are already sorted, you should first sort the new lines, then run sort -m to merge two sorted files.
Notice that the existing data is processed in memory and never written to disk, which greatly reduces disk IO usage.
References:
You don't need a persistent server for this. Instead, get a block storage (S3 compatible bucket) to serve downloads, and an hourly compute instance for weekly processing.
I've created and funded an account. It looks promising, but I haven't had the time to do a full test-run.
That depends on the sort-order. I (once again) realize I didn't give all the details (but didn't expect to be asked about the details). One of the files is sorted in chronological order, but duplicate entries have to be removed.
If I can get the RamNode thing to work, it's going to be a $1 (one day) per month thing. I haven't found the option to automatically turn on a node for a day once a month though, that would be perfect so I can just cronjob everything.
Yes, the large files should be available for downloads all (most of) the time.
Gzip is meant to make the files a few GB smaller to download, and to prevent browsers from trying to open 30 GB txt.
But gzip isn't the part that makes it slow.
You may be on to something for at least one of the files (the one that's actually sorted, not in chronological order). I may also be able to improve performance a lot by permanently storing more temporary files. I'll experiment with this too when I have time.
I wrote an awk script to merge two sorted text files. It might be able to consume less CPU time and disk space than the sort command. The script, merge_sorted_textfiles.awk, is as follows:
Assuming you got two sorted text files, tmp1.txt and tmp2.txt. To merge them and output the result to stdout, you would issue the awk script as follows:
< tmp1.txt awk -f merge_sorted_textfiles.awk -v new_data_file=tmp2.txt
The script compares input data on the whole string (as in "$0 < new_data_line"). But with awk's flexible string operations, you can easily modify the script to merge based on the comparison of chosen data fields.
Thank you!
Maybe doable with our API.
I never updated what I choose, so here it is: I've used RamNode for a while, I tried several of their servers and the VDS turned out to be the best choice. Until I got a dedicated server donated (running projects for the community has it's perks), which is what I use now (every week).
I also tried to improve the data sorting, with some success. But it's still taking a lot of disk activity. I once had the opportunity to test awk on a server with 256 GB RAM:
awk '!a[$0]++'
(this command removes duplicates but keeps lines in the original order)
It turned out 256 GB isn't enough, it got about half way.
This topic can be locked.