Linux VPS for sorting large files

LoyceV · January 2021

I'm looking for a VPS to sort 2 files once a week. Each file is around 20 GB compressed, with half a billion lines. I add some lines, then need to sort the files. It's basically one line of code piping like this: gunzip | sort | nl | sort | gzip.
I'm looking for a VPS that allows higher load once a week. I can limit it with cpuload if needed, I don't mind if it takes longer. I think a fast system would only be under load maybe an hour per week. A heavily loaded server would easily take a day. This task requires both CPU and quite a lot of temporary file writing. I don't mind if it takes a while, as long as it doesn't get me kicked off. If you have a server with less load on a certain day of the week, I can adjust to that day.
Update: the rest of the week, the server should be available for downloads. Just without any heavy load.

VZ Type: KVM or OpenVZ
Number of Cores: 1 or 2
RAM: depends: more RAM means less disk activity and faster sorting. I can work with 128 MB but more is better.
Disk Space: 150 GB
Disk Type: depends: SSD is faster, HDD causes more load. I can work with any.
Bandwidth: 1 TB per month should be enough
Port Speed: doesn't matter
DDoS Protection: no
Number of IPs: 1 IPv4 is enough. I can also work with just IPv6
Location: doesn't matter
Budget: crypto! Not too much though, this is only a side project that doesn't take much resources (on average)
Billing period: depends: I've been burned by several disappearing hosts in the past, other than that I prefer to pay per year.

lentro · January 2021

@seriesn won’t disappoint.

https://nexusbytes.com/

rattlecattle · January 2021

NexusBytes is prem no doubt. Can handle any workload. Highly recommended.

But if you need it for just an hour each week you can grab a VDS at 6 cents an hour from RamNode. Fully dedicated so you can use 100% cpu. And yes accepts crypto.

seriesn · January 2021

@lentro said:
@seriesn won’t disappoint.

https://nexusbytes.com/

Thanks for the mention fam!

@LoyceV said:
I'm looking for a VPS to sort 2 files once a week. Each file is around 20 GB compressed, with half a billion lines. I add some lines, then need to sort the files. It's basically one line of code piping like this: gunzip | sort | nl | sort | gzip.
I'm looking for a VPS that allows higher load once a week. I can limit it with cpuload if needed, I don't mind if it takes longer. I think a fast system would only be under load maybe an hour per week. A heavily loaded server would easily take a day. This task requires both CPU and quite a lot of temporary file writing. I don't mind if it takes a while, as long as it doesn't get me kicked off. If you have a server with less load on a certain day of the week, I can adjust to that day.

VZ Type: KVM or OpenVZ
Number of Cores: 1 or 2
RAM: depends: more RAM means less disk activity and faster sorting. I can work with 128 MB but more is better.
Disk Space: 150 GB
Disk Type: depends: SSD is faster, HDD causes more load. I can work with any.
Bandwidth: 1 TB per month should be enough
Port Speed: doesn't matter
DDoS Protection: no
Number of IPs: 1 IPv4 is enough. I can also work with just IPv6
Location: doesn't matter
Budget: crypto! Not too much though, this is only a side project that doesn't take much resources (on average)
Billing period: depends: I've been burned by several disappearing hosts in the past, other than that I prefer to pay per year.

I wouldn’t recommend our storage server for this, however, block storage 💯

Join the family

LoyceV · January 2021

@rattlecattle said:
NexusBytes is prem no doubt. Can handle any workload. Highly recommended.

I quickly checked the site, and it looks out of budget for this task indeed.

But if you need it for just an hour each week you can grab a VDS at 6 cents an hour from RamNode. Fully dedicated so you can use 100% cpu. And yes accepts crypto.

Sorry, I think my description wasn't complete:
Update: the rest of the week, the server should be available for downloads. Just without any heavy load.
I checked RamNode: the VDS doesn't have dedicated I/O, so they may still not appreciate my activity.

rattlecattle · January 2021

@LoyceV said:
I checked RamNode: the VDS doesn't have dedicated I/O, so they may still not appreciate my activity.

RamNode support should be able to clarify on Disk IO. Although the other day, compiled Google Chrome from scratch on their VDS. Took 10 hours straight with 100% cpu and high disk IO without any issue

marvel · January 2021

What's the budget?

mhosting_in · January 2021

@LoyceV said:
I'm looking for a VPS to sort 2 files once a week. Each file is around 20 GB compressed, with half a billion lines. I add some lines, then need to sort the files. It's basically one line of code piping like this: gunzip | sort | nl | sort | gzip.
I'm looking for a VPS that allows higher load once a week. I can limit it with cpuload if needed, I don't mind if it takes longer. I think a fast system would only be under load maybe an hour per week. A heavily loaded server would easily take a day. This task requires both CPU and quite a lot of temporary file writing. I don't mind if it takes a while, as long as it doesn't get me kicked off. If you have a server with less load on a certain day of the week, I can adjust to that day.
Update: the rest of the week, the server should be available for downloads. Just without any heavy load.

VZ Type: KVM or OpenVZ
Number of Cores: 1 or 2
RAM: depends: more RAM means less disk activity and faster sorting. I can work with 128 MB but more is better.
Disk Space: 150 GB
Disk Type: depends: SSD is faster, HDD causes more load. I can work with any.
Bandwidth: 1 TB per month should be enough
Port Speed: doesn't matter
DDoS Protection: no
Number of IPs: 1 IPv4 is enough. I can also work with just IPv6
Location: doesn't matter
Budget: crypto! Not too much though, this is only a side project that doesn't take much resources (on average)
Billing period: depends: I've been burned by several disappearing hosts in the past, other than that I prefer to pay per year.

Europe and USA location are cheaper than other countries.

chihcherng · January 2021

Can't you eliminate the two sort commands since your input data are already sorted? You just uncompress them, add the weekly new data when suitable, and then compress the data. Something like: gunzip | [add new data with something like awk] | gzip. You won't need much CPU this way.

MikeA · January 2021

Is this a $10/y looking thing or $10/m thing?

CyberneticTitan · January 2021

Update: the rest of the week, the server should be available for downloads. Just without any heavy load.

Are the two tasks related? I'm thinking you can have one budget VPS for downloads and another one you spin up on demand for sorting.

If the files that are to be download are the same compressed files for sorting, you can also just use this structure, but spend the additional few minutes to transfer the files.

rm_ · January 2021

[add new data with something like awk]

darkimmortal · January 2021

The gzip is the factor that makes this task hard to simplify, is there any flexibility in the choice of compression, or the implementation (eg spamming flush markers)? If you could use transparent filesystem compression (eg btrfs) it becomes simpler still

raindog308 · January 2021

@darkimmortal said: The gzip is the factor that makes this task hard to simplify, is there any flexibility in the choice of compression, or the implementation (eg spamming flush markers)? If you could use transparent filesystem compression (eg btrfs) it becomes simpler still

lzop is a very CPU-friendly compression utility/algorithm.

yoursunny · January 2021

If most of the lines are already sorted, you should first sort the new lines, then run sort -m to merge two sorted files.
Notice that the existing data is processed in memory and never written to disk, which greatly reduces disk IO usage.

aws s3 cp s3://example/new.txt - | sort > new-sorted.txt
sort -m \
  <(aws s3 cp s3://example/20210103.txt.gz - | gunzip) \
  new-sorted.txt \
  | gzip | aws s3 cp - s3://example/20210110.txt.gz

References:

You don't need a persistent server for this. Instead, get a block storage (S3 compatible bucket) to serve downloads, and an hourly compute instance for weekly processing.

LoyceV · January 2021

@rattlecattle said:
RamNode support should be able to clarify on Disk IO. Although the other day, compiled Google Chrome from scratch on their VDS. Took 10 hours straight with 100% cpu and high disk IO without any issue

I've created and funded an account. It looks promising, but I haven't had the time to do a full test-run.

@chihcherng said:
Can't you eliminate the two sort commands since your input data are already sorted? You just uncompress them, add the weekly new data when suitable, and then compress the data.

That depends on the sort-order. I (once again) realize I didn't give all the details (but didn't expect to be asked about the details). One of the files is sorted in chronological order, but duplicate entries have to be removed.

@MikeA said:
Is this a $10/y looking thing or $10/m thing?

If I can get the RamNode thing to work, it's going to be a $1 (one day) per month thing. I haven't found the option to automatically turn on a node for a day once a month though, that would be perfect so I can just cronjob everything.

@CyberneticTitan said:
Are the two tasks related? I'm thinking you can have one budget VPS for downloads and another one you spin up on demand for sorting.

Yes, the large files should be available for downloads all (most of) the time.

@darkimmortal said:
The gzip is the factor that makes this task hard to simplify, is there any flexibility in the choice of compression, or the implementation (eg spamming flush markers)? If you could use transparent filesystem compression (eg btrfs) it becomes simpler still

Gzip is meant to make the files a few GB smaller to download, and to prevent browsers from trying to open 30 GB txt.
But gzip isn't the part that makes it slow.

@yoursunny said:
If most of the lines are already sorted, you should first sort the new lines, then run sort -m to merge two sorted files.

You may be on to something for at least one of the files (the one that's actually sorted, not in chronological order). I may also be able to improve performance a lot by permanently storing more temporary files. I'll experiment with this too when I have time.

chihcherng · January 2021

I wrote an awk script to merge two sorted text files. It might be able to consume less CPU time and disk space than the sort command. The script, merge_sorted_textfiles.awk, is as follows:

BEGIN { if ((getline new_data_line < new_data_file) <= 0) no_more_new_data = 1; }
no_more_new_data == 1 { do { print; } while (getline); exit; }
$0 < new_data_line { print; next; }
$0 == new_data_line { print; next; }
{ print new_data_line;
  while (getline new_data_line < new_data_file) {
    if ($0 < new_data_line) { print; next; }
    if ($0 == new_data_line) { print; next; }
    print new_data_line;
  }
  print;
  no_more_new_data = 1;
}
END { if (no_more_new_data != 1)
        do { print new_data_line; }
        while (getline new_data_line < new_data_file);
}

Assuming you got two sorted text files, tmp1.txt and tmp2.txt. To merge them and output the result to stdout, you would issue the awk script as follows:

< tmp1.txt awk -f merge_sorted_textfiles.awk -v new_data_file=tmp2.txt

The script compares input data on the whole string (as in "$0 < new_data_line"). But with awk's flexible string operations, you can easily modify the script to merge based on the comparison of chosen data fields.

Nick_A · January 2021

@LoyceV said: I've created and funded an account. It looks promising, but I haven't had the time to do a full test-run.

Thank you!

@LoyceV said: If I can get the RamNode thing to work, it's going to be a $1 (one day) per month thing. I haven't found the option to automatically turn on a node for a day once a month though, that would be perfect so I can just cronjob everything.

Maybe doable with our API.

LoyceV · January 2023

I never updated what I choose, so here it is: I've used RamNode for a while, I tried several of their servers and the VDS turned out to be the best choice. Until I got a dedicated server donated (running projects for the community has it's perks), which is what I use now (every week).

I also tried to improve the data sorting, with some success. But it's still taking a lot of disk activity. I once had the opportunity to test awk on a server with 256 GB RAM:
awk '!a[$0]++'
(this command removes duplicates but keeps lines in the original order)
It turned out 256 GB isn't enough, it got about half way.

This topic can be locked.

Howdy, Stranger!

Categories

In this Discussion

Linux VPS for sorting large files

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Linux VPS for sorting large files

Comments