Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Linux VPS for sorting large files
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

Linux VPS for sorting large files

LoyceVLoyceV Member
edited January 2021 in Requests

I'm looking for a VPS to sort 2 files once a week. Each file is around 20 GB compressed, with half a billion lines. I add some lines, then need to sort the files. It's basically one line of code piping like this: gunzip | sort | nl | sort | gzip.
I'm looking for a VPS that allows higher load once a week. I can limit it with cpuload if needed, I don't mind if it takes longer. I think a fast system would only be under load maybe an hour per week. A heavily loaded server would easily take a day. This task requires both CPU and quite a lot of temporary file writing. I don't mind if it takes a while, as long as it doesn't get me kicked off. If you have a server with less load on a certain day of the week, I can adjust to that day.
Update: the rest of the week, the server should be available for downloads. Just without any heavy load.

VZ Type: KVM or OpenVZ
Number of Cores: 1 or 2
RAM: depends: more RAM means less disk activity and faster sorting. I can work with 128 MB but more is better.
Disk Space: 150 GB
Disk Type: depends: SSD is faster, HDD causes more load. I can work with any.
Bandwidth: 1 TB per month should be enough
Port Speed: doesn't matter
DDoS Protection: no
Number of IPs: 1 IPv4 is enough. I can also work with just IPv6
Location: doesn't matter
Budget: crypto! Not too much though, this is only a side project that doesn't take much resources (on average)
Billing period: depends: I've been burned by several disappearing hosts in the past, other than that I prefer to pay per year.

Comments

  • lentrolentro Member, Host Rep

    @seriesn won’t disappoint.

    https://nexusbytes.com/

    Thanked by 2seriesn Sanjue007
  • NexusBytes is prem no doubt. Can handle any workload. Highly recommended.

    But if you need it for just an hour each week you can grab a VDS at 6 cents an hour from RamNode. Fully dedicated so you can use 100% cpu. And yes accepts crypto.

    Thanked by 2seriesn Nick_A
  • @lentro said:
    @seriesn won’t disappoint.

    https://nexusbytes.com/

    Thanks for the mention fam!

    @LoyceV said:
    I'm looking for a VPS to sort 2 files once a week. Each file is around 20 GB compressed, with half a billion lines. I add some lines, then need to sort the files. It's basically one line of code piping like this: gunzip | sort | nl | sort | gzip.
    I'm looking for a VPS that allows higher load once a week. I can limit it with cpuload if needed, I don't mind if it takes longer. I think a fast system would only be under load maybe an hour per week. A heavily loaded server would easily take a day. This task requires both CPU and quite a lot of temporary file writing. I don't mind if it takes a while, as long as it doesn't get me kicked off. If you have a server with less load on a certain day of the week, I can adjust to that day.

    VZ Type: KVM or OpenVZ
    Number of Cores: 1 or 2
    RAM: depends: more RAM means less disk activity and faster sorting. I can work with 128 MB but more is better.
    Disk Space: 150 GB
    Disk Type: depends: SSD is faster, HDD causes more load. I can work with any.
    Bandwidth: 1 TB per month should be enough
    Port Speed: doesn't matter
    DDoS Protection: no
    Number of IPs: 1 IPv4 is enough. I can also work with just IPv6
    Location: doesn't matter
    Budget: crypto! Not too much though, this is only a side project that doesn't take much resources (on average)
    Billing period: depends: I've been burned by several disappearing hosts in the past, other than that I prefer to pay per year.

    I wouldn’t recommend our storage server for this, however, block storage 💯

    Join the family :)

  • LoyceVLoyceV Member
    edited January 2021

    @rattlecattle said:
    NexusBytes is prem no doubt. Can handle any workload. Highly recommended.

    I quickly checked the site, and it looks out of budget for this task indeed.

    But if you need it for just an hour each week you can grab a VDS at 6 cents an hour from RamNode. Fully dedicated so you can use 100% cpu. And yes accepts crypto.

    Sorry, I think my description wasn't complete:
    Update: the rest of the week, the server should be available for downloads. Just without any heavy load.
    I checked RamNode: the VDS doesn't have dedicated I/O, so they may still not appreciate my activity.

    Thanked by 1Nick_A
  • @LoyceV said:
    I checked RamNode: the VDS doesn't have dedicated I/O, so they may still not appreciate my activity.

    RamNode support should be able to clarify on Disk IO. Although the other day, compiled Google Chrome from scratch on their VDS. Took 10 hours straight with 100% cpu and high disk IO without any issue

    Thanked by 1Nick_A
  • What's the budget?

  • @LoyceV said:
    I'm looking for a VPS to sort 2 files once a week. Each file is around 20 GB compressed, with half a billion lines. I add some lines, then need to sort the files. It's basically one line of code piping like this: gunzip | sort | nl | sort | gzip.
    I'm looking for a VPS that allows higher load once a week. I can limit it with cpuload if needed, I don't mind if it takes longer. I think a fast system would only be under load maybe an hour per week. A heavily loaded server would easily take a day. This task requires both CPU and quite a lot of temporary file writing. I don't mind if it takes a while, as long as it doesn't get me kicked off. If you have a server with less load on a certain day of the week, I can adjust to that day.
    Update: the rest of the week, the server should be available for downloads. Just without any heavy load.

    VZ Type: KVM or OpenVZ
    Number of Cores: 1 or 2
    RAM: depends: more RAM means less disk activity and faster sorting. I can work with 128 MB but more is better.
    Disk Space: 150 GB
    Disk Type: depends: SSD is faster, HDD causes more load. I can work with any.
    Bandwidth: 1 TB per month should be enough
    Port Speed: doesn't matter
    DDoS Protection: no
    Number of IPs: 1 IPv4 is enough. I can also work with just IPv6
    Location: doesn't matter
    Budget: crypto! Not too much though, this is only a side project that doesn't take much resources (on average)
    Billing period: depends: I've been burned by several disappearing hosts in the past, other than that I prefer to pay per year.

    Europe and USA location are cheaper than other countries.

  • chihcherngchihcherng Veteran
    edited January 2021

    Can't you eliminate the two sort commands since your input data are already sorted? You just uncompress them, add the weekly new data when suitable, and then compress the data. Something like: gunzip | [add new data with something like awk] | gzip. You won't need much CPU this way.

    Thanked by 1yoursunny
  • MikeAMikeA Member, Patron Provider

    Is this a $10/y looking thing or $10/m thing?

    Thanked by 1coreflux
  • Update: the rest of the week, the server should be available for downloads. Just without any heavy load.

    Are the two tasks related? I'm thinking you can have one budget VPS for downloads and another one you spin up on demand for sorting.

    If the files that are to be download are the same compressed files for sorting, you can also just use this structure, but spend the additional few minutes to transfer the files.

  • rm_rm_ IPv6 Advocate, Veteran

    [add new data with something like awk]

  • darkimmortaldarkimmortal Member
    edited January 2021

    The gzip is the factor that makes this task hard to simplify, is there any flexibility in the choice of compression, or the implementation (eg spamming flush markers)? If you could use transparent filesystem compression (eg btrfs) it becomes simpler still

  • raindog308raindog308 Administrator, Veteran

    @darkimmortal said: The gzip is the factor that makes this task hard to simplify, is there any flexibility in the choice of compression, or the implementation (eg spamming flush markers)? If you could use transparent filesystem compression (eg btrfs) it becomes simpler still

    lzop is a very CPU-friendly compression utility/algorithm.

  • yoursunnyyoursunny Member, IPv6 Advocate
    edited January 2021

    If most of the lines are already sorted, you should first sort the new lines, then run sort -m to merge two sorted files.
    Notice that the existing data is processed in memory and never written to disk, which greatly reduces disk IO usage.

    aws s3 cp s3://example/new.txt - | sort > new-sorted.txt
    sort -m \
      <(aws s3 cp s3://example/20210103.txt.gz - | gunzip) \
      new-sorted.txt \
      | gzip | aws s3 cp - s3://example/20210110.txt.gz
    

    References:

    You don't need a persistent server for this. Instead, get a block storage (S3 compatible bucket) to serve downloads, and an hourly compute instance for weekly processing.

    Thanked by 2darkimmortal Doragon
  • LoyceVLoyceV Member
    edited January 2021

    @rattlecattle said:
    RamNode support should be able to clarify on Disk IO. Although the other day, compiled Google Chrome from scratch on their VDS. Took 10 hours straight with 100% cpu and high disk IO without any issue

    I've created and funded an account. It looks promising, but I haven't had the time to do a full test-run.

    @chihcherng said:
    Can't you eliminate the two sort commands since your input data are already sorted? You just uncompress them, add the weekly new data when suitable, and then compress the data.

    That depends on the sort-order. I (once again) realize I didn't give all the details (but didn't expect to be asked about the details). One of the files is sorted in chronological order, but duplicate entries have to be removed.

    @MikeA said:
    Is this a $10/y looking thing or $10/m thing?

    If I can get the RamNode thing to work, it's going to be a $1 (one day) per month thing. I haven't found the option to automatically turn on a node for a day once a month though, that would be perfect so I can just cronjob everything.

    @CyberneticTitan said:
    Are the two tasks related? I'm thinking you can have one budget VPS for downloads and another one you spin up on demand for sorting.

    Yes, the large files should be available for downloads all (most of) the time.

    @darkimmortal said:
    The gzip is the factor that makes this task hard to simplify, is there any flexibility in the choice of compression, or the implementation (eg spamming flush markers)? If you could use transparent filesystem compression (eg btrfs) it becomes simpler still

    Gzip is meant to make the files a few GB smaller to download, and to prevent browsers from trying to open 30 GB txt.
    But gzip isn't the part that makes it slow.

    @yoursunny said:
    If most of the lines are already sorted, you should first sort the new lines, then run sort -m to merge two sorted files.

    You may be on to something for at least one of the files (the one that's actually sorted, not in chronological order). I may also be able to improve performance a lot by permanently storing more temporary files. I'll experiment with this too when I have time.

    Thanked by 1Nick_A
  • I wrote an awk script to merge two sorted text files. It might be able to consume less CPU time and disk space than the sort command. The script, merge_sorted_textfiles.awk, is as follows:

    BEGIN { if ((getline new_data_line < new_data_file) <= 0) no_more_new_data = 1; }
    no_more_new_data == 1 { do { print; } while (getline); exit; }
    $0 < new_data_line { print; next; }
    $0 == new_data_line { print; next; }
    { print new_data_line;
      while (getline new_data_line < new_data_file) {
        if ($0 < new_data_line) { print; next; }
        if ($0 == new_data_line) { print; next; }
        print new_data_line;
      }
      print;
      no_more_new_data = 1;
    }
    END { if (no_more_new_data != 1)
            do { print new_data_line; }
            while (getline new_data_line < new_data_file);
    }
    

    Assuming you got two sorted text files, tmp1.txt and tmp2.txt. To merge them and output the result to stdout, you would issue the awk script as follows:

    < tmp1.txt awk -f merge_sorted_textfiles.awk -v new_data_file=tmp2.txt

    The script compares input data on the whole string (as in "$0 < new_data_line"). But with awk's flexible string operations, you can easily modify the script to merge based on the comparison of chosen data fields.

  • Nick_ANick_A Member, Top Host, Host Rep

    @LoyceV said: I've created and funded an account. It looks promising, but I haven't had the time to do a full test-run.

    Thank you!

    @LoyceV said: If I can get the RamNode thing to work, it's going to be a $1 (one day) per month thing. I haven't found the option to automatically turn on a node for a day once a month though, that would be perfect so I can just cronjob everything.

    Maybe doable with our API.

    Thanked by 1yoursunny
  • I never updated what I choose, so here it is: I've used RamNode for a while, I tried several of their servers and the VDS turned out to be the best choice. Until I got a dedicated server donated (running projects for the community has it's perks), which is what I use now (every week).

    I also tried to improve the data sorting, with some success. But it's still taking a lot of disk activity. I once had the opportunity to test awk on a server with 256 GB RAM:
    awk '!a[$0]++'
    (this command removes duplicates but keeps lines in the original order)
    It turned out 256 GB isn't enough, it got about half way.

    This topic can be locked.

Sign In or Register to comment.