Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Storing and Serving 10's of Millions of Images
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

Storing and Serving 10's of Millions of Images

BigBBigB Member
edited February 2020 in Help

I have a collection of ~ 30 Million Images i need for AI and Statistical Analysis for a thesys that i'm working on. The Images are quite small and currently take about 1.5TB in total.

Currently i use a Hetzner NVME Ryzen Server with 2x 1TB nvme Raid-0 (I do have a daily external backup)

All images are named based on their sha1 Hash and put in to subfolders like "ab/cd/ef/gh/abcdefgh.jpg " to prevent too many files in one directory.

All metadata i generate is stored on a mysql db on the same server.

I access the files ether locally (some analysis is done with python on the server itself) and i setup a nginx server to load images externally over https for some applications.

I backup the server using incremental rsync.

So much for the background. Now to my Question:

Since i'm running out of space and having it on raid 0 is a little bit worrying i want to offload all images to a separate Server.

I either want something like Raid 10 for higher iops or ssd cache and raid 1 or 5. With just a single hard drive many small images are just too slow.

Any ideas on improving image storage?

Is there something to do caching in ssd automatically?

What should i look out for, do you know any offers?

Storage 3TB+, Location EU, BW: ~5TB-10TB @ 1gbit peak. CPU + RAM not that important.

Budget < 150$ (The cheaper the better)

This Server would be needed until October.

Thanks a lot.

Comments

  • Never use RAID1 or RAID5 because you will only get the write speeds of one drive, it's just the same as having one drive (only read speeds will improve). Besides RAID1/5 with big drives is just too risky, if one breaks down you are looking at a long rebuild period and the other might fail as well and you'll lose all your data. Performance will heavily degrade as well during that time.

    Best option for this is ZFS mirroring with SSD caching, if you're interested and need help with this I've send you a PM if that's ok :smile:

  • vfusevfuse Member, Host Rep
    edited February 2020

    You can ask Hetzner to add a 3.8tb datacenter ssd for 38 euro/month, these datacenter drives can handle a lot more iops compared to the consumer drives.

    I ran a server with million of small files and the only way to go is SSD/nvme, besides nowadays you don't pay that much more for SSD/nvme.

    Thanked by 1Hetzner_OL
  • edited February 2020

    Did you try https://www.hetzner.com/storage/storage-box?

    Free Internal Traffic to hetzner
    Useable as network drive (SAMBA?)

    Try BX10 for € 3.59

    Thanked by 1Hetzner_OL
  • @vfuse said:
    You can ask Hetzner to add a 3.8tb datacenter ssd for 38 euro/month, these datacenter drives can handle a lot more iops compared to the consumer drives.

    That is a very good deal, i will look in to that.

  • @marvel said:
    ZFS mirroring with SSD caching

    How does caching work in ZFS?
    I need the newest files and some of the most accessed ones in cache. Is that doable?
    Thanks

  • @vfuse said:
    You can ask Hetzner to add a 3.8tb datacenter ssd for 38 euro/month, these datacenter drives can handle a lot more iops compared to the consumer drives.

    I ran a server with million of small files and the only way to go is SSD/nvme, besides nowadays you don't pay that much more for SSD/nvme.

    It's a good option but you need at least RAID1. So you need two drives.

  • marvelmarvel Member
    edited February 2020

    @BigB said:

    @marvel said:
    ZFS mirroring with SSD caching

    How does caching work in ZFS?
    I need the newest files and some of the most accessed ones in cache. Is that doable?
    Thanks

    Yes so ZFS will cache most accessed files in memory first, if that's full it will use the SSD. It kinda predicts the future of which files you access most and cache those in advance. It's truly awesome and cheap because you can use spinning disks and you can add as many SSDs and/or memory as you want. You can also easily expand the pool.

    It's also faster than any hardware RAID solution if you have good CPUs. And last but not least it's self healing so it's great for data integrity. I've been running some ZFS pools for more than 7 years now, never with one corrupt file or lost data.

  • BigBBigB Member
    edited February 2020

    Ok, will do some benchmarking, iops might be an issue.

  • Multiple, load balanced Hetzner VPSes with Block Storage?

Sign In or Register to comment.