All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
Storing and Serving 10's of Millions of Images
I have a collection of ~ 30 Million Images i need for AI and Statistical Analysis for a thesys that i'm working on. The Images are quite small and currently take about 1.5TB in total.
Currently i use a Hetzner NVME Ryzen Server with 2x 1TB nvme Raid-0 (I do have a daily external backup)
All images are named based on their sha1 Hash and put in to subfolders like "ab/cd/ef/gh/abcdefgh.jpg " to prevent too many files in one directory.
All metadata i generate is stored on a mysql db on the same server.
I access the files ether locally (some analysis is done with python on the server itself) and i setup a nginx server to load images externally over https for some applications.
I backup the server using incremental rsync.
So much for the background. Now to my Question:
Since i'm running out of space and having it on raid 0 is a little bit worrying i want to offload all images to a separate Server.
I either want something like Raid 10 for higher iops or ssd cache and raid 1 or 5. With just a single hard drive many small images are just too slow.
Any ideas on improving image storage?
Is there something to do caching in ssd automatically?
What should i look out for, do you know any offers?
Storage 3TB+, Location EU, BW: ~5TB-10TB @ 1gbit peak. CPU + RAM not that important.
Budget < 150$ (The cheaper the better)
This Server would be needed until October.
Thanks a lot.
Comments
Never use RAID1 or RAID5 because you will only get the write speeds of one drive, it's just the same as having one drive (only read speeds will improve). Besides RAID1/5 with big drives is just too risky, if one breaks down you are looking at a long rebuild period and the other might fail as well and you'll lose all your data. Performance will heavily degrade as well during that time.
Best option for this is ZFS mirroring with SSD caching, if you're interested and need help with this I've send you a PM if that's ok
You can ask Hetzner to add a 3.8tb datacenter ssd for 38 euro/month, these datacenter drives can handle a lot more iops compared to the consumer drives.
I ran a server with million of small files and the only way to go is SSD/nvme, besides nowadays you don't pay that much more for SSD/nvme.
Did you try https://www.hetzner.com/storage/storage-box?
Free Internal Traffic to hetzner
Useable as network drive (SAMBA?)
Try BX10 for € 3.59
That is a very good deal, i will look in to that.
How does caching work in ZFS?
I need the newest files and some of the most accessed ones in cache. Is that doable?
Thanks
It's a good option but you need at least RAID1. So you need two drives.
Yes so ZFS will cache most accessed files in memory first, if that's full it will use the SSD. It kinda predicts the future of which files you access most and cache those in advance. It's truly awesome and cheap because you can use spinning disks and you can add as many SSDs and/or memory as you want. You can also easily expand the pool.
It's also faster than any hardware RAID solution if you have good CPUs. And last but not least it's self healing so it's great for data integrity. I've been running some ZFS pools for more than 7 years now, never with one corrupt file or lost data.
Ok, will do some benchmarking, iops might be an issue.
Multiple, load balanced Hetzner VPSes with Block Storage?