Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Advertise on LowEndTalk.com
How to set up your own distributed, redundant, and encrypted storage grid in a few easy steps
New on LowEndTalk? Please read our 'Community Rules' by clicking on it in the right menu!

How to set up your own distributed, redundant, and encrypted storage grid in a few easy steps

joepie91joepie91 Member, Provider
edited November 2012 in Tutorials

If you have a few different VPSes, you'll most likely have a significant amount of unused storage space across all of them. This guide will be a quick introduction to setting up and using Tahoe-LAFS, a distributed, redundant, and encrypted storage system - some may call it 'cloud storage'.

What are the requirements?

  • At least 2 VPSes required, at least 3 VPSes recommended. More is better.
  • Each VPS should have at least 256MB RAM (for OpenVZ burstable), or 128MB RAM (for OpenVZ vSwap and other virtualization technologies with proper memory accounting).
  • Reading comprehension and an hour of your time or so :)

What is Tahoe-LAFS?

From the Tahoe-LAFS website:

Tahoe-LAFS is a Free and Open cloud storage system. It distributes your data across multiple servers. Even if some of the servers fail or are taken over by an attacker, the entire filesystem continues to function correctly, including preservation of your privacy and security.

How does Tahoe-LAFS work?

The short version: Tahoe-LAFS uses a RAID-like mechanism to store 'shares' (parts of a file) across the storage grid, according to the settings you specified. When a file is retrieved, all storage servers will be asked for shares of this file, and those that responded fastest will be used to retrieve the data from. The shares are reconstructed by the requesting client into the original file.

All shares are encrypted and checksummed; storage servers cannot possibly know or modify the contents of a share, or the file it derives from.

There are (roughly) two types of files: immutable (these cannot be changed afterwards) and mutable (these can be changed). Immutable files will result in a "read capability" (an encoded string that tells Tahoe-LAFS how to find it and how to decrypt it) and a "verify capability" (that can be used for verifying or repairing the file). A mutable file will also yield a "write capability" that can be used to modify the file. This way, it is possible to have a mutable file, but restrict the write capability to yourself, while sharing the read capability with others.

There is also a pseudo-filesystem with directories; while it isn't required to use this, it makes it possible to for example mount part of a Tahoe-LAFS filesystem via FUSE.

For more specifics, read this documentation entry.

How do I set it up?

1. Install dependencies

Follow the below instructions for all VPSes.

To install and run Tahoe-LAFS, you will need Python (with development files), setuptools, and the usual tools for compiling software. On Debian, this can be installed by running apt-get install python python-dev python-setuptools build-essential. If you use a different distro, your package manager or package names may differ.

Python setuptools comes with a Python package manager (or installer, rather) named easy_install. We'd rather have pip as our Python package manager, so we'll install that instead: easy_install pip.

After installing pip, we'll install the last dependency we need to install manually (pip install twisted), and then we can install Tahoe-LAFS itself: pip install allmydata-tahoe.

When you're done installing all of the above, you'll have to make a new user (adduser tahoe) that you're going to use to run Tahoe-LAFS under. From this point on, run all commands as the tahoe user.

2. Setting up an introducer

First of all, you'll need an 'introducer' - this is basically the central server that all other nodes connect to, to be made aware of other nodes in the storage grid. While the storage grid will continue to function if the introducer goes down, no new nodes will be discovered, and there will be no reconnections to nodes that went down until the introducer is back up.

Preferably, this introducer should be installed on a server that is not a storage node, but it's possible to run an introducer and a storage node alongside each other.

Run the following on the VPS you wish to use as an introducer, as the tahoe user:

tahoe create-introducer ~/.tahoe-introducer
tahoe start ~/.tahoe-introducer

Your introducer should now be started successfully. Read out the file ~/.tahoe-introducer/introducer.furl and note the entire contents down somewhere. You will need this later to connect the other nodes.

3. Setting up storage nodes

Now it's time to set up the actual storage nodes. This will involve a little more configuration than the introducer node. On each storage node, run the following command: tahoe create-node.

If all went well, a storage node should now be created. Now edit ~/.tahoe/tahoe.cfg in your editor of choice. I will explain all the important configuration values - you can leave the rest of the values unchanged. Note that the 'shares' settings all apply to uploads from that particular server - each machine connected to the network can pick their own encoding settings.

  • nickname: The name for this particular storage node, as it will appear in the web panel.
  • introducer.furl: The FURL for the introducer node - this is the address that you noted down before.
  • shares.needed: This is the amount of shares that will be needed to reconstruct a file.
  • shares.happy: This is the amount of different servers that have to be available for storing shares, for an upload to succeed.
  • shares.total: The total amount of shares that should be created on upload. One storage node may hold more than one share, as long as it doesn't violate the shares.happy setting.
  • reserved_space: The amount of space that should be reserved for other applications on this server. Read below for more information.

[cont.]

Comments

  • joepie91joepie91 Member, Provider
    edited November 2012

    Reserved space

    Tahoe-LAFS has a somewhat interesting way of counting space - instead of keeping track of how much space it can use for itself, it will try to make sure that a certain amount of space is available for other applications. What this means in practice is, that if another application fills up 1GB of disk space, this 1GB will be subtracted from the amount of space that Tahoe-LAFS can use, not from the amount of space that it can't use. The end result is Tahoe-LAFS being very conservative in the way it uses disk space. This means that you can typically set the amount of reserved space to a very low value like 1GB to 5GB, because by the time you hit that amount of free space, you will still have plenty of time to clean up your VPS, before the last gigabytes are used up by other applications.

    Share settings

    At first, share settings may seem very tricky to configure correctly. My advice would be to set it as the following:

    • shares.total: about 80% of the amount of servers you have available.
    • shares.happy: 2 lower than shares.total
    • shares.needed: half of shares.total

    This means that if you have for example 10 storage servers, shares.total = 8, shares.happy = 6, shares.needed = 4.

    Now you can't just set any arbitrary values here - your share settings will influence the 'expansion factor' - how many times more space you use than the file would take up on its own. You can calculate the expansion factor by doing shares.total / shares.needed - for example, with the above suggested setup the expansion factor would be 2, meaning that a 100MB file would take up 200MB of space.

    The level of redundancy can be calculated quite easily as well: the amount of servers you can lose while being guaranteed to still have access to your data, is shares.happy - shares.needed (this assumes worst case scenario). In most cases, the amount of servers you can lose will be shares.total - shares.needed.

    4. Starting your storage nodes

    On each node, simply run the command tahoe start as the tahoe user, and you should be in business!

    5. (optional) Install a local client

    To more easily use Tahoe-LAFS, you may want to install a Tahoe-LAFS client on your local machine. To do this, you should basically follow the instructions in step 3 - however, instead of running tahoe create-node, you should run tahoe create-client. Configuring and starting works the same, but you don't need to fill in the reserved_space option (as you're not storing files).

    Using your new storage grid

    There are several ways to use your storage grid:

    Via the web interface

    Simply make sure you have a client (or storage node) installed, and point your browser at http://localhost:3456/ - you will see the web interface for Tahoe-LAFS, which will allow you to use it. The "More info" link on a directory page (or for a file) will give you the read, write, and verify capability URIs that you need to work with them using other methods.

    Using Python

    I recently started working on a Python module named pytahoe, that you can use to easily interface with Tahoe-LAFS from a Python application or shell. To install it, simply run pip install pytahoe as root - you'll need to make sure that you have libfuse/libfuse2 installed. There is no real documentation for now other than in the code itself, but the below code gives you an idea of how it works:

    >>> import pytahoe
    >>> fs = pytahoe.Filesystem()
    >>> d = fs.Directory("URI:DIR2:hnncfsbzsxv5fhdymxhycm3xc4:qjipiqg3bozb5evb6krdwfmsgks6j4ymivopgx7eoxcjb3avslqq")
    >>> d.upload("devilskitchen.tar.gz")
    
    

    The result of this is something like this.

    Mounting a directory

    You can also mount a directory as a local filesystem using FUSE (on OpenVZ, make sure your host supports FUSE). Right now, the easiest way appears to be using pytahoe (this can be done from a Python shell as well). Example:

    >>> import pytahoe
    >>> fs = pytahoe.Filesystem()
    >>> d = fs.Directory("URI:DIR2:hnncfsbzsxv5fhdymxhycm3xc4:qjipiqg3bozb5evb6krdwfmsgks6j4ymivopgx7eoxcjb3avslqq")
    >>> d.mount("/mnt/something")
    

    Via the web API

    If you're using something that is not Python, or want a bit more control over what you do, you may want to use the Tahoe-LAFS WebAPI directly - documentation for this can be found here.

    Need more help?

    There's plenty more (very clear) documentation on the Tahoe-LAFS website! :)

    EDIT: For those interested in copying this guide - it's released under the WTFPL, meaning you can basically do with it whatever you want, including copying it elsewhere. Credits or a donation are both appreciated, but neither is required :)

  • I wish there was a like button on here.

    This signature is brought to you by the NSA. Spying on the entire world since 1952!

  • +1 (thanks) Nice post.

  • @TheHackBox said: I wish there was a like button on here.

    Or a thanks button

  • IshaqIshaq Member, Provider

    Nice guide, reading..

    [BudgetNode] DDoS Protected. 7 Locations (US/EU). Check out our latest offer!
  • @joepie91: is this how you setup Cryto Storage Grid?

    Catalyst Host - Pie Approved!
  • joepie91joepie91 Member, Provider

    @HalfEatenPie said: @joepie91: is this how you setup Cryto Storage Grid?

    Yes, but I also have a statistics node running.

  • @joepie91 said: Yes, but I also have a statistics node running.

    Awesome. A tip of the hat to you.

    Catalyst Host - Pie Approved!
  • InfinityInfinity Member, Provider

    Great 'un @joepie91. :)

    Cablestreet - London based ISP - Managed Solutions, Carrier Services, Colocation, Dedicated Servers, VMs, and more..

  • Interesting write up.

    Still mucking with GlusterFS and not real impressed so far.

    Perhaps I'll give Tahoe a spin :) Thanks immensely for your contribution with this tutorial.

  • Cheers @joepie, I've been meaning to set this up for a few of my projects :)

    Need to reach me quickly? Ping me on Discord

  • jcalebjcaleb Moderator

    thanks man!

  • @joepie91 great work!

  • Thanks man :)

  • Great post!!! I'll fav it and read more carefully later :)

    How many stars in your bowl? How many sorrows in your soul?
  • Thank you.

    I trust Namesilo for my domains. Use DOLLARLESS for $1 discount.
  • Great work. Going to test this.

    Serving you the best VPS, Web hosting, dedicated servers and more - Cloud Shards | Query Foundry
    We operate the network AS62638 | Available in Syd AU and Dallas, Los Angeles and NYC USA
  • Out of curiosity @joepie91, what if one of the servers suddenly just "disappear" from the network? What happens to the files?

    Catalyst Host - Pie Approved!
  • @joepie91 if you don't have a wiki/blog of your own to save this for future generations, why not copy to http://www.lowendtalk.com/wiki/

  • Good timing @joepie91 as this is what I'm setting up on all those storage vps I've been buying...

    "Go cheap on rarely used things"

  • rm_rm_ Member
    edited November 2012

    BTW I noticed Wheezy has http://packages.debian.org/wheezy/tahoe-lafs in the repo.
    Might be even easier to set up that way. And I like using the official repos much better than silly language-specific ones (PEAR, PECL, Ruby Gems, PIP/easy_install, etc)

  • joepie91joepie91 Member, Provider
    edited November 2012

    @HalfEatenPie said: Out of curiosity @joepie91, what if one of the servers suddenly just "disappear" from the network? What happens to the files?

    This doesn't really matter; if you have set up your share settings as I advised above, for example, you can usually lose half the servers before it becomes a problem. It's usually worth repairing (via a deep check) now and then if you often lose nodes, because this will redistribute shares over new nodes to meet the original settings again.

    From a practical viewpoint, I've had many (and I mean MANY) nodes disappear from my storage grid over time, and barely ever had an issue with it. If you get to the point where you have 20 shares spread over 20 nodes and you only need 10 to reconstruct the file... your storage grid is pretty much practically invincible. Just be sure to do a deep check now and then :)

    @rm_ said: BTW I noticed Wheezy has http://packages.debian.org/wheezy/tahoe-lafs in the repo.

    Might be even easier to set up that way. And I like using the official repos much better than silly language-specific ones (PEAR, PECL, Ruby Gems, PIP/easy_install, etc)

    I'm not sure who packages this, so I would be careful :)

  • Thanks for the guide @joepie91! Very interesting read.

    Fusioned | KVM SSD VPS | LSI RAID10 | Netherlands 1Gbps | R1Soft | IPv4 & IPv6 | SolusVM
  • rm_rm_ Member
    edited November 2012

    @joepie91 okay assuming I have 10 nodes with 10 GB of space each, with your recommended settings:

    • how many of those 10 can disappear with data still intact?
    • what is the amount of usable space out of the raw 10x10GB capacity?

    @joepie91 said: I'm not sure who packages this, so I would be careful :)

    Wha.... do you expect people to personally know each and every Debian Developer, and not install any packages from Debian 'main' if they don't? O.o

  • :) Prerolled packages are pretty alright. Even compiling the source for things could be filled with baddies. Not like we pre-audit stuff at compile time

    @joepie91, how much space are you combing in nodes and doing so all over internet?

  • joepie91joepie91 Member, Provider
    edited November 2012

    @rm_ said: @joepie91 okay assuming I have 10 nodes with 10 GB of space each, with your recommended settings:

    • how many of those 10 can disappear with data still intact?

    Total shares would be 8, happy would be 6, and needed would be 4 - this means you can lose 6 - 4 = 2 servers (worst case scenario) without losing access to your data. It's likely possible to lose 3 or 4 servers (this depends on whether the servers you are losing hold 1 or more shares). In this, with "losing" servers I only mean the (max.) 8 servers that you uploaded a share to, to start with. Since your total amount of servers is 10, you could lose 2 more servers without any issues if those servers happen to not hold any shares for this file.

    Summary: worst case scenario, you can lose any 2 servers. Best case scenario, you can lose 6 servers. It'll usually be somewhere in the middle.

    @rm_ said: - what is the amount of usable space out of the raw 10x10GB capacity?

    Since your expansion factor is 8 / 4 = 2, and every storage server has an equal amount of space available, you should be able to use 100 / 2 = 50GB of practical space.

    @rm_ said: Wha.... do you expect people to personally know each and every Debian Developer, and not install any packages from Debian 'main' if they don't? O.o

    No, but since one of the stated features of Tahoe-LAFS is file integrity and confidentiality, you'll want to make sure that you're not depending for that on a potentially modified version of the software :)

    @pubcrawler said: @joepie91, how much space are you combing in nodes and doing so all over internet?

    iqj5wkzuo2x3tdcjhauzsafpe5gwcojq    [name removed] CA       13.41GB
    a2bjjtujmabiwfqungzlywzyjszm2gyp    [name removed]      265.96GB
    fzu6dmqq23u2km6ywtlym4tvmtefn25b    Box     3.35GB
    oywsltqtxm6su6gu54j6bxmgh5qf6o5r    Git     4.29GB
    mbbs6staiw56f7dtyxxnzecixjoz2m2r    Haless      44.04GB
    n3fhesvxzg5mpq3gsov76lf2sdwfwo45    Konjassiem      9.16GB
    z3hc2nw2g2jjhb7vntt5z3mtdcebiho6    Arvel       7.14GB
    cqq4hmk7flrfwmlt6mldulfrc4swdrhl    Eris        26.86GB
    akd5kzq4bsmdr6yeyltaro3t2rtap5xo    [name removed]      600.95GB
    u5ygxnwa25ztku4qpubsjjahlp2pl5bp    Discordia       11.01GB
    sxbcue26orebknqpzchx5yl63ywep66n    Alba        69.10GB
    s72mw7cw3ojzki5wz7qxhxs2eex4ethf    CVM-VZ      54.00GB
    6ck5rd7g46o6kx2wxcym3ku3obwv645d    [name removed]      26.60GB
    hepqdbu7mohz6jg4uzozouotapfm74pk    [name removed] US       11.37GB
    qenkbcotohq4c4vhsfmzjmixqhj7ohww    Shi     4.45GB
    mhelfzivcdzjisxrlwkxo3rnmp5bef3m    Basket      43.67GB
    jxba3idp4epcvfughxsni5c7pprgrxkw    Aarnist     33.83GB
    5yunndzcq7a2bqvlyqjj6kxedgiymhtt    [name removed] ZNC      13.46GB
    y3hgi5fi3qdnoamemuj5qpfrnmopy5ra    equinox     5.03GB
    jyq6lzjwff3a7ijae54y3zfg2mcv2ykr    Nijaxor     48.43GB
    pu5m53joaxfdc5zwbcvzu3gv65v3wab3    Sabit       17.66GB
    
    Total free storage space: 1313.78GB
    

    The nodes are distributed geographically fairly evenly.

    The 600.95GB node is a bit lost, because it's connected to the old introducer address (which no longer exists), so I can't use that space right now. I'm having some issues tracking down the owner :)

  • @joepie91,

    Fascinating post with the storage amounts. So Tahoe doesn't care that nodes have different storage amounts available? No sort of disclaimer or worry or best case against such?

  • joepie91joepie91 Member, Provider

    @pubcrawler said: @joepie91,

    Fascinating post with the storage amounts. So Tahoe doesn't care that nodes have different storage amounts available? No sort of disclaimer or worry or best case against such?

    No, the actual amount of storage space that you have available doesn't really matter. The only caveat is that you won't be able to use up all of it in all situations - say that you, for example, have total/happy shares set to 10, but only 2 servers offer more than 30GB of space, then your ceiling for storing files will be at about 30GB - after all, at some point, you simply only have 2 servers left that have more space to store files, and that wouldn't satisfy shares.happy.

  • @joepie91 also, isn't it the case that by default, nodes closest in latency terms get filled up faster on average?

    "Go cheap on rarely used things"

  • joepie91joepie91 Member, Provider

    @craigb said: @joepie91 also, isn't it the case that by default, nodes closest in latency terms get filled up faster on average?

    No. Nodes are, as far as I am aware, only chosen by latency when downloading. Uploading will happen with deterministic randomness - as it should, because if the storage servers were picked on basis of latency, it would create a single (geographical) point of failure.

  • @joepie91 that's good to know - thanks!

    "Go cheap on rarely used things"

  • joepie91joepie91 Member, Provider

    That being said, if you're planning on for example building a CDN with Tahoe-LAFS as backend, you'll probably want to make sure that you either have an expansion factor of at least 3, or heavy caching, so that it's likely that data can be retrieved entirely from the same geographical area as the request originates from :)

  • @joepie91 said: building a CDN with Tahoe-LAFS as backend

    I like this.

    Catalyst Host - Pie Approved!
  • RaymiiRaymii Member
    edited November 2012

    I also like this. Mirrored it at https://raymii.org/cms/p_Tahoe_LAFS_set_up_your_own_distributed_redundant_and_encrypted_storage_grid

    btw, I once also had vps's that where part of the CCC storage grid...

    Quis custodiet ipsos custodes?
    https://raymii.org - https://cipherli.st
  • Very good, thanks for posting this. Will read properly and have a play with it myself later. Just what ive been wanting to do with a few of my boxes.

  • @joepie91 said: I'm not sure who packages this, so I would be careful :)

    Since you're using pip to install tahoe-lafs, don't you have to trust the pip and whoever packaged tahoe-lafs? Either way, you have to trust someone... I'd much rather it be a debian package, which can be verified against the original source pretty easily.

    Also, why use pip to install twisted rather than using the python-twisted debian package?

    Overall though, great guide!

  • joepie91joepie91 Member, Provider

    @NickM said: Since you're using pip to install tahoe-lafs, don't you have to trust the pip and whoever packaged tahoe-lafs? Either way, you have to trust someone...

    The PyPi package (which pip uses) is packaged by zooko and warner, who are actually Tahoe-LAFS developers :)

    @NickM said: Also, why use pip to install twisted rather than using the python-twisted debian package?

    Because this way the dependency management is up entirely to the same package manager you're using to install Tahoe-LAFS - meaning that if for example a Twisted update is required, this can theoretically be done by pip itself. Twisted should be automatically installed when installing Tahoe-LAFS, but for some reason that dependency isn't working quite right, so I've included it as a separate installation step.

  • @joepie91 said: Because this way the dependency management is up entirely to the same package manager you're using to install Tahoe-LAFS - meaning that if for example a Twisted update is required, this can theoretically be done by pip itself. Twisted should be automatically installed when installing Tahoe-LAFS, but for some reason that dependency isn't working quite right, so I've included it as a separate installation step.

    It's almost like a gem install :P

    Quis custodiet ipsos custodes?
    https://raymii.org - https://cipherli.st
  • By the way @joepie91 your article is very popular:

    image

    Quis custodiet ipsos custodes?
    https://raymii.org - https://cipherli.st
  • Can anyone here comment on the performance of Tahoe? I'm thinking of using either Tahoe or GlusterFS, but read/write speed is important, and I can't seem to find a lot of performance-related info on Tahoe. I like the p2p design of Tahoe, but it has to be fast too.

Sign In or Register to comment.