Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Providers - what happens when a node blows up?
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

Providers - what happens when a node blows up?

raindog308raindog308 Administrator, Veteran
edited May 2012 in General

Curious how smaller vendors (not the Amazons of this world) handle node failures.

I assume (good) providers use enterprise-grade gear with RAID disk, redundant NICs, redundant power supplies, etc. But what if a node goes down anyway - CPU fails, human error, software problem, whatever.

At work, we use VMware and vmotion stuff for HA and have other (ridiculously expensive) HA systems. Do the various open source virtualization systems (OvZ, Xen, KVM) offer something like that? Can you pick up a VPS and move it from one node to the other over the network? I'm guessing if you do, the VPS is down for the duration of the move?

Of course, that doesn't help if the node just crashes - I imagine you'd need some sort of SAN to overcome that.

I get the sense that the really small players rent a dedicated server from someone, stick a virt layer on it, and start selling nodes without a lot of thought about what happens if the node fails.

I'm also assuming that better providers either keep spare parts/nodes at the DC, or live near it...but if the box goes down, the VPSes on it might be down for as long as it takes to fix that box.

Really just curious...

Comments

  • rds100rds100 Member
    edited May 2012

    You can live migrate an OpenVZ container to another node 95% of the time, there is no downtime and no reboot of the VPS. The drawback is that both nodes must be working. Also in some rare cases (like if ntpd is run inside the container or other process using posix timers) the live migration doesn't work, then you can do offline migration - which means the VPS reboots, but other than that the downtime is minimal.

  • DamianDamian Member
    edited May 2012

    @raindog308 said: I'm also assuming that better providers either keep spare parts/nodes at the DC, or live near it...but if the box goes down, the VPSes on it might be down for as long as it takes to fix that box.

    This is the approach we take. We keep spare hard drives (really annoying), PSUs, and NICs on hand. We're on good terms with a local Dell/HP supplier to supply us anything else that may burn up (CPU, ram, etc...) I'm really trying to push it with the rest of the team that we need to move away from HP/Dell hardware and buy standard components and assemble our own nodes. We can buy standard components from any store that sells computer components, unlike vendor-specific components.

    On the other hand, even with this planning, it takes 45 minutes to drive to the datacenter, so that's at minimum 45 minutes of downtime. Too long. We're moving to a datacenter that offers remote hands to mitigate this.

    And even further, don't forget about infrastructure failures, like I discussed here: http://www.lowendtalk.com/discussion/2211/when-shopping-for-a-datacenter

  • InfinityInfinity Member, Host Rep

    If a node blows up then it is usually in lots of pieces. Trololol. <\really-crap-joke>

    I would guess this is an issue for providess with rental servers but I guess if a node's hardware goes wrong then it's up to the provider of the server to recover data etc. But it depends on the individual provider and their terms of service.

    For providers who colo it would depend on the individual person, which is the point of this thread, too see what different providers do.

  • FRCoreyFRCorey Member

    I keep backups of the VM's themselves, just copy them over to AmazonS3. Just wish SolusVM had a more robust backup system with choices.

  • KMyersKMyers Member

    @Infinity said: If a node blows up then it is usually in lots of pieces. Trololol.

    Yes, and the fire department typically leaves it in a state of disrepair

    Honestly this has only happened once. It was a lease so they just migrated the hard disk to the new server. The Motherboard started to smoke

  • subigosubigo Member

    I've never seen a node crash so hard you couldn't just migrate clients off of it first. There's almost always errors and signs before something bad happens. If it did happen, I'd just have another server setup, restore managed users from a backup, and set unmanaged users back up to their original state (if they have a backup, I'll restore it). However, these days I normally tell everyone that they have to have their own restorable backups if they want something restored.

  • JacobJacob Member

    If your colocating, It's generally a good Idea to keep exact spares / replacements of the original parts, Always a good idea to keep chassis, Especially PSUs if you do not have a Dual PSU Setup.

    I have noticed some strange things on our KS2, Kansas node and beginning to think that is about to flip... Bob is going to switch it to a new chassis tomorrow and he is going to run a quick test on all the parts.

  • WilliamWilliam Member

    Pull out hdds, put HDDs in spare, boot up, done.
    30min downtime.

  • KairusKairus Member

    @Infinity said: I would guess this is an issue for providess with rental servers

    That's the one thing I liked about renting servers (not for VPS nodes). Had some strange kernel panics on one FreeBSD server (and not another, same specs, same exact software setup), ended up replacing the ram, still had the issue, then they swapped out all the hardware. Then when I wanted to upgrade, it was simply, "I want to upgrade 5 servers, what kind of deal can you give me?"

  • BoltersdriveerBoltersdriveer Member, LIR

    What @William said. But not that straightforward for us, we had a few problems here and there last year when a hard drive died on us. Took a few hours communicating with the provider and they simply still didn't help us. Probably just miscommunication during that period, but not too worried about that nowadays :)

  • miTgiBmiTgiB Member

    If a node blows up, it releases it's blue smoke of life and floats to heaven

    Honestly, I try to keep a live node that is empty but online in both data centers at all times. I don't always succeed with this plan, but I also am adding small nodes so fast that I generally have enough spare parts around to build a node. I've had nodes fail is many different ways, but since moving to 100% SuperMicro, I've never had anything other than drives fail while in service. And well, drives just fail, nothing to do about it, and I have spares on hand for that too. For LA, New Egg is a few miles away, so if I have to replace a part, it is always there the next day, and QuadraNet will lend me anything as long as I replace it, and we use a lot of the same parts. Charlotte is a 20 minute drive, and I have just pulled a node to bring it back to work on in the comfort of a local bench and return the next day with a fully functional node.

  • KuJoeKuJoe Member, Host Rep
    edited May 2012

    @William said: Pull out hdds, put HDDs in spare, boot up, done.

    This. That's why all of our nodes are identical (except the HDDs).

Sign In or Register to comment.