Providers - what happens when a node blows up?

raindog308 · May 2012

Curious how smaller vendors (not the Amazons of this world) handle node failures.

I assume (good) providers use enterprise-grade gear with RAID disk, redundant NICs, redundant power supplies, etc. But what if a node goes down anyway - CPU fails, human error, software problem, whatever.

At work, we use VMware and vmotion stuff for HA and have other (ridiculously expensive) HA systems. Do the various open source virtualization systems (OvZ, Xen, KVM) offer something like that? Can you pick up a VPS and move it from one node to the other over the network? I'm guessing if you do, the VPS is down for the duration of the move?

Of course, that doesn't help if the node just crashes - I imagine you'd need some sort of SAN to overcome that.

I get the sense that the really small players rent a dedicated server from someone, stick a virt layer on it, and start selling nodes without a lot of thought about what happens if the node fails.

I'm also assuming that better providers either keep spare parts/nodes at the DC, or live near it...but if the box goes down, the VPSes on it might be down for as long as it takes to fix that box.

Really just curious...

rds100 · May 2012

You can live migrate an OpenVZ container to another node 95% of the time, there is no downtime and no reboot of the VPS. The drawback is that both nodes must be working. Also in some rare cases (like if ntpd is run inside the container or other process using posix timers) the live migration doesn't work, then you can do offline migration - which means the VPS reboots, but other than that the downtime is minimal.

Damian · May 2012

@raindog308 said: I'm also assuming that better providers either keep spare parts/nodes at the DC, or live near it...but if the box goes down, the VPSes on it might be down for as long as it takes to fix that box.

This is the approach we take. We keep spare hard drives (really annoying), PSUs, and NICs on hand. We're on good terms with a local Dell/HP supplier to supply us anything else that may burn up (CPU, ram, etc...) I'm really trying to push it with the rest of the team that we need to move away from HP/Dell hardware and buy standard components and assemble our own nodes. We can buy standard components from any store that sells computer components, unlike vendor-specific components.

On the other hand, even with this planning, it takes 45 minutes to drive to the datacenter, so that's at minimum 45 minutes of downtime. Too long. We're moving to a datacenter that offers remote hands to mitigate this.

And even further, don't forget about infrastructure failures, like I discussed here: http://www.lowendtalk.com/discussion/2211/when-shopping-for-a-datacenter

Infinity · May 2012

If a node blows up then it is usually in lots of pieces. Trololol. <\really-crap-joke>

I would guess this is an issue for providess with rental servers but I guess if a node's hardware goes wrong then it's up to the provider of the server to recover data etc. But it depends on the individual provider and their terms of service.

For providers who colo it would depend on the individual person, which is the point of this thread, too see what different providers do.

FRCorey · May 2012

I keep backups of the VM's themselves, just copy them over to AmazonS3. Just wish SolusVM had a more robust backup system with choices.

KMyers · May 2012

@Infinity said: If a node blows up then it is usually in lots of pieces. Trololol.

Yes, and the fire department typically leaves it in a state of disrepair

Honestly this has only happened once. It was a lease so they just migrated the hard disk to the new server. The Motherboard started to smoke

subigo · May 2012

I've never seen a node crash so hard you couldn't just migrate clients off of it first. There's almost always errors and signs before something bad happens. If it did happen, I'd just have another server setup, restore managed users from a backup, and set unmanaged users back up to their original state (if they have a backup, I'll restore it). However, these days I normally tell everyone that they have to have their own restorable backups if they want something restored.

Jacob · May 2012

If your colocating, It's generally a good Idea to keep exact spares / replacements of the original parts, Always a good idea to keep chassis, Especially PSUs if you do not have a Dual PSU Setup.

I have noticed some strange things on our KS2, Kansas node and beginning to think that is about to flip... Bob is going to switch it to a new chassis tomorrow and he is going to run a quick test on all the parts.

William · May 2012

Pull out hdds, put HDDs in spare, boot up, done.
30min downtime.

Kairus · May 2012

@Infinity said: I would guess this is an issue for providess with rental servers

That's the one thing I liked about renting servers (not for VPS nodes). Had some strange kernel panics on one FreeBSD server (and not another, same specs, same exact software setup), ended up replacing the ram, still had the issue, then they swapped out all the hardware. Then when I wanted to upgrade, it was simply, "I want to upgrade 5 servers, what kind of deal can you give me?"

Boltersdriveer · May 2012

What @William said. But not that straightforward for us, we had a few problems here and there last year when a hard drive died on us. Took a few hours communicating with the provider and they simply still didn't help us. Probably just miscommunication during that period, but not too worried about that nowadays

miTgiB · May 2012

If a node blows up, it releases it's blue smoke of life and floats to heaven

Honestly, I try to keep a live node that is empty but online in both data centers at all times. I don't always succeed with this plan, but I also am adding small nodes so fast that I generally have enough spare parts around to build a node. I've had nodes fail is many different ways, but since moving to 100% SuperMicro, I've never had anything other than drives fail while in service. And well, drives just fail, nothing to do about it, and I have spares on hand for that too. For LA, New Egg is a few miles away, so if I have to replace a part, it is always there the next day, and QuadraNet will lend me anything as long as I replace it, and we use a lot of the same parts. Charlotte is a 20 minute drive, and I have just pulled a node to bring it back to work on in the comfort of a local bench and return the next day with a fully functional node.

KuJoe · May 2012

@William said: Pull out hdds, put HDDs in spare, boot up, done.

This. That's why all of our nodes are identical (except the HDDs).

Howdy, Stranger!

Categories

In this Discussion

Providers - what happens when a node blows up?

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Providers - what happens when a node blows up?

Comments