ZxHost Failure - OpenVZ

comeback · April 2017

I've been at ZHost for two years, and I've never had a problem.

On March 25 I received this mail

Since then, I have not had any news, and the VPS does not work.

Did other people have the same problem?

Hello Stephane,

The node your OpenVZ VM operates on has been effected by a RAID failure before all migrations to our KVM enviornment could be fully completed.

We are working to restore the data if possible from the RAID set, however to get you back online asap we are looking to setup your new KVM VM.

If you can please reply to this email with the OS you require along with any further requirement's, we will be providing an extension of 1 month to all effected services.

Thanks,

ZXHost

Nekki · April 2017

Did you reply to the email?

jar · April 2017

comeback said: Did other people have the same problem?

Nope, you're the only one on the node! RAID failures generally only cause problems for one person. I've never heard of one happening before though.

Bopie · April 2017

@jarland i almost thought that your comment was from @nekki had to check the names twice

comeback · April 2017

@jarland said:

comeback said: Did other people have the same problem?

Nope, you're the only one on the node! RAID failures generally only cause problems for one person. I've never heard of one happening before though.

I have misunderstood myself,

Have you ever had this problem with another provider?

Do you think they can fix it?

Nekki · April 2017

@Bopie said:
@jarland i almost thought that your comment was from @nekki had to check the names twice

There would almost certainly be more swearing if it was me.

Harambe · April 2017

@comeback said:

I have misunderstood myself,

Have you ever had this problem with another provider?

Yes.

Do you think they can fix it?

Who knows. Shit happens. This is why you always need backups.

jar · April 2017

comeback said: Have you ever had this problem with another provider?

Pretty much every provider. Depends on the reason the RAID failed. Honestly, you can never know who is going to be honest with you about why. Could be they were lazy replacing a drive and another went out, could be a controller going nuts. If a controller failed then I'd give it a 50/50 shot of recovery.

pbgben · April 2017

@jarland said:

comeback said: Have you ever had this problem with another provider?

Pretty much every provider. Depends on the reason the RAID failed. Honestly, you can never know who is going to be honest with you about why. Could be they were lazy replacing a drive and another went out, could be a controller going nuts. If a controller failed then I'd give it a 50/50 shot of recovery.

"What do you mean I can't buy these controllers anymore"

Yura · April 2017

@AshleyUk

Nekki · April 2017

@comeback

DID YOU RESPOND TO THE FUCKING EMAIL.

Bopie · April 2017

@Nekki said:
@comeback

DID YOU RESPOND TO THE FUCKING EMAIL.

And there is the real nekki

Falzo · April 2017

I am also affected by this, but luckily I only used this old storage node for backups only and don't need the data restored after all.

to satisfy @nekki of course I replied and waited patiently since then... ;-)

those services on the old hetzner nodes were to be transferred to the frankfurt location of zxhost since a while but probably @AshleyUk couldn't make it as fast as planned before.

I also noticed some changes at least to the naming of the storages nodes in frankfurt recently where the newer nodes are located. so I assume Ashley is steadily working on this issue, including migrating services over which might take quite some time depending on how much services there are and how much data were put into them...

I'd also appreciate a tad more info or status-updates in between - there probably already is a big ticket backlog anyways ^^

AshleyUk · April 2017

Thanks for the tags! People who where effected and replied to the email got their new VM setup.

Have been working my way trough the migration of the VM's for a while now, quite a few storage nodes from hetzner. Has taken longer due to many reasons including some people just not replying to emails and sadly this happened on one of the nodes before it was fully empty.

I think I know who the OP is as received a reply to the email around the same time as this post, awaiting further details and will happily resolve for the OP.

willie · April 2017

Ashley, thanks for posting. How is the raid recovery going? Can I ask what the raid levels was? Are you saying you're migrating all your old Hetzner storage servers to Frankfurt?

AshleyUk · April 2017

@willie said:
Ashley, thanks for posting. How is the raid recovery going? Can I ask what the raid levels was? Are you saying you're migrating all your old Hetzner storage servers to Frankfurt?

Was running Raid10, does not look too good to be honest but still trying.

And yes we have been working on it for a while, we had nearly finished just happened to be one of the last few servers with the issue.

willie · April 2017

AshleyUk said:

Was running Raid10, does not look too good to be honest but still trying.

Oh yikes. Is this multiple drive failures, or hw controller or what, if you don't mind my asking? (I have some storage with you but it's in your Frankfurt Ceph cluster which sounds safer than raid-10).

AshleyUk · April 2017

@willie said:

AshleyUk said:

Was running Raid10, does not look too good to be honest but still trying.

Oh yikes. Is this multiple drive failures, or hw controller or what, if you don't mind my asking? (I have some storage with you but it's in your Frankfurt Ceph cluster which sounds safer than raid-10).

Multiple drive failures.

rokok · April 2017

AshleyUk said: Are you saying you're migrating all your old Hetzner storage servers to Frankfurt?

i got no issue on my old storage, but i need answer: will Frankfurt location free incoming bandwidth like Hetzner??

Harambe · April 2017

@rokok said:

AshleyUk said: Are you saying you're migrating all your old Hetzner storage servers to Frankfurt?

i got no issue on my old storage, but i need answer: will Frankfurt location free incoming bandwidth like Hetzner??

If it's the same as the ceph plans, then yeah, free inbound.

willie · April 2017

AshleyUk said: Multiple drive failures.

Thanks. I'm getting less enthusiastic about raid-10. Will try to aim for Raid-6, Ceph, ZFS etc. I'm liking my VPS on your Ceph cluster. It feels incredibly solid for some reason.

jar · April 2017

@willie said:

AshleyUk said: Multiple drive failures.

Thanks. I'm getting less enthusiastic about raid-10. Will try to aim for Raid-6, Ceph, ZFS etc. I'm liking my VPS on your Ceph cluster. It feels incredibly solid for some reason.

I just don't understand how multiple drives fail at once unless you either get a really really unlucky dice roll or someone was lazy about replacing one of the bad drives because "it's fine, array is still alive."

RAID10 is amazing. But you also have to consider the possibility SO many people are running fleets of RAID10 that you're going to hear more failure stories about it than a configuration that less servers are running.

I say don't throw out the popular choice because you hear about the few times that one array fails. It's popular because you have to lose two drives to kill it, and if people are on top of things and not lazy, and controllers don't fail themselves in a spectacular way (a risk on any RAID array, don't buy shit controllers and keep spares), cases of failure should be exceptionally minimal.

And let's be real, hosts don't have to tell you it's because they were slow replacing the first drive that failed. You won't know any different, you weren't there. So it's real easy to blame something else, and you'll never really know who is telling the truth.

willie · April 2017

It's popular because you have to lose two drives to kill it,

Yes, and the same is true of raid-5 which is deprecated with large drives these days. The other alternatives I mentioned can survive every possible 2-drive failure, or even 3-drive (etc.) depending on configuration. They were designed for the specific reason that 2-drive failures aren't all that rare. Remember also that in a raid-10 rebuild you're pounding the crap out of the surviving member of the pair that had a failure. That increases its own likelihood of failure.

From what I understand, Online's Enterprise C14 product ($$$$) is a distributed software RAID with something like 47 servers (rack of 1U's I guess) and up to something like 20 can fail. For the non-enterprise version the number is lower but it's still a lot compared to what we're used to.

daffy · April 2017

We had a batch of 12 drives where 2 went belly up within 48 hours now. 1.5 year old Toshiba enterprise drives. Luckily, the rebuild finished on the first drive just before the second one decided to die.

michaels · April 2017

This is one of the things that we look at when we build our servers, making sure that drives aren't all from the same batch. We have been caught out by HP drives failing at the same time. We tend to make sure at the very minimum the hot spare is from a different batch, as it gives us a chance if it is a model issue to correct it. Have also experienced drives failing during a rebuild. The RAID model isn't so great on new really big drives because of the rebuild times... Is there anything as good as it yet? Not as far as I have seen.

Howdy, Stranger!

Categories

In this Discussion

ZxHost Failure - OpenVZ

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

ZxHost Failure - OpenVZ

Comments