What is considered reasonable downtime?

techwiz44 · April 2018

I do IT for a living so I know about planning image backups, fault tolerance, failover, redundancy, raid 10 etc. I also understand that sometimes things just happen. For example: one of the VPS providers that was heavily active on LEB was subjected to a long-term denial of service attack and during that process my VPS was off-line. I certainly would not fault the providing company. But during the last 60 days my VPS has gone off-line for 4 to 5 hours straight. This happened less than 30 days ago and the second occurrence is happening as I type this. I did what any reasonable client would do, I filled out a support ticket and said my VPS is off-line, please advise.

When this took place 30 days ago, after five hours of downtime all I received was a comment from the support department that there was a problem with another VPS and supposedly things had been corrected. Here I am less than 30 days later and my VPS is been off-line for going on over four hours. This is right in the middle of the business day and again I filled out a support ticket with an inquiry as to what the heck is going on. This provider has a Twitter page and a Facebook page and a network status link all of which make no mention of any outages.

If they’re having a serious issue I believe all their technicians should be focusing on solving it and not taking time out to hand hold affected clients. But after 4 to 5 hours is it reasonable to be advised as to when my VPS will come back online?
What is considered reasonable here?
At this point the provider is nameless. If this continues they won’t be!

WebProject · April 2018

4-5 hours of downtime for no good valid reason is not great unless your firewall blocked your IP address on your VPS.

mikho · April 2018

what does the SLA say?

deank · April 2018

It depends on what the SLA is. Even 99% uptime allows a fair bit of hours. 7 hours or something.

YokedEgg · April 2018

99.9% yearly uptime is a good estimate. Don't confuse this with 99.99%.

Neoon · April 2018

It depends also what you pay, if you pay just a small amount get another provider, maybe you can claim SLA.

HashTag · April 2018

Around an hour a month is reasonable anything over that id be changing host SLA or no SLA.

omelas · April 2018

SLA	Allowed downtime per year
95%	18 Day 6 hour
99%	3 Day 15 hour
99.9%	8 Hour 45 Minutes
99.99%	52 Minutes
99.999%	5 Minutes

EHRA · April 2018

It all depends on the contracted service and the expectations that you have. When I contract a VPS or dedicated server I expect the highest possible uptime, if I do not find it, I will go to another provider. What is happening to you will be unacceptable to my needs, considering the inactivity and the short time that this problem was repeated.

PS: If your complaint is legitimate, do not be afraid to quote the provider, this allows the right of defense and helps us stay away from the bad suppliers.

deank · April 2018

And, of course, do not forget that you do have an option to sue.

techwiz44 · April 2018

They are no longer anonymous...

https://hostus.us/terms-and-conditions.html

I have been down for over 5 hours. I have received one reply from Alexander in support stating they are rebooting the node and will monitor same. For over 5 hours I cannot ssh into my vps. When managed via their panel it says the server cannot be contacted. I'm way past pissed. I am purchasing another VPS from another provider now.

deank · April 2018

There is a chance that it's running fsck. Fsck can take hours in some cases.

Of course, I am not defending the clowns. Just saying.

lazyt · April 2018

Fsck can take days in extreme cases.

Aidan · April 2018

@lazyt said:
Fsck can take days in extreme cases.

Weeks on certain storage nodes ;-;

deank · April 2018

Months on some deadpooled storage servers.

cubedata · April 2018

isn't @AlexanderM related to HostUS? let's see what he says on this.

AlexanderM · April 2018

@techwiz44 said:
They are no longer anonymous...

https://hostus.us/terms-and-conditions.html

I have been down for over 5 hours. I have received one reply from Alexander in support stating they are rebooting the node and will monitor same. For over 5 hours I cannot ssh into my vps. When managed via their panel it says the server cannot be contacted. I'm way past pissed. I am purchasing another VPS from another provider now.

Do you have a ticket number and i can check it so i can locate your account? 5 hours is a long downtime, we did have an issue on an atlanta node today which is mainly used for $1/month services,but not for 5 hours, but if you let me know your ticket ID or account email I can tell you the facts rather than heresay!

I’m sorry that your experience with us haen’t been great, if you send me the details above i’ll arrange for a refund for your service.

@cubedata said:
isn't @AlexanderM related to HostUS? let's see what he says on this.

Yes, i own and manage the company for 5 years.

Thanks,
Alex

willie · April 2018

I think you are saying your VPS worked ok other than two multi-hour outages in [corrected] a two month period. Yeah that's not great, but it's well enough within range of stuff we see on LET that I wouldn't go nuts over it.

It's often the case that there's an outage, the host deploys a fix, and then it turns out that the issue wasn't really solved so the problem happens again. It's also in the nature of LET that people entering this market haven't yet gotten their processes completely figured out, so stuff is more likely to break than if you host at AWS. You have to set your expectations accordingly.

Obviously if the same thing keeps happening then there's a problem, but if it's a one-off (well, two-off) I'd say it's usual, and if your reliability needs can't handle it, then you have to move to a HA scheme involving multiple servers and maybe multiple hosts, rather than simply expecting more from a single VPS.

404error · April 2018

@AlexanderM said:
...we did have an issue on an atlanta node today which is mainly used for $1/month services...

Is this your way of saying YGWYPF ? hehe

YokedEgg · April 2018

@404error said:

@AlexanderM said:
...we did have an issue on an atlanta node today which is mainly used for $1/month services...

Is this your way of saying YGWYPF ? hehe

That abbreviation is almost as long as just writing out "you get what you pay for".

KuJoe · April 2018

@techwiz44 said:
They are no longer anonymous...

https://hostus.us/terms-and-conditions.html

I have been down for over 5 hours. I have received one reply from Alexander in support stating they are rebooting the node and will monitor same. For over 5 hours I cannot ssh into my vps. When managed via their panel it says the server cannot be contacted. I'm way past pissed. I am purchasing another VPS from another provider now.

I'm not able to find an SLA anywhere on their site. I suggest finding a provider that offers one and stands by it.

willie · April 2018

KuJoe said: I'm not able to find an SLA anywhere on their site. I suggest finding a provider that offers one and stands by it.

Lol, if this server was within LET limits we're talking about a $7/month product at most (23 cents a day). So we'd likely be looking at 46 cents compensation for the two outages at best. It's not going to matter.

deank · April 2018

If no SLA is found, it defaults to 100% uptime guarantee.

doghouch · April 2018

@deank said:
Months on some deadpooled storage servers.

More like half a year...

donli · April 2018

@doghouch said:

@deank said:
Months on some deadpooled storage servers.

More like half a year...

Hope springs eternal among people with no backup.

omelas · April 2018

what's reasonable expected downtime for no SLA? Techtically provider can run away without turn on the server, but that obviously wouldn't work.

willie · April 2018

omelas said: what's reasonable expected downtime for no SLA?

Realistically this is LET and you can't expect much. You might look at the long term reliability reports in the old vpsboard.com host review threads. Keep in mind that a lot of "downtime" is network interruption, while the server itself is still running.

Generally I'd say a server is pretty good if it's rebooted without notice no more than maybe 1x a year, and with notice no more than maybe 2x a year, plus it shouldn't have extended (> 1 hour) unavailability more than 1x or 2x a year without notice. If it has brief (< 2 minutes) network outages 1x or 2x a month that's not great but not too bad. At some point a 1 hour outage is less bad than several 5 minute outages.

I really wouldn't rely on any single server for a service that will significantly inconvenience you if it's down. You ought to have backup services just like you have to back up your data. I personally keep local copies of any data that I might have to use while a remote server is down. The remote server (modulo temporary outages) is more reliable than my laptop, but those outages are hard to foresee. E.g. my home internet was out for a few hours earlier this week, so the remote servers were unreachable through no fault of their own.

deadpool · April 2018

I think anything in the high 90% range is alright by me personally. Obviously some people need their website to be up 100% of the time to reach potential clients or viewers. I have a provider that guarantees 99.99% uptime but hetrixtools says otherwise and I thought about busting their balls but to me 98.76% vs 99.99% is a very small margin.

donli · April 2018

@caniac22 said:
I thought about busting their balls but to me 98.76% vs 99.99% is a very small margin.

99.99% = 53 minutes downtime/year.

98.76% = 109 hours downtime/year.

If you are running a production application and being down each hour costs you money moving to a better host can easily pay for itself in savings. If you are running a personal project and being down just means mild annoyance to someone getting something for free then that's a different matter.

willie · April 2018

omelas said: what's reasonable expected downtime for no SLA?

Put another way, an LET style host is doing ok with 99.9% uptime, which allows downtime of about 9 hours a year. That's about what you get if you run a server in a DC and do a halfway decent job of keeping it running and fixing outages. To get 99.99% (under 1h/year down) you need a failover server, plus serious monitoring and 24/7 op intervention if something goes wrong.

Very big systems (e.g. Google search) SLA's are based on requests rather than downtime. They're spread across 1000s of servers so it's assumed something will always be down somewhere. But "everything out simultaneously" is headline news and is supposed to never happen. 99.999% SLA means that out of every billion requests, 0.001% or about 10,000 are allowed to fail. You have to do some work to make your service that reliable.

According to a book I saw about Google SRE (site reliability engineering), if you're tasked with 99.999% SLA then the other 0.001% is called your "error budget". If you exceed your error budget (your server fails too many requests) then you get dinged for an unreliable service. But if you don't use much of it (you were allowed 10,000 errors and you only made 20) you get dinged for suboptimal resource allocation (you did extra work to make your service more reliable than necessary, instead of making it faster or adding features).

All their services rely on other services and know the error budgets of the services that they rely on, and take the possible upstream failures into account when figuring out how to meet their own reliability goals. Your thing has to keep running (maybe degraded) even if something it relies on is temporarily down. Similarly things depending on you have to keep running even if YOUR thing is down. So it's acceptable and expected for your thing to have some downtime, up to a specified amount.

Apparently it's bad juju to undershoot your error budget too often. One service had an allowed error budget of X but in practice it always worked, so its downstream clients started to rely on that. One day management said something like "hey this thing is allowed to be down 25 minutes a year and it's never been down at all. Let's shut it off for 25 minutes and see what happens". Of course other stuff that depended on it started crashing (as they suspected it would) and people got spoken to. I hate Google in all kinds of ways but there are some things they have figured out pretty well. Now they make sure that thing is down 25 minutes a year even if they have to shut it off intentionally, to make sure that its downstream is able to survive the outages.

Here's a summary of the book, with a link: http://danluu.com/google-sre-book/

Ole_Juul · April 2018

99.9% is OK for one year because shit happens, but I'd not be happy if they were down 8 hours every year. I'd expect somewhere closer to 99.99% for any decent host. And my experience confirms that.

PS: for the $7/yr hosts I accept whatever I get.

Howdy, Stranger!

Categories

In this Discussion

What is considered reasonable downtime?

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

What is considered reasonable downtime?

Comments