XenPower Uptime/Redudancy

iaTa · March 2014

For the last day or so I've had a major issue with our XenPower Dallas server. Salvatore replied to my ticket to say that the issue was regarding one of the upstreams which wasn't announcing their prefix correctly so some traffic was getting dropped.

I thanked Salvatore for fixing the problem but I also mentioned that I was quite concerned that it took over 24 hours to fix the problem and asked why it took so long to fix.

This is the reply I received from another member of support regarding my concerns:

If you are a business where uptime is very important, I suggest you switch to a redundant setup like www.iwstack.com. Our budget brands such as XenPower and OVerZold do not benefit from the same level of uptime and network redundancy.
XenPower is thought to be a budget brand offering a lot of resources at a low price, however it does not compare well with a business grade cloud with a lot of extra features and HA/fail-over. For a business, the small price difference of a couple of dollars should be totally irrelevant.

I moved over to Prometues/Incero because they both have top notch reputations so this response really did surprise me. Uptime is important to everybody not just businesses. Never has it been mentioned that XenPower is a budget brand. In fact the complete opposite has been stated numerous times on LET.

To receive a reply worded in that way after experiencing over 24 hours of downtime leaves a very bad taste in the mouth. Is it a reply which should be expected and accepted?

myhken · March 2014

All other brands then Prometeus/iwStack is low end brands from Mr. Salvaltore. In fact, I pretty sure Mr. S want iwStack.com to be the brand with the highest quality now, since they can offer more features there then on the Prometeus.com brand.

DalComp · March 2014

Nothing unusual for a business trying to upsell their clients.

iaTa · March 2014

So because XenPower is "low end" I should just accept over 24 hours down time and not ask any questions?

I would class LES at $5/y budget. Not a $90/y XEN. In your opinions what should I be spending to avoid situations such as this then?

concerto49 · March 2014

iaTa said: So because XenPower is "low end" I should just accept over 24 hours down time and not ask any questions?

Is there an uptime SLA? Has it been met? If not, ask them.

iaTa said: I would class LES at $5/y budget. Not a $90/y XEN.

That is budget. Even the most basic Linode offers is $20/month, so more than 2x $90/year.

Maounique · March 2014

There will be issues when you do not control all the aspects of the DC. In Italy, everything is under control, from the DC to the multiple peering and carriers, however, in US we depend on Incero staff. This is one of the reasons we are still testing there, hopefully things will improve or we will be able to find the perfect mix, but that is not likely to be up to par with DC in Italy. US people have another approach towards business, a high cost of litigation means they might provide what they promised, but if they dont, you cant do much about it without huge legal costs, especially for non-locals, therefore, finding a perfect DC there is going to take some time.

That being said, Incero is not bad for an US DC, there are very few incidents like these, too bad they were quick to blame their upstream and suggest our customers to no longer use cogent when the issue was on their side. Me being out for a couple of days and Salvatore sick, didnt help it either.

ErawanArifNugroho · March 2014

So the problem is at XenPower Dallas? Sometimes small details have a big chance for explanation. It should be noted in the first posts.

As for the simple uptime, we can see uptime.erawan.me, Most of my vps were from Prometeus, because it's stability, and it's hard to trust another provider for the uptime and anoter factor, based on my experiences.

As for now, I hope uncle get well soon

Maounique · March 2014

It was a problem with a small subnet of ours not being announced properly with some routes over a couple of carriers being broken/filtered. Nobody is perfect and will never be, the more people along the chain, the higher the chance something will go wrong and will take longer to fix, and, as you know, if something can go wrong, it will. The difference is made on the level of cooperation you get to fix it.

@iaTa said:
So because XenPower is "low end" I should just accept over 24 hours down time and not ask any questions?

I would class LES at $5/y budget. Not a $90/y XEN. In your opinions what should I be spending to avoid situations such as this then?

Nobody said you should not ask any question, but lets be fair:

It was not a total failure, multiple locations saw it up all the time;
I was reasonably forthcoming with the explanations and took the problem upstream trying to solve it;
XenPower offers big resources at a low price, that is budget, you are paying 5.75 Eur a month, pretty close to LEB standards. IWStack is not much pricier, but it does offer more redundancy while being still a pretty budget offer;
Not taking advantage of our free anycast DNS system to setup some redundancy in another place, be it with us or another provider is not a good idea. If you expect enterprise level uptime, you need to setup enterprise level redundancy and look for enterprise level offers. We have 100% SLA offers for banks and other critical businesses in the vmware and rhev clouds, but you will not like the price.

iaTa · March 2014

@concerto49 said:
Is there an uptime SLA? Has it been met? If not, ask them.

Prometues don't seem to advertise an SLA. Salvatore has stated on a couple of forums that their uptime is near 100%. I guess 99.7% is still "near". My fault for not checking this beforehand.

@concerto49 said:
That is budget. Even the most basic Linode offers is $20/month, so more than 2x $90/year.

This isn't Linode but if you are saying that's what I need to be paying to avoid similar situations then fine. I didn't realise $90/y was looked on as being such a poor product, even by their own staff.

@ErawanArifNugroho said:
So the problem is at XenPower Dallas? Sometimes small details have a big chance for explanation. It should be noted in the first posts.

@iaTa said:
For the last day or so I've had a major issue with our XenPower Dallas server.

Maounique thank you for explaining what went wrong. It would have been good to have that sort of reply in my ticket rather than the upsell I received instead.

My replies to your edit: That was the point of this post - I asked questions/raised concerns and received no answers. 1) From my testing around 15% of locations could see the server. That's down in my book. 2) The ticket has zero explanations, that's what I was asking for. 3) So budget means over 24 hours down time is Ok? 4) Shouldn't be necessary but maybe I will have to consider it.

ErawanArifNugroho · March 2014

my bad, sorry. I do need some rest

Maounique · March 2014

Agreed. It was merely an amendment to the idea that seemed to transpire that was a total failure;
I count this as explanation:
"I'm sorry it took so long. One of the upstream wasn't announcing our prefix correctly so some traffic were dropped."
You replied saying how bad we are and how you thought we were better. I agree it is not a good impression for a first time customer, I agree it should have been solved faster, I agree it is a failure, but if you were online long enough, you know things on the internet do not stay up all the time, including google or amazon, you know you need to setup redundancy if you need 100% uptime, you know that, even so, a major carrier failing, it will not be up 100%, more like, less people will see it down. We have redundant network, redundant storage, but that is as far as it goes, you do need to setup redundant locations too. With IWStack you can do that from the same interface, you can scale it up or down as needed, can setup internal load balancing and firewall/NAT/IPSec so if you have to switch something you just clone the VM, redirect ports and do the work with 0 downtime, etc.
Not by far. I agreed and still agree this was a serious screw-up, my point is that you will need at least 2 locations to make sure things will stay up close to 100%. Hardware failure can strike at any time, also, and for this you need a redundant setup such as iwstack, but even that will not be able to supplant the issue of possible network failure, either in our network, at the gates or in the internet carriers except if you setup 2 locations there too. You say uptime is very important to you, this is why I thought you are a company, no matter what is the reason you need high uptime for, you need to get a redundant setup.
We offer free backup space, free anycast DNS failover, exactly because nothing is perfect, everything can fail, given enough servers, enough switches, enough routes, enough people, enough time, it will never be 100%, but it can be close if we take precautions and consider our options well, even at a budget price.

iaTa · March 2014

1) Ok

2) You count that as an explanation? One short sentence. No details, no reasons for long delay to fix, why did it happen in the first place, what's been done to stop similar situations etc. I did not reply saying how bad you are at all. This was my reply:

Hi Salvatore,

I appreciate that the issue is now fixed but I am incredibly concerned that our newly relaunched website (and hence a very important time) was inaccessible for the majority of users around the world for over 24 hours.

I switched to Prometeus/Incero to avoid situations exactly such as this. How did this happen in the first place and why did it take so long to fix? Was it an Incero issue or a Prometeus issue? I would have expected this sort of problem to be sorted within minutes, not over a day.

Are you able to give any assurances that something like this won't happen again?

Kind regards,

Which was polite and completely reasonable. To which I received the reply in the first post with zero answers to my questions. I have since received another reply from the same member of staff with more facetious comments regarding my previous hosting experiences. I did not want this to play out the way that it is. It was not my intention. But I will defend my corner when necessary.

3) Uptime is important to everyone, no matter how small and insignificant.

4) I backup daily and using Anycast DNS should be totally overkill for what I'm doing. Going by your previous post you mentioned that you were not available and Salvatore was sick. So I should build redundancy into my setup to cater for your staffing issues as the problem would have been sorted much more quickly had you been available?

serverian · March 2014

iaTa said: 3) Uptime is important to everyone, no matter how small and insignificant.

Then you do something like this: http://www.tuxz.net/blog/High_Availability_Automated_origin_failover_using_CloudFlare_Nagios_and_OpenShift/

Maounique · March 2014

OK
Yes, it was an explanation. It was not detailed, yes, but far from 0 explanation you claimed.
It depends how important. The measure of it is to which lengths are people willing to go to get it, either by paying more or working more to setup redundancy.
It would have not been much quicker, given Incero staff was not willing to aknowledge the problem easily and it only affected our subnet, not their network at large. They said it is a problem with cogent and said our customers should not use cogent to test. It took a long explanation and testing by salvatore to prove it is not cogent's fault and even point the problem for them before they agreed to check it. Possibly quicker if I had more experience in dealing with these issues as I am available in the night anyway, but it is my first interaction with their staff and didnt expect this to go south so quickly. I suspect 3-4 hours could have been saved in best circumstances.
Once again, we are both sorry for this situation, however, we can count the failures we had in 2 years+ on the fingers on one hand. This excludes planned downtime for upgrades and such, short DDoSes for a few minutes and OVZ reboots due to kernel issues, soft lockups and race conditions. Demanding reassurance this will not happen again is not reasonable, I am sorry to say, but I did point you on how to increase your uptime in the future. You took it wrong and hence this whole issue.

Either way, to show how sorry we are, I am offering you a full refund for this issue, even if we do not offer a SLA. What do you think?

tchen · March 2014

Money, Availability, Latency. Pick two.

iaTa · March 2014

I'm going to leave it there as we're going round in circles.

@serverian thanks for the suggestion. Looks like it might be a bit tricky to use that setup with Wordpress but I need to sort something.

serverian · March 2014

@iaTa said:
serverian thanks for the suggestion. Looks like it might be a bit tricky to use that setup with Wordpress but I need to sort something.

This looks easier: http://blog.booru.org/?p=12

Shoaib_A · March 2014

@iaTa said:
So because XenPower is "low end" I should just accept over 24 hours down time and not ask any questions?

** I would class LES at $5/y budget. Not a $90/y XEN**. In your opinions what should I be spending to avoid situations such as this then?

Go tell that at WHT, you would be ridiculed because by general industry standards that really is ultra low budget pricing for the specs & quality you get.

Here the problem was out of hands of the provider & they could do nothing but wait for their DC to fix it yet they tried to explain it to you.

Problems can happen to any vps providers even the best in the industry, try to understand & move along or if uptime is so important to you host your stuff in multiple locations with same or multiple providers so that if things go wrong at one place the other one is ready to take its place.

Howdy, Stranger!

Categories

In this Discussion

XenPower Uptime/Redudancy

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

XenPower Uptime/Redudancy

Comments