All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
IWstack outage
First time I've seen a service outage in the serveral months I've been with them. One of my servers went down, the other was unaffected.
today at 10:01 AM CEST the iwStack orchestrator received a disconnection event from several hosts which triggered a massive High Availability recover procedure for more than 600 instances.
At 10:50 while most instances were back running, a couple hundreds were stuck in starting state waiting for the network setup to complete.
At 11:10 in the attempt to speed up the process we forced a network restart (including VR rebuild), but this turned to be a wrong solution causing more delay.
Finally at 13:00 all the queued instances were started.If your instances are still in stopped state, just start them. Please open a ticket if some instance don't start.
At present we have disabled the HA flag for all the instances while we're investigating on the incident.
We are sorry for any inconvenience this issue may have caused.
Comments
I dont understand it" iwStack orchestrator received a disconnection event from several hosts which triggered a massive High Availability" but whatever the reason is they should make sure such incidents wont happen again and this put bad image to cloud computing which claims 100% uptime.
I just received the email too. But my uptime is still at 309 days, so it seems like it's not affecting all instances.
I hope no more problem like this in the future
@instatech: Dude, NOTHING ON THIS PLANET has 100% uptime.
1) Ensure you have recent backups
2) Redundancy, redundancy, redundancy (multiple instances, multiple providers, multiple geographies)
3) Have "Hot spares" always ready to go (up to date servers that are NOT normally exposed to the internet that you can failover to) in multiple geographies/regions
4) Have a good failover mechanism (whether DNS, or load balancer device)
5) Hold providers with persistent recurring faults accountable (and dump them at YOUR convenience). Prometeus is NOT a provider with recurring/persistent issues. This is the first iwStack outage I am aware of (at least since I started using it).
6) Did I mention redundancy?
7) Sit back, relax, enjoy life
This is called having a Business Continuity Plan
Cheers
The issue was with a network card which caused network problems on some hosts, that then caused the orchastrator to go into HA mode and restart all of those instances on other nodes. Following that some instances were stuck in starting, it didn't affect all instances. As mentioned in the RFO it is being looked into, and of course to avoid such issues in the future.
Also, iwStack does not claim 100%. This is the first issue of this scale since iwStack's inception.
There is no 100% uptime.
Here there is a bit more extensive RFO:
http://board.prometeus.net/viewtopic.php?f=15&t=1409&p=1965#p1965
In this case, the main problem was the HA, without HA the downtime would have been a few minutes it took us to isolate the malfunctioning nic and solve the problem. However, those few minutes of downtime convinced the orchestrator all the VMs on the affected nodes are down and proceeded to restart them on other nodes. That meant the queue was full for hours and since the virtual routers are vms too on random nodes, at times the VMs started before the VR or were up on nodes which were online all the time and, while the vm was up, the network was down, so it was a huge mess.
You can defend against a node failure, even a few, but when the orchestrator thinks tons of nodes died at once it cannot be really fixed fast.
We are thinking to add some code to check if more than one node appears offline and if it does, to wait for human intervention because that is highly unlikely to happen due to node failure. Cloudstack was conceived by people used with XenServer clusters and added KVM to it. It would have been better to put it on Xen at that time, in hindsight, but what is done is done, we do plan to make a Xen cluster soon, though, to test it and give people a choice, maybe phase out KVM in time if it proves successful.
My instances were not affected as it seems. All 100% up like throughout the last months. Thumbs up for that!
@Maounique, @Prometeus: PLEASE do not phase out KVM
"and give people a choice" --> THIS is the better solution, IMHO
Do not understand phase out as closing down the KVM zones, far from it, just put as default Xen zones and only re-assign nodes if the KVM usage lowers and there is a need for more nodes in the Xen zone.
We only phased out a few products so far, I can only remember the windows separate offer (with proxmox, outside cloud) and we will discontinue the KVM storage plans as well as the atomic cloudmin ones made redundant by the xenpower L plans. Add to this the old shared hosting with shared IP and no resource isolation where people suffer from bad neighbours, because the new plans with dedicated IP, dedicated IOPS and dedicated cpu cycles are far better.
ah, OK
OK thanks to all for posting detailed information now i understand it.@geekalot i like your Business Continuity Plan it is useful i will follow it.
I can definitely understand the rationale behind it (support costs alone for managing multiple KVM 'variations') but I would be sad to see classic KVM go.
@instatech, it is basically combinations & permutations:
I won't bore you with the actual math, but suffice it to say it is HIGHLY unlikely. (1% of 1% of 1% of 1% to have a complete failure against 4 x 99% uptime independent servers; even less for 99.9% uptime)
Just try your best to reduce any SINGLE point of failure.
This is how you can string together cheaper 2nd (or even 3rd) tier providers and have better performance than the expensive "1st" tier providers -- all day, every day (IMHO).
Cheers
Good riddance to classic KVM. I've been using a newer one for a while and I have to say it's much better!
No, as long as ta least 3 nodes are populated 2/3 to offer 2+1 redundancy, it will not happen, even then, sacrificing some space on 3 nodes will not create us as many problems as it would for our really long time customers to move. It will be a bitch to maintain, though, but hopefully not something at the level of individual nodes for regular VPS which usually need more maintenance than iwstack nodes and it is easy here to do it without downtime, just put the node in maintenance, wait for the VMs to be moved without downtime or noticeable issues and proceed.
@Maounique - our experience of Cloudstack is that XenServer is a lot more robust than KVM. We have a 200 node CS install, the only thing that really bites with Xenserver is that you need to keep your cluster size small (8-16 server). Also you need to be careful that you have the exact same processor model/stepping/revision across your cluster otherwise migrations start being refused.
Those problems are similar with KVM, so, that is not an issue, we hit some walls when we were designing iwstack, it took way too long and KVM happened to work much faster at that time.
Your heart has a 100 percent uptime .
Doesn't actually. Sometimes a heart can stop and be started again.
@Coudio, Well, at least you HOPE so ... LOL
Better pray to your "provider" if it does not :-)
Nope. There are many people with hearts stopped at least once and while it can happen to have 100% uptime for the majority of people, on average it is not 100%.
The only 100% sure thing is death, so far.
And taxes
I am 100% sure I am wearing blue trousers at the moment.
Blue is subjective, I often say something is blue, my partner says it is green, you learn the colors when you are a kid and parents explain them, however, the spectrum is continuous and there are not just 7 colors, there are practically infinite.
Taxes are not 100% sure, many people live in tribal areas where there is no government to collect, not even money.
+500 esxi hosts here with a mixture of hardware. The fuck all the issues with failover and migration. We're now standardizing on e5's for the next 10 clusters, but still, didnt expect so much issues...
Excellent point
@Raymii - make sure the machines are IDENTICAL, same mobo, same CPU version, same RAM type, etc. Drove me crazy about 18 months ago when we built this thing.
our main VM was down until I restarted it from the panel
I have to say my IWstack uptime and performance have been better than rackspace, which is intended for mission critical stuff, and IWstack pricing is an order of magnitude better.
Mind sharing the list? They will definitely be on the top list of places to retire ... assuming they have internet connectivity :-)
But seriously though, even "tribal areas" have their own "taxation" system ... e.g., having to ante up 100 cows to marry the chief's daughter etc
Sorry for that, at this time HA is turned off until we manage to make sure something like this will not happen again.