WTF?! How was I pwned?

jlay · July 2020

Change management policies are free, if something needs a questionable fix - don't do it in 'production' where exposure is the highest. Set up a sandbox on another vlan for testing.

Stuff happens, but it's not a good look to have stuff like this happening. Even in singular cases when it's so easily avoided.

Edit:
This all may sound critical, but now I wonder - what may happen next? What primary and secondary controls are going to be implemented, and what is the expected result? Are there contingencies in place?

This is all a part of doing business in good faith

jar · July 2020

@jlay said:
Change management policies are free, if something needs a questionable fix - don't do it in 'production' where exposure is the highest. Set up a sandbox on another vlan for testing.

Stuff happens, but it's not a good look to have stuff like this happening. Even in singular cases when it's so easily avoided.

Edit:
This all may sound critical, but now I wonder - what may happen next? What primary and secondary controls are going to be implemented, and what is the expected result? Are there contingencies in place?

This is all a part of doing business in good faith

Sounding like someone who earned their job title, but a bit high expectation for this market segment here. That’s why you do well in your career, you badass

I just like to see the honesty and transparency, as lying about it is another time honored LET tradition. Props to VirMach for being open about it, could’ve just let the thread die.

jlay · July 2020

@jar said:

@jlay said:
Change management policies are free, if something needs a questionable fix - don't do it in 'production' where exposure is the highest. Set up a sandbox on another vlan for testing.

Stuff happens, but it's not a good look to have stuff like this happening. Even in singular cases when it's so easily avoided.

Edit:
This all may sound critical, but now I wonder - what may happen next? What primary and secondary controls are going to be implemented, and what is the expected result? Are there contingencies in place?

This is all a part of doing business in good faith

Sounding like someone who earned their job title, but a bit high expectation for this market segment here. That’s why you do well in your career, you badass

I just like to see the honesty and transparency, as lying about it is another time honored LET tradition. Props to VirMach for being open about it, could’ve just let the thread die.

Thanks for the kind words Jar

It may come across as critical, but they're words of love. Nobody is perfect, but that's what processes and mitigations are for

It may cost money in the moment by not adding gear to the fleet, but a long track record of doing what's right for the customers gets noticed

smarthead · July 2020

@VirMach said:
We've worked with @pwned to resolve this issue.

A total of 47 virtual servers could have potentially been affected by this, and we will send out communication soon. We did already patch everyone; we are just triple/quadruple checking everything to make sure we didn't miss anything before sending out more information. This was related to a specific fix for a problem that we discovered in regards to SolusVM's configuration for certain VMs which resulted in higher than normal idle CPU usage. We are also working on our own layer of security to avoid this from potentially being an issue in the future. We've informed SolusVM and have also requested they consider making some changes.

I actually don't believe anyone else was affected to the same degree. This was just a combination of very specific and rare scenarios. It definitely could have been handled better -- our staff followed some direct instructions from SolusVM instead of questioning it. Said staff was instructed not to use the fix in that manner, but it was never reverted. The fix should have theoretically worked out, but it just wasn't ideal.

We never pushed out this fix in this manner outside of a single node, but we are checking others as well. I'll return with more information later.

What was the actual problem you fixed? We had the same malware on some of our KVM instances.

SSH only using keys, no web- or mailserver.
We use OpenNebula and not SolusVM and still have no idea what happened, just when.

VirMach · July 2020

@smarthead said: What was the actual problem you fixed? We had the same malware on some of our KVM instances.

SSH only using keys, no web- or mailserver.

We use OpenNebula and not SolusVM and still have no idea what happened, just when.

I'm not sure how OpenNebula does it, but SolusVM seems to insert settings into the configuration file for the KVM instance, and this includes information including authentication for VNC.

@jlay said: Nobody is perfect, but that's what processes and mitigations are for

It may cost money in the moment by not adding gear to the fleet, but a long track record of doing what's right for the customers gets noticed

@jlay said: if something needs a questionable fix - don't do it in 'production' where exposure is the highest. Set up a sandbox on another vlan for testing.

Stuff happens, but it's not a good look to have stuff like this happening. Even in singular cases when it's so easily avoided.

Okay so just to be clear, we definitely tested this thoroughly outside of just pushing it out to the live environment. We tried multiple different methods, and I definitely leaned toward a fix we haven't deployed yet.

From the beginning, I had concerns regarding the initial solution provided by SolusVM for pushing this out in bulk. I probably tested out a dozen plus different configuration/VMs and that was after I already established the specific way I wanted to do it. We had internal discussions about it and I essentially voiced my concerns to the team, and they did agree at the end. The problem is that the person who was going to push out an initial batch for the fix where we needed it most did not disable the configuration change and by default, it takes effect after a reboot. This means anyone who rebooted, re-installed, or touched some other controls, would have potentially been affected. In addition I'm pretty sure that in one conversation or another these actually were rebooted. I wasn't the person who directly handled these so unfortunately I didn't follow up as I should have; I'd really have to look even further later to see what actually happened and make sure that kind of miscommunication does not occur in the future. As for why the person didn't revert it, I'm not sure, I've reviewed our conversation and it seemed like there was a mutual understanding. Even so, this person also did mention that he was going through the configurations and ensuring they're correct and functional. Of course, though, this person did also unfortunately depart about a week or two later, and honestly I'd have to perform a thorough audit to see what actually was and wasn't done and why.

@jlay said: This all may sound critical, but now I wonder - what may happen next? What primary and secondary controls are going to be implemented, and what is the expected result? Are there contingencies in place?

This is all a part of doing business in good faith

I do know however that I'm now the one spearheading this, so I will continue onward as I had always planned and I believe that's a lot closer to what you would expect.

Even with how it turned out, we only did it to the level we thought necessary, so this was pushed out to less than 0.1% of our customers and I've checked logs, it looks like it's actually closer to 1/5th of that which were actually potentially affected, because the others do not follow any of the patterns of a compromised service, there's no reboot either. And realistically, I think this may have actually only affect 2 or 3 VMs. I know it's not 0, but there will always be a level of thinking and planning that goes into anything we do where if everything goes wrong, we still do end up mitigating the impact. We didn't just look at what SolusVM said to us and immediately push it out to all servers just because we'd save processing power. We'd never do that.

@tetech said: Sounds like a VirMach BF special. I'd agree to raise a ticket, and say that you're not looking for help setting up your VM but want them to be aware in case there is a vulnerability on the node.

@PulsedMedia said: whatever host you went with is asking $5 per ticket?

Whatever the case is, you should raise a ticket with them just to be sure.

@thedp said: It’s a security-related matter so it should go above everything 😊

These are the kind of things you would be able to make tickets about without worrying about getting billed. I know we really pushed the rules and scared some people off beyond a reasonable level, but we were just trying to not get people making the type of tickets they're making anyway. Like today we had a ticket from a limited support package that reported an outage using that special button, in the priority queue, and it was because he ordered the VM and it began setting up at like let's say 1:00AM and at around 1:05AM he tried re-installing the OS. Then he rebooted it, then he re-installed it twice more and then made the ticket right after, within the span of like 20-30 minutes.

Then another person wanted us to basically install/configure his software for him (and this one I'm pretty sure was actually a black friday special) and he got slightly annoyed we gave him some general instructions for free that didn't exactly meet his requirements for configuration. Neither of those people got charged because I'd have two LET threads I'd have to spend a few hours each on where they talk about how we scammed them. Another person made not one but two priority tickets (remember these are the new tickets where we specifically even put a price-tag on it and warning) on his limited support plan, he's that confident, and then a third ticket, all about the same thing. That's just today. Anyway, I didn't mean to go on this weird tangent but I just had to address the whole everyone being afraid of us being unreasonable for whatever reason. Of course it's good to take what we say at face value but you guys are right, this got marked as priority support, nothing was billed, and it was of course handled.

@thedp said: Perhaps you should let your provider know what’s going on and your findings. Just an ‘FYI’ for them and if they’re willing to help or provide feedback, then they just might.

@pwned was very helpful in this. His ticket initially basically had all the information we needed. I even told him a version of this: it was very easy to take him seriously from the start and rule out that he had just set his password as "Dog123" because he actually provided all the information we needed to indicate something else.

So here's the actual TLDR part of the post: the full-ish explanation. I was planning on making this after I informed customers but I had to gather my thoughts either way. Customers are at the same time getting a more concise version of this.

When we began OpenVZ to KVM migrations, apart from a bunch of other issues, we noticed something strange. Certain nodes were at higher loads than others. They were definitely at unacceptable levels, but we had done all the math and planning correctly. Initially we were just worried about performance, so we ran some real-world tests and such and realized that the high load did not really affect it to the level where it would be problematic. Of course, it still had to be addressed.

We did the thing we usually do if our automated systems are off by a little bit and allow a node to overfill with a few extra VMs. So we sent out some requests to see if anyone wanted to migrate to other nodes. This works out in the previous cases I mentioned, however, in this case after knowing what was causing it, that made sense on why efforts failed. Essentially no single person or small group of people were really using a lot of processing power. The nodes were still overloading. And again, I do want to clarify that it's just a few nodes that had this problem showing to a visible level so it's not like everything was going haywire and they weren't overloading in a way where it was an emergency situation. We've of course had these nodes locked off since December, and they cooled off after a few weeks.

Being busy, and having disagreements on what could be causing it, we didn't really dedicate our lives to it. Since the nodes had calmed there was no reason to have a debate when we had other important things to handle (one sys admin thought it was customer VMs having malware that somehow became more apparent after the conversion, I personally thought it may be some slight mis-configurations as a result of the conversions, and another sys admin thought it was because we just put too high of a quantity of services and we should have spread it across more nodes.)

Enter the problem: we had those OpenVZ to KVM conversions that we delayed. Since this issue only happened to a small quantity of servers, fingers crossed, and we completed these remaining servers.

Well right off the bat it was worse. Again, now that I know what was causing the issue, it makes sense on why it would be worse. So of course now it's more of an immediate issue again. We try the migrations again, doesn't work. We try various patches (this was one of the maintenance we scheduled, we bundled it with some were were doing for hardware as well, thought it would be a good time.) These aren't related to the current one. They don't make a difference. I mean it does probably slightly improve the overall performance of the node but it's a drop in the bucket.

I get more involved in the matter (usually I lay back a little more and focus on sales but I had been shifting back to sys admin work and customer service like the early days.)

One of these days I have this flashback to a conversion I had with one of my friends when I was discussing this for fun (yes, I know, it's extremely exciting and after a long day of work what better way to relax than come up with drawn out theories.) Anyway, he had mentioned back then when we had the initial issue, how he thought it was related to interrupts and some specific bug. Then I remembered I researched this issue and sent it off to one of our sys admins and that was that. So this time I followed up on it myself. After an hour or two (well much longer if you count the initial research and discussions) I had mapped out all the VMs on each of the two (or maybe three?) nodes that had this issue to a concerning level. The information I had gathered was related to clock settings and some other issues related to Windows OS when it came to KVM configuration. Well, it wasn't Windows. Then I realized that it was definitely correlated with operating systems. Tallied everything up, and it seemed like stretch was the issue. I started exploring this on my own for a bit, and then came across information where Proxmox had a problem similar to it for something else and that they had patched it. So while I was still looking for the specific configuration issue, I figured I'd reach out to SolusVM at this point, as they would have more tools and flexibility (source code) to be able to implement this more efficiently and perhaps even out of kindness help us figure out this issue which at this point I was fairly certain was more related to libvirt/KVM than SolusVM.

Oh, right, so I still haven't really stated the problem: each VM was using a certain amount of processing power that was abnormally high. So instead of idling at X amount they were idling at let's say 3-4X. Multiply that by a bunch of VMs and it becomes difficult to look at and locate, but it adds up. Especially with Debian 9 becoming much more frequently used by this point. So these bad nodes just ended up being unlucky and having a lot of stretch VMs. We also believe Ubuntu 18 has the same problem, potentially others, but there's not that many people that use that yet so it hasn't been thoroughly verified.

We provide a lot of information to SolusVM, mainly letting them know we're leaning toward KVM configs. At the time I thought it may be related to a device, more specifically the disk device and how the Debian template was built in relation to that and the configuration for it, with some vague relations to the other material I had discussed. I wasn't right on the money, but luckily SolusVM replied in about a week and they had located the specific device causing the issue int he configuration. (Well they didn't exactly come up with this solution from a bunch of testing so they technically didn't come up with it but they had found the specific problem with this specific operating system online.)

So they put in a request to patch this, with no ETA, letting us know it will be at least a month until they can give us any news. We decided for these couple nodes we needed to move forward with it to some degree as soon as possible. This is also where the questionable workout was provided to us.

Here's where it became questionable: libvirt has some default configuration and then SolusVM modifies it in whatever way. So this is the step OpenNebula would come in @smarthead. Initially I was just concerned that the patch would not function this way because we do not know what would be re-inserted and what wouldn't and we wouldn't know what specific to try to detach. Instead, we used a copy of the live configuration, in our testing, and it worked out fine. I did test it with the method SolusVM provided and it seemed like it wasn't doing something correctly. I spoke with our team and voiced this concern, and did also vaguely mention that we wouldn't know what else could go wrong since there's a lot of other configuration.

Well, what went wrong is that SolusVM doesn't re-insert the VNC authentication settings back in if you use a custom configuration.

We're working on further locking off VNC so there's something to fall back to should this ever become a problem again. We've also asked SolusVM to consider potentially doing a few different fixes that we could not do ourselves as efficiently without access to customizing the code. I think a combination of these things should be pretty solid, even outside of this specific scenario.

They did get back to us and wanted us to proceed in a way that's more similar to how we did our initial testing where we confirmed it functioned and everything was still intact, and then followed up with an actual patch. We'll definitely thoroughly test this on our dev environment and try to break it for a few weeks before we push it out.

Howdy, Stranger!

Categories

In this Discussion

WTF?! How was I pwned?

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

WTF?! How was I pwned?

Comments