Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Advertise on LowEndTalk.com
WTF?! How was I pwned?
New on LowEndTalk? Please read our 'Community Rules' by clicking on it in the right menu!

WTF?! How was I pwned?

pwnedpwned Member

A monitoring service showed that my KVM VPS was rebooted, and when I logged in /var/log was empty and there were several strange processes running: /etc/my.conf, /var/ssh.conf, /usr/bin/.sshd, /usr/bin/pythno, /usr/bin/dpkgd/ps ax. (I did save a few of these binaries before I nuked the box, if there's a way to safely examine them.)

I happened to have a screen with /var/log/auth.log tailed. The last entry is ominous:

Jun 21 03:59:48 XXXX login[394]: pam_unix(login:auth): check pass; user unknown
Jun 21 03:59:48 XXXX login[394]: pam_unix(login:auth): authentication failure; logname=LOGIN uid=0 euid=0 tty=/dev/tty1 ruser= rhost=
Jun 21 03:59:51 XXXX login[394]: FAILED LOGIN (1) on '/dev/tty1' FOR 'UNKNOWN', Authentication failure

Wait, /dev/tty1? Isn't that the console connection? How would anyone connect there?

The down alert came in seconds later at 04:00:37 UTC. It went back up at 04:04:38 UTC.

More importantly, what did I do wrong to allow this to happen? I've been admin'ing VPSs for years, and I've never had something like this happen.

Background: Installed debian buster a few days ago (used the provider's Plesk VNC console to select the netinst kernel and initrd through grub), no root account, no services other than ssh at install. On first boot the usual: configure sshd to a non-standard port, disable sshd password auth, sshd AllowUsers to my account only, installed ntp, nginx, certbot, and munin. Configured iptables to allow tcp/80, tcp/443, udp/123, and tcp/ssh port through. Pretty sure I even disabled the VNC console through the Plesk control panel when I was done. It's an idler--there really wasn't anything running on it.

My account with the host (had!) a unique 64 char password and 2FA (TOTP) is active.

I'm stuck on that /dev/tty1 entry...all the usual login attempts come in through an IP to sshd, but not that one. What am I missing here?

Thoughts?

Thanks in advance!

Thanked by 1mxvin

Comments

  • LittleCreekLittleCreek Member, Provider

    Is there a console app in the control panel for the vps?

    Thanked by 1pwned

    Floyd Morrissette - DirectAdmin Expert
    LittleCreekHosting.com - VPS specials

  • pwnedpwned Member

    I see a VNC/Desktop button at the management page, and under the control panel there's a VNC button. I don't think there's non-VNC console app or ssh tunnel style console option.

  • thedpthedp Member
    edited June 21

    Doesn’t look like an actual login attempt though.

    You should inspect those confs and binaries in a sandbox - might give you clues to what’s going on.

    Edit: Infact, those auth logs entries, could that have been you yourself logging in but for some reason it’s logged as ‘unknown’?

    Thanked by 1pwned
  • pwnedpwned Member

    No, my logins all go through sshd, show my username, and log an IP address. There are no other tty1 entries in auth.log that I can see. I have another box with the same host, and no tty1 entries there, either. I was doing something else at 03:59 UTC and didn't notice the situation until about 6:30 UTC.

    Obviously I'm concentrating on that entry because of the timing, and maybe my ego doesn't want to admit I screwed up and got cracked.

  • LTnigerLTniger Member

    @pwned said:
    No, my logins all go through sshd, show my username, and log an IP address. There are no other tty1 entries in auth.log that I can see. I have another box with the same host, and no tty1 entries there, either. I was doing something else at 03:59 UTC and didn't notice the situation until about 6:30 UTC.

    Obviously I'm concentrating on that entry because of the timing, and maybe my ego doesn't want to admit I screwed up and got cracked.

    Don't be delusional. You have not been pwned yet. Any other intrusion evidence you have?

    hostWP.net - Wordpress Hosting Platform.

  • pwnedpwned Member

    The running processes and empty log directory are pretty big clues, aren't they?

  • jarjar Provider

    @pwned said:
    The running processes and empty log directory are pretty big clues, aren't they?

    For sure. Though you're probably focusing too hard on the auth.log entry. Actual authentication seems terribly unlikely unless there's a default account/pass in their OS images that you weren't aware of. What I'd have been interested in knowing, and you might take note of it next time, is what user account the processes were owned by / running under. If your individual services don't run as root, that should tell you which user account was compromised and through what service. Maybe munin has a vulnerability, for example.

    I'm assuming you hadn't deployed a dynamic web app yet. You just said Nginx not Nginx + PHP + Wordpress or something. That's the culprit in 999 out of 1000 cases but you're no doubt well aware of that.

    Thanked by 1pwned
  • pwnedpwned Member

    I used the standard debian netinst files from debian.org, so I hope there's not a default account! (As I mentioned, I selected the no root account option in the installer.)

    The only user account on the box is mine.

    No webapps, PHP, wordpress, etc. The only change to the nginx config was to create an alias to the munin directory, /var/cache/munin/www. Munin was listening on 127.0.0.1:4949, not 0.0.0.0, and iptables wasn't open for tcp/4949.

    Sorry, I did run a ps ax, not ps uax:

      PID TTY      STAT   TIME COMMAND
        1 ?        Ss     0:03 /sbin/init
        2 ?        S      0:00 [kthreadd]
        3 ?        I<     0:00 [rcu_gp]
        4 ?        I<     0:00 [rcu_par_gp]
        6 ?        I<     0:00 [kworker/0:0H-kblockd]
        8 ?        I<     0:00 [mm_percpu_wq]
        9 ?        S      0:02 [ksoftirqd/0]
       10 ?        I      0:09 [rcu_sched]
       11 ?        I      0:00 [rcu_bh]
       12 ?        S      0:00 [migration/0]
       14 ?        S      0:00 [cpuhp/0]
       15 ?        S      0:00 [kdevtmpfs]
       16 ?        I<     0:00 [netns]
       17 ?        S      0:00 [kauditd]
       18 ?        S      0:00 [khungtaskd]
       19 ?        S      0:00 [oom_reaper]
       20 ?        I<     0:00 [writeback]
       21 ?        S      0:00 [kcompactd0]
       22 ?        SN     0:00 [ksmd]
       23 ?        I<     0:00 [crypto]
       24 ?        I<     0:00 [kintegrityd]
       25 ?        I<     0:00 [kblockd]
       26 ?        I<     0:00 [edac-poller]
       27 ?        I<     0:00 [devfreq_wq]
       28 ?        S      0:00 [watchdogd]
       29 ?        S      0:00 [kswapd0]
       47 ?        I<     0:00 [kthrotld]
       48 ?        I<     0:00 [ipv6_addrconf]
       49 ?        I      0:00 [kworker/u2:1-events_unbound]
       58 ?        I<     0:00 [kstrp]
      112 ?        I<     0:00 [kworker/0:1H-kblockd]
      142 ?        I      0:00 [kworker/u2:2-events_unbound]
      169 ?        I<     0:00 [kworker/u3:0]
      171 ?        S      0:00 [jbd2/vda2-8]
      172 ?        I<     0:00 [ext4-rsv-conver]
      204 ?        Ss     0:00 /lib/systemd/systemd-journald
      222 ?        Ss     0:00 /lib/systemd/systemd-udevd
      256 ?        I<     0:00 [ata_sff]
      259 ?        S      0:00 [scsi_eh_0]
      261 ?        I<     0:00 [scsi_tmf_0]
      262 ?        S      0:00 [scsi_eh_1]
      264 ?        I<     0:00 [scsi_tmf_1]
      269 ?        I<     0:00 [ttm_swap]
      378 ?        Ssl    0:00 /usr/sbin/rsyslogd -n -iNONE
      386 ?        Ss     0:00 /usr/sbin/cron -f
      388 ?        Ss     0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
      408 ?        Ss     0:00 /usr/sbin/vnstatd -n
      411 ?        Ssl    0:01 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 106:112
      417 ?        Ss     0:00 nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
      418 ?        S      0:00 nginx: worker process
      650 ?        Ss     0:00 /lib/systemd/systemd --user
      651 ?        S      0:00 (sd-pam)
     2150 ?        Ss     0:00 /usr/bin/perl -wT /usr/sbin/munin-node
     2741 ?        Ss     0:00 /usr/sbin/sshd -D
     3011 ?        Ssl    0:03 /etc/my.conf
     3025 ?        Ssl    0:03 /var/ssh.conf
     3050 ?        Ss     0:00 /lib/systemd/systemd-logind
     3124 ?        Ssl    0:00 /usr/bin/.sshd
     3178 ?        Ssl    0:00 /usr/bin/pythno
     3235 tty1     Ss+    0:00 /sbin/agetty -o -p -- \u --noclear tty1 linux
     3417 ?        I      0:00 [kworker/0:0-ata_sff]
     3427 ?        I      0:00 [kworker/0:1-events_power_efficient]
     3428 ?        Ss     0:00 sshd: my_username [priv]
     3431 ?        Ss     0:00 /lib/systemd/systemd --user
     3432 ?        S      0:00 (sd-pam)
     3440 ?        S      0:00 sshd: [email protected]/0
     3441 pts/0    Ss+    0:00 -bash
     3451 ?        Ss     0:00 sshd: my_username [priv]
     3457 ?        S      0:00 sshd: [email protected]/1
     3458 pts/1    Ss     0:00 -bash
     3484 ?        I      0:00 [kworker/0:2-ata_sff]
     3590 ?        I      0:00 [kworker/u2:0-events_unbound]
     3629 pts/1    S+     0:00 ps ax
     3630 pts/1    S+     0:00 sh -c /usr/bin/dpkgd/ps ax
     3631 pts/1    R+     0:00 /usr/bin/dpkgd/ps ax
    
    
    Thanked by 1jar
  • raindog308raindog308 Moderator

    /usr/bin/dpkgd/ps ax

    If this is a compromised binary, then you have no idea what was running.

    For LET support, please visit the support desk.

  • pwnedpwned Member

    @raindog308 said:
    /usr/bin/dpkgd/ps ax

    If this is a compromised binary, then you have no idea what was running.

    It is--you're absolutely correct.

  • jarjar Provider
    edited June 21

    I guess any reasonable way you spin it if /var/log is empty it was rooted. It wasn’t booted into single user mode or anything since your screen session was alive. So even if they got a console they weren’t guessing your password.

    Interesting set of variables. I’m a bit curious too. Compromised SSH key really isn’t a standard thing for this kind of behavior, that’s too personal and specific where these attacks are most typically automated and not targeted (granted can’t rule out that this one was). Compromised password still possible I suppose but seems like you’d have a log of that in your screen session and it would probably mean you’re compromised somewhere that password was noted.

    Thanked by 1pwned
  • pwnedpwned Member

    I should clarify--I had screen running on my local box, ssh'd into the victim, with tail -F /var/log/auth.log running on the victim. When the victim rebooted I lost the connection. If it had been a ssh compromise, I'm pretty sure I would have seen at least the sssh connection start in tail before they had a chance to cover their tracks, right?

    Now if they somehow had a console, they could have booted single user at grub, installed the rootkit, then rebooted. I was able to ssh in after the break.

    If it was targeted, I would think the 8GB quad-core black friday special would have been a better choice than the 384 kB KVM-lite instance...

    Thanked by 1jar
  • vbloodsvbloods Member

    that's scary.

  • sureiamsureiam Member

    Haven't run debian in years but doesn't it have a recovery option on boot up that auto logs in via root? A compromised vnc session might get in that way

    Thanked by 1pwned
  • pwnedpwned Member

    @sureiam said:
    Haven't run debian in years but doesn't it have a recovery option on boot up that auto logs in via root? A compromised vnc session might get in that way

    I'd have to check, but I'm pretty sure that recovery option hangs if there's no root account, as in this case.

    I've just about convinced myself to re-install as before just to see what happens.

  • lonealonea Member, Provider

    Most likely your VNC connection had no password on, and someone just vnc in and did a single mode boot.

    Thanked by 1pwned

    BuyWebHosting - Web Hosting for $10 per year

  • thedpthedp Member

    Perhaps you should let your provider know what’s going on and your findings. Just an ‘FYI’ for them and if they’re willing to help or provide feedback, then they just might.

    Thanked by 1pwned
  • pwnedpwned Member

    @lonea said:
    Most likely your VNC connection had no password on, and someone just vnc in and did a single mode boot.

    If so, How can I prevent that in the future? I'm not sure, but I thought I'd turned VNC off at the control panel. Assuming it was on, doesn't that mean someone compromised the host to gain access to my control panel?

    I certainly don't run a VNC server on the VPS.

  • lonealonea Member, Provider

    No, they connected via the vps node's IP and VNC port, ip:5900, ip:5901, etc

    Try to see if you can connect to it.

    @pwned said:

    @lonea said:
    Most likely your VNC connection had no password on, and someone just vnc in and did a single mode boot.

    If so, How can I prevent that in the future? I'm not sure, but I thought I'd turned VNC off at the control panel. Assuming it was on, doesn't that mean someone compromised the host to gain access to my control panel?

    I certainly don't run a VNC server on the VPS.

    Thanked by 1pwned

    BuyWebHosting - Web Hosting for $10 per year

  • pwnedpwned Member

    Someone asked for the files I grabbed. They're available at https://we.tl/t-8RuPUWz9t1

  • pwnedpwned Member

    @lonea said:
    No, they connected via the vps node's IP and VNC port, ip:5900, ip:5901, etc

    Try to see if you can connect to it.

    No connection. No process listening on those ports, and it's not open in iptables.

  • jarjar Provider
    edited June 21

    @pwned said: I should clarify--I had screen running on my local box, ssh'd into the victim, with tail -F /var/log/auth.log running on the victim. When the victim rebooted I lost the connection. If it had been a ssh compromise, I'm pretty sure I would have seen at least the sssh connection start in tail before they had a chance to cover their tracks, right?

    Now if they somehow had a console, they could have booted single user at grub, installed the rootkit, then rebooted. I was able to ssh in after the break.

    Ah ok, now that does sound plausible.

    @pwned said: No connection. No process listening on those ports, and it's not open in iptables.

    Makes me wonder if the host was compromised.

    Thanked by 1pwned
  • thedpthedp Member

    @jar said: Makes me wonder if the host was compromised.

    Hence why I advised to raise this to the provider as perhaps it could ring some bells or even light up bulbs on their end. They still need to be aware of what’s going on and be on standby in case other ‘neighbors’ are reporting the same.

    Thanked by 1pwned
  • pwnedpwned Member

    @thedp said:

    @jar said: Makes me wonder if the host was compromised.

    Hence why I advised to raise this to the provider as perhaps it could ring some bells or even light up bulbs on their end. They still need to be aware of what’s going on and be on standby in case other ‘neighbors’ are reporting the same.

    Will do. I was hesitant in case someone found an obvious mistake on my part (why waste the host's time in that case), and it's a black-friday-no-support deal, so this might cost me $5. But at this point I'm curious enough to accept the charge if it gets us an answer.

  • thedpthedp Member
    edited June 21

    @pwned said:

    @thedp said:

    @jar said: Makes me wonder if the host was compromised.

    Hence why I advised to raise this to the provider as perhaps it could ring some bells or even light up bulbs on their end. They still need to be aware of what’s going on and be on standby in case other ‘neighbors’ are reporting the same.

    Will do. I was hesitant in case someone found an obvious mistake on my part (why waste the host's time in that case), and it's a black-friday-no-support deal, so this might cost me $5. But at this point I'm curious enough to accept the charge if it gets us an answer.

    It’s a security-related matter so it should go above everything 😊

    Thanked by 1pwned
  • PulsedMediaPulsedMedia Member, Provider

    @pwned said: Will do. I was hesitant in case someone found an obvious mistake on my part (why waste the host's time in that case), and it's a black-friday-no-support deal, so this might cost me $5. But at this point I'm curious enough to accept the charge if it gets us an answer.

    whatever host you went with is asking $5 per ticket? :O
    Whatever the case is, you should raise a ticket with them just to be sure.

  • tetechtetech Member

    Sounds like a VirMach BF special. I'd agree to raise a ticket, and say that you're not looking for help setting up your VM but want them to be aware in case there is a vulnerability on the node.

    Thanked by 1pwned
  • sdglhmsdglhm Member

    @pwned said: why waste the host's time in that case

    Are we sure that we're on LET?


    I'd also guess the VNC part. Maybe your host has some VNC running and you don't need to have a port open to get that VNC to be connected. (It will even show you rebooting)

    I've seen a similar setup at some providers. They don't use your IP address but some IP address from the provider. It's possible that password was cracked and someone was able to login using VNC.

    Thanked by 1pwned

    I repeat, RAID is not backup | Looking for a developer for your next project? - Hire me

  • raindog308raindog308 Moderator

    @pwned said: Someone asked for the files I grabbed. They're available at https://we.tl/t-8RuPUWz9t1

    You may find it interesting to run:

    strings .sshd

    If you're not familiar, the 'strings' command pulls out all strings (English and other words) in a binary.

    That .sshd binary has

    • a list of 188 IP addresses
    • references to many other binaries
    • The word "Google" which doesn't appear in my OpenSSH binary LOL
    • references to Mozille, AppleWebKit, etc.
    • The string "#!/bin/bash" and commands to put things in /etc/rc.d
    • C++ stuff (OpenSSH does not use C++)
    • mentions of many files and binaries on the system
    • the name and address of a guy in Denmark, though looking him up, I believe that's just an author credit from some binary packed inside it

    The IP addresses (just judging by the nslooked-up hostnames) are mostly DNS servers. Lots and lots of .cn there, as well as cnmobile.net, chinamobile.com, etc.

    Same IP list is in pythno, and since it's the same binary, ssh.conf (which is not a conf file but rather a binary). Also same IP list in my.conf (also a binary).

    Not really my specialty but I'm sure someone used to deconstructing binaries could learn more from these.

    For LET support, please visit the support desk.

  • VirMachVirMach Member, Provider, Top Provider

    We've worked with @pwned to resolve this issue.

    A total of 47 virtual servers could have potentially been affected by this, and we will send out communication soon. We did already patch everyone; we are just triple/quadruple checking everything to make sure we didn't miss anything before sending out more information. This was related to a specific fix for a problem that we discovered in regards to SolusVM's configuration for certain VMs which resulted in higher than normal idle CPU usage. We are also working on our own layer of security to avoid this from potentially being an issue in the future. We've informed SolusVM and have also requested they consider making some changes.

    I actually don't believe anyone else was affected to the same degree. This was just a combination of very specific and rare scenarios. It definitely could have been handled better -- our staff followed some direct instructions from SolusVM instead of questioning it. Said staff was instructed not to use the fix in that manner, but it was never reverted. The fix should have theoretically worked out, but it just wasn't ideal.

    We never pushed out this fix in this manner outside of a single node, but we are checking others as well. I'll return with more information later.

  • jlayjlay Member
    edited July 1

    Change management policies are free, if something needs a questionable fix - don't do it in 'production' where exposure is the highest. Set up a sandbox on another vlan for testing.

    Stuff happens, but it's not a good look to have stuff like this happening. Even in singular cases when it's so easily avoided.

    Edit:
    This all may sound critical, but now I wonder - what may happen next? What primary and secondary controls are going to be implemented, and what is the expected result? Are there contingencies in place?

    This is all a part of doing business in good faith

    Thanked by 1pwned

    Site Reliability Engineer - happy to help with anything Linux!

  • jarjar Provider
    edited July 1

    @jlay said:
    Change management policies are free, if something needs a questionable fix - don't do it in 'production' where exposure is the highest. Set up a sandbox on another vlan for testing.

    Stuff happens, but it's not a good look to have stuff like this happening. Even in singular cases when it's so easily avoided.

    Edit:
    This all may sound critical, but now I wonder - what may happen next? What primary and secondary controls are going to be implemented, and what is the expected result? Are there contingencies in place?

    This is all a part of doing business in good faith

    Sounding like someone who earned their job title, but a bit high expectation for this market segment here. That’s why you do well in your career, you badass ;)

    I just like to see the honesty and transparency, as lying about it is another time honored LET tradition. Props to VirMach for being open about it, could’ve just let the thread die.

    Thanked by 3angelius pwned jlay
  • jlayjlay Member
    edited July 1

    @jar said:

    @jlay said:
    Change management policies are free, if something needs a questionable fix - don't do it in 'production' where exposure is the highest. Set up a sandbox on another vlan for testing.

    Stuff happens, but it's not a good look to have stuff like this happening. Even in singular cases when it's so easily avoided.

    Edit:
    This all may sound critical, but now I wonder - what may happen next? What primary and secondary controls are going to be implemented, and what is the expected result? Are there contingencies in place?

    This is all a part of doing business in good faith

    Sounding like someone who earned their job title, but a bit high expectation for this market segment here. That’s why you do well in your career, you badass ;)

    I just like to see the honesty and transparency, as lying about it is another time honored LET tradition. Props to VirMach for being open about it, could’ve just let the thread die.

    :) Thanks for the kind words Jar

    It may come across as critical, but they're words of love. Nobody is perfect, but that's what processes and mitigations are for :smile:

    It may cost money in the moment by not adding gear to the fleet, but a long track record of doing what's right for the customers gets noticed

    Thanked by 1jar

    Site Reliability Engineer - happy to help with anything Linux!

  • smartheadsmarthead Member

    @VirMach said:
    We've worked with @pwned to resolve this issue.

    A total of 47 virtual servers could have potentially been affected by this, and we will send out communication soon. We did already patch everyone; we are just triple/quadruple checking everything to make sure we didn't miss anything before sending out more information. This was related to a specific fix for a problem that we discovered in regards to SolusVM's configuration for certain VMs which resulted in higher than normal idle CPU usage. We are also working on our own layer of security to avoid this from potentially being an issue in the future. We've informed SolusVM and have also requested they consider making some changes.

    I actually don't believe anyone else was affected to the same degree. This was just a combination of very specific and rare scenarios. It definitely could have been handled better -- our staff followed some direct instructions from SolusVM instead of questioning it. Said staff was instructed not to use the fix in that manner, but it was never reverted. The fix should have theoretically worked out, but it just wasn't ideal.

    We never pushed out this fix in this manner outside of a single node, but we are checking others as well. I'll return with more information later.

    What was the actual problem you fixed? We had the same malware on some of our KVM instances.

    SSH only using keys, no web- or mailserver.
    We use OpenNebula and not SolusVM and still have no idea what happened, just when.

    Thanked by 1pwned
  • VirMachVirMach Member, Provider, Top Provider

    @smarthead said: What was the actual problem you fixed? We had the same malware on some of our KVM instances.

    SSH only using keys, no web- or mailserver.

    We use OpenNebula and not SolusVM and still have no idea what happened, just when.

    I'm not sure how OpenNebula does it, but SolusVM seems to insert settings into the configuration file for the KVM instance, and this includes information including authentication for VNC.

    @jlay said: Nobody is perfect, but that's what processes and mitigations are for

    It may cost money in the moment by not adding gear to the fleet, but a long track record of doing what's right for the customers gets noticed

    @jlay said: if something needs a questionable fix - don't do it in 'production' where exposure is the highest. Set up a sandbox on another vlan for testing.

    Stuff happens, but it's not a good look to have stuff like this happening. Even in singular cases when it's so easily avoided.

    Okay so just to be clear, we definitely tested this thoroughly outside of just pushing it out to the live environment. We tried multiple different methods, and I definitely leaned toward a fix we haven't deployed yet.

    From the beginning, I had concerns regarding the initial solution provided by SolusVM for pushing this out in bulk. I probably tested out a dozen plus different configuration/VMs and that was after I already established the specific way I wanted to do it. We had internal discussions about it and I essentially voiced my concerns to the team, and they did agree at the end. The problem is that the person who was going to push out an initial batch for the fix where we needed it most did not disable the configuration change and by default, it takes effect after a reboot. This means anyone who rebooted, re-installed, or touched some other controls, would have potentially been affected. In addition I'm pretty sure that in one conversation or another these actually were rebooted. I wasn't the person who directly handled these so unfortunately I didn't follow up as I should have; I'd really have to look even further later to see what actually happened and make sure that kind of miscommunication does not occur in the future. As for why the person didn't revert it, I'm not sure, I've reviewed our conversation and it seemed like there was a mutual understanding. Even so, this person also did mention that he was going through the configurations and ensuring they're correct and functional. Of course, though, this person did also unfortunately depart about a week or two later, and honestly I'd have to perform a thorough audit to see what actually was and wasn't done and why.

    @jlay said: This all may sound critical, but now I wonder - what may happen next? What primary and secondary controls are going to be implemented, and what is the expected result? Are there contingencies in place?

    This is all a part of doing business in good faith

    I do know however that I'm now the one spearheading this, so I will continue onward as I had always planned and I believe that's a lot closer to what you would expect.

    Even with how it turned out, we only did it to the level we thought necessary, so this was pushed out to less than 0.1% of our customers and I've checked logs, it looks like it's actually closer to 1/5th of that which were actually potentially affected, because the others do not follow any of the patterns of a compromised service, there's no reboot either. And realistically, I think this may have actually only affect 2 or 3 VMs. I know it's not 0, but there will always be a level of thinking and planning that goes into anything we do where if everything goes wrong, we still do end up mitigating the impact. We didn't just look at what SolusVM said to us and immediately push it out to all servers just because we'd save processing power. We'd never do that.

    @tetech said: Sounds like a VirMach BF special. I'd agree to raise a ticket, and say that you're not looking for help setting up your VM but want them to be aware in case there is a vulnerability on the node.

    @PulsedMedia said: whatever host you went with is asking $5 per ticket?

    Whatever the case is, you should raise a ticket with them just to be sure.

    @thedp said: It’s a security-related matter so it should go above everything 😊

    These are the kind of things you would be able to make tickets about without worrying about getting billed. I know we really pushed the rules and scared some people off beyond a reasonable level, but we were just trying to not get people making the type of tickets they're making anyway. Like today we had a ticket from a limited support package that reported an outage using that special button, in the priority queue, and it was because he ordered the VM and it began setting up at like let's say 1:00AM and at around 1:05AM he tried re-installing the OS. Then he rebooted it, then he re-installed it twice more and then made the ticket right after, within the span of like 20-30 minutes.

    Then another person wanted us to basically install/configure his software for him (and this one I'm pretty sure was actually a black friday special) and he got slightly annoyed we gave him some general instructions for free that didn't exactly meet his requirements for configuration. Neither of those people got charged because I'd have two LET threads I'd have to spend a few hours each on where they talk about how we scammed them. Another person made not one but two priority tickets (remember these are the new tickets where we specifically even put a price-tag on it and warning) on his limited support plan, he's that confident, and then a third ticket, all about the same thing. That's just today. Anyway, I didn't mean to go on this weird tangent but I just had to address the whole everyone being afraid of us being unreasonable for whatever reason. Of course it's good to take what we say at face value but you guys are right, this got marked as priority support, nothing was billed, and it was of course handled.

    @thedp said: Perhaps you should let your provider know what’s going on and your findings. Just an ‘FYI’ for them and if they’re willing to help or provide feedback, then they just might.

    @pwned was very helpful in this. His ticket initially basically had all the information we needed. I even told him a version of this: it was very easy to take him seriously from the start and rule out that he had just set his password as "Dog123" because he actually provided all the information we needed to indicate something else.

    So here's the actual TLDR part of the post: the full-ish explanation. I was planning on making this after I informed customers but I had to gather my thoughts either way. Customers are at the same time getting a more concise version of this.

    When we began OpenVZ to KVM migrations, apart from a bunch of other issues, we noticed something strange. Certain nodes were at higher loads than others. They were definitely at unacceptable levels, but we had done all the math and planning correctly. Initially we were just worried about performance, so we ran some real-world tests and such and realized that the high load did not really affect it to the level where it would be problematic. Of course, it still had to be addressed.

    We did the thing we usually do if our automated systems are off by a little bit and allow a node to overfill with a few extra VMs. So we sent out some requests to see if anyone wanted to migrate to other nodes. This works out in the previous cases I mentioned, however, in this case after knowing what was causing it, that made sense on why efforts failed. Essentially no single person or small group of people were really using a lot of processing power. The nodes were still overloading. And again, I do want to clarify that it's just a few nodes that had this problem showing to a visible level so it's not like everything was going haywire and they weren't overloading in a way where it was an emergency situation. We've of course had these nodes locked off since December, and they cooled off after a few weeks.

    Being busy, and having disagreements on what could be causing it, we didn't really dedicate our lives to it. Since the nodes had calmed there was no reason to have a debate when we had other important things to handle (one sys admin thought it was customer VMs having malware that somehow became more apparent after the conversion, I personally thought it may be some slight mis-configurations as a result of the conversions, and another sys admin thought it was because we just put too high of a quantity of services and we should have spread it across more nodes.)

    Enter the problem: we had those OpenVZ to KVM conversions that we delayed. Since this issue only happened to a small quantity of servers, fingers crossed, and we completed these remaining servers.

    Well right off the bat it was worse. Again, now that I know what was causing the issue, it makes sense on why it would be worse. So of course now it's more of an immediate issue again. We try the migrations again, doesn't work. We try various patches (this was one of the maintenance we scheduled, we bundled it with some were were doing for hardware as well, thought it would be a good time.) These aren't related to the current one. They don't make a difference. I mean it does probably slightly improve the overall performance of the node but it's a drop in the bucket.

    I get more involved in the matter (usually I lay back a little more and focus on sales but I had been shifting back to sys admin work and customer service like the early days.)

    One of these days I have this flashback to a conversion I had with one of my friends when I was discussing this for fun (yes, I know, it's extremely exciting and after a long day of work what better way to relax than come up with drawn out theories.) Anyway, he had mentioned back then when we had the initial issue, how he thought it was related to interrupts and some specific bug. Then I remembered I researched this issue and sent it off to one of our sys admins and that was that. So this time I followed up on it myself. After an hour or two (well much longer if you count the initial research and discussions) I had mapped out all the VMs on each of the two (or maybe three?) nodes that had this issue to a concerning level. The information I had gathered was related to clock settings and some other issues related to Windows OS when it came to KVM configuration. Well, it wasn't Windows. Then I realized that it was definitely correlated with operating systems. Tallied everything up, and it seemed like stretch was the issue. I started exploring this on my own for a bit, and then came across information where Proxmox had a problem similar to it for something else and that they had patched it. So while I was still looking for the specific configuration issue, I figured I'd reach out to SolusVM at this point, as they would have more tools and flexibility (source code) to be able to implement this more efficiently and perhaps even out of kindness help us figure out this issue which at this point I was fairly certain was more related to libvirt/KVM than SolusVM.

    Oh, right, so I still haven't really stated the problem: each VM was using a certain amount of processing power that was abnormally high. So instead of idling at X amount they were idling at let's say 3-4X. Multiply that by a bunch of VMs and it becomes difficult to look at and locate, but it adds up. Especially with Debian 9 becoming much more frequently used by this point. So these bad nodes just ended up being unlucky and having a lot of stretch VMs. We also believe Ubuntu 18 has the same problem, potentially others, but there's not that many people that use that yet so it hasn't been thoroughly verified.

    We provide a lot of information to SolusVM, mainly letting them know we're leaning toward KVM configs. At the time I thought it may be related to a device, more specifically the disk device and how the Debian template was built in relation to that and the configuration for it, with some vague relations to the other material I had discussed. I wasn't right on the money, but luckily SolusVM replied in about a week and they had located the specific device causing the issue int he configuration. (Well they didn't exactly come up with this solution from a bunch of testing so they technically didn't come up with it but they had found the specific problem with this specific operating system online.)

    So they put in a request to patch this, with no ETA, letting us know it will be at least a month until they can give us any news. We decided for these couple nodes we needed to move forward with it to some degree as soon as possible. This is also where the questionable workout was provided to us.

    Here's where it became questionable: libvirt has some default configuration and then SolusVM modifies it in whatever way. So this is the step OpenNebula would come in @smarthead. Initially I was just concerned that the patch would not function this way because we do not know what would be re-inserted and what wouldn't and we wouldn't know what specific to try to detach. Instead, we used a copy of the live configuration, in our testing, and it worked out fine. I did test it with the method SolusVM provided and it seemed like it wasn't doing something correctly. I spoke with our team and voiced this concern, and did also vaguely mention that we wouldn't know what else could go wrong since there's a lot of other configuration.

    Well, what went wrong is that SolusVM doesn't re-insert the VNC authentication settings back in if you use a custom configuration.

    We're working on further locking off VNC so there's something to fall back to should this ever become a problem again. We've also asked SolusVM to consider potentially doing a few different fixes that we could not do ourselves as efficiently without access to customizing the code. I think a combination of these things should be pretty solid, even outside of this specific scenario.

    They did get back to us and wanted us to proceed in a way that's more similar to how we did our initial testing where we confirmed it functioned and everything was still intact, and then followed up with an actual patch. We'll definitely thoroughly test this on our dev environment and try to break it for a few weeks before we push it out.

Sign In or Register to comment.