Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


kernel:NMI watchdog: BUG: Soft lockup- CPU#3 stuck for 22s! [mysqld:4001920]
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

kernel:NMI watchdog: BUG: Soft lockup- CPU#3 stuck for 22s! [mysqld:4001920]

desfiredesfire Member
edited March 2019 in Help

Hi there,

Since 3 weeks my server have gotten unresponsive till the point I had to reboot in order to get it back working, today, I have been able to diagnose the issue:

message from syslogd@servername at [date]
kernel:NMI watchdog: BUG: Soft lockup- CPU#3 stuck for 22s! [mysqld:4001920]

I assume this is due not enough CPU to handle MySQL but that is strange, server is usually only using 25% CPU, what elso could it be?

Comments

  • BochiBochi Member

    As far as I know, softlock means the lock appears on kernel level - so it might be related to an I/O issue or high wait time.
    Are your running on a dedicated or a virtual server? Maybe it could be related to some kind of overcommitment as well, were there are simply no more ressources available.

  • @desfire said:
    server is usually only using 25% CPU, what elso could it be?

    Lemme guess, quadcore cpu? 25% means 1 full core then.

  • @Bochi said:
    As far as I know, softlock means the lock appears on kernel level - so it might be related to an I/O issue or high wait time.
    Are your running on a dedicated or a virtual server? Maybe it could be related to some kind of overcommitment as well, were there are simply no more ressources available.

    It is a dedicated server

  • What's your kernel and distribution? I'll throw the danger money on a 2.6 based CentOS 6.

  • @Letzien said:
    What's your kernel and distribution? I'll throw the danger money on a 2.6 based CentOS 6.

    CloudLinux 7.6
    3.10.0-962.3.2.lve1.5.24.9.el7.x86_64

  • Well damn, there goes that theory, even if it is CentOS (kind-of). You may have a hardware issue, but I'd check your current RAM and HDD before going any further- if you're not perpetually running swap, then I'd start looking into hardware.

  • @Letzien said:
    Well damn, there goes that theory, even if it is CentOS (kind-of). You may have a hardware issue, but I'd check your current RAM and HDD before going any further- if you're not perpetually running swap, then I'd start looking into hardware.

    Only using around 20-30% RAM and 0% SWAP, and using 45% of disk space. What's the best wait of looking HDD health?

  • williewillie Member
    edited March 2019

    NMI is likely to be a hardware problem. Open a ticket with the provider.

  • This will be a good start to understand CPU soft lockup http://www.inetservicescloud.com/knowledgebase/what-is-a-cpu-soft-lockup/

  • @desfire said:

    @Letzien said:
    Well damn, there goes that theory, even if it is CentOS (kind-of). You may have a hardware issue, but I'd check your current RAM and HDD before going any further- if you're not perpetually running swap, then I'd start looking into hardware.

    Only using around 20-30% RAM and 0% SWAP, and using 45% of disk space. What's the best wait of looking HDD health?

    If sata, install smartmontools (smartmon-tools sometimes), and run smartctl -a /dev/sd(a, b, c, etc..)

    This is a darn good chance to be either CPU or Motherboard, though. I'd suggest syncing everything and running stress to see what you end up.

  • desfiredesfire Member
    edited March 2019

    @Letzien said:

    @desfire said:

    @Letzien said:
    Well damn, there goes that theory, even if it is CentOS (kind-of). You may have a hardware issue, but I'd check your current RAM and HDD before going any further- if you're not perpetually running swap, then I'd start looking into hardware.

    Only using around 20-30% RAM and 0% SWAP, and using 45% of disk space. What's the best wait of looking HDD health?

    If sata, install smartmontools (smartmon-tools sometimes), and run smartctl -a /dev/sd(a, b, c, etc..)

    This is a darn good chance to be either CPU or Motherboard, though. I'd suggest syncing everything and running stress to see what you end up.


    [root@ser ~]# smartctl -H /dev/sdb
    smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-962.3.2.lve1.5.24.9.el7.x86_64] (local build)
    Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED

    [root@ser ~]# smartctl -H /dev/sda
    smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-962.3.2.lve1.5.24.9.el7.x86_64] (local build)
    Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED

    I have also checked if Raid 1 is working fine and seems to be doing it as well

    [root@ser ~]# cat /proc/mdstat
    Personalities : [raid1]
    md2 : active raid1 sdb3[1] sda3[0]
    3889340224 blocks super 1.2 [2/2] [UU]
    bitmap: 15/29 pages [60KB], 65536KB chunk

    md0 : active raid1 sdb1[1] sda1[0]
    16760832 blocks super 1.2 [2/2] [UU]

    md1 : active raid1 sdb2[1] sda2[0]
    767424 blocks super 1.2 [2/2] [UU]

    unused devices:

    All tests passed, I guess I will need to get in touch with @Hetzner_OL? Hardware issue?

  • Shutdown and backup your DB, because they'll probably nuke your box.

    I'd install 'stress' and let it run for awhile, and turn it over to them if it hangs up again.

  • desfire said: All tests passed, I guess I will need to get in touch with @Hetzner_OL?

    No, she's in a different part of the company than the part that deals with stuff like this. Open a ticket.

    Thanked by 1Hetzner_OL
  • @willie said:

    desfire said: All tests passed, I guess I will need to get in touch with @Hetzner_OL?

    No, she's in a different part of the company than the part that deals with stuff like this. Open a ticket.

    Yeah I meant opening a ticket ;) thanks

    Thanked by 1Hetzner_OL
  • @Letzien said:
    Shutdown and backup your DB, because they'll probably nuke your box.

    I'd install 'stress' and let it run for awhile, and turn it over to them if it hangs up again.

    I do daily backups thanks, will also run stress, thanks for the help

  • hardgamershardgamers Member
    edited March 2019

    Can you share the logs after :

    message from syslogd@servername at [date]
    kernel:NMI watchdog: BUG: Soft lockup- CPU#3 stuck for 22s! [mysqld:4001920]

    ? Maybe in pastebin etc, just remove some sensitive information

    Sometimes cause of lockup is in log after the warning from NMI watchdog.

  • @hardgamers said:
    Can you share the logs after :

    message from syslogd@servername at [date]
    kernel:NMI watchdog: BUG: Soft lockup- CPU#3 stuck for 22s! [mysqld:4001920]

    ? Maybe in pastebin etc, just remove some sensitive information

    Sometimes cause of lockup is in log after the warning from NMI watchdog.

    As I rebooted there is not much information, however in March 11 log there is something I do not understand prior the downtime: https://pastebin.com/mDtrwW0B

Sign In or Register to comment.