Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Kernel Errors - Help, please!
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

Kernel Errors - Help, please!

AmitzAmitz Member
edited July 2012 in Help

Dear all,

I have logwatch installed on my server and today, the daily summary contained the following lines:

 WARNING:  Kernel Errors Present
    grsec: From 124.189.2.60: Invalid alignment/Bus error occurred at b76e ...:  1 Time(s)
    grsec: From 178.14.49.233: Invalid alignment/Bus error occurred at b4e8 ...:  1 Time(s)
    grsec: From 207.46.13.114: Invalid alignment/Bus error occurred at b4e0 ...:  2 Time(s)
    grsec: From 66.249.72.212: Invalid alignment/Bus error occurred at b4e0 ...:  1 Time(s)
    grsec: From 68.15.45.29: Invalid alignment/Bus error occurred at b4e8 ...:  1 Time(s)
    grsec: From 70.196.201.18: Invalid alignment/Bus error occurred at b76e ...:  1 Time(s)
    grsec: From 77.184.174.4: Invalid alignment/Bus error occurred at b76e ...:  1 Time(s)
    grsec: From 77.9.249.96: Invalid alignment/Bus error occurred at b76e ...:  1 Time(s)
    grsec: From 79.242.123.175: Invalid alignment/Bus error occurred at b4e8 ...:  1 Time(s)
    grsec: From 81.107.2.171: Invalid alignment/Bus error occurred at b4e8 ...:  1 Time(s)
    grsec: From 98.210.80.60: Invalid alignment/Bus error occurred at b4e8 ...:  1 Time(s)

Should that worry me? I do not exactly know what the impact of that message may be and google did not make me much smarter... Thank you for your advice in advance!

Kind regards
-A

«1

Comments

  • MaouniqueMaounique Host Rep, Veteran
    edited July 2012

    Looks like memory errors. Try take out some sticks, or replace till the errors go away. It could also be some bad MB, memory controller, things like those, but I strongly suspect hardware problems.
    M

  • AmitzAmitz Member

    Unfortunately, I do not have direct access to the server. I will have to inform the DC then... :(

  • MaouniqueMaounique Host Rep, Veteran
    edited July 2012

    From the looks of it, seems the errors are in the same range, so it is probably a bad memory stick. 2 areas of error at b4 and b7, but that is not sure indication, can you run some memory test ?
    You will be sure then.
    Something like this: http://linux.m2osw.com/memory-test-on-live-system
    M

  • AmitzAmitz Member

    Thank you, Maounique!
    What program would you suggest for memory testing? The server contains a website that is live - Is there anything than can do the test while the server is running in normal mode?

  • MaouniqueMaounique Host Rep, Veteran

    Yeah, sorry, I just thought of this later and edited my previous post.
    M

  • AmitzAmitz Member

    Again, thank you!
    I will switch the website over to a spare system in case that the test renders the server unusable for some time and will then see what happens...

  • MaouniqueMaounique Host Rep, Veteran

    The test shouldnt render the server unusable, however, if the memory is bad and data is shifted through the bad areas (OS data that is needed for functioning), the kernel might hang.
    If you only have one site, moving it is the best thing to do tho.
    M

  • AmitzAmitz Member

    My server has 4 GB of RAM. Do I understand the link right that this would be the command to use?

    dd if=/dev/urandom bs=1024 of=/tmp/memtest count=4294967296
    md5sum /tmp/memtest; md5sum /tmp/memtest; md5sum /tmp/memtest
    
  • MaouniqueMaounique Host Rep, Veteran
    edited July 2012

    I think the original command was right. The larger the file the more likely to catch the error faster. Only if you have very little spare memory you should do a large count on a little file at a time.
    So put bs to something like 100 mb and then do it 40 times if you have large unused memory (likely if you offload the site and the system is idle), or do 10 mb 400 times if your memory is at the limit. I would go with 100 mb.
    M
    P.S. count=4294967296 That should be 1000 times lower, you already take a chunk of 1 K so the iteration should be 4 mil not 4 bn.

  • AmitzAmitz Member
    edited July 2012

    Ah, right! I did not take the chunk size into consideration! So I will use:

    dd if=/dev/urandom bs=104857600 of=/tmp/memtest count=40
    md5sum /tmp/memtest; md5sum /tmp/memtest; md5sum /tmp/memtest
    
  • MaouniqueMaounique Host Rep, Veteran
    edited July 2012

    http://people.redhat.com/dledford/memtest.shtml
    This is much more comprehensive.
    I wonder why there is no standard utility for that. Something in the kernel or even user space. Not everyone would like to reboot and run memtest.
    M

  • MrAndroidMrAndroid Member
    edited July 2012

    You could run stress

  • MaouniqueMaounique Host Rep, Veteran
    edited July 2012

    That might cook it. Only testing memory with bogus random data will probably not crash the kernel, but if it does many other things that need memory fed to the kernel, it may die in case memory is bad.
    M

  • AmitzAmitz Member
    edited July 2012

    Okay, I did the first test with:

    dd if=/dev/urandom bs=104857600 of=/tmp/memtest count=40
    md5sum /tmp/memtest; md5sum /tmp/memtest; md5sum /tmp/memtest
    

    and this is the result:

    25083a1361a4c50a44ceaacb2a6d41b6  /tmp/memtest
    25083a1361a4c50a44ceaacb2a6d41b6  /tmp/memtest
    25083a1361a4c50a44ceaacb2a6d41b6  /tmp/memtest
    

    Looks fine to me, what do you think?

  • MaouniqueMaounique Host Rep, Veteran

    Looks ok, try the other script too. By the looks of it also stress might be not so dangerous.
    M

  • AmitzAmitz Member

    Your support is very much appreciated!
    I will now fully move the website to another server and wait until DNS propagation is over before I proceed. There will surely be a downtime if I find some hardware error so it seems wise to me to have the website at another place then already.

    The server is at OVH (duck and cover) by the way, so I will use their rescue boot mode which offers all kind of testing methods to check the server health.

  • Have you tired re-compiling the kernel with latest Grsec or switched to stock kernel and tried to duplicate the issue. Which Grsec Version your Kernel is compiled with?

  • AmitzAmitz Member

    Honestly said, I did not mess with the kernel at all and to be even more honest: I would not even dare to try to do anything with it. I would consider myself as some kind of advanced amateur when it comes to linux, but some things (like kernel compilation) still scare the hell out of me.

    But another thing: I did not want to wait and already started the rescue mode and the hardware tests. Looking forward to the results...

  • AmitzAmitz Member

    Mmmmmmh.... The RAM test passed without errors:
    pastebin.com/nX3ZGLNR

    Same with hard disk. Will now stress the CPU a bit.

  • AmitzAmitz Member

    Well, stressing the CPU now for quite some time (and still ongoing). No problems yet.

    Is there any other reason that you could imagine for those kernel errors? I will investigate the exact kernel version as soon as I have finished the stress tests and have access to the server again...

  • prometeusprometeus Member, Host Rep

    Heat, sometime spike in temp cause nasty unrepeatable errors....

  • AmitzAmitz Member

    Heat. Could be a problem with OVH. I do not know much about their data center design...
    I have looked through my logs and the problem seems to be there since the first logwatch email that I have received. Obviously, I have just ignored that until now.

    I wonder whether I should keep that OVH server. But it is incredibly cheap.
    Intel Q6600 4x 2.40 GHz, 4 GB RAM and 1TB HD, 10 TB Traffic, for lousy EUR 26.99 per month.

    Runs like a charm (except the kernel errors) by the way. And I cannot even complain about their network or the german support team. I receive answers to tickets within 4 hours and they are friendly so far. But I cannot trust this server with those error messages. And they will not do anything about the hardware as long as the tests that I have run do not report any problem. What would you do in my shoes?

  • @Amitz said: What would you do in my shoes?

    As long as the server runs fine, ignore the error messages. They aren't hurting anyone, are they?

  • AmitzAmitz Member

    @gsrdgrdghd said: As long as the server runs fine, ignore the error messages. They aren't hurting anyone, are they?

    This is exactly the part that I am unable to judge.

  • PhilNDPhilND Member

    @Amitz ovh servers are watercooled. But by nature (since it was the first quad core) the Q6600 runs hot

  • @PhilND Is that a joke?;']

  • PhilNDPhilND Member

    Which part?

  • MaouniqueMaounique Host Rep, Veteran

    Try upgrade the kernel and put some monitoring in place for temperature of cpu, mb, hard drive.
    It might be some bug, since you already offloaded the site, put all back clean and upgrade to latest stable versions of the software you are using, before putting it in production make sure everything runs OK under some stress.
    If it does, you are OK to go.
    At times, errors are just impossible to track, even misleading. To be on the safe side, do what is under your control (software upgrade and such) since hardware is out of reach or expensive to check.
    If you still get errors, then it must be some HW problem, but if you dont, even if it is a hw problem, could run like this indeffinitely, at worst keep your backups current.
    Good luck, it is indeed a good deal you have there :P
    M

  • AmitzAmitz Member

    Yes, I will try to update the kernel tomorrow. My first kernel update ever. TENSION IN THE AIR!
    I hope that I find a good tutorial how to do this on CentOS...

    However, yes - the OVH (Kimsufi) deal for that server is indeed nice. They still have some left at this price in case that somebody is interested:
    http://www.isgenug.de/hot_deals/index.xml

  • AmitzAmitz Member

    Okay, there was no way to upgrade the kernel as it obviously was already the latest version available. However, I did a complete reinstall. Now with CentOS 6.2 instead of 5.8 and the errors have gone...

Sign In or Register to comment.