Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Cpanel based Server high-load on every sunday for 5 hours
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

Cpanel based Server high-load on every sunday for 5 hours

Hello.
i have a Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz 16GB ram server from OVH. i installed virtualizor on that dedicated server and make 2 vps. installed cpanel/whm and cloudlinux on 1vps, and install nginx on other one. i use vps2 for storage downloadable files only. vps1(cpanel) has some website with low traffic. vps1 has 1 domain account with a large number of email. they use almost 55gb space for email only. now my problem is with vps2 which is cpanel and cloudlinux based. vps2 used almost 300gb and vps1 used almost 60-70gb.

problem is every sunday on 12pm to 5pm (gmt +6) for exact 5hours, server load become high. but vps2 is totally normal, node is totally normal at the time and other time also. i checked,monitored, cpanel L4 technician team then L3 technician team also monitored the vps and cant find exact reason why server become high for exact 5hours on every sunday. while there was no cpu process at the time but I/O wait rate was high. No cronjob No backup No update set on the time. i use "sar" for check server load time, use tech-SysSnapv2 for log details.

Now, cpanel Level 3 and Level 4 team give up and they said it might be other cause for server highload except cpanel. but i think it is because of cpanel. because i have vps2 on same node and node is also normal at the time. both vps has centos7 and up2date.

here is log file from week1 and 2.log 1 log 2 some log file replaced due to late of downloading.

if anyone here have any idea what else can cause this on specific time for exact 5hours every sunday, please tell me. or willing to check and solve the problem also tell me, i will give some courtesy money.

Thanked by 1Chuck

Comments

  • do they use ColoCrossing?

  • AnthonySmithAnthonySmith Member, Patron Provider

    Best guess you have software raid and that is the mdadm raid resync generating excessive IOPS because it has been left with defaults.

  • @Chuck said:
    do they use ColoCrossing?

    who? the node is my dedicated server from ovh

  • @AnthonySmith said:
    Best guess you have software raid and that is the mdadm raid resync generating excessive IOPS because it has been left with defaults.

    Very interesting. software raid is in main dedicated server. i virtualized the dedicated server to get 2vps, one vps never gone high while other one become. is it possible due to software raid of dedicated server, one vps can get high load spike?

  • AnthonySmithAnthonySmith Member, Patron Provider

    bdspice said: is it possible due to software raid of dedicated server, one vps can get high load spike?

    yes absolutley, i assume they both run different stacks so the lack of IOPS creating io wait could significantly impact 1 and not the other.

    with the times it seems almost certain to be the case.

    Try and turn off bitmap caching and reduce the max speed to 5000 on your mdadm array and see if it still happens, it may simply be time to upgrade your disk to a faster array.

    Thanked by 2bdspice vimalware
  • @AnthonySmith said:

    bdspice said: is it possible due to software raid of dedicated server, one vps can get high load spike?

    yes absolutley, i assume they both run different stacks so the lack of IOPS creating io wait could significantly impact 1 and not the other.

    with the times it seems almost certain to be the case.

    Try and turn off bitmap caching and reduce the max speed to 5000 on your mdadm array and see if it still happens, it may simply be time to upgrade your disk to a faster array.

    can you help me in this to do or give me any tutorial or idea to do this? since i am a little bit of new to this thats why i left all this as default.

  • If you have IPv6 enabled, I would ask you to run
    ip route show cache table all |grep -c cache
    now while not affected, and then again on Sunday when you are.

    Thanked by 1bdspice
  • AnthonySmithAnthonySmith Member, Patron Provider

    bdspice said: can you help me in this to do or give me any tutorial or idea to do this? since i am a little bit of new to this thats why i left all this as default.

    Yeah np, just pm me the output of: cat /proc/mdstat

    I will try and advise you from there.

    Thanked by 1bdspice
  • @AnthonySmith said:

    Try and turn off bitmap caching and reduce the max speed to 5000 on your mdadm array

    here is how i reduce the max speed from 200000 to 5000 by command is :
    sysctl -w dev.raid.speed_limit_max=5000

    here is the code i use to turn off bitmap since i have two partitions of server md2 and md3 :
    mdadm --grow --bitmap=none /dev/md2

    here is output of cat /proc/mdstat

        Personalities : [raid1]
        md3 : active raid1 sda3[0] sdb3[1]
              1744516032 blocks [2/2] [UU]
              bitmap: 5/13 pages [20KB], 65536KB chunk
    
        md2 : active raid1 sda2[0] sdb2[1]
              204798912 blocks [2/2] [UU]
    
        unused devices: <none>
    

    am i doing it right? here is the Status of partitions from ovh panel:

    Name of partition-  Mount point-    Used space- Inodes used
    md2                 /                   4%                  1%
    md3                 /vpsprt         62%             1%
    
  • whm/cpanel backups maybe

    Thanked by 1bdspice
  • here is latest update, i changed the server time to my time. then i change cronjob time to friday from sunday at /etc/cron.d/raid-check. because server load on friday is not a problem for my office while its offday. i disable bitmap for both partitions by
    "mdadm --grow --bitmap=none /dev/md2"
    "mdadm --grow --bitmap=none /dev/md3"

    also reduce the max speed as advice to 5000 from 200000 by
    sysctl -w dev.raid.speed_limit_max=5000

    now i will update the domain dns which use high email storage to cloudflare and convert the email format for the account on cpanel to mdbox from maildir.

    i hope i will get this problem solved this in sha allah.

    Thanked by 1alilet
  • @FrankZ said:
    If you have IPv6 enabled, I would ask you to run
    ip route show cache table all |grep -c cache
    now while not affected, and then again on Sunday when you are.

    output is 0. what does that mean?

  • AnthonySmithAnthonySmith Member, Patron Provider

    bdspice said: i hope i will get this problem solved this in sha allah

    image

    Seems you have done enough to see if its the raid sync or at least rule it out, keep us updated.

    Thanked by 1bdspice
  • FrankZFrankZ Veteran
    edited December 2018

    bdspice said: output is 0. what does that mean?

    Neighbor Discovery Cache . It should be 0 or low number < 50. If you have 500 to 1500 come Sunday let me know. If it stays low, then this is not your problem. @AnthonySmith is probably correct, but since this is a one time a week thing, I figured I would give you something else to check.

    Thanked by 1bdspice
  • @FrankZ @AnthonySmith Thanks. since i change max_speed and resync time to 12am of 5(friday), now is the time of resync. but i got no high load on any terminal of my server. so its time to watch on sunday now. i will update you all. thanks again

    Thanked by 1FrankZ
  • hello. here is last update. vps is now normal. but main dedicated server is doing something and load average 1 which is normal also. here is some output i got now. please check and tell me that what to worry or not and whats going on. is it syncing now or not? if sync running, how much time will it take accorind to my drive speed?

    [root@dedicated cron.d]# cat /proc/mdstat
    Personalities : [raid1]
    md3 : active raid1 sda3[0] sdb3[1]
          1744516032 blocks [2/2] [UU]
          [>....................]  check =  4.3% (76682624/1744516032) finish=5331.7min speed=5213K/sec
    
    md2 : active raid1 sda2[0] sdb2[1]
          204798912 blocks [2/2] [UU]
            resync=DELAYED
    
    unused devices: <none>
    [root@dedicated cron.d]# date
    Fri Dec 14 16:11:38 +06 2018
    [root@dedicated cron.d]# mdadm -D /dev/md2
    /dev/md2:
               Version : 0.90
         Creation Time : Tue Jan  9 22:58:49 2018
            Raid Level : raid1
            Array Size : 204798912 (195.31 GiB 209.71 GB)
         Used Dev Size : 204798912 (195.31 GiB 209.71 GB)
          Raid Devices : 2
         Total Devices : 2
       Preferred Minor : 2
           Persistence : Superblock is persistent
    
           Update Time : Fri Dec 14 07:31:41 2018
                 State : active, resyncing (DELAYED)
        Active Devices : 2
       Working Devices : 2
        Failed Devices : 0
         Spare Devices : 0
    
    Consistency Policy : resync
    
                  UUID : 80c79728:4f7a0921:a4d2adc2:26fd5302
                Events : 0.12983
    
        Number   Major   Minor   RaidDevice State
           0       8        2        0      active sync   /dev/sda2
           1       8       18        1      active sync   /dev/sdb2
    [root@dedicated cron.d]# mdadm -D /dev/md3
    /dev/md3:
               Version : 0.90
         Creation Time : Tue Jan  9 22:58:49 2018
            Raid Level : raid1
            Array Size : 1744516032 (1663.70 GiB 1786.38 GB)
         Used Dev Size : 1744516032 (1663.70 GiB 1786.38 GB)
          Raid Devices : 2
         Total Devices : 2
       Preferred Minor : 3
           Persistence : Superblock is persistent
    
           Update Time : Fri Dec 14 16:11:51 2018
                 State : active, checking
        Active Devices : 2
       Working Devices : 2
        Failed Devices : 0
         Spare Devices : 0
    
    Consistency Policy : resync
    
          Check Status : 4% complete
    
                  UUID : ee258205:a4ce61ec:a4d2adc2:26fd5302
                Events : 0.148380
    
        Number   Major   Minor   RaidDevice State
           0       8        3        0      active sync   /dev/sda3
           1       8       19        1      active sync   /dev/sdb3
    [root@dedicated cron.d]# sar -q
    Linux 3.10.0-693.11.6.el7.x86_64 (dedicated)    12/14/2018      _x86_64_        (8 CPU)
    
    02:00:04 AM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15   blocked
    02:10:04 AM        11       267      0.13      0.14      0.31         0
    02:20:04 AM         2       249      0.13      0.13      0.22         0
    02:30:04 AM         2       250      0.15      0.22      0.22         0
    02:40:04 AM         3       253      0.12      0.10      0.16         0
    02:50:03 AM         2       251      0.10      0.07      0.11         0
    03:00:04 AM         3       251      0.17      0.22      0.19         2
    03:10:03 AM         2       261      0.15      0.18      0.17         0
    03:20:03 AM         2       254      0.11      0.14      0.16         0
    03:30:03 AM         2       261      0.58      0.53      0.32         0
    03:40:04 AM         3       278      0.02      0.23      0.28         2
    03:50:03 AM         2       250      0.11      0.16      0.23         0
    04:00:05 AM         3       262      0.06      0.20      0.27         0
    04:10:03 AM         2       253      0.04      0.07      0.16         0
    04:20:03 AM         2       252      0.29      0.24      0.19         0
    04:30:03 AM         2       254      0.05      0.08      0.13         0
    04:40:04 AM         3       256      0.15      0.08      0.10         0
    04:50:03 AM         2       251      0.03      0.11      0.12         0
    05:00:04 AM         3       252      0.00      0.04      0.08         0
    05:10:04 AM         3       249      0.20      0.09      0.07         0
    05:20:04 AM         2       293      0.06      0.12      0.12         1
    05:30:04 AM         2       249      0.00      0.18      0.26         2
    05:40:04 AM         3       259      0.87      0.32      0.25         0
    05:50:03 AM         2       247      0.01      0.06      0.13         0
    06:00:04 AM         3       250      0.08      0.18      0.21         0
    06:10:04 AM         2       250      0.19      0.19      0.18         0
    06:20:03 AM         3       249      0.29      0.26      0.19         0
    06:30:04 AM         2       248      0.06      0.07      0.12         0
    06:40:04 AM         3       253      0.17      0.19      0.16         0
    06:50:03 AM         2       253      0.01      0.07      0.12         0
    07:00:04 AM         3       258      0.00      0.02      0.08         1
    07:10:03 AM         2       248      0.40      0.20      0.15         0
    07:20:04 AM         3       258      0.00      0.06      0.11         0
    07:30:03 AM         2       250      0.36      0.16      0.15         0
    07:40:04 AM         3       250      0.23      0.25      0.23         0
    07:50:03 AM         2       252      0.28      0.19      0.20         1
    08:00:04 AM         3       249      0.09      0.11      0.17         2
    08:10:03 AM         2       250      0.14      0.23      0.23         1
    08:20:03 AM         2       251      0.11      0.13      0.19         0
    08:30:03 AM         2       253      0.82      0.31      0.20         0
    08:40:04 AM         3       251      0.13      0.17      0.18         0
    08:50:03 AM         2       246      0.05      0.07      0.12         2
    09:00:04 AM         3       250      0.09      0.19      0.15         0
    
    09:00:04 AM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15   blocked
    09:10:03 AM         2       248      0.19      0.17      0.15         2
    09:20:04 AM         2       250      0.19      0.15      0.14         0
    09:30:03 AM         2       250      0.12      0.18      0.17         1
    09:40:04 AM         3       253      0.11      0.11      0.14         0
    09:50:03 AM         2       248      0.19      0.11      0.13         0
    10:00:04 AM         4       251      0.26      0.19      0.16         3
    10:10:03 AM         2       263      0.13      0.12      0.14         0
    10:20:04 AM         2       247      0.19      0.13      0.14         1
    10:30:03 AM         2       250      0.02      0.11      0.13         0
    10:40:04 AM         3       255      0.14      0.13      0.14         0
    10:50:04 AM         2       249      0.87      0.42      0.28         1
    11:00:04 AM         3       252      0.05      0.14      0.21         0
    11:10:03 AM         2       249      0.04      0.08      0.13         0
    11:20:03 AM         2       252      0.09      0.15      0.15         0
    11:30:03 AM         2       255      0.00      0.06      0.11         0
    11:40:04 AM         3       253      0.01      0.11      0.14         0
    11:50:04 AM         2       252      0.10      0.48      0.37         0
    12:00:05 PM         3       265      0.08      0.10      0.21         1
    12:10:04 PM         2       253      1.00      0.94      0.63         1
    12:20:03 PM         2       282      1.17      1.25      0.94         0
    12:30:03 PM         2       295      1.14      1.14      1.05         0
    12:40:04 PM         3       264      1.01      1.06      1.05         0
    12:50:03 PM         2       254      1.11      1.15      1.11         0
    01:00:04 PM         3       261      1.04      1.11      1.13         1
    01:10:03 PM         2       253      1.02      1.12      1.16         0
    01:20:03 PM         2       251      1.28      1.18      1.18         0
    01:30:04 PM         2       256      1.24      1.33      1.23         0
    01:40:04 PM         3       281      2.43      1.52      1.30         0
    01:50:03 PM         2       279      1.13      1.25      1.27         0
    02:00:04 PM         3       264      1.10      1.09      1.16         0
    02:10:03 PM         2       256      1.08      1.09      1.13         0
    02:20:04 PM         2       265      1.23      1.12      1.14         0
    02:30:04 PM         2       256      1.06      1.09      1.12         0
    02:40:04 PM         3       267      2.01      1.58      1.32         0
    02:50:03 PM         4       263      1.17      1.18      1.22         0
    03:00:04 PM         4       259      1.75      1.80      1.48         0
    03:10:03 PM         2       253      1.10      1.34      1.38         0
    03:20:03 PM         2       254      1.15      1.15      1.26         0
    03:30:04 PM         2       254      1.09      1.08      1.18         0
    03:40:04 PM         3       253      1.02      1.14      1.19         1
    03:50:03 PM         3       272      1.50      1.68      1.45         0
    04:00:04 PM         2       258      1.63      1.82      1.61         3
    
    04:00:04 PM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15   blocked
    04:10:03 PM         2       254      1.16      1.30      1.45         1
    Average:            3       256      0.49      0.48      0.48         0
    [root@dedicated cron.d]# vi raid-check
    # Run system wide raid-check once a week on Sunday at 1am by default
    0 1 * * 5 root /usr/sbin/raid-check
    
  • jackbjackb Member, Host Rep

    It will take four days at current speed. So I'd recommend either raising the speed or pausing the check at the end of the period you want the check to run in. Search for "echo idle sys block mdadm" to see how to pause the raid check.

    Thanked by 1bdspice
  • AnthonySmithAnthonySmith Member, Patron Provider

    If everything is running fine and the sync time is <7 days (it is) then just leave it your problem is solved.

    There is just no way to quickly sync slow sata drives with 0 impact, you get to choose 1 or the other or you upgrade to ssd :)

  • @jackb said:
    It will take four days at current speed. So I'd recommend either raising the speed or pausing the check at the end of the period you want the check to run in. Search for "echo idle sys block mdadm" to see how to pause the raid check.

    can i pause the the proccess now by entering this code?
    echo "idle" > /sys/block/md3/md/sync_action
    then how much speed should i increase in this drive from current max 5000?
    can i make it 15000? when it was 200000, average load was around 15 and it was take 5hours. so i divide 200000 by 15= around 14000.

  • @FrankZ @AnthonySmith thanks. sunday is gone without any trouble. so now i figure it out that its cause just because of mdadm array. thanks again @AnthonySmith for your idea to turn off bitmap caching and reduce the max speed to 5000 on mdadm array. now i just solved the 11 month aged problem :smiley:

    Thanked by 1FrankZ
  • AnthonySmithAnthonySmith Member, Patron Provider

    bdspice said: @FrankZ @AnthonySmith thanks. sunday is gone without any trouble. so now i figure it out that its cause just because of mdadm array. thanks again @AnthonySmith for your idea to turn off bitmap caching and reduce the max speed to 5000 on mdadm array. now i just solved the 11 month aged problem

    No worries, if your server has space for another drive you can store the bitmap cache on a tiny ssd which will speed up the resync a lot and no longer impact the IO either.

    Thanked by 1vimalware
  • @AnthonySmith said:

    No worries, if your server has space for another drive you can store the bitmap cache on a tiny ssd which will speed up the resync a lot and no longer impact the IO either.

    how would i know that my server has space for another drive or not? i just buy it from ovh soyoustart. ann do i need bitmap cache? is it so important?

  • AnthonySmithAnthonySmith Member, Patron Provider

    bdspice said: how would i know that my server has space for another drive or not? i just buy it from ovh soyoustart. ann do i need bitmap cache? is it so important?

    That is a question from whomever you lease the server from, but SYS is not flexible so you can forget about it I guess.

    You don't need a bitmap cache however it greatly improved sync speed because in simple terms it is used to keep track of blocks that may be out of sync.

    For a 2 disk raid 1 I would not worry to much as the penalty for having one probably does more harm to performance than good, probably not worth spending money on now I think about it.

    You just have a far better chance of recovering from any significant disk failure/system crash if you have a bitmap cache but its like a permanent double check that slows things down.

    I am sure when your server/service grows you will migrate to an SSD based server anyway and none of these problems are noticeable anymore.

    Thanked by 2vimalware bdspice
  • You can also yolo and run the cron only once a month :) Sure it checks for inconsistencies between the two drives, but you perform backups anyway, so in the very unlikely case there would be inconsistencies that would absolutely kill everything (I've yet to see this across 1000+ servers), then you could simply restore your backups.

    Thanked by 1bdspice
Sign In or Register to comment.