Cpanel based Server high-load on every sunday for 5 hours

bdspice · December 2018

Hello.
i have a Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz 16GB ram server from OVH. i installed virtualizor on that dedicated server and make 2 vps. installed cpanel/whm and cloudlinux on 1vps, and install nginx on other one. i use vps2 for storage downloadable files only. vps1(cpanel) has some website with low traffic. vps1 has 1 domain account with a large number of email. they use almost 55gb space for email only. now my problem is with vps2 which is cpanel and cloudlinux based. vps2 used almost 300gb and vps1 used almost 60-70gb.

problem is every sunday on 12pm to 5pm (gmt +6) for exact 5hours, server load become high. but vps2 is totally normal, node is totally normal at the time and other time also. i checked,monitored, cpanel L4 technician team then L3 technician team also monitored the vps and cant find exact reason why server become high for exact 5hours on every sunday. while there was no cpu process at the time but I/O wait rate was high. No cronjob No backup No update set on the time. i use "sar" for check server load time, use tech-SysSnapv2 for log details.

Now, cpanel Level 3 and Level 4 team give up and they said it might be other cause for server highload except cpanel. but i think it is because of cpanel. because i have vps2 on same node and node is also normal at the time. both vps has centos7 and up2date.

here is log file from week1 and 2.log 1 log 2 some log file replaced due to late of downloading.

if anyone here have any idea what else can cause this on specific time for exact 5hours every sunday, please tell me. or willing to check and solve the problem also tell me, i will give some courtesy money.

Chuck · December 2018

do they use ColoCrossing?

AnthonySmith · December 2018

Best guess you have software raid and that is the mdadm raid resync generating excessive IOPS because it has been left with defaults.

bdspice · December 2018

@Chuck said:
do they use ColoCrossing?

who? the node is my dedicated server from ovh

bdspice · December 2018

@AnthonySmith said:
Best guess you have software raid and that is the mdadm raid resync generating excessive IOPS because it has been left with defaults.

Very interesting. software raid is in main dedicated server. i virtualized the dedicated server to get 2vps, one vps never gone high while other one become. is it possible due to software raid of dedicated server, one vps can get high load spike?

AnthonySmith · December 2018

bdspice said: is it possible due to software raid of dedicated server, one vps can get high load spike?

yes absolutley, i assume they both run different stacks so the lack of IOPS creating io wait could significantly impact 1 and not the other.

with the times it seems almost certain to be the case.

Try and turn off bitmap caching and reduce the max speed to 5000 on your mdadm array and see if it still happens, it may simply be time to upgrade your disk to a faster array.

bdspice · December 2018

@AnthonySmith said:

bdspice said: is it possible due to software raid of dedicated server, one vps can get high load spike?

yes absolutley, i assume they both run different stacks so the lack of IOPS creating io wait could significantly impact 1 and not the other.

with the times it seems almost certain to be the case.

Try and turn off bitmap caching and reduce the max speed to 5000 on your mdadm array and see if it still happens, it may simply be time to upgrade your disk to a faster array.

can you help me in this to do or give me any tutorial or idea to do this? since i am a little bit of new to this thats why i left all this as default.

FrankZ · December 2018

If you have IPv6 enabled, I would ask you to run
ip route show cache table all |grep -c cache
now while not affected, and then again on Sunday when you are.

AnthonySmith · December 2018

bdspice said: can you help me in this to do or give me any tutorial or idea to do this? since i am a little bit of new to this thats why i left all this as default.

Yeah np, just pm me the output of: cat /proc/mdstat

I will try and advise you from there.

bdspice · December 2018

@AnthonySmith said:

Try and turn off bitmap caching and reduce the max speed to 5000 on your mdadm array

here is how i reduce the max speed from 200000 to 5000 by command is :
sysctl -w dev.raid.speed_limit_max=5000

here is the code i use to turn off bitmap since i have two partitions of server md2 and md3 :
mdadm --grow --bitmap=none /dev/md2

here is output of cat /proc/mdstat

    Personalities : [raid1]
    md3 : active raid1 sda3[0] sdb3[1]
          1744516032 blocks [2/2] [UU]
          bitmap: 5/13 pages [20KB], 65536KB chunk

    md2 : active raid1 sda2[0] sdb2[1]
          204798912 blocks [2/2] [UU]

    unused devices: <none>

am i doing it right? here is the Status of partitions from ovh panel:

Name of partition-  Mount point-    Used space- Inodes used
md2                 /                   4%                  1%
md3                 /vpsprt         62%             1%

dedicados · December 2018

whm/cpanel backups maybe

bdspice · December 2018

here is latest update, i changed the server time to my time. then i change cronjob time to friday from sunday at /etc/cron.d/raid-check. because server load on friday is not a problem for my office while its offday. i disable bitmap for both partitions by
"mdadm --grow --bitmap=none /dev/md2"
"mdadm --grow --bitmap=none /dev/md3"

also reduce the max speed as advice to 5000 from 200000 by
sysctl -w dev.raid.speed_limit_max=5000

now i will update the domain dns which use high email storage to cloudflare and convert the email format for the account on cpanel to mdbox from maildir.

i hope i will get this problem solved this in sha allah.

bdspice · December 2018

@FrankZ said:
If you have IPv6 enabled, I would ask you to run
ip route show cache table all |grep -c cache
now while not affected, and then again on Sunday when you are.

output is 0. what does that mean?

AnthonySmith · December 2018

bdspice said: i hope i will get this problem solved this in sha allah

Seems you have done enough to see if its the raid sync or at least rule it out, keep us updated.

FrankZ · December 2018

bdspice said: output is 0. what does that mean?

Neighbor Discovery Cache . It should be 0 or low number < 50. If you have 500 to 1500 come Sunday let me know. If it stays low, then this is not your problem. @AnthonySmith is probably correct, but since this is a one time a week thing, I figured I would give you something else to check.

bdspice · December 2018

@FrankZ @AnthonySmith Thanks. since i change max_speed and resync time to 12am of 5(friday), now is the time of resync. but i got no high load on any terminal of my server. so its time to watch on sunday now. i will update you all. thanks again

bdspice · December 2018

hello. here is last update. vps is now normal. but main dedicated server is doing something and load average 1 which is normal also. here is some output i got now. please check and tell me that what to worry or not and whats going on. is it syncing now or not? if sync running, how much time will it take accorind to my drive speed?

[root@dedicated cron.d]# cat /proc/mdstat
Personalities : [raid1]
md3 : active raid1 sda3[0] sdb3[1]
      1744516032 blocks [2/2] [UU]
      [>....................]  check =  4.3% (76682624/1744516032) finish=5331.7min speed=5213K/sec

md2 : active raid1 sda2[0] sdb2[1]
      204798912 blocks [2/2] [UU]
        resync=DELAYED

unused devices: <none>
[root@dedicated cron.d]# date
Fri Dec 14 16:11:38 +06 2018
[root@dedicated cron.d]# mdadm -D /dev/md2
/dev/md2:
           Version : 0.90
     Creation Time : Tue Jan  9 22:58:49 2018
        Raid Level : raid1
        Array Size : 204798912 (195.31 GiB 209.71 GB)
     Used Dev Size : 204798912 (195.31 GiB 209.71 GB)
      Raid Devices : 2
     Total Devices : 2
   Preferred Minor : 2
       Persistence : Superblock is persistent

       Update Time : Fri Dec 14 07:31:41 2018
             State : active, resyncing (DELAYED)
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : resync

              UUID : 80c79728:4f7a0921:a4d2adc2:26fd5302
            Events : 0.12983

    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync   /dev/sda2
       1       8       18        1      active sync   /dev/sdb2
[root@dedicated cron.d]# mdadm -D /dev/md3
/dev/md3:
           Version : 0.90
     Creation Time : Tue Jan  9 22:58:49 2018
        Raid Level : raid1
        Array Size : 1744516032 (1663.70 GiB 1786.38 GB)
     Used Dev Size : 1744516032 (1663.70 GiB 1786.38 GB)
      Raid Devices : 2
     Total Devices : 2
   Preferred Minor : 3
       Persistence : Superblock is persistent

       Update Time : Fri Dec 14 16:11:51 2018
             State : active, checking
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : resync

      Check Status : 4% complete

              UUID : ee258205:a4ce61ec:a4d2adc2:26fd5302
            Events : 0.148380

    Number   Major   Minor   RaidDevice State
       0       8        3        0      active sync   /dev/sda3
       1       8       19        1      active sync   /dev/sdb3
[root@dedicated cron.d]# sar -q
Linux 3.10.0-693.11.6.el7.x86_64 (dedicated)    12/14/2018      _x86_64_        (8 CPU)

02:00:04 AM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15   blocked
02:10:04 AM        11       267      0.13      0.14      0.31         0
02:20:04 AM         2       249      0.13      0.13      0.22         0
02:30:04 AM         2       250      0.15      0.22      0.22         0
02:40:04 AM         3       253      0.12      0.10      0.16         0
02:50:03 AM         2       251      0.10      0.07      0.11         0
03:00:04 AM         3       251      0.17      0.22      0.19         2
03:10:03 AM         2       261      0.15      0.18      0.17         0
03:20:03 AM         2       254      0.11      0.14      0.16         0
03:30:03 AM         2       261      0.58      0.53      0.32         0
03:40:04 AM         3       278      0.02      0.23      0.28         2
03:50:03 AM         2       250      0.11      0.16      0.23         0
04:00:05 AM         3       262      0.06      0.20      0.27         0
04:10:03 AM         2       253      0.04      0.07      0.16         0
04:20:03 AM         2       252      0.29      0.24      0.19         0
04:30:03 AM         2       254      0.05      0.08      0.13         0
04:40:04 AM         3       256      0.15      0.08      0.10         0
04:50:03 AM         2       251      0.03      0.11      0.12         0
05:00:04 AM         3       252      0.00      0.04      0.08         0
05:10:04 AM         3       249      0.20      0.09      0.07         0
05:20:04 AM         2       293      0.06      0.12      0.12         1
05:30:04 AM         2       249      0.00      0.18      0.26         2
05:40:04 AM         3       259      0.87      0.32      0.25         0
05:50:03 AM         2       247      0.01      0.06      0.13         0
06:00:04 AM         3       250      0.08      0.18      0.21         0
06:10:04 AM         2       250      0.19      0.19      0.18         0
06:20:03 AM         3       249      0.29      0.26      0.19         0
06:30:04 AM         2       248      0.06      0.07      0.12         0
06:40:04 AM         3       253      0.17      0.19      0.16         0
06:50:03 AM         2       253      0.01      0.07      0.12         0
07:00:04 AM         3       258      0.00      0.02      0.08         1
07:10:03 AM         2       248      0.40      0.20      0.15         0
07:20:04 AM         3       258      0.00      0.06      0.11         0
07:30:03 AM         2       250      0.36      0.16      0.15         0
07:40:04 AM         3       250      0.23      0.25      0.23         0
07:50:03 AM         2       252      0.28      0.19      0.20         1
08:00:04 AM         3       249      0.09      0.11      0.17         2
08:10:03 AM         2       250      0.14      0.23      0.23         1
08:20:03 AM         2       251      0.11      0.13      0.19         0
08:30:03 AM         2       253      0.82      0.31      0.20         0
08:40:04 AM         3       251      0.13      0.17      0.18         0
08:50:03 AM         2       246      0.05      0.07      0.12         2
09:00:04 AM         3       250      0.09      0.19      0.15         0

09:00:04 AM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15   blocked
09:10:03 AM         2       248      0.19      0.17      0.15         2
09:20:04 AM         2       250      0.19      0.15      0.14         0
09:30:03 AM         2       250      0.12      0.18      0.17         1
09:40:04 AM         3       253      0.11      0.11      0.14         0
09:50:03 AM         2       248      0.19      0.11      0.13         0
10:00:04 AM         4       251      0.26      0.19      0.16         3
10:10:03 AM         2       263      0.13      0.12      0.14         0
10:20:04 AM         2       247      0.19      0.13      0.14         1
10:30:03 AM         2       250      0.02      0.11      0.13         0
10:40:04 AM         3       255      0.14      0.13      0.14         0
10:50:04 AM         2       249      0.87      0.42      0.28         1
11:00:04 AM         3       252      0.05      0.14      0.21         0
11:10:03 AM         2       249      0.04      0.08      0.13         0
11:20:03 AM         2       252      0.09      0.15      0.15         0
11:30:03 AM         2       255      0.00      0.06      0.11         0
11:40:04 AM         3       253      0.01      0.11      0.14         0
11:50:04 AM         2       252      0.10      0.48      0.37         0
12:00:05 PM         3       265      0.08      0.10      0.21         1
12:10:04 PM         2       253      1.00      0.94      0.63         1
12:20:03 PM         2       282      1.17      1.25      0.94         0
12:30:03 PM         2       295      1.14      1.14      1.05         0
12:40:04 PM         3       264      1.01      1.06      1.05         0
12:50:03 PM         2       254      1.11      1.15      1.11         0
01:00:04 PM         3       261      1.04      1.11      1.13         1
01:10:03 PM         2       253      1.02      1.12      1.16         0
01:20:03 PM         2       251      1.28      1.18      1.18         0
01:30:04 PM         2       256      1.24      1.33      1.23         0
01:40:04 PM         3       281      2.43      1.52      1.30         0
01:50:03 PM         2       279      1.13      1.25      1.27         0
02:00:04 PM         3       264      1.10      1.09      1.16         0
02:10:03 PM         2       256      1.08      1.09      1.13         0
02:20:04 PM         2       265      1.23      1.12      1.14         0
02:30:04 PM         2       256      1.06      1.09      1.12         0
02:40:04 PM         3       267      2.01      1.58      1.32         0
02:50:03 PM         4       263      1.17      1.18      1.22         0
03:00:04 PM         4       259      1.75      1.80      1.48         0
03:10:03 PM         2       253      1.10      1.34      1.38         0
03:20:03 PM         2       254      1.15      1.15      1.26         0
03:30:04 PM         2       254      1.09      1.08      1.18         0
03:40:04 PM         3       253      1.02      1.14      1.19         1
03:50:03 PM         3       272      1.50      1.68      1.45         0
04:00:04 PM         2       258      1.63      1.82      1.61         3

04:00:04 PM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15   blocked
04:10:03 PM         2       254      1.16      1.30      1.45         1
Average:            3       256      0.49      0.48      0.48         0
[root@dedicated cron.d]# vi raid-check
# Run system wide raid-check once a week on Sunday at 1am by default
0 1 * * 5 root /usr/sbin/raid-check

jackb · December 2018

It will take four days at current speed. So I'd recommend either raising the speed or pausing the check at the end of the period you want the check to run in. Search for "echo idle sys block mdadm" to see how to pause the raid check.

AnthonySmith · December 2018

If everything is running fine and the sync time is <7 days (it is) then just leave it your problem is solved.

There is just no way to quickly sync slow sata drives with 0 impact, you get to choose 1 or the other or you upgrade to ssd

bdspice · December 2018

@jackb said:
It will take four days at current speed. So I'd recommend either raising the speed or pausing the check at the end of the period you want the check to run in. Search for "echo idle sys block mdadm" to see how to pause the raid check.

can i pause the the proccess now by entering this code?
echo "idle" > /sys/block/md3/md/sync_action
then how much speed should i increase in this drive from current max 5000?
can i make it 15000? when it was 200000, average load was around 15 and it was take 5hours. so i divide 200000 by 15= around 14000.

bdspice · December 2018

@FrankZ @AnthonySmith thanks. sunday is gone without any trouble. so now i figure it out that its cause just because of mdadm array. thanks again @AnthonySmith for your idea to turn off bitmap caching and reduce the max speed to 5000 on mdadm array. now i just solved the 11 month aged problem

AnthonySmith · December 2018

bdspice said: @FrankZ @AnthonySmith thanks. sunday is gone without any trouble. so now i figure it out that its cause just because of mdadm array. thanks again @AnthonySmith for your idea to turn off bitmap caching and reduce the max speed to 5000 on mdadm array. now i just solved the 11 month aged problem

No worries, if your server has space for another drive you can store the bitmap cache on a tiny ssd which will speed up the resync a lot and no longer impact the IO either.

bdspice · December 2018

@AnthonySmith said:

No worries, if your server has space for another drive you can store the bitmap cache on a tiny ssd which will speed up the resync a lot and no longer impact the IO either.

how would i know that my server has space for another drive or not? i just buy it from ovh soyoustart. ann do i need bitmap cache? is it so important?

AnthonySmith · December 2018

bdspice said: how would i know that my server has space for another drive or not? i just buy it from ovh soyoustart. ann do i need bitmap cache? is it so important?

That is a question from whomever you lease the server from, but SYS is not flexible so you can forget about it I guess.

You don't need a bitmap cache however it greatly improved sync speed because in simple terms it is used to keep track of blocks that may be out of sync.

For a 2 disk raid 1 I would not worry to much as the penalty for having one probably does more harm to performance than good, probably not worth spending money on now I think about it.

You just have a far better chance of recovering from any significant disk failure/system crash if you have a bitmap cache but its like a permanent double check that slows things down.

I am sure when your server/service grows you will migrate to an SSD based server anyway and none of these problems are noticeable anymore.

Zerpy · December 2018

You can also yolo and run the cron only once a month Sure it checks for inconsistencies between the two drives, but you perform backups anyway, so in the very unlikely case there would be inconsistencies that would absolutely kill everything (I've yet to see this across 1000+ servers), then you could simply restore your backups.

Howdy, Stranger!

Categories

In this Discussion

Cpanel based Server high-load on every sunday for 5 hours

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Cpanel based Server high-load on every sunday for 5 hours

Comments