New on LowEndTalk? Please Register and read our Community Rules.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
Comments
Are you testing ICMP, TCP, or HTTP?
How did you determine this is network issue, instead of server side application issue?
For example, a daily backup job that locks the database for too long could cause an HTTP request that accesses that database to fail.
The end is not neigh, as @VirMach does not offer IPv6 yet and thus does not need neighbor discovery protocol.
My mailserver with them on another NY node is fine: https://hetrixtools.com/report/uptime/18dfa185a0dfcca7df57a055540744fb/
also have some VPS on other nodes in NY (NY10GKVMXX) and no hetrixtools alerts.
OP mentioned PMS.
Therefore, I was notified of PMS.
Will there be PMS?
We shall see.
It's HTTP monitor that checks for a certain word on home page. Checking from 4 different locations. I also personally faced it i.e. as soon as I received downtime email, I opened my website and it was down.
Just after posting this thread there has been 3 more downtimes where two of them were 4 mins each.
Then don't title this thread with 'network issues'.
You should first probe on L3 before stepping up to the higher OSI layers.
L7 isn't called the 'network layer'.
Check if you encounter similiar issues with an ICMP/Ping probe at the same time.
I think hetrixtools shows the error code in details - if it was timeout, network issue, 50X from your webserver? Make sure those are not 404/50X from your code/script/network. No DNS issues aka sometimes ends on wrong server due to duplicated
A
entries?it's like the worst description ever, same sending pigeon letter to mechanic and everything you write it's "my car not worky"
Didn't know you can check error detail in Hetrix Tools
The log says
Error 28: Operation timed out after 10001 milliseconds with 0 bytes received (10002ms)
Opened ticket yesterday and today there is no downtime so far. Haven't received response on ticket yet but looks like they fixed something.
As per suggestion from VirMach I added a ping monitor and next time when error came I noticed that ping monitor is working 100% fine while website monitors throws error. I am using Clourflare and the actual error is
504 Gateway timeout
so I guess something is wrong with my server? I am using nginx on Debian and when I runtop
then it shows that machine is working fine with no load issues. May be I need to check nginx logs!?Here's error from log
2021/08/20 06:18:28 [error] 14657#14657: *1927970 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xx.xx.xx.xx, server: www.testing.com, request: "GET / HTTP/1.1", upstream: "fastcgi://unix:/var/run/php/php7.4-fpm.sock", host: "www.testing.com"
Some request is taking too long to be processed. Always helpful to check if something had overloaded the server during that time.
I restarted machine thinking it may fix the issue but it didn't.
Here's screenshot of when everything was calm:
And here's screenshot from exact moment when site went down:
I wonder what is causing that sudden CPU spike. It's a 4 core VPS with 8GB RAM. Don't know what nginx/PHP setting/config shall I use to fix this issue. The issue is with fastcgi. I can increase
fastcgi_read_timeout
timeout but this doesn't seem like the right solution.Enable more detail CPU Usage on htop to display IOwait % and steal %
I think mysql generates a lot of wa and php waits for it. There might be some layer 7 issue, either an attack or some misconfigured crawler.
I enabled this setting but still I don't see any more CPU detail in
htop
. It is still showing same information.I have disabled a couple of plugins (it's WordPress/WooCommerce) so let's see. Here's my nginx config:
You need to add CPU average, and press space button on keyboard to change the type
After more than 2 months I finally found ut what the issue is. It was related to
pm.max_children
which was using default value of 2. I changed its value along with other related parameters according to my RAM and now situation is much better.There is almost no error except occasional issue which lasts about a second or 2 only unlike previous cases of 2+ minutes. I am still tweaking it a little bit as we speak.