Unable to find cause of high disk I/O

emmd19 · February 2017

I've been experiencing issues with high disk I/O on my KVM VPS recently, which has caused several excessive I/O alerts from my host (a well-known provider here on LEB/LET). In particular, there is always a constant disk read of about 7MB/s. The VPS is running Ubuntu 16.04 64-bit and has 1.5GB of RAM, of which about 2/3 is in use at any given time. Swap usage hovers between 25-100MB out of 1.2GB.

Here's the output from a single invocation of iostat, which, according to my understanding, represent running averages since boot:
Here's the I/O graph over the last 24 hours from my host's control panel:

The VPS is used for web/HTTP and PostgreSQL database. I initially suspected the Postgres database was causing issues, but I can't seem to find the source of this phantom read I/O. iotop does not provide any clues, and running iostat at 1s intervals show that disk I/O is minimal. Furthermore, vmstat shows that swap activity is minimal as well. I even considered the possibility that my host's I/O metering was buggy, but they replied that since I/O usage is read directly from the hypervisor it is impossible for their readings to be wrong.

I'm at my wit's end trying to figure this out. Does anyone have any ideas?

WSS · February 2017

I'd run lsof and start checking processes. There's far too much information which you just haven't given us. What sort of host is this running on? What version KVM? Etc, etc..

There may very well be a nasty way VirtIO is being handled with your Postgres DBs. Check your systat and everything else.

emmd19 · February 2017

How would lsof help? Since I'm just a client I have no idea about the particulars about my hosts underlying KVM implementation, sorry

WSS · February 2017

Crawl through your dmesg, look for "virtio" information; what's your ethernet device show as, etc..

What's your build? What're you running for these services- are they stock distrib, custom, etc..

To check and change things without taking down services, you can play around with ionice.

emmd19 · February 2017

lspci:
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02) 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II] 00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01) 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03) 00:02.0 VGA compatible controller: Cirrus Logic GD 5446 00:03.0 Ethernet controller: Red Hat, Inc Virtio network device 00:04.0 SCSI storage controller: Red Hat, Inc Virtio block device 00:05.0 SCSI storage controller: Red Hat, Inc Virtio block device 00:06.0 Unclassified device [00ff]: Red Hat, Inc Virtio memory balloon

All packages are stock - nginx, postgres from Ubuntu repos, and some custom Python/Django projects running behind gunicorn. These are not resource-heavy at all, with the possible exception of frequent database accesses (although certainly not to the point of several MB/s sustained). I've just remounted / with noatime - will let you know how that goes...

xyz · February 2017

Your iostat screenshot above shows I/O that seems to be consistent with the graph. When you say minimal I/O with iostat at 1s intervals, does it look like the screenshot above?

Have you tried stopping Postgresql and measuring the I/O?

Ishaq · February 2017

What does iotop show?

emmd19 · February 2017

@xyz Nope, when running iostat at 1s intervals most of them look like this:

Ishaq · February 2017

Your TPS value is high especially if the provider uses HDDs.

emmd19 · February 2017

@Ishaq When running iotop, the disk read is usually anywhere from 0 to a few hundred KB/s. There are rare moments when it spikes/bursts due to a Postgres SELECT running on a large table, however, these never last more than a second or two. Basically there's nothing I can see that would explain sustained disk I/O.

emmd19 · February 2017

@Ishaq said:
Your TPS value is high especially if the provider uses HDDs.

I guess there's not much point in hiding my provider lol. It's LunaNode, and IIRC this is one of their SSD-cached plans out of OVH-BHS.

Ishaq · February 2017

Try installing and running atop.

emmd19 · February 2017

Will try that now. In the meantime, here's about 30 seconds worth of vmstat 1 in case that helps:

xyz · February 2017

emmd19 said: @xyz Nope, when running iostat at 1s intervals most of them look like this:

Could you run iostat 10, leave it for a minute, and screenshot all of the output?

WSS · February 2017

atop -d -A specifically. I'm so used to systat that I had to look that up.

emmd19 · February 2017

@Ishaq @WSS atop -d -A:

Ishaq · February 2017

How strange.

WSS · February 2017

Huh. What's throwing me for a loop is that it's showing 20% USER above, but then- nada, so we should be seeing something here.

I'm just going to assume it's an ancient KVM on CentOS 6.

I'm assuming you've tried changing priorities and nonsuch.

emmd19 · February 2017

@xyz iostat 10 for 1 minute (1st entry is the average since boot):

emmd19 · February 2017

@WSS My CPU load average is around 20-30% - is that what USER% means?

xyz · February 2017

CPU usage is user+system (+nice if you have any nice'd processes).

From that iostat, it looks like to me that your I/O is usually low, but you get spikes (like that 2047 tps reading), which causes the average to be what it is. The graph is likely averaging over a long period of time, and the first iostat reading shows an average which seems to be in-line if you average all your other iostat readings.

WSS · February 2017

@xyz said:
CPU usage is user+system (+nice if you have any nice'd processes).

From that iostat, it looks like to me that your I/O is usually low, but you get spikes (like that 2047 tps reading), which causes the average to be what it is. The graph is likely averaging over a long period of time, and the first iostat reading shows an average which seems to be in-line if you average all your other iostat readings.

The problem with this is the generated graph, because it looks pretty consistent, I guess we'd need to have better sample data from what the host is running- and again- would be a lot more useful from the host perspective than the QEMU. The end result is that we're all left wondering.

emmd19 · February 2017

@xyz said:
CPU usage is user+system.

From that iostat, it looks like to me that your I/O is usually low, but you get spikes (like that 2047 tps reading), which causes the average to be what it is. The graph is likely averaging over a long period of time, and the first iostat reading shows an average which seems to be in-line if you average all your other iostat readings.

Hmm...that makes sense I suppose. I guess the short-term is to upgrade to beefier hardware...

WSS · February 2017

@emmd19 said:
@WSS My CPU load average is around 20-30% - is that what USER% means?

This little article should help you understand what the different numbers actually mean: http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages

emmd19 · February 2017

So going off the conclusion that I need to upgrade my infrastructure - what do you guys recommend? Is this level of resource/disk utilization still within the realm of VPS, or is dedicated the way to go?

WSS · February 2017

I'd recommend you start benchmarking/setting up accounting to see if you can actually find what's going on, first. If you're having quick little stabs that even out to 7MB/s, you might just have to work on your queries and/or change the design. We're all flying blind here- you might try asking your host to show you your allocated system use as well as running accounting under your processes.

bsdguy · February 2017

Your iostat shows up to 25 MB/s read. At the same time it shows that the amount of data read per minute is about the amount one would expect per 10s. Ergo you have something that reads a lot in spikes.

You will need to watch with finer granularity and find out who is reading cyclically from vda.

Also show your mounts and tell us about your swap.

WSS · February 2017

@bsdguy said:
Also show your mounts and tell us about your swap.

Filesystem wouldn't hurt, either. You brazen hussy.

bsdguy · February 2017

Can't you ever think about anything else, you slut? How perfect!

File system? Don't care yet. 25 MB/s smells strongly like cache. A propos smelling: That whole thing smells.

WSS · February 2017

Being that we don't have any host specs, I still wonder if we're getting a combination of random select hits and just an overall shitty driver base since it's all virtio.

I know you'll do such crazy things for 25MB/s.. even if your sisters' eyebrow entrances me so.

emmd19 · February 2017

All right guys, get your mind out of the gutter :P Filesystems are nothing special, just a single 15GB / formatted as ext4 with plenty of free space:
Filesystem Size Used Avail Use% Mounted on udev 744M 12K 744M 1% /dev tmpfs 150M 1.3M 149M 1% /run /dev/vda1 16G 9.1G 6.1G 60% / none 4.0K 0 4.0K 0% /sys/fs/cgroup none 5.0M 0 5.0M 0% /run/lock none 749M 0 749M 0% /run/shm none 100M 0 100M 0% /run/user

Swap consists of a ~256MB swap on /dev/vdb1 and an additional 1GB swapfile mounted on /dev/vda1.

Swap utilization is modest and currently at 218/1061MB with minimal swap activity.

Howdy, Stranger!

Categories

In this Discussion

Unable to find cause of high disk I/O

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Unable to find cause of high disk I/O

Comments