Unable to find cause of high disk I/O

xyz · February 2017

WSS said: The problem with this is the generated graph, because it looks pretty consistent

The graph looks like it takes 5-6 samples per hour, so it's likely averaging every ~10 minutes. If the load is consistent for those periods of times, you're not going to see spikes that you see with 10s granularity.

emmd19 said: I guess the short-term is to upgrade to beefier hardware...

As others mentioned, see if you can identify what's causing the spikes. Are these perhaps the large SELECTs you're referring to?

You can also just stop PostgreSQL temporarily and do a similar iostat test to see if the read spikes disappear.

Disk is usually the bane of most server workloads (unless you're on SSDs) - if you do look at upgrading, perhaps consider SSD storage or more RAM (to cache reads more, assuming that they're cacheable).

WSS · February 2017

Local-file-as-caches. That'll do. Throw more RAM at it.

emmd19 · February 2017

I've done a brief review of the application code that's running (it powers a website that gets about 200K-400K hits a month). Unfortunately there's not much optimization that can be done in terms of DB calls; the app makes frequent calls to an external API for each incoming HTTP request from a user, and the response of each call results in accesses to many individual rows. Hence it's an inherently read-heavy (and to a lesser extent write-heavy) app.

WSS · February 2017

Throw RAM at it and let it become someone else's eventual problem.

emmd19 · February 2017

@xyz said:

WSS said: The problem with this is the generated graph, because it looks pretty consistent

The graph looks like it takes 5-6 samples per hour, so it's likely averaging every ~10 minutes. If the load is consistent for those periods of times, you're not going to see spikes that you see with 10s granularity.

emmd19 said: I guess the short-term is to upgrade to beefier hardware...

As others mentioned, see if you can identify what's causing the spikes. Are these perhaps the large SELECTs you're referring to?

You can also just stop PostgreSQL temporarily and do a similar iostat test to see if the read spikes disappear.

Disk is usually the bane of most server workloads (unless you're on SSDs) - if you do look at upgrading, perhaps consider SSD storage or more RAM (to cache reads more, assuming that they're cacheable).

I can't stop PostgreSQL since it is somewhat mission-critical right now, but I can clone everything to a new VM and let it sit for a while (where it won't be dealing with any workload). I'll let you know how that goes.

bsdguy · February 2017

Your web server (and hence db) isn't guilty I think. The hits don't explain that.
Before we go on, turn off your swap at least on vda. Having a cache of about 1/3rd of total mem and having a swap is not needed anyway.

And show us an lsof output cleaned of standard shit.

xyz · February 2017

emmd19 said: I've done a brief review of the application code that's running (it powers a website that gets about 200K-400K hits a month). Unfortunately there's not much optimization that can be done in terms of DB calls; the app makes frequent calls to an external API for each incoming HTTP request from a user, and the response of each call results in accesses to many individual rows. Hence it's an inherently read-heavy (and to a lesser extent write-heavy) app.

10k reqs/day is around one every ~9 seconds. Do you get clumps of requests together often? The graphs show fairly steady load throughout the day (with a few peaks here and there), but you're seeing very spiky load during smaller windows. It's hard to guess, but if you have consistent load, you shouldn't be getting those big read spikes - you'd expect it to be spread out more.

In other words, your load may be coming from another source.
Then again, I don't know your application...

WSS · February 2017

A very simple means to track this would be to do a ps/systat(atop)/etc from a script to read through and see just what's going on. At least with a basic ext4 system it isn't going to be a silly journal issue or anything like that. Basically, it just needs to be babysat and watched.

WSS · February 2017

A very simple means to track this would be to do a ps/systat(atop)/etc from a script to read through and see just what's going on. At least with a basic ext4 system it isn't going to be a silly journal issue or anything like that. Basically, it just needs to be babysat and watched.

eva2000 · February 2017

emmd19 said: I've been experiencing issues with high disk I/O on my KVM VPS recently, which has caused several excessive I/O alerts from my host (a well-known provider here on LEB/LET). In particular, there is always a constant disk read of about 7MB/s. The VPS is running Ubuntu 16.04 64-bit and has 1.5GB of RAM, of which about 2/3 is in use at any given time. Swap usage hovers between 25-100MB out of 1.2GB.

if you have sysstat and sar installed and configured, you'd have historic records for close to exact date and time (10 min interval based entries by default) so you can go back to that exact date/time to review sar -d output to at least see which block device it's coming from with help of lsblk output matching

last 24hrs for centos not sure for Ubuntu where you sar logs are

sar -d -f /var/log/sa/sa$(date +%d -d yesterday)

nullnothere · February 2017

My guess (entirely based on the limited context and totally based on previous hunches) is that it's the DB. There are probably tables that could do with an index update and/or an overall statistics update (and related general DB maintenance). When was the last time you explicitly ran some PG maintenace scripts?

Easy way to test (if possible) - shutdown the DB for a bit and see if the IO graph drops (consistently). Of course it is a HeisenTest(tm) but it's probably worth it. Needless to say I assume you're not going to have some logging daemon spew lots of messages, promptly increasing the IO back to where it was.

rincewind · February 2017

Install the blktrace package, and run

btrace /dev/vda1

for an online trace.

You can also use blktrace to record logs and parse them later with blkparse.

emmd19 · February 2017

@nullnothere Yep, shutting off Postgres makes disk I/O drop like a rock. What sort of maintenance do you recommend? I just finished reindexing the entire database, and so far I've just been letting autovacuum do its thing.

nullnothere · February 2017

@emmd19 said: Yep, shutting off Postgres makes disk I/O drop like a rock. What sort of maintenance do you recommend?

I'm no PG expert (hopefully some other PG Guru's here can help) but if you're doing all the usual good things (vacuum, analyze, reindex) then I think you're OK. Check that they are happening frequently enough (or that they're happening correctly). You may need to increase autovacuum frequency as well.

If despite these maintenance jobs (etc.), you're having high IO then you'll have to start looking at the queries (figure out the slow ones etc.). Perhaps somewhere you're doing a table scan (missing an index). The nice thing is that it doesn't appear to be a CPU hog (so it's not some sort of a stupid loop join etc. that the query planner is picking for whatever reason).

You may have "grown" (DB wise) to a point where now you need more memory for optimum performance etc.

Hope this gets you going (you're now entering DBA land...)

emmd19 · February 2017

Turns out that I had some relatively DB-intensive cron jobs that were originally configured to run every minute. This was fine when the DB was small but not so much now that the DB is much larger (~2GB). I've set them to run less frequently and that seems to have done the trick for now!

emmd19 · February 2017

@nullnothere I did some more digging and found a query that is essentially a table scan (filtering on an unindexed boolean). Added a partial index on that field and so far so good.

bsdguy · February 2017

As you are at it you might want to considerably increase index memory for PG rather than wasting it on disk caching.

Falzo · February 2017

@bsdguy said:
As you are at it you might want to considerably increase index memory for PG rather than wasting it on disk caching.

WSS said: ... Throw more RAM at it.

WSS said: Throw RAM at it ...

;-) ;-) ;-)

bdarwin · February 2017

@Ishaq said:
What does iotop show?

Thanks for the advice.

AnthonySmith · February 2017

I used to have a script somewhere for measuring the actual IO Disk R+W requests p/second(s) on a per process basis, this thread reminds me I should probably share that.

ashkulz · February 2017

@AnthonySmith please share it if you do find it...

Howdy, Stranger!

Categories

In this Discussion

Unable to find cause of high disk I/O

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Unable to find cause of high disk I/O

Comments