New on LowEndTalk? Please Register and read our Community Rules.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
Comments
The graph looks like it takes 5-6 samples per hour, so it's likely averaging every ~10 minutes. If the load is consistent for those periods of times, you're not going to see spikes that you see with 10s granularity.
As others mentioned, see if you can identify what's causing the spikes. Are these perhaps the large SELECTs you're referring to?
You can also just stop PostgreSQL temporarily and do a similar iostat test to see if the read spikes disappear.
Disk is usually the bane of most server workloads (unless you're on SSDs) - if you do look at upgrading, perhaps consider SSD storage or more RAM (to cache reads more, assuming that they're cacheable).
Local-file-as-caches. That'll do. Throw more RAM at it.
I've done a brief review of the application code that's running (it powers a website that gets about 200K-400K hits a month). Unfortunately there's not much optimization that can be done in terms of DB calls; the app makes frequent calls to an external API for each incoming HTTP request from a user, and the response of each call results in accesses to many individual rows. Hence it's an inherently read-heavy (and to a lesser extent write-heavy) app.
Throw RAM at it and let it become someone else's eventual problem.
I can't stop PostgreSQL since it is somewhat mission-critical right now, but I can clone everything to a new VM and let it sit for a while (where it won't be dealing with any workload). I'll let you know how that goes.
Your web server (and hence db) isn't guilty I think. The hits don't explain that.
Before we go on, turn off your swap at least on vda. Having a cache of about 1/3rd of total mem and having a swap is not needed anyway.
And show us an lsof output cleaned of standard shit.
10k reqs/day is around one every ~9 seconds. Do you get clumps of requests together often? The graphs show fairly steady load throughout the day (with a few peaks here and there), but you're seeing very spiky load during smaller windows. It's hard to guess, but if you have consistent load, you shouldn't be getting those big read spikes - you'd expect it to be spread out more.
In other words, your load may be coming from another source.
Then again, I don't know your application...
A very simple means to track this would be to do a ps/systat(atop)/etc from a script to read through and see just what's going on. At least with a basic ext4 system it isn't going to be a silly journal issue or anything like that. Basically, it just needs to be babysat and watched.
A very simple means to track this would be to do a ps/systat(atop)/etc from a script to read through and see just what's going on. At least with a basic ext4 system it isn't going to be a silly journal issue or anything like that. Basically, it just needs to be babysat and watched.
if you have sysstat and sar installed and configured, you'd have historic records for close to exact date and time (10 min interval based entries by default) so you can go back to that exact date/time to review sar -d output to at least see which block device it's coming from with help of lsblk output matching
last 24hrs for centos not sure for Ubuntu where you sar logs are
sar -d -f /var/log/sa/sa$(date +%d -d yesterday)
My guess (entirely based on the limited context and totally based on previous hunches) is that it's the DB. There are probably tables that could do with an index update and/or an overall statistics update (and related general DB maintenance). When was the last time you explicitly ran some PG maintenace scripts?
Easy way to test (if possible) - shutdown the DB for a bit and see if the IO graph drops (consistently). Of course it is a HeisenTest(tm) but it's probably worth it. Needless to say I assume you're not going to have some logging daemon spew lots of messages, promptly increasing the IO back to where it was.
Install the
blktrace
package, and runfor an online trace.
You can also use
blktrace
to record logs and parse them later withblkparse
.@nullnothere Yep, shutting off Postgres makes disk I/O drop like a rock. What sort of maintenance do you recommend? I just finished reindexing the entire database, and so far I've just been letting autovacuum do its thing.
I'm no PG expert (hopefully some other PG Guru's here can help) but if you're doing all the usual good things (vacuum, analyze, reindex) then I think you're OK. Check that they are happening frequently enough (or that they're happening correctly). You may need to increase autovacuum frequency as well.
If despite these maintenance jobs (etc.), you're having high IO then you'll have to start looking at the queries (figure out the slow ones etc.). Perhaps somewhere you're doing a table scan (missing an index). The nice thing is that it doesn't appear to be a CPU hog (so it's not some sort of a stupid loop join etc. that the query planner is picking for whatever reason).
You may have "grown" (DB wise) to a point where now you need more memory for optimum performance etc.
Hope this gets you going (you're now entering DBA land...)
Turns out that I had some relatively DB-intensive cron jobs that were originally configured to run every minute. This was fine when the DB was small but not so much now that the DB is much larger (~2GB). I've set them to run less frequently and that seems to have done the trick for now!
@nullnothere I did some more digging and found a query that is essentially a table scan (filtering on an unindexed boolean). Added a partial index on that field and so far so good.
As you are at it you might want to considerably increase index memory for PG rather than wasting it on disk caching.
;-) ;-) ;-)
Thanks for the advice.
I used to have a script somewhere for measuring the actual IO Disk R+W requests p/second(s) on a per process basis, this thread reminds me I should probably share that.
@AnthonySmith please share it if you do find it...