I was recently contacted by a customer who runs a busy ERP system on their Oracle database. Their users have been screaming for some time about unpredictable performance, where application screens which would usually open in under a second suddenly take an age. ERP systems are legendary for this, of course, because as each screen loads the application tends to fire off a glut of queries to populate all the various fields, graphs and so on. When performance problems strike, the elapsed time for each query rises, which usually results in “stuff going slowly”. And unfortunately, because of something known to psychologists as the availability heuristic, users tend not to remember the times when performance was acceptable… just the occasions when they were left frustrated.
Initial Investigation
The first thing I ask for in situations like this is a set of AWR reports. Customers are generally happy to part with such information, whereas the running of bespoke SQL scripts to harvest data can often cause delays as the scripts have to be investigated and tested. AWR reports, I’m sure you know, contain information about the behaviour and performance of the database between two points in time – and much of the data is averaged. For this reason, I always always ask for a set of multiple, consecutive reports covering the problematic period where each report covers no more than one hour. I’d prefer less than one hour per report, but the default configuration of Oracle results in hourly snapshots so that’s usually the limit… as it was for this customer.
As a first pass, I like to run the AWR reports through my awr-parser.sh script. This parses them all and produces a CSV file as output. This can then be easily used to graph information such the amount of time spent working on CPU versus time spent waiting on I/O requests to complete. Often this is enough to get a basic idea of what’s happening….
In the above graph, each column shows data from a single (hourly) AWR report; blue represents DB CPU while red represents User I/O wait time. If we are looking to increase the user experience with low-latency storage then we would hope to find two things: a) lots of User I/O wait time, and b) average latency which is higher than we would expect a flash array to deliver.
Unfortunately, the first requirement isn’t really there in the above graph. There just isn’t enough red, except for peaks towards the end of the day where there appears to be some batch job or backup taking place. A peak at the AWR reports shows that this is in fact a scheduled job which performs an online rebuild of an index. With more capable storage we could make this go a lot faster, so would that perhaps help the customer?
Sadly, the answer was no, it wouldn’t. The customer specifically stated that it’s the online day where they struggle and not the index rebuild. Hmmm.
More Graphs
Let’s have a look at some of the other information that came out of the AWR reports to see if anything interesting shows up:
Plenty of spare CPU capacity here. In fact, if you think about how much Oracle licenses cost per CPU – and how much CPU is not being used… it looks like this database server represents a significant over-investment.
Not much in the way of I/O… there are spikes here and there, including those for the index rebuilds, but overall the I/O rates are very low and for many samples we barely scratch 10MiB/sec.
So what have we got? A system which hardly performs any I/O and doesn’t seem to spend any time waiting on I/O during the period the users are complaining about. Time to go back to the customer and ask for some more detail…
Second Attempt
After a plea for more details, I received another batch of AWR reports covering a single 24 hour period, as well as the crucial information that complaints were received between 5am and 6am. Perfect. Let’s have a look at that throughput graph again:
Again, the throughput is very low – including around the critical 5am-6am period where complaints were received – which might fool us into thinking any problems are not related to the storage. But what about if we plot a graph of I/O Operations Per Second (IOPS)?
Bingo! There’s a big spike of read IOPS at exactly the time where problems were reported, which we can put down to a large amount of random physical reads. We know from the customer that they have a multi-controller monolithic disk array – the top of the range product from the world’s biggest disk array vendor. But it’s still a disk array, which means random I/O frequently results in high latency as somewhere a disk head has to move across the platter and then the platter has to rotate to the correct sector.
So we should be able to just check the average wait time for random physical reads (the wait event db file sequential read) and it will reveal itself as the problem. Right? We’ll expect to see waits in the region of 4-8ms or perhaps even worse if it’s really struggling.
Here’s a graph of the average user I/O wait time seen across each AWR report – don’t forget we’re especially interested in the troublesome 5am-6am period:
Not what we expected. The 5am-6am slot is showing the best latency figures for the entire day! And this is when users are complaining?
Averages Lie
A disk array like this usually has a reasonably large DRAM cache and quite possibly a set of SSDs being used for further caching or tiering of data. If a physical read can be satisfied from the cache, it will likely be serviced in under a millisecond. But if we experience a cache miss, we’re forced to pay the penalty of mechanical latency and wait for many milliseconds. Thus there are two ranges of response times: sub-millisecond and many milliseconds.
Average wait times won’t show this because they are simply the average of all waits over the period of the AWR report (an entire hour in this case). What if many of the reads are cache hits but some still miss? Maybe it’s possible that there are enough misses to cause users to complain, but enough hits to make the average look acceptable?
Luckily, there is another section in the AWR report which can help here: the Wait Event Histogram. Here’s the histograms covering db file sequential read waits (random physical reads) for our 24 hour period:
$ cat header.txt; for file in `ls -1`; do grep -A 40 "Wait Event Histogram" $file | grep "db file sequential read" | head -1 done % of Waits ----------------------------------------------- Total Event Waits <1ms <2ms <4ms <8ms <16ms <32ms <=1s >1s -------------------------- ----- ----- ----- ----- ----- ----- ----- ----- ----- db file sequential read 3329 99.2 .8 .0 .0 .0 .0 .0 .0 db file sequential read 3487 99.4 .6 .0 .0 db file sequential read 35.4K 21.0 1.2 2.5 23.2 39.0 11.9 1.1 db file sequential read 82.9K 34.6 1.8 3.4 15.9 31.9 11.2 1.2 db file sequential read 428.6 67.4 2.1 2.5 10.1 15.2 2.4 .3 db file sequential read 4440. 92.6 1.3 1.2 1.9 2.6 .4 .0 db file sequential read 2013K 91.7 1.4 1.2 2.1 3.0 .5 .0 db file sequential read 131.7 45.7 1.1 2.8 15.6 29.1 5.4 .3 db file sequential read 76.3K 20.7 1.7 2.8 21.7 44.0 8.7 .4 db file sequential read 83.7K 31.1 1.1 2.4 16.6 38.1 10.0 .7 db file sequential read 430.4 79.6 1.5 2.0 4.4 9.3 2.9 .3 db file sequential read 106.6 32.3 1.7 3.0 15.4 36.1 10.3 1.2 db file sequential read 94K 23.8 1.8 2.7 16.6 41.7 12.1 1.2 db file sequential read 248.6 36.0 1.8 2.9 14.8 34.5 8.9 1.1 db file sequential read 202.5 31.9 1.4 2.0 9.0 38.0 16.1 1.6 db file sequential read 237.7 20.2 1.9 2.9 17.7 43.5 12.5 1.2 db file sequential read 49.5K 9.4 1.0 2.7 24.1 48.0 13.6 1.1 db file sequential read 75.6K 42.4 1.5 3.0 16.9 29.9 5.9 .5 db file sequential read 88.7K 61.4 1.3 2.5 12.4 18.2 3.8 .4 db file sequential read 644.3 84.3 1.0 1.5 4.3 7.3 1.5 .2 db file sequential read 61.3K 34.9 6.9 4.4 11.0 31.2 10.5 1.1 db file sequential read 330 96.7 3.1 .2 .0 .0 db file sequential read 12K 88.7 9.5 1.2 .4 .2 .1 .0 .0
Now the problem becomes clear. Although many reads are hitting the cache (<1ms bucket), there are an awful lot that are missing. The <2ms and <4ms buckets are virtually empty and then around <8ms we start to see big numbers again. Those are cache misses which incur the penalty of seek time on a hard disk drive somewhere in the array.
What’s more, it’s clear that the array is under some pressure because in many of the AWR reports there are substantial waits in the <16ms and above buckets. I’ve highlighted in red the occasions where over one third of read requests took over 8ms. That’s exactly the sort of behaviour that is making people replace disks with flash.
Let’s visualise the above data. I’ve gone back to the older set of AWR reports and graphed the histogram buckets to see if we can get a feel for the distribution of waits:
As you can see, I’ve tried to go for a heat map-style colour scheme, with green indicating cache hits on the storage array and the colours moving from yellow through orange to red indicating cache misses which result in disk drive seek time. There are also some real outliers in black where requests took over one thousand milliseconds to be serviced (ouch).
Another way to look at it is to make the picture clearer by reducing the detail. In the graph below I’ve simply assumed that everything sub-millisecond is a storage array cache hit and everything else is a miss:
Clearly there are a lot of physical reads paying the price of mechanical latency here.
Conclusion
What this customer does next is up for debate – there are more than just technical issues to consider. But the point of this post is to illustrate that while average wait times are great for getting a rough feel of how a system is behaving, they also offer huge scope to lose details. And sometimes those details are important. The devil, as they say, is truly in the detail…
Update: January 2015 – I have now added the ability to read Wait Event Histograms to my awr-parser.sh script which you can find on my Useful Scripts page…
Filed under: AWR Reports, Blog, Database, Storage Tagged: AWR reports, database, oracle