If you’ve been unfortunate enough to follow my dithering on Twitter recently you’ll know that I’ve been lurching between thinking that there is and isn’t a problem with Oracle’s tracking of I/O statistics in its AWR reports.
I’m now convinced there is a problem, but I can’t work out what causes it… so step 1 is to describe the problem here, after which step 2 will probably be to sit back and hope someone far more intelligent than me will solve it.
But first some background:
AWR I/O Statistics
I’ve written about the I/O statistics contained in Oracle AWR Reports before, so I won’t repeat myself too much other than to highlight two critical areas which we’ll focus on today: Instance Statistics and IOStat Summaries. By the way, the format of AWR reports changed in 11.2.0.4 and 12c to include a new IO Profile section, but today we’re covering reports from 11.2.0.3.
First of all, here are some sensible statistics. I’m going to show you the IOStat by Function Summary section of a report from a real life database:
IOStat by Function summary -> 'Data' columns suffixed with M,G,T,P are in multiples of 1024 other columns suffixed with K,M,G,T,P are in multiples of 1000 -> ordered by (Data Read + Write) desc Reads: Reqs Data Writes: Reqs Data Waits: Avg Function Name Data per sec per sec Data per sec per sec Count Tm(ms) --------------- ------- ------- ------- ------- ------- ------- ------- ------- Buffer Cache Re 14.9G 504.1 4.22491 0M 0.0 0M 1690.3K 0.0 Direct Reads 12.2G 3.5 3.45612 0M 0.0 0M 0 N/A DBWR 0M 0.0 0M 7.9G 186.6 2.25336 0 N/A Others 4.2G 12.1 1.18110 3.6G 2.5 1.03172 41.9K 0.1 LGWR 1M 0.0 .000277 5.8G 11.9 1.65642 17.4K 1.0 Direct Writes 0M 0.0 0M 4M 0.0 .001110 0 N/A TOTAL: 31.2G 519.7 8.86241 17.4G 201.0 4.94263 1749.6K 0.0 -------------------------------------------------------------
This section of the report is breaking down all I/O operations into the functions that caused them (e.g. buffer cache reads, direct path reads or writes, redo log writes by LGWR, the writing of dirty buffers by DBWR and so on). Let’s ignore that level of detail now and just focus on the TOTAL row at the bottom.
To try and make this simple to describe, I’ve gone a bit crazy with the colours. In green I’ve highlighted the labels for reads and writes – and now let’s walk through the columns:
- The first column is the function name, but we’re just focussing on TOTAL
- The second column is the total amount of reads that happened in this snapshot: 31.2GB
- The third column is the average number of read requests per second, i.e. 519.7 read IOPS
- The fourth column is the average volume of data read per second, i.e. 8.86 MB/sec read throughput
- The fifth column is the total amount of writes that happened in this snapshot: 17.4GB
- The sixth column is the average number of write requests per second, i.e. 201 write IOPS
- The seventh column is the average volume of data written per second, i.e. 4.94 MB/sec write throughput
- The eighth and ninth columns are not of interest to us here
The whole section is based on the DBA_HIST_IOSTAT_FUNCTION view. What we care about today is the IOPS figures (shown in red) and the Throughput figures (shown in blue). Pay attention to the comments in the view header which show that Data columns (including throughput) are multiples of 1024 while other columns (including IOPS) are multiples of 1000. It’s interesting that the two throughput values are obviously measured in MB/sec and yet are missing the “M” suffix – I assume this “falls off the end” of the column because of the number of decimal places displayed.
Now that we have these figures explained, let’s compare them to what we see in the Instance Activity Stats section of the same AWR report:
Instance Activity Stats -> Ordered by statistic name Statistic Total per Second per Trans -------------------------------- ------------------ -------------- ------------- physical read IO requests 1,848,576 513.3 121.5 physical read bytes 29,160,300,544 8,096,201.8 1,917,053.5 physical read total IO requests 1,892,366 525.4 124.4 physical read total bytes 33,620,594,176 9,334,578.5 2,210,281.7 physical read total multi block 17,096 4.8 1.1 physical reads 3,559,607 988.3 234.0 physical write IO requests 671,728 186.5 44.2 physical write bytes 8,513,257,472 2,363,660.5 559,677.7 physical write total IO requests 723,348 200.8 47.6 physical write total bytes 18,657,198,080 5,180,071.5 1,226,559.6 physical write total multi block 26,192 7.3 1.7 physical writes 1,039,216 288.5 68.3
For this section we care about the per Second column because both IOPS and Throughput are measured using units per second. For both reads and writes there are two sets of statistics: those with the word total in them and those without. You can find the full description of 11gR2 statistics in the documentation, but the difference between the two is best summed up by this snippet which describes physical read total bytes:
Total size in bytes of disk reads by all database instance activity including application reads, backup and recovery, and other utilities. The difference between this value and “physical read bytes” gives the total read size in bytes by non-application workload.
I’ve underlined non-application workload here because this is critical. If you merely look at the Load Profile section at the top of an AWR report you will only see values for “application workload” I/O but this does not include stuff like RMAN backups, archive logging and so on… important stuff if you care about the actual I/O workload. For this reason, we only care about the following four statistics:
- physical read total IO requests (per second) = the average number of read IOPS
- physical read total bytes (per second) = the average read throughput in bytes per second
- physical write total IO requests (per second) = the average number of write IOPS
- physical write total bytes (per second) = the average write throughput in bytes per second
Again I’ve coloured the IOPS figures in red and the throughput figures in blue.
Tying It Together
Now that we have our two sets of values, let’s just compare them to make sure they align. The IOPS figures do not require any conversion but the throughput figures do: the values in the Instance Activity Stats section are in bytes/second and we want them to be in MB/second so we need to divide by 1048576 (i.e. 1024 * 1024).
IOStat by Function |
Instance Activity Stats |
Error Percentage |
|
Read IOPS |
519.7 |
525.4 |
1.09% |
Write IOPS |
201.0 |
200.8 |
0.10% |
Read Throughput |
8.86241 MB/sec |
9,334,578.5 bytes/sec 8.90215 MB/sec |
0.45% |
Write Throughput |
4.94263 MB/sec |
5,180,071.5 bytes/sec 4.94010 MB/sec |
0.05% |
I’ve calculated the error percentages here to see how far the figures vary. It is my assumption that the Instance Activity Stats are accurate and that any margin of error in the IOStat figures comes as a result of sampling issues. The highest error percentage we see here is just over 1%, which is hardly a problem in my opinion.
Don’t Believe The Stats
So far I have no complaints about the matching of statistics in the AWR report. But now let me introduce you to the AWR report that has been puzzling me recently:
IOStat by Function summary -> 'Data' columns suffixed with M,G,T,P are in multiples of 1024 other columns suffixed with K,M,G,T,P are in multiples of 1000 -> ordered by (Data Read + Write) desc Reads: Reqs Data Writes: Reqs Data Waits: Avg Function Name Data per sec per sec Data per sec per sec Count Tm(ms) --------------- ------- ------- ------- ------- ------- ------- ------- ------- Direct Reads 3.8T 2.7E+04 3.2G 405M 3.6 .335473 0 N/A Direct Writes 0M 0.0 0M 164.5G 1159.0 139.519 0 N/A Buffer Cache Re 63.2G 3177.6 53.6086 0M 0.0 0M 3292.5K 23.4 DBWR 0M 0.0 0M 2G 45.3 1.71464 0 N/A LGWR 0M 0.0 0M 485M 19.2 .401739 19.7K 0.9 Others 220M 12.3 .182232 26M 0.7 .021536 15.6K 34.5 Streams AQ 0M 0.0 0M 0M 0.0 0M 1 9.0 TOTAL: 3.9T 3.0E+04 3.3G 167.4G 1227.8 141.992 3327.8K 23.3 ------------------------------------------------------------- Instance Activity Stats -> Ordered by statistic name Statistic Total per Second per Trans -------------------------------- ------------------ -------------- ------------- physical read total IO requests 35,889,568 29,728.4 3,239.4 physical read total bytes 4,261,590,852,608 3.52999864E+09 3.8465483E+08 physical write total IO requests 1,683,381 1,394.4 151.9 physical write total bytes 205,090,714,624 169,882,555.1 18,511,663.0
Again I’ve coloured the IOPS measurements in red and the throughput measurements in blue. And as before we need to convert the bytes per second values shown in the Instance Activity Stats section to MB/sec as shown in the IOStat by Function Summary section.
IOStat by Function |
Instance Activity Stats |
Error Percentage |
|
Read IOPS |
30,000 |
29,728.4 |
0.91% |
Write IOPS |
1227.8 |
1,394.4 |
11.95% |
Read Throughput |
3.3 GB/sec 3,379.2 MB/sec |
3,529,998,640 bytes/sec 3,366.469 MB/sec |
0.38% |
Write Throughput |
141.992 MB/sec |
169,882,555.1 bytes/sec 162.01263 MB/sec |
12.36% |
Do you see what’s bugging me here? The write values for both IOPS and throughput are out by over 10% when I compare the values in IOStat by Function against the Instance Activity Stats. Ten percent is a massive margin of error at this level – we’re talking 20MB/sec. To translate that into something easier to understand, if 20MB/sec were sustained over a 24 hour period it would amount to over 1.6TB of data. I’ve seen smaller data warehouses.
So why is this happening? Unfortunately I don’t have access to the system where this AWR report was created, so I cannot tell if, for example, there was an instance restart between the start and end snapshots (although the elapsed time of the report was just 20 minutes so it seems unlikely).
The truth is I don’t know. Which is why I’m writing about it here… if you think you have the answer, or just as importantly if you see the same behaviour, let me know!
Filed under: AWR Reports, Blog, Database, Storage Tagged: AWR reports, database, oracle