DB OPTimize

No, It is not always the database fault..

When things are starting to get wrong, Database ( i.e DBA ) is always the first tier to blame. I am sure you have faced this situation, probably much more than once or twice.
Guilty as charged, until proven innocent, right? 

Well, it is not that nothing can’t go wrong in the database, but I wish it would stop be treated as the black sheep of the family.

Recently, I had been in a situation where application was constantly complaining for bad response time, day after day. First thing to explore was whether application changed its behaviour, since most of the time that is the cause for performance issues. Surprisingly enough, it did not. No new SQLs, no major changes in execution count.
Second, I went to look for database changing its behaviour. But it didn’t. No changes in execution plan (usually major cause for performance degradation). As much as I have looked, from the database side, it looked like database didn’t change its behavior.

So what could be wrong?

Storage can go wrong for example, and that was the case here.

Looking for problems in the wrong place will not assist in solving them. But just saying “it is not the database” probably won’t help. You will probably always have to prove it…
Average physical read/write time is what oracle experiencing when dealing with IO. But underneath these numbers, two more tiers are involved. Storage and File System.

Well, after checking average physical read time and average physical write time of several files in this database, it was clear (unfortunately, just to me for a while) that writes became much slower, at exactly the time when the application have started to complain.

Don’t think it was easy. For many days Storage guys continues to say “Nothing is wrong with the storage” (how many times have you heard this? I wonder), but finally, when numbers became so high, it couldn’t be ignored any more.

At last, spotlight went to spot on the right place (and stopped pointing to the wrong place, the database). Well, probably just until the next complaint will rise again…but I guess I can’t win it all..

Here is a query that can be run in Oracle 10g and above. It might look scary, but in short, it’s just go over dba_hist_filestatxs (where file statistics are held overtime), and since numbers there are incremented every snapshot (it doesn’t hold the deltas), I have used LAG analytic function to subtract current value from preceding value (i.e, current snapshot from preceding snapshot), and I did it for every statistics I wanted to observe.
That’s it.

select *   from (
select
b.begin_interval_time begin_interval_time ,
nullif((phywrts – ( lag(phywrts,1,0) over( order by a.snap_id ) )),0) delta_phywrts,
trunc( (writetim*10 – ( lag(writetim*10,1,0) over( order by a.snap_id ) ) ) /
nullif((phywrts – ( lag(phywrts,1,0) over( order by a.snap_id ) )),0)  ,3 ) avg_write_ms,
nullif((phyrds – ( lag(phyrds,1,0) over( order by a.snap_id ) )),0)   delta_phyrds,
trunc( (readtim*10 – ( lag(readtim*10,1,0) over( order by a.snap_id ) ) ) /
nullif((phyrds – ( lag(phyrds,1,0) over( order by a.snap_id ) )),0)  ,3 ) avg_read_ms,
nullif((singleblkrds – ( lag(singleblkrds,1,0) over( order by a.snap_id ) )),0) delta_singleblkrds,
trunc( (singleblkrdtim*10 – ( lag(singleblkrdtim*10,1,0) over( order by a.snap_id ) ) ) /
nullif((singleblkrds – ( lag(singleblkrds,1,0) over( order by a.snap_id ) ))  ,0),3 ) avg_singleblkrds_ms
from dba_hist_filestatxs a ,
dba_hist_snapshot b
where filename = ‘my_file.dbf’
and  a.snap_id=b.snap_id
)
where to_char(begin_interval_time,’HH24′) between 7 and 11  — in case you wish to see only specific hours
and to_char(begin_interval_time,’HH24′) not in (‘FRIDAY’,’SATURDAY ‘) — in case you wish to filter nonworking days
order by begin_interval_time;

Below is part of the real life output, where major changes in average write time can clearly be seen, especially since 24.5. I have taken statistics of a specific data file, each time between 09:30-12:00, in normal working days (not in a weekend). Snapshot interval is 15 minutes. (For simplification, although above query shows reads and writes statistics, below I am showing only writes, since this is the issue relevant for this post).

on 7.5, average write time was between 2-7 ms.
on 16.5, average write time was between 6-11 ms.
on 21.5, average write time was between 6-14 ms.
on 24.5, average write time was between 12-21 ms.
on 13.6, average write time was very very bad. around 40 ms write time. This is 10 times slower than a month ago!
Finally, on 6/7, average write time went back to normal. went down to around 5 ms.

Pay attention to the increasing values of the 3rd column, avg_write_ms, which shows the average physical write time of a specific datafile, for every snapshot.

BEGIN_INTERVAL_TIME DELTA_PHYWRTS AVG_WRITE_MS
=================== ============= ============
07-MAY-12 09:30:05 1970 3.395
07-MAY-12 09:45:10 2007 4.489
07-MAY-12 10:00:12 1679 2.829
07-MAY-12 10:15:17 2120 3.037
07-MAY-12 10:30:19 2355 2.169
07-MAY-12 10:45:23 1609 2.094
07-MAY-12 11:00:24 750 7.533
07-MAY-12 11:15:26 737 2.713
07-MAY-12 11:30:28 779 3.465
07-MAY-12 11:45:29 625 2.448
16-MAY-12 09:30:48 572 6.73
16-MAY-12 09:45:49 645 5.395
16-MAY-12 10:00:51 576 8.072
16-MAY-12 10:15:52 720 10.513
16-MAY-12 10:30:54 622 11.495
16-MAY-12 10:45:55 745 7.208
16-MAY-12 11:01:00 692 9.205
16-MAY-12 11:15:01 920 10.793
16-MAY-12 11:30:03 839 8.116
16-MAY-12 11:45:05 755 8.172
21-MAY-12 09:30:54 2156 8.409
21-MAY-12 09:45:57 2016 14.627
21-MAY-12 10:00:59 1877 10.436
21-MAY-12 10:15:03 2224 6.753
21-MAY-12 10:30:06 1145 12.043
21-MAY-12 10:45:10 869 12.105
21-MAY-12 11:00:12 774 11.989
21-MAY-12 11:15:14 857 8.191
21-MAY-12 11:30:16 737 8.71
21-MAY-12 11:45:18 826 12.421
24-MAY-12 09:30:27 661 14.19
24-MAY-12 09:45:29 715 12.881
24-MAY-12 10:00:30 800 15.637
24-MAY-12 10:15:32 716 12.835
24-MAY-12 10:30:33 895 19.307
24-MAY-12 10:45:37 799 19.148
24-MAY-12 11:00:38 716 21.983
24-MAY-12 11:15:41 841 21.878
24-MAY-12 11:30:44 724 15.138
24-MAY-12 11:45:46 653 15.773
13-JUN-12 09:30:15 24608 24.476
13-JUN-12 09:45:17 24249 31.427
13-JUN-12 10:00:21 20164 39.09
13-JUN-12 10:15:23 19807 38.68
13-JUN-12 10:30:24 19572 38.088
13-JUN-12 10:45:26 18812 41.594
13-JUN-12 11:00:29 19003 40.993
13-JUN-12 11:15:33 21660 44.534
13-JUN-12 11:30:34 23242 38.171
13-JUN-12 11:45:36 21055 42.248
06-JUL-12 09:30:50 3057 6.319
06-JUL-12 09:45:51 3406 5.733
06-JUL-12 10:00:52 3880 8.103
06-JUL-12 10:15:53 4324 5.381
06-JUL-12 10:30:54 5409 4.906
06-JUL-12 10:45:55 2714 5.036
06-JUL-12 11:00:56 2723 4.697
06-JUL-12 11:15:57 2434 7.781
06-JUL-12 11:30:58 3804 4.91
06-JUL-12 11:45:59 2243 5.737

Ladies and gentlemen of the jury, what is your verdict?
We have found the defendant NOT GUILTY, your honor.
To sum it all up, as stated in the beginning, Database is not always to blame. This time the jury found it innocent.
🙂
First thing to check is the application. If you see that applications is behaving more or less the same (no new SQLs, same execution count per SQL), check the database.
And if the database is behaving more or less the same, it is simply not there.
Something else is the source of the problem.

As always, comments are most welcome..

Merav Kedem,
Founder of
DB OPTimize
Oracle Performance Tuning & DBA Consulting