Search This Blog

Thursday, July 12, 2012

No, It is not always the database fault..

When things are starting to get wrong, Database ( i.e DBA ) is always the first tier to blame. I am sure you have faced this situation, probably much more than once or twice.
Guilty as charged, until proven innocent, right?

Well, it is not that nothing can't go wrong in the database, but I wish it would stop be treated as the black sheep of the family.

Recently, I had been in a situation where application was constantly complaining for bad response time, day after day. First thing to explore was whether application changed its behaviour, since most of the time that is the cause for performance issues. Surprisingly enough, it did not. No new SQLs, no major changes in execution count.
Second, I went to look for database changing its behaviour. But it didn't. No changes in execution plan (usually major cause for performance degradation). As much as I have looked, from the database side, it looked like database didn't change its behavior.

So what could be wrong?

Storage can go wrong for example, and that was the case here.

Looking for problems in the wrong place will not assist in solving them. But just saying "it is not the database" probably won't help. You will probably always have to prove it...
Average physical read/write time is what oracle experiencing when dealing with IO. But underneath these numbers, two more tiers are involved. Storage and File System.

Well, after checking average physical read time and average physical write time of several files in this database, it was clear (unfortunately, just to me for a while) that writes became much slower, at exactly the time when the application have started to complain.

Don't think it was easy. For many days Storage guys continues to say "Nothing is wrong with the storage" (how many times have you heard this? I wonder), but finally, when numbers became so high, it couldn't be ignored any more.

At last, spotlight went to spot on the right place (and stopped pointing to the wrong place, the database). Well, probably just until the next complaint will rise again...but I guess I can't win it all..

Here is a query that can be run in Oracle 10g and above. It might look scary, but in short, it's just go over dba_hist_filestatxs (where file statistics are held overtime), and since numbers there are incremented every snapshot (it doesn't hold the deltas), I have used LAG analytic function to subtract current value from preceding value (i.e, current snapshot from preceding snapshot), and I did it for every statistics I wanted to observe.
That's it.


select *   from (
   select
      b.begin_interval_time begin_interval_time ,  
      nullif((phywrts - ( lag(phywrts,1,0) over( order by a.snap_id ) )),0) delta_phywrts,
      trunc( (writetim*10 - ( lag(writetim*10,1,0) over( order by a.snap_id ) ) ) /  
             nullif((phywrts - ( lag(phywrts,1,0) over( order by a.snap_id ) )),0)  ,3 ) avg_write_ms,
      nullif((phyrds - ( lag(phyrds,1,0) over( order by a.snap_id ) )),0)   delta_phyrds,
      trunc( (readtim*10 - ( lag(readtim*10,1,0) over( order by a.snap_id ) ) ) /  
             nullif((phyrds - ( lag(phyrds,1,0) over( order by a.snap_id ) )),0)  ,3 ) avg_read_ms,
      nullif((singleblkrds - ( lag(singleblkrds,1,0) over( order by a.snap_id ) )),0) delta_singleblkrds,
      trunc( (singleblkrdtim*10 - ( lag(singleblkrdtim*10,1,0) over( order by a.snap_id ) ) ) /  
             nullif((singleblkrds - ( lag(singleblkrds,1,0) over( order by a.snap_id ) ))  ,0),3 ) avg_singleblkrds_ms
   from dba_hist_filestatxs a , 
   dba_hist_snapshot b 
   where filename = 'my_file.dbf'
   and  a.snap_id=b.snap_id  
  )
 where to_char(begin_interval_time,'HH24') between 7 and 11  -- in case you wish to see only specific hours
 and to_char(begin_interval_time,'HH24') not in ('FRIDAY','SATURDAY ') -- in case you wish to filter nonworking days
  order by begin_interval_time;




Below is part of the real life output, where major changes in average write time can clearly be seen, especially since 24.5. I have taken statistics of a specific data file, each time between 09:30-12:00, in normal working days (not in a weekend). Snapshot interval is 15 minutes. (For simplification, although above query shows reads and writes statistics, below I am showing only writes, since this is the issue relevant for this post).

on 7.5, average write time was between 2-7 ms.
on 16.5, average write time was between 6-11 ms.
on 21.5, average write time was between 6-14 ms.
on 24.5, average write time was between 12-21 ms.
on 13.6, average write time was very very bad. around 40 ms write time. This is 10 times slower than a month ago!
Finally, on 6/7, average write time went back to normal. went down to around 5 ms.

Pay attention to the increasing values of the 3rd column, avg_write_ms, which shows the average physical write time of a specific datafile, for every snapshot.

BEGIN_INTERVAL_TIMEDELTA_PHYWRTSAVG_WRITE_MS
============================================
07-MAY-12 09:30:0519703.395
07-MAY-12 09:45:1020074.489
07-MAY-12 10:00:1216792.829
07-MAY-12 10:15:1721203.037
07-MAY-12 10:30:1923552.169
07-MAY-12 10:45:2316092.094
07-MAY-12 11:00:247507.533
07-MAY-12 11:15:267372.713
07-MAY-12 11:30:287793.465
07-MAY-12 11:45:296252.448
16-MAY-12 09:30:485726.73
16-MAY-12 09:45:496455.395
16-MAY-12 10:00:515768.072
16-MAY-12 10:15:5272010.513
16-MAY-12 10:30:5462211.495
16-MAY-12 10:45:557457.208
16-MAY-12 11:01:006929.205
16-MAY-12 11:15:0192010.793
16-MAY-12 11:30:038398.116
16-MAY-12 11:45:057558.172
21-MAY-12 09:30:5421568.409
21-MAY-12 09:45:57201614.627
21-MAY-12 10:00:59187710.436
21-MAY-12 10:15:0322246.753
21-MAY-12 10:30:06114512.043
21-MAY-12 10:45:1086912.105
21-MAY-12 11:00:1277411.989
21-MAY-12 11:15:148578.191
21-MAY-12 11:30:167378.71
21-MAY-12 11:45:1882612.421
24-MAY-12 09:30:2766114.19
24-MAY-12 09:45:2971512.881
24-MAY-12 10:00:3080015.637
24-MAY-12 10:15:3271612.835
24-MAY-12 10:30:3389519.307
24-MAY-12 10:45:3779919.148
24-MAY-12 11:00:3871621.983
24-MAY-12 11:15:4184121.878
24-MAY-12 11:30:4472415.138
24-MAY-12 11:45:4665315.773
13-JUN-12 09:30:152460824.476
13-JUN-12 09:45:172424931.427
13-JUN-12 10:00:212016439.09
13-JUN-12 10:15:231980738.68
13-JUN-12 10:30:241957238.088
13-JUN-12 10:45:261881241.594
13-JUN-12 11:00:291900340.993
13-JUN-12 11:15:332166044.534
13-JUN-12 11:30:342324238.171
13-JUN-12 11:45:362105542.248
06-JUL-12 09:30:5030576.319
06-JUL-12 09:45:5134065.733
06-JUL-12 10:00:5238808.103
06-JUL-12 10:15:5343245.381
06-JUL-12 10:30:5454094.906
06-JUL-12 10:45:5527145.036
06-JUL-12 11:00:5627234.697
06-JUL-12 11:15:5724347.781
06-JUL-12 11:30:5838044.91
06-JUL-12 11:45:5922435.737


Ladies and gentlemen of the jury, what is your verdict?
We have found the defendant NOT GUILTY, your honor.

To sum it all up, as stated in the beginning, Database is not always to blame. This time the jury found it innocent.
:)
First thing to check is the application. If you see that applications is behaving more or less the same (no new SQLs, same execution count per SQL), check the database.
And if the database is behaving more or less the same, it is simply not there.
Something else is the source of the problem.

As always, comments are most welcome..


Merav Kedem,
Founder of
DB OPTimize
Oracle Performance Tuning & DBA Consulting


1 comment:

  1. It is very simple, Applications are rarely correctly instrumented. Operating systems are barely instrumented and networks less so.

    Thus the well instrumented Database in between can become a useful source of information about what goes on above and bellow.

    The simple reason that the DBA gets kicked so often is because the apps guys probably can't answer questions about their own application!

    We need to be immensely clear, just because the Database recorded the problem that doesn't mean it caused the problem!

    ReplyDelete