Your database history is precious. Keeping it is crucial. Keep it as if you keep a valuable treasure.
And why is that? because it holds the normal and the abnormal behaviour of your database, and both or them are equally important.
It will save you, over and over again. That’s a guarantee.
Your database history keeps information of everything that goes well and everything that goes wrong (will assist in the detection stage).
And, it gives you the ability to return to the normal desired behaviour (will assist in fixing the problem).
Oracle changes SQL execution plan. Till now, it’s like an axiom. A well known fact. Like knowing the sun will shine in the morning, but with a bad connotation.
It happens all the time, even if you don’t want it to, and it causes severe problems.
The bad effect is wide. Online suddenly get stuck; DW reports suddenly finish long hours after they should; Night processes leak into morning hours, overloading the system; and this time it is the database fault (as opposed to a previous post I once wrote “No, it is not always the database fault” ).
How to detect things that suddenly go wrong?
First, keep your history.
OEM does it by default, and keeps 8 days of performance history in its AWR tables with 1 hour interval (note that using AWR tables has some licensing issues).
Eight days of history is not enough, as proven many times in my experience. You should increase it to 2-3 months at least.
And why is that? because, for an instance, there are some monthly processes which run once a month, and in order to be able to compare this current-month execution behaviour to last-month execution behaviour, or to even two-months-ago execution behaviour, you have to keep old-enough history.
I would say that in most cases keeping a history of somewhere between 70-100 days is enough.
I also prefer to change the default interval from 1 hour to 15 minutes. It will allow you to detect issues faster. Otherwise you will need to wait until one hour will pass before last-one-hour history is kept in the database.
So how can you see what is your current retention period (how long your history is kept) and what is your retention interval?
By running the following query:
select extract( day from snap_interval) * 24 * 60 +
extract( hour from snap_interval) * 60 +
extract( minute from snap_interval ) Interval_in_minutes,
extract( day from retention) retention_peroid_in_days,
extract( day from retention) * 24 * 60+
extract( hour from retention) * 60 +
extract( minute from retention ) retention_peroid_in_minutes
This is what you will see in case you haven’t changed Oracle’s default setting:
And how to change it to something else? by using dbms_workload_repository.modify_snapshot_settings :
This will change the interval to 15 minutes, and the history period to 70 days ( as 70 days * 24 hours * 60 minutes = 100800 minutes)
execute dbms_workload_repository.modify_snapshot_settings (interval => 15, retention => 100800);
This is what you see after the change:
OK. Now you are set. This was the first step to success. That was easy.
Next steps are to find the abnormal behaviour, and then fix it. This can be useful in several situations. Problems can occur online when someone is complaining things are not working as they should. It can be a complaint requiring checking things in past. It can be proactive action the DBA does for detecting performance issue. And so on.
The following query can show you the top SQLs running in your database for the last 5 minutes:
from v$active_session_history a,
where sample_time between sysdate – 5/60/24 and sysdate
and a.sql_id=b.sql_id (+)
and session_type != ‘BACKGROUND’
and a.sql_id is not null
group by a.sql_id , sql_text
order by count(a.sql_id) desc;
After spotting a problematic sql_id, the query below will show you its current performance information, such as: how many executions were running, what is the average execution time, how many rows are being returned in average, when was it last active, etc…
trunc(elapsed_time/1000000/executions,5 ) avg_exec_sec,
from v$sql a
Then, you can explore the historical behaviour of the same problematic SQL:
select b.begin_interval_time ,
trunc( a.buffer_gets_delta/a.executions_delta,2) buff_avg,
from dba_hist_sqlstat a ,
and executions_delta > 0
order by a.snap_id desc;
This query will show you what was the performance behaviour of this SQL overtime, within each snapshot interval.
As my interval was set to 15 minutes, I have performance statistics for every 15 minutes. For each snapshot (15 minutes interval) I can see what execution plan was being used, how many executions were running, what was the average elapsed time, how many rows were returned, how many buffers were read, what was the total time it consumed, and more.
These are the results of an online query constantly changing its execution plan (a very undesired behaviour):
Now I can easily detect abnormal behaviour of this SQL.
I can tell that a severe performance issue started at 12:30 and ended around 14:00. I can also say that the problem occurred when execution plan was changed, and oracle had chosen a bad execution plan for this SQL. During this 1.5 hour, average execution time went up from few milliseconds (and even below) to somewhere around 2 sec. Multiply this bad average execution time with thousands of executions (as seen in executions column) and you can clearly see the problem.
Lets say in words what we see in the above results:
At 24-NOV-15 12.45 (i.e 12:45-13:00, line marked with red color), this SQL was being executed 7429 times. Oracle run it with a particular execution plan: plan_hash_value=4279890028. It returned an average of 1 row per execution, average execution time was 1.95 seconds, and the total time those 7429 executions had consumed between 12:45-13:00 was 14519 seconds (~4 hours).
WOW!!! That’s a lot.
As appose, at 25-NOV-15 10.15 (i.e 10:15-10:30, line marked with green color), same SQL was running 8098 times. This time Oracle had run it with different execution plan, plan_hash_value=2391881619. It also returned an average of 1 row per execution. Nevertheless, average execution time was 0.00081 seconds, and the total time those 8098 executions had consumed between 10:15-10:30 was only 6.6 seconds.
WOW again. What a huge difference!
Same interval (15 minutes), almost same amount of executions (7429 vs 8098), but totally not the same average time (0.00081 sec vs. 1.95 sec), thus extremely not the same total time (14519 sec vs. 6.6 sec).
This had involved into a huge performance issue in my customer’s site. Application was stuck and couldn’t progress as it should.
Now you probably ask yourself how to fix it and and how to prevent it from occurring again?
Well, this will be discussed in the future… hold your breath for now
Comments are always welcomed