Historical performance data is extremely important when tuning application. This gives us the ability to instantly know when things start go wrong, and the ability to investigated over the past what start causing things to go wrong. Often I am feeling like a modern Sherlock Holmes, searching for criminals
OEM does this, and so does several third party tools (such as Precise). Of course this can be done by manual scripts also.
Here is an example:
One of my customers was replicating data from mainframe to Oracle, and was complaining that replication has started working very slowly. With this information I went comparing program’s current behavior to program’s normal behavior bases on overtime activity information collected and saved for this instance. Indeed I saw that suddenly the program have started waiting much longer than it use to on “log file sync”. This was clue #1.
When looking at a wider time frame of program’s activity I have spotted the exact time when things start going wrong. This was clue #2.
Ok, something was changed. Now I needed to find out what was changed that was causing replication program to start waiting longer and longer for “log file sync”.
As we all know one of the reasons for “log file sync” is massive commits. I went looking for overtime information on Commits, and saw that the amount of commits were doubled compared to their usual amount per day (3.4M instead of 1.8M). This was clue #3. Suspicious….
I needed to look for the cause of this massive increased in commits. First, to answer – when do we do commits? After DMLs of course. So let’s look for suspicious DMLs (new DMLs, DMLs with high executions, DMLs with increased executions, etc). This was clue #4.
Now we are getting closer to solving the mystery. When adding clue #4 (suspicious DMLs) to clue #2 (time when things start going wrong), it wasn’t hard to find the criminal to blame. A massive new UPDATE had started at that time, executing over 2.6M executions a day, causing the massive commits, and the increase of “log file sync”.
Now all I have to do was to go back to the application and ask them few questions, such as: what was this SQL, why was it added, why was it executing so many times a day.
You will be surprise to know that the answer was “by mistake”. Well, they remove this UPDATE and things went back to normal.
All this investigation was finished by less than an hour. No doubt it would have been much harder and much longer to find the root cause of the abnormal program behavior without overtime information saved on application activity and instance activity.
All this investigation and conclusion can be visually seen in the following link: http://www.dboptimize.co.il/Oracle_abnormal_program_behavior_Redo_log.pdf
Remember: from my experience at least 80% of instance performance problems are caused by application… Much less are caused by wrong instance tuning (init parameters etc). It is always beneficial to know your application and talk to people who are responsible to it.