Sreekanth Bandi: November 2012

Here I would like to share performance troubleshooting experience that I had recently. The database environment comprises Extended Oracle RAC on 2 Solaris Sparc M8000 machines, running Solaris 10, each located in a separate data centre which are 1-2 miles apart from each other. The oracle clusterware version is 11.1.0.7. There are 14 different database instances(some are 10gR2 and others are 11gR1) running on each node in the cluster, which are monitored by Enterprise Manager Grid Control.

The problem is noticed after Enterprise Manager sending EM Alerts with ORA-12170 error for all the instances running on one node; the instances on the other node are fine. There was another alert with the message "Agent is unrecheable, but the host is available".

The EM agent status is shown as running. There are about 20 OS processes with the command /install/app/oracle/product/agent10g/perl/bin/perl /install/app/oracle/product/. The ps -ef | grep agent will list all such processes. Even after shutting down the agent, these processes did not disappear. All these processes are killed using kill command. It did not improve anything.

The top command shows the load averages as 15 to 20. At normal times the load averages are observed in the range of 3 to 7. The machine has 2 quad-core CPUs. The top command output also shows few zombie processes, the no of zombie processes are different each time top output is refreshed. The process ids of these zombie processes are also different. The command ps -ef | grep defunct confirmed this.

The database application users started complaining about connection issues with time out errors.

The load average is not high, the response time of commands like ls, ps, oraenv is not bad but we observed 2 or 3 seconds wait time before any command (ls, ps, oraenv) is returning the control to the shell afer showing the output, if any. Connecting to any database instance as sysdba locally is also quick(fraction of second) but the quit command spends 3 to 5 secs to exit.

Then I have found the article with ID 1002436.1 on My Oracle Support site. (I don't remember which search terms I have used to get this page from the knowledge base. I was using keywords like process exit slow on solaris; zombie process on solaris)

As explained in the article, the file /var/adm/pacct was 2GB. It looked like we encountered the specified bug at OS level which has caused the performance problems. The adm account was locked which has cron job scheduled to process this file on regular basis and hence the file has grown big. The process accounting was turned off (/usr/lib/acct turnacct off) and the performance of the server has come down to normal. Later process accounting was turned on (/usr/lib/acct turnacct on) but the performance was stable.

The OS reboot might have solved this, but I am glad that we got to the bottom of what has caused the issue.

Sreekanth Bandi

Thursday, 22 November 2012

Performance Troubleshooting on Solaris 10

Thursday, 8 November 2012

Is NLS_LENGTH_SEMANTICS really dynamic?