We have 11.2.0.1.6 2 node RAC system which is belong r12.1.3 EBS system.

Our node 1 is goes down by below errors:
opiodr aborting process unknown ospid (11075852) as a result of ORA-28
LMON (ospid: 63767522) detects hung instances during IMR reconfiguration
LMON (ospid: 63767522) tries to kill the instance 2.
Please check instance 2’s alert log and LMON trace file for more details.
Tue Mar 19 10:58:36 2013
USER (ospid: 32900426): terminating the instance due to error 481
Tue Mar 19 10:58:36 2013
Errors in file /oracle11g/PROD00/db/diag/rdbms/PROD00/PROD001/trace/PROD001_lmon_63767522.trc:
ORA-29702: error occurred in Cluster Group Service operation
System state dump is made for local instance
System State dumped to trace file /oracle11g/PROD00/db/diag/rdbms/PROD00/PROD001/trace/PROD001_diag_9373174.trc
Instance terminated by USER, pid = 32900426
Error Codes
—————————————————

From: PROD001_lmon_63767522.trc

rom: PROD001_lmon_63767522.trc

*** 2013-03-19 10:55:00.531

* DRM RCFG called (swin 1)
CGS recovery timeout = 85 sec
Begin DRM(5108) (swin 1)

*** 2013-03-19 10:57:11.547
kjxgmrcfg: Reconfiguration started, type 6
CGS/IMR TIMEOUTS:
CSS recovery timeout = 31 sec (Total CSS waittime = 65)
IMR Reconfig timeout = 75 sec
CGS rcfg timeout = 85 sec
kjxgmcs: Setting state to 274 0.

*** 2013-03-19 10:57:11.567
Name Service frozen

..
* kjfcln: DRM aborted due to CGS rcfg.

*** 2013-03-19 10:57:16.439
*** 2013-03-19 10:57:41.514
=====================================================
kjxgmpoll: CGS state (274 1) start 0x51482867 cur 0x51482885 rcfgtm 30 sec
=====================================================
Group name: PROD00
Member id: node 0 inst 1
Cached KGXGN event: 0
Group State:
State: 274 1
Flags: 0x4 SSFlags: 0x0
Reconfig started cur-tm 0x6aba6ec8 start-tm 0x6ab9fc80 tmout 0x55 state 0x2
Reconfig INPG type 6 inc 274 rsn 5 data 0x1
Reconfig COMP type 6 inc 274 rsn 5 data 0x1
..

*** 2013-03-19 10:58:31.632
=====================================================
kjxgmpoll: CGS state (274 1) start 0x51482867 cur 0x514828b7 rcfgtm 80 sec

*** 2013-03-19 10:58:36.664
=====================================================
kjxgmpoll: CGS state (274 1) start 0x51482867 cur 0x514828bc rcfgtm 85 sec
kjxgmpoll: the CGS reconfiguration has spent 85 seconds.
kjxgmpoll: terminate the CGS reconfig.
Error: Cluster Group Service reconfiguration takes too long
LMON caught an error 29702 in the main loop
error 29702 detected in background process
ORA-29702: error occurred in Cluster Group Service operation

We see many drm quiesce hang messages
find . -name “*lmon*.trc” |xargs grep -i “quiesce hang”
./oracle/PROD001_lmon_63767522.trc:* Request pseudo reconfig due to drm quiesce hang
./oracle/PROD001_lmon_63767522.trc:* Request pseudo reconfig due to drm quiesce hang
./oracle/PROD002_lmon_14221454.trc:* Request pseudo reconfig due to drm quiesce hang
./oracle/PROD002_lmon_14221454.trc:* Request pseudo reconfig due to drm quiesce hang
./oracle/PROD002_lmon_14221454.trc:* Request pseudo reconfig due to drm quiesce hang
./oracle/PROD002_lmon_14221454.trc:* Request pseudo reconfig due to drm quiesce hang
./oracle/PROD002_lmon_14221454.trc:* Request pseudo reconfig due to drm quiesce hang

Based on these, The issue does appear to be an occurance of bug : 12879027 LMON gets stuck in DRM quiesce causing intermittent pseudo reconfiguration
To get the fix for the bug, please install the 11.2.0.3 patchset into the rdbms $ORACLE_HOME and then , apply on top, the 11.2.0.3.3 PSU, or higher/later PSUs

More details can be found at MOS note: Bug 12879027 – LMON gets stuck in DRM quiesce causing intermittent pseudo reconfiguration [ID 12879027.8]

Advertisements