Oracle RAC Troubleshooting — The Inter-Node Issues That Fool You?

RAC troubleshooting is different from single-instance troubleshooting in ways that are non-obvious until you’ve been through a few incidents. The most frustrating RAC problems are those where everything looks healthy on each individual node, but the cluster as a whole is behaving badly. This post focuses specifically on the inter-node and cluster-layer problems that don’t show up in single-instance diagnostic tools.


The Cluster Interconnect — Your First Suspect

Every RAC node communicates with other nodes through the private interconnect network. Cache Fusion — the mechanism that transfers data blocks between nodes without writing to disk — uses this network. Global enqueues, distributed lock management, and SCN coordination all happen over this network.

If the interconnect has high latency, packet loss, or bandwidth saturation, every RAC operation that requires inter-node coordination suffers. The symptoms can look like a slow database, like lock contention, like I/O problems, or like “random” performance degradation.

First diagnostic: verify the interconnect configuration is correct.

sql

-- Check which network interfaces are used for interconnect
SELECT name, ip_address, is_public
FROM v$cluster_interconnects;
-- The interconnect should be a private network (is_public = NO)
-- If it shows a public network IP, RAC is using the wrong interface

This is more common than you’d think. After a server rebuild or network reconfiguration, RAC sometimes falls back to the public network for cluster communication. Cache Fusion traffic over a shared public network brings performance to its knees.

Also check:

bash

# On each node, measure interconnect latency
ping -I <private_nic> <other_node_private_ip> -f -s 1472 -c 1000
# Should be < 1ms. Anything above 2ms is problematic for RAC.

GCS and GES — Understanding the Wait Events

The Global Cache Service (GCS) and Global Enqueue Service (GES) are the heart of RAC coordination. When you see waits in AWR or ASH with “gc” prefix, you’re looking at inter-node block transfers:

gc cr request - requesting a consistent read copy of a block
gc current request - requesting the current version of a block
gc cr block 2-way - block transferred between 2 nodes (fast)
gc current block 2-way - current block transferred (fast)
gc cr block 3-way - block transferred through a 3rd node (slower)
gc cr disk read - block had to be read from disk by the owner

Two-way transfers are fast (milliseconds). Three-way and disk reads are slow. High rates of three-way transfers or disk reads in your “gc” wait events indicate hot block contention — the same blocks being accessed by multiple nodes.

sql

-- Find the hot objects causing gc waits
SELECT o.owner, o.object_name, o.object_type,
SUM(s.gc_requests_lost) gc_lost,
SUM(s.physical_reads) phys_reads
FROM v$segment_statistics s
JOIN dba_objects o ON s.obj# = o.object_id
WHERE s.statistic_name IN ('gc cr blocks received', 'gc current blocks received')
GROUP BY o.owner, o.object_name, o.object_type
ORDER BY SUM(s.gc_requests_lost) DESC
FETCH FIRST 20 ROWS ONLY;

If you identify specific hot segments, examine whether they can be redesigned to reduce inter-node contention: reverse key indexes for sequential inserts, hash partitioning to spread hot blocks, sequence caching to reduce SGA mutex contention.


ASH for RAC — The Multi-Node View

Standard AWR and ASH views show data per instance. For RAC troubleshooting, you need the multi-instance views:

sql

-- ASH across all nodes for a specific time window
SELECT inst_id,
event,
COUNT(*) waits,
ROUND(COUNT(*)/SUM(COUNT(*)) OVER () * 100, 2) pct
FROM gv$active_session_history
WHERE sample_time BETWEEN TIMESTAMP '2025-04-19 14:00:00'
AND TIMESTAMP '2025-04-19 14:30:00'
AND session_state = 'WAITING'
GROUP BY inst_id, event
ORDER BY waits DESC
FETCH FIRST 20 ROWS ONLY;

GV$ views (as opposed to V$ views) span all instances. Always use GV$ for RAC-wide diagnostics.

If you see a specific wait event concentrated on one instance but not others, you have a node-specific problem (local I/O issue, CPU saturation on that node, OS-level problem). If you see it across all nodes, it’s a shared resource problem (storage, interconnect, shared pool).


Node Eviction — The Most Disruptive RAC Event

A node eviction (also called a node reboot by Clusterware) is when the cluster forcibly removes a node. Symptoms: applications get ORA-03113 or ORA-01033 errors, one RAC node reboots unexpectedly, alert log shows “Evicting peer instance” messages.

Evictions happen when a node fails to respond to the Cluster Health Monitor’s heartbeats within the configured timeout. They’re Oracle’s way of ensuring cluster integrity — a silent node is more dangerous than an evicted node.

Diagnosing why an eviction happened:

bash

# Check the Cluster Ready Services log
tail -200 $ORACLE_BASE/diag/crs/<hostname>/crs/trace/ocssd.trc
grep -i "evict\|reboot\|vote" $ORACLE_BASE/diag/crs/<hostname>/crs/trace/ocssd.trc
# Check the Oracle Cluster Registry trace
ls -lt $ORACLE_BASE/diag/crs/<hostname>/crs/trace/ | head -20

Common eviction causes:

  • Voting disk I/O timeout (node couldn’t write to voting disks within the timeout)
  • Network split-brain (nodes couldn’t communicate over interconnect)
  • OS-level freeze (high memory pressure triggering OOM killer)
  • NTP misconfiguration causing clock skew between nodes

Voting disk I/O issues are the most common. If your voting disks are on shared storage that experiences latency spikes, nodes can be evicted even when the database and application are perfectly healthy. For OCI RAC, Oracle manages the voting disk configuration — but if you’re on-prem, ensure your voting disks are on low-latency, highly available storage separate from your data storage.


Services — Use Them for Workload Management

One of the most underused RAC features is Services. Instead of connecting directly to a specific instance, applications connect to a service name that Oracle routes to the appropriate instance(s).

sql

-- Create a service for OLTP workload, primary on instance 1
EXEC DBMS_SERVICE.CREATE_SERVICE(
service_name => 'OLTP_SERVICE',
network_name => 'OLTP_SERVICE',
preferred_instances => 'RAC1',
available_instances => 'RAC2'
);
EXEC DBMS_SERVICE.START_SERVICE('OLTP_SERVICE');
-- Create a service for reporting, primary on instance 2
EXEC DBMS_SERVICE.CREATE_SERVICE(
service_name => 'REPORT_SERVICE',
network_name => 'REPORT_SERVICE',
preferred_instances => 'RAC2',
available_instances => 'RAC1'
);

With services:

  • OLTP connections go to RAC1, reporting connections go to RAC2
  • If RAC1 fails, OLTP connections automatically failover to RAC2
  • Resource Manager plans can be tied to services for workload isolation
  • TAF (Transparent Application Failover) settings are defined at the service level

Running all connections on a single service name that distributes across all nodes without workload isolation is leaving one of RAC’s most powerful capabilities unused.



Yorum bırakın