ORA-19566 & RMAN-03009 exceeded limit of 0 corrupt blocks

I have faced with that issue on our 12.1.3 EBS system on AIX system with 2 node 11gR2 RAC databases.

During the full backup operation, I noticed that RMAN log mention below error message at the backup log file.

RMAN-03009: failure of backup command on ORA_DISK_3 channel at 03/29/2012 22:19:19
ORA-19566: exceeded limit of 0 corrupt blocks for file +ORADATA/PROD/datafile/apps_ts_xxcustom_data.308.752148073
RMAN-03009: failure of backup command on ORA_DISK_2 channel at 03/29/2012 21:42:32
ORA-19566: exceeded limit of 0 corrupt blocks for file +ORADATA/PROD/datafile/apps_ts_xxcustom_data.395.741949465
RMAN-03009: failure of backup command on ORA_DISK_8 channel at 03/29/2012 22:18:24
ORA-19566: exceeded limit of 0 corrupt blocks for file +ORADATA/PROD/datafile/apps_ts_xxcustom_data.583.753881375

We have below error at alert.log:

 Hex dump of (file 440, block 1106642) in trace file /oracle11g/PROD/db/diag/rdbms/PROD/PROD1/trace/PROD1_ora_45744742.trc
 Corrupt block relative dba: 0x6e10e2d2 (file 440, block 1106642)
 Bad check value found during backing up datafile
 Data in bad block:
 type: 6 format: 2 rdba: 0x6e10e2d2
 last change scn: 0x08fc.fd857452 seq: 0x2 flg: 0x04
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0x74520602
 check value in block header: 0x937c
 computed block checksum: 0x3a7
 Reread of blocknum=1106642, file=+ORADATA/PROD/datafile/apps_ts_xxcustom_data.716.771786733. found same corrupt data
 Reread of blocknum=1106642, file=+ORADATA/PROD/datafile/apps_ts_xxcustom_data.716.771786733. found same corrupt data
 Reread of blocknum=1106642, file=+ORADATA/PROD/datafile/apps_ts_xxcustom_data.716.771786733. found same corrupt data
 Reread of blocknum=1106642, file=+ORADATA/PROD/datafile/apps_ts_xxcustom_data.716.771786733. found same corrupt data
 Reread of blocknum=1106642, file=+ORADATA/PROD/datafile/apps_ts_xxcustom_data.716.771786733. found same corrupt data

After some search on Metalink and I followed Intermittent Ora-19566 When Using RMAN Backup [ID 549256.1] and then We first started to work with our storage team to can identified Issue is related with storage or not.

I noticed that issue is not with storage level. So We decided to go with SR1.

During SR session I followed below steps to identified why  I faced this corruption and how we can solve an issue in a short period.

Firstly While corruption happens Oracle started to fill in V$DATABASE_BLOCK_CORRUPTION dynamic view. So I query below view

select * from V$DATABASE_BLOCK_CORRUPTION ;

INST_ID FILE# BLOCK# BLOCKS CORRUPTION_CHANGE# CORRUPTION_TYPE
1 71 1.340.426 1 0 CHECKSUM
1 81 1.338.102 1 0 CHECKSUM
1 101 1.335.514 1 0 CHECKSUM
1 139 1.332.486 1 0 CHECKSUM
1 192 1.074.210 1 0 CHECKSUM
1 234 980.706 1 0 CHECKSUM
1 284 974.490 1 0 CHECKSUM
1 307 1.060.358 1 0 CHECKSUM
1 309 1.047.338 1 0 CHECKSUM
1 334 926.887 1 0 CHECKSUM
1 354 882.790 1 0 CHECKSUM
1 413 1.205.042 1 0 CHECKSUM
1 440 1.106.642 1 0 CHECKSUM
1 740 1.243.698 1 0 CHECKSUM

I was trying to check if the blocks are on free space or occupied by any object.

Even if the blocks were occupied by any object and if that object was dropped the block would still be in corrupted formatted unless Oracle reuses it.

If the blocks are indeed in free space then we use a document to manually format the block.

So the output of these queries is important.

SELECT e.owner, e.segment_type, e.segment_name, e.partition_name, c.file#
, greatest(e.block_id, c.block#) corr_start_block#
, least(e.block_id+e.blocks-1, c.block#+c.blocks-1) corr_end_block#
, least(e.block_id+e.blocks-1, c.block#+c.blocks-1)
- greatest(e.block_id, c.block#) + 1 blocks_corrupted
, null description
FROM dba_extents e, v$database_block_corruption c
WHERE e.file_id = c.file#
AND e.block_id <= c.block# + c.blocks - 1
AND e.block_id + e.blocks - 1 >= c.block#;
SELECT s.owner, s.segment_type, s.segment_name, s.partition_name, c.file#
, header_block corr_start_block#
, header_block corr_end_block#
, 1 blocks_corrupted
, 'Segment Header' description
FROM dba_segments s, v$database_block_corruption c
WHERE s.header_file = c.file#
AND s.header_block between c.block# and c.block# + c.blocks - 1;
SELECT null owner, null segment_type, null segment_name, null partition_name, c.file#
, greatest(f.block_id, c.block#) corr_start_block#
, least(f.block_id+f.blocks-1, c.block#+c.blocks-1) corr_end_block#
, least(f.block_id+f.blocks-1, c.block#+c.blocks-1)
- greatest(f.block_id, c.block#) + 1 blocks_corrupted
, 'Free Block' description
FROM dba_free_space f, v$database_block_corruption c
WHERE f.file_id = c.file#
AND f.block_id <= c.block# + c.blocks - 1
AND f.block_id + f.blocks - 1 >= c.block#
order by file#, corr_start_block#;

After using upper queries We run below queries:

SQL>Select file#,block# from v$database_block_corruption ;

From rman:

RMAN > Validate datafile <file no> block <Block no> ;
PS: You can use below sql's for your issues
select 'Validate datafile '|| file# || ' block '||block# ||';' FROM v$database_block_corruption;
select 'Select * from dba_free_space where file_id='|| file# || ' and '||block# ||' between block_id and block_id + blocks -1;' FROM v$database_block_corruption;

So our issue we just replaced the file no with file# and block no with block# ;

The above will run the validate only on that file and block and not entire database.

So we just started ti run those queries our first corrupted block
We just run for single file and output as below

rman connect target / ( Do not connect catalog)

RMAN > Validate datafile 71 block 1340426 ;
Validate datafile 71 block 1340426;
List of Datafiles
=================
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
71 FAILED 0 0 1 0
File Name: +ORADATA/PROD/datafile/apps_ts_xxcustom_data.365.739618975
Block Type Blocks Failing Blocks Processed
---------- -------------- ----------------
Data             0                  0
Index            1                  1
Other            0                  0

So this indicates the block is still corrupted.

Its showing failing block for index.

I run the following:

SQL>Select file#,block# from v$database_block_corruption where file#=71 ;
SQL>Select * from dba_free_space where file_id=71 and <block# value from above query> between block_id and block_id + blocks -1 ;

If this returns row then the block is in free space.

Query return:

APPS_TS_XXCUSTOM_DATA 71 1337856 24903680 3040 71

The good news is this block is in free space. Which mean when the index has dropped the block was still in the corrupted state but since it was not reused by any other segment oracle still believe’s it to be corrupt.

Once this block is reused oracle would autoformat this block.
So We can do a block recovery. By using the backups taken to tape or disk

As I mention previously We should first use below queries for other corrupted blocks

SQL>Select file#,block# from v$database_block_corruption ;

From rman

RMAN > Validate datafile <file no> block <Block no> ;

So replace the file no with file# and block no with block# ;

Once done

Select * from dba_free_space where file_id=71 and <block# value from above query> between block_id and block_id + blocks -1 ;

For example

V$database_block_corruption

2 81 1.338.102 1 0 CHECKSUM
2 101 1.335.514 1 0 CHECKSUM

Rman> backup validate datafile 81 block 1338102 ;
SQL>Select * from from dba_free_space where file_id=81 and 1338102 between block_id and block_id + blocks -1 ;

So corrupt blocks are not occupied by any objects. So My production database activities would not be affected.

However, since the blocks have still not been reused rman still considers them as corrupted. So what We have to do?

We should now go with the rman block recovery command.

Please note block recover command would restore archivelogs required for block recovery to disk for doing a block recovery of the affected block.

You can do the following for block recover:

A single command will recovery all blocks affected in v$database_block_corruption ;

Rman> run
{ allocate channel c1 device type disk ;
allocate channel c2 device type disk ;
blockrecover corruption list ;
}

Or Alternate command is

Rman>recover corruption list ;

In these steps, We faced first block corruption recovery took 1 hour 06 min., second block corruption took almost 1 hour. We started to thing the recovery process could be more than 10 hours. But After 2th recover other block corruption took less than 5-10 min.

So the expected behavior is while recovering corruption list process end, dynamic view v$database_block_corruption will be updated. But We see that after rman complete process this view still show us there are corruption block available.

While We started to run validate command again view became empty

rman target / nocatalog
RMAN> Validate datafile 71 block 1340426;

So what is the cause of this corruption? What is a suggestion to prevent this error?

By default db_block_checksum is true.

When dbwr writes blocks to disk it does a xor of bytes within the blocks. Once the block is written to disk no oracle process modifies the blocks unless its re-read to buffer.
Before the block is read back to buffer the server process does a recomputation of the checksum value by doing xor of the bytes within the block.
This should match with the value stored in the block by dbwr process as no oracle process has modified the blocks.
If there is a mismatch in this value it clearly indicates some OS/hardware issue which has overwritten some contents of the blocks when they were on disk.

Also, any corruption by Oracle would be through redo changes. However, we see rman block recovery has gone fine.

Clearly indicating the redo changes vectors were good.You can set db_block_checking=typical …. however, these parameters would have some performance impact around 2 to 10%.

Or set DB_ULTRA_SAFE

Refer

New Parameter DB_ULTRA_SAFE introduce In 11g [ID 465130.1]

These parameters would help detect corruption at earlies in memory and prevent on occasion corrupted block from written to disk.

Again these might also have some performance impact around 2-10% depending on your application

References:
New Parameter DB_ULTRA_SAFE introduce In 11g [ID 465130.1]
Known Corruption issues caused by 3rd party Software Provider [ID 1323649.1]
Intermittent Ora-19566 When Using RMAN Backup [ID 549256.1]
Ext/Pub RMAN 9i Block-Level Media Recovery – Concept & [ID 144911.1 ]

Reklam


Bir Cevap Yazın

Aşağıya bilgilerinizi girin veya oturum açmak için bir simgeye tıklayın:

WordPress.com Logosu

WordPress.com hesabınızı kullanarak yorum yapıyorsunuz. Çıkış  Yap /  Değiştir )

Facebook fotoğrafı

Facebook hesabınızı kullanarak yorum yapıyorsunuz. Çıkış  Yap /  Değiştir )

Connecting to %s

%d blogcu bunu beğendi: