Recovering from the LOSS of REDO log files

tuanlpOctober 28, 2023377 views

In the following post we’ll go through every possible scenario where we loose REDO, one by one. REDO is the most crucial part when we talk about recovery and contains the current transaction details, hence if the data is redo is lost Point time recovery is impossible (depending on the scenario).

From now onwards I would expect the readers are aware of the basics of REDO logs.

Redo logs can have below STATUS’s (select status from v$log;):-

Log status:

UNUSED – Online redo log has never been written to. This is the state of a redo log that was just added, or just after a RESETLOGS, when it is not the current redo log.
CURRENT – Current redo log. This implies that the redo log is active. The redo log could be open or closed.
ACTIVE – Log is active but is not the current log. It is needed for crash recovery. It may be in use for block recovery. It may or may not be archived.
CLEARING – Log is being re-created as an empty log after an ALTER DATABASE CLEAR LOGFILE statement. After the log is cleared, the status changes to UNUSED.
CLEARING_CURRENT – Current log is being cleared of a closed thread. The log can stay in this status if there is some failure in the switch such as an I/O error writing the new log header.
INACTIVE – Log is no longer needed for instance recovery. It may be in use for media recovery. It might or might not be archived.

Recovering from loss of REDO is completely dependent on the STATUS of the member/s that are corrupted or lost. Hence the following discussion would cover the recovery scenario’s segregating with the STATUS of the lost member/s:-

CASE 1:- Recovering from the loss of one member in a multiplexed REDO log environment.

Assuming the case for an env. where there are 3 REDO log groups with 2 members in each group. In such a case if one member of one group is lost, the database would continue to be running as its able to write on the other member of this group but would throw the error in the Alert log and the STATUS of the lost member would become INVALID. To recover from this scenario following are the steps required:-

STEP 1:- Check the Status of the Log:-

select member, a.group#, a.status, b.status from v$log a, v$logfile b where a.group# = b.group# order by a.group#, member;

In this case the solution is to drop and re-create the lost member, but a member of the CURRENT redo log group (a.status in above query) could not be dropped (ORA-01609), hence we have to switch the logfile to change the STATUS of the logfile from CURRENT to ACTIVE or INACTIVE. Hence:-

STEP 2:- Alter system switch logfile;

STEP 3:- Alter database drop log file member ‘<LOG File name where b.status in above query was INVALID>’;

STEP 4:- Alter database add logfile member ‘<Same LOG FILE NAME that was dropped>’ to group <a.group# from above query>;

STEP 5:- Now to make sure the logfile member is available for Oracle to write the redo log switch the logfile (Alter system switch logfile; until b.status from above query becomes NULL for this member)

Easy enough… right ? Well it is undoubtedly BUT when a DBA notices the Online REDO logfile corruption, its stressful enough to make him/her sweat… Trust Me..!!!

Anyway moving towards the next cases, where all redo members are lost, this seems interesting right ? Well I can not agree more on that, so lets take a look on these cases.

CASE 2:- Recovering from the loss of All the members in a multiplexed REDO log environment.

When all the members of a REDO group are lost, the database would crash, in fact SMON would kill itself causing the shutdown as there is no place to write the REDO, which in turn means no recovery of transaction thus Oracle shuts itself to make sure no more transactions are committed when it cannot write the same to the redo logs.

STEP 1:- STARTUP;

This startup would fail indicating errors that the SMON is not able to find the REDO logs to check the status of ROLL Forward. Thus you need to mount the database to see the status of the logfile members.

STEP 2:- STARTUP MOUNT;

STEP 3:- select member, a.group#, a.status, b.status, a.archived from v$log a, v$logfile b where a.group# = b.group# order by a.group#, member;

Now it depends on the a.status and a.archived column of the above query, to decide how to proceed with the recovery.

CASE 2.a) When the lost group’s STATUS = INACTIVE

If the lost group is INACTIVE and ARCHIVED = YES (Archiver process has written the redo for this logfile to Archived redo log) then We’ll clear the logfile (alter database clear logfile group <a.group# of the lost group from the above query>) and then open the database but if ARCHIVED = NO (Hence the the redo log has not been archived) then we’ll clear the logfile with UNARCHIVED clause (alter database clear unarchived logfile group <a.group# of the lost group from the above query>; ) and then open the database using , in this case as you are clearing the log you should immediately take a full backup.

CASE 2.b) When the lost group’s STATUS = ACTIVE

In this case we have to try to make this group INACTIVE, for that we’ll issue checkpoint (ALTER SYSTEM CHECKPOINT), if successful, this would make the status of REDO group as INACTIVE and then you can continue with CASE 2.a), if unsuccessful, then we have to begin incomplete media recovery, recovering up through the log before the damaged log, for that:-

STARTUP MOUNT
select group#, status, archived, thread#, sequence#, first_change# from v$log; (Get the latest SCN {first_change#} of the group just before the group with STATUS = CURRENT)
rman target /
restore database until scn <SCN {first_chenge#}from v$log >;
recover database until scn <SCN {first_chenge#}from v$log >;
alter database open resetlogs;

But in this case you might have lost some transactions that were in the lost group, but there is no other way to recover from this scenario.

CASE 2.c) When the lost group’s STATUS = CURRENT

It’s similar to CASE 2.b).

To summarize the above cases:-

If the group is . . .	Then . . .	And you should . . .
Inactive	It is not needed for crash recovery	Clear the archived or unarchived group.
Active	It is needed for crash recovery	Attempt to issue a checkpoint and clear the log; if impossible, then you must restore a backup and perform incomplete recovery up to the most recent available redo log.
Current	It is the log that the database is currently writing to	Attempt to clear the log; if impossible, then you must restore a backup and perform incomplete recovery up to the most recent available redo log.

I hope this post would be helpful in scenarios when you know the recovery process but as there is lot of stress on DBAs during recovery scenarios even googling at that time for the syntax is an overhead, hence the idea is to make a repository that cover all possibile recovery scenarios step by step with syntax of commands at a single destination. I would try to cover all the scenarios in my subsequent posts. Please comment if any of the curious readers are interested to know the internals of the recovery process like which process does that, what information is in redo that is applied to the datafiles to recover them to point in time and etc.

PLEASE COMMENT IF YOU FACE ANY ERRORS WHEN PERFORMING THIS RECOVERY SO THAT WE CAN WORK ON THOSE AND ENHANCE THIS REPOSITORY FOR OTHER READERS.

How to Configure Network Interfaces using NMCLI on Centos 7.6

ORA-16086: Redo data cannot be written to the standby redo log

Related posts

Error: ORA-16765: Redo Apply is running

Khắc phục lỗi recovery khi OPEN RESETLOGS lỗi SYSTEM trong Oracle Database?

check the FRA usage and troubleshoot the issue in oracle DB