Several problems can halt the normal operation of an Oracle database or affect database I/O operations. The following sections describe the most common types of problems. For some of these problems, crash and instance recovery occur automatically and require no action on the part of the database administrator. For other problems, administrator-initiated media recovery is required.
This section contains these topics:
- Media Failure
- User Error
- Database Instance Failure
- Statement Failure
- Process Failure
Learn Oracle - Media Failure
An error can occur when trying to write or read an file on disk that is required to operate an Oracle database. This occurrence is called media failure because there is a physical problem reading or writing to files on the storage medium.
A common example of media failure is a disk head crash that causes the loss of all database files on a disk drive. All files associated with a database are vulnerable to a disk crash, including datafiles, control files, online redo logs, and archived logs.
The appropriate recovery from a media failure depends on the files affected. Media failure is the primary concern of a backup and recovery strategy, because it typically requires restoring some or all database files and the application of redo during recovery.
How Media Failures Affect Database Operation
Media failures can affect one or all types of files necessary for the operation of an Oracle database, including datafiles, online redo log files, and control files. Also, media failures can affect archived redo logs stored on disk.
Database operation after a media failure of online redo log files or control files depends on whether the online redo log or control file is protected by multiplexing, as recommended. When an online redo log or control file is multiplexed, multiple copies of the file are maintained on the system. Typically, multiplexed files are stored on separate disks.
If a media failure damages a single disk, and if you have a multiplexed online redo log, then the database can usually continue to operate without significant interruption. Damage to a nonmultiplexed online redo log causes database operation to halt and may cause permanent loss of data. Damage to any control file, whether it is multiplexed or not, halts database operation once Oracle attempts to read or write to the damaged control file, which happens frequently, for example at every checkpoint and log switch.
Media failures that affect datafiles can be divided into two categories: read errors and write errors. In a read error, Oracle discovers it cannot read a datafile and an operating system error is returned to the application, along with an Oracle error indicating that the file cannot be found, cannot be opened, or cannot be read. Oracle continues to run, but the error is returned each time an unsuccessful read occurs. At the next checkpoint, a write error will occur when Oracle attempts to write the file header as part of the standard checkpoint process.
If Oracle discovers that it cannot write to a datafile, and if Oracle is in ARCHIVELOG mode, then Oracle returns an error in the database writer trace file and takes the datafile offline automatically. Only the datafile that cannot be written to is taken offline; the tablespace containing that file remains online.
If the datafile that cannot be written to is in the SYSTEM tablespace, then the file is not taken offline. Instead, an error is returned and Oracle shuts down the instance. The reason for this exception is that all files in the SYSTEM tablespace must be online in order for Oracle to operate properly. For the same reason, the undo tablespaces (if in automatic undo management mode) or the datafiles of a tablespace containing active rollback segments (if in manual undo management mode) must remain online.
If Oracle cannot write to a datafile, and Oracle is not archiving the filled online redo log files, then the database writer background process fails and the current instance fails. If the problem is temporary (for example, the disk controller was powered off), then crash or instance recovery usually can be performed using the online redo log files, in which case the instance can be restarted. However, if a datafile is permanently damaged and archiving is not used, then you must restore the entire database using the most recent consistent backup.
Learn Oracle - Recovery of Read-Only Tablespaces
Recovery is not needed on any read-only tablespace during crash or instance recovery. During startup, recovery verifies that each online read-only datafile does not need media recovery. That is, the file was not restored from a backup taken before it was made read-only. If Oracle DBA restore a read-only tablespace from a backup taken before the tablespace was made read-only, then Oracle DBA cannot access the tablespace until Oracle DBA complete media recovery.
Learn Oracle - User Error
As an administrator, Oracle DBA can do little to prevent user errors such as accidentally dropping a table. Often, user error can be reduced by increased training on database and application principles. Oracle DBA can also avoid user errors by administering privileges correctly so that users are able to do less potential damage. Furthermore, by planning an effective recovery scheme ahead of time, Oracle DBA can ease the work necessary to recover from user errors.
Typically, a user error such as a dropped table requires either re-entering the lost changes manually (if a record of them exists), importing the dropped object (if an export file exists), or performing incomplete recovery either of an individual tablespaces (called tablespace point-in-time recovery (TSPITR)) or of the entire database.
Learn Oracle - Database Instance Failure
Database instance failure occurs when a problem prevents an Oracle database instance from continuing to run. An instance failure can result from a hardware problem, such as a power outage, or a software problem, such as an operating system crash. Instance failure also results when Oracle DBA issue a SHUTDOWN ABORT or STARTUP FORCE statement.
Learn Oracle - Mechanics of Instance and Crash Recovery
When one or more instances fail, Oracle automatically recovers the lost changes associated with the instance or instances. Crash or instance recovery consists of the following steps:
- Rolling forward to recover data that has not been recorded in the datafiles, yet has been recorded in the online redo log, including changes to undo blocks. This phase is called cache recovery.
- Opening the database. Instead of waiting for all transactions to be rolled back before making the database available, Oracle allows the database to be opened as soon as cache recovery is complete. Any data that is not locked by unrecovered transactions is immediately available.
- Marking all transactions systemwide that were active at the time of failure as DEAD and marking the rollback or undo segments containing these transactions as PARTLY AVAILABLE.
- Rolling back dead transactions as part of SMON recovery. This phase is called transaction recovery.
Resolving any pending distributed transactions undergoing a two-phase commit at the time of the instance failure.
- As new transactions encounter rows locked by dead transactions, they can automatically roll back the dead transaction to release the locks. If Oracle DBA are using Fast-Start Recovery, then only the data block is immediately rolled back, as opposed to the entire transaction.
Learn Oracle - Statement Failure
Statement failure occurs when there is a logical failure in the handling of a statement in an Oracle program. For example, assume that all extents of a table (in other words, the number of extents specified in the MAXEXTENTS parameter of the CREATE TABLE statement) are allocated, and are completely filled with data. A valid INSERT statement cannot insert a row because no space is available. Therefore, the statement fails.
If a statement failure occurs, then the Oracle software or operating system returns an error. A statement failure usually requires no recovery steps: Oracle automatically corrects for statement failure by rolling back any effects of the statement and returning control to the application. The user can simply re-execute the statement after the problem indicated by the error message is corrected. For example, if insufficient extents are allocated, then the DBA needs to allocate more extents so that the user's statement can execute.
Learn Oracle - Process Failure
A process failure is a failure in a user, server, or background process of a database instance such as an abnormal disconnect or process termination. When a process failure occurs, the failed subordinate process cannot continue work, although the other processes of the database instance can continue.
The Oracle background process PMON detects aborted Oracle processes. If the aborted process is a user or server process, then PMON resolves the failure by rolling back the current transaction of the aborted process and releasing any resources that this process was using. Recovery of the failed user or server process is automatic. If the aborted process is a background process, then the instance usually cannot continue to function correctly. Therefore, Oracle DBA must shut down and restart the instance.
Learn Oracle - Network Failure
When your system uses networks such as local area networks and phone lines to connect client workstations to database servers, or to connect several database servers to form a distributed database system, network failures such as aborted phone connections or network communication software failures can interrupt the normal operation of a database system. For example:
A network failure can interrupt normal execution of a client application and cause a process failure to occur. In this case, the Oracle background process PMON detects and resolves the aborted server process for the disconnected user process, as described in the previous section.
A network failure can interrupt the two-phase commit of a distributed transaction. After the network problem is corrected, the Oracle background process RECO of each involved database automatically resolves any distributed transactions not yet resolved at all nodes of the distributed database system.