Legato Storage Manager Administrator's Guide Release 8.1.5 A68160-01 |
|
This chapter provides information about how you can use LSM to recover from different types of system disasters on your Oracle Server. It stresses the importance of preparing for a disaster. If you back up your Oracle data regularly and implement the planning procedures outlined in this chapter, you will be well prepared to recover from a disaster.
You can use LSM to recover from different types of disaster on your Oracle Server machine. The degree of data loss during a disaster can range from one or more files lost when a disk crashes to an entire computer system. The degree of severity of the disaster determines the procedures you need to perform to recover data on your Oracle Server.
The information in this chapter explains how to recover from four different types of disaster on your Oracle Server:
It is important to develop a plan for recovering from a disaster on your Oracle Server. Not only do you need to back up important data on a daily basis, but you need to develop and test a plan for recovering your data should you experience a disk crash or loss of data. The more time and effort you invest in creating and testing your disaster recovery plan, the better prepared you will be should disaster strike.
The section "LSM Server Bootstrap Backups" describes how the LSM Server is preconfigured to perform an automatic nightly bootstrap backup. The bootstrap is a special backup save set that includes the client index, media index, and resource database. You can also perform a manual bootstrap backup by using the procedure in "Manual Bootstrap Backup" .
The bootstrap information printed at the end of every bootstrap backup is essential for recovering LSM Server indexes and resource configuration files. As explained in "Maintaining Bootstrap Information" , you should keep the bootstrap printout in a safe place, ready for use during a disaster recovery. If you do not keep the bootstrap printout, you will need to determine the save set ID of the most recent bootstrap by means of the procedure in "Bootstrap Save Set ID" below.
For more information on using the bootstrap information, see the section "Recovering LSM Indexes and Configuration Files", under either "Disaster Recovery on UNIX" or "Disaster Recovery on Windows NT".
Along with the bootstrap information, you should keep accurate records of your network and system configurations and maintain all your original software in a safe location. For a comprehensive disaster recovery, you need the following items:
The most efficient way to recover the bootstrap is to be sure to save the bootstrap information prior to a disaster. However, if you do not have the information, you must scan the most recent backup volume to find the save set ID of the most recent bootstrap. Use the scanner -B
command, as it will always find a valid bootstrap.
Use the following steps to find the most recent save set ID for a bootstrap on a UNIX system:
scanner -B
command to locate the most recent bootstrap on the media. For example:
For Solaris systems:
/usr/sbin scanner -B /dev/rmt/0hbn
For AIX systems:
/usr/bin scanner -B /dev/rmt0.1
For DIGITAL UNIX systems:
/usr/opt/networker/bin scanner -B /dev/nrmt0h
For HP-UX systems:
/opt/networker/bin scanner -B /dev/rmt/0mnb
The scanner -B
command displays the latest bootstrap save set information found on the backup volume, as illustrated in the following example:
scanner: scanning 8mm tape jupiter.001 /dev/rmt/0hbn scanner: Bootstrap 1148869870 of 8/21/96 7:45:15 located on volume jupiter.001, file 88
After you locate the bootstrap with the most recent date, you can run the mmrecov
command on a UNIX system to recover the LSM Server indexes and resource configuration. For more information, see "Recovering LSM Indexes and Configuration Files".
Use the following steps to find the most recent save set ID for a bootstrap on a Windows NT system:
scanner -B
command to locate the most recent bootstrap. For example:
C:\win32app\nsr\bin scanner -B \\.\Tape0
The scanner -B
command displays the latest bootstrap save set information found on the backup volume, as show in this example:
scanner: scanning 8mm tape mars.006 on \\.\Tape0 scanner: Bootstrap 1148869870 8/11/96 6:29:58 mars.006, file 88
After you locate the bootstrap with the most recent date, you can run the mmrecov
command on a Windows NT system to recover the LSM Server indexes and resource configuration. For more information, see "Recovering LSM Indexes and Configuration Files".
As an additional precautionary step to help you recover from loss of critical data, find out how each disk on your Oracle Server machine is partitioned and formatted--prior to a disaster recovery-- and print and save this information. If a disk is damaged or destroyed during a disaster, use the disk information to recreate the disk exactly as it was prior to the disk crash.
Use the df
command to find out how the LSM Server disks are partitioned and mounted. Use the appropriate operating system command to print disk partitioning information.
df
and prtvtoc
commands.
df
and lslv
commands or the Logical Volume Manager in the System Management Interface Tool (SMIT).
df
and fstab
commands.
df
and bdf
commands.
For example, the df
information looks similar to the following:
Filesystem kbytes used avail capacity Mounted on
/dev/dsk/c0t3d0s6 480919 414138 18691 96% /usr
/dev/dsk/c0t3d0s0 1251422 183449 942833 17% /
swap 208112 380 207732 1% /tmp
/dev/dsk/c0t3d0s5 96031 12799 73632 15% /var
The prtvtoc
command example below provides information about how each disk is partitioned for a Solaris system. The device name is the "raw" device corresponding to the device name used for the output from the df
command shown previously.
/dev/dsk/c0t3d0s0 partition map
Dimensions:
512 bytes/sector
80 sectors/track
19 tracks/cylinder
1520 sectors/cylinder
3500 cylinders
2733 accessible cylinders
Flags:
1: unmountable
10: read-only
First Sector Last
Partition Tag Flags Sector Count Sector Mount Directory
0 2 00 0 2663040 2663039 /
1 3 01 2663040 261440 2924479
2 5 00 0 4154160 4154159
5 7 00 2924480 205200 3129679 /var
6 4 00 3129680 1024480 4154159 /usr
The lslv
command example below gives you information about the logical volumes on an AIX system.
OUTPUT of $ lslv hd6
LOGICAL VOLUME: hd6 VOLUME GROUP: rootvg
LV IDENTIFIER: 00004421b56f747b.1 PERMISSION: read/write
VG STATE: active/complete LV STATE: opened/syncd
TYPE: paging WRITE VERIFY: off
MAX LPs: 128 PP SIZE 4 megabyte(s)
COPIES: 1 SCHED POLICY: parallel
LPs: 8 PPs: 8
STALE PPs: 0 BB POLICY: non-relocatable
INTER-POLICY: minimum RELOCATABLE: yes
INTRA-POLICY: middle UPPER BOUND 32
MOUNT POINT: N/A LABEL: None
MIRROR WRITE CONSISTENCY: off
EACH LP COPY ON A SEPARATE PV ?: yes
If a disk was damaged, you will be able to restore it and recover the filesystems to their original state, using the hardcopy information from these disk information commands.
Prior to a disaster, copy the information that appears in the Windows NT Disk Administrator window, including the size of the partitions, the formatting methods, and the drive letters the partitions have been assigned to.
The procedures to follow for disaster recovery depend on whether your Oracle Server machine is running UNIX or Windows NT. For more information, see either "Disaster Recovery on UNIX" or "Disaster Recovery on Windows NT" below.
For either UNIX or Windows NT, it is difficult to provide step-by-step disaster recovery instructions since every disaster situation is unique. The examples included in the following sections are designed to give you general principles on how to recover critical data and to help you understand the procedures.
While performing any disaster recovery procedures on your UNIX system, keep in mind the following hardware, operating system, and LSM requirements. Fulfill the requirements that are pertinent to the disaster recovery procedure that you are following.
Use the following list to install and configure your system hardware correctly:
Adhere to the following list when you reinstall the UNIX operating system:
Fulfill the following requirements to ensure that you reinstall LSM successfully. Refer to the Oracle installation guide for your UNIX system for LSM installation instructions.
When a disk with the operating system and LSM binaries has been damaged or completely destroyed, you need to replace the damaged disk and reinstall both the operating system and the LSM software. If the disk was not completely destroyed and the operating system or LSM is still operational, use only those steps in this section that apply to your situation.
Use the following steps to guide you through recovering the operating system and LSM software:
disk information
command to make a filesystem for each raw partition that you plan to recover, then mount the block partition. (LSM does not initialize or create filesystems; it recovers data into existing filesystems.)
newfs
or mkfs
. For AIX systems, use SMIT
. For HP-UX systems, use mkfs
.
You need to fully configure the operating system by recreating any unique configurations that existed before you lost data or experienced a disk crash. If you use a device with a default configuration that is not directly supported by the operating system, you also need to modify the appropriate device configuration files during installation:
If you lost the LSM indexes and resource configuration files that reside in the /nsr directory, you will need to follow the instructions in the next section to recover them.
If the LSM Server indexes and configuration files that reside in the /nsr directory have been destroyed, you will need to use the mmrecov
command to recover them.
If the operating system and LSM software were also destroyed, they must be reinstalled prior to recovering the /nsr directory contents. See "Recovering the Operating System and LSM Software".
When you use the mmrecov
command to recover the /nsr directory, you actually recover the contents of three important directories:
mmrecov
recovers the /nsr/res directory as /nsr/res.R. Later, you must change the directory name to /nsr/res.
The mmrecov
command asks you for the bootstrap save set identification number (ssid). If you followed the recommended procedures to prepare for loss of critical data, you have a copy of the bootstrap file as a hardcopy printout with the name of the backup media you need and the bootstrap save set ID (ssid).
In the following example, ssid "17851237" is the most recent bootstrap backup:
Jun 17 22:21 1997 mars's LSM bootstrap information date time level ssid file record volume 6/14/97 23:46:13 full 17826163 48 0 mars.1 6/15/97 22:45:15 9 17836325 87 0 mars.2 6/16/97 22:50:34 9 17846505 134 0 mars.2 6/17/97 22:20:25 9 17851237 52 0 mars.3
If you do not have this information, you can still recover the indexes by finding the ssid using the scanner -B
command. See "Bootstrap Save Set ID".
After you locate the bootstrap with the most recent date on your UNIX system, you can run the mmrecov
command, supplying the save set ID and file number displayed by the scanner
command, to recover the LSM Server indexes and resource configuration.
With the operating system and LSM software in place, recover the indexes and configuration files from the backup media by following these steps:
mmrecov
command to extract the contents of the bootstrap backup. (Never run the mmrecov
command from the root directory (/); you can use any other directory.) For example:
# mmrecov Doing mmrecov operation as root on the server !!! mmrecov: Using mars as server NOTICE: mmrecov is used to recover the NetWorker server's on-line file and media indexes from media (backup tapes or disks) when either of the server's on-line file or media index has been lost or damaged. Note that this command will OVERWRITE the server's existing on-line file and media indexes. mmrecov is not used to recover NetWorker clients' on-line indexes; normal recover procedures may be used for this purpose. See the mmrecov(8) and nsr_crash(8) man pages for more details. rd=mars:/space1/DISKDEV1 rd=mars:/space1/DISKDEV2 /space1/DISKDEV1 /space1/DISKDEV2 What is the name of the device you plan on using [rd=mars:/space1/DISKDEV1]? /space1/DISKDEV1 Enter the latest bootstrap save set id []: 17851237 Enter starting file number (if known) [0]: 52 Enter starting record number (if known) [0]: 0 Please insert the volume on which save set id 17851237 started into /space1/DISKDEV1. When you have done this, press <RETURN>: [Return] Scanning /space1/DISKDEV1 for save set 17851237; this may take a while... scanner: scanning optical disk TestBackup.199 on /space1/DISKDEV1 /nsr/res/nsr.res /nsr/res/nsrjb.res scanner: ssid 17851237: scan complete scanner: ssid 17851237: 44 KB, 11 file(s) /nsr/res/nsrla.res /nsr/res/ /nsr/mm/ /nsr/index/mars/ /nsr/index/ /nsr/ / nsrmmdbasm -r /nsr/mm/mmvolume/ nsrindexasm -r /nsr/index/mars/db/ /space1/DISKDEV1: mount operation in progress /space1/DISKDEV1: mounted optical disk TestBackup.199 (write protected) The bootstrap entry in the on-line index for mars has been recovered.The complete index is now being reconstructed from the various partial indexes which were saved during the normal save for this server. If your resource files were lost, they are now recovered in the 'res.R'directory. Copy or move them to the 'res' directory, after the index has been reconstructed and you have shut down the daemons. Then restart the daemons. Otherwise, just restart the daemons after the index has been reconstructed. 7 records recovered, 0 discarded. nsrindexasm: Pursuing index pieces of /nsr/index/mars/db from mars. Recovering files into their original locations. nsrindexasm -r ./mars/db/ merging with existing mars index mars: 2035 records recovered, 0 discarded. Received 1 matching file(s) from NSR server `mars' Recover completion time: Tue Jun 24 16:46:38 1997 Cross checking index for client mars to remove duplicate records The index for `mars' is now fully recovered.
You can use LSM commands such as nsrwatch
or nwadmin
to watch the progress of the LSM Server during the recovery of the index and configuration files. Open a new window (shell tool) to monitor the recovery so that the mmrecov
output does not display on top of the nsrwatch
output.
mars# nsrwatch Tue 16:36:11 server notice: started Tue 16:36:30 index notice: The client index is missing, recover the index or run nsrck -c Tue 16:36:30 index notice: completed checking 2 client(s) Tue 16:36:34 /space1/DISKDEV1 volume TestBackup.199 not found in media index Tue 16:45:21 /space1/DISKDEV1 mount operation in progress Tue 16:45:30 /space1/DISKDEV1 mounted optical disk TestBackup.199 (write protected) Tue 16:45:33 index notice: nsrim has finished cross checking the media db Tue 16:46:24 index notice: cross-checking index for mars Tue 16:46:30 /space1/DISKDEV1 mounted optical disk TestBackup.199 (write protected) Tue 16:46:31 mars:/nsr/index/mars (6/24/97) starting read from TestBackup.199 of 397 Tue 16:46:35 mars:/nsr/index/mars (6/24/97) done reading 397 KB Tue 16:46:38 index notice: cross-checking index for mars
Unlike the /nsr/index directory, the /nsr/res directory containing the configuration files cannot be reliably overwritten while LSM is running. Therefore, mmrecov
recovers the /nsr/res directory as /nsr/res.R.
To complete the recovery of the LSM configuration files:
Complete these steps after mmrecov
has finished and this final message appears:
The on-line index for 'server' is now fully recovered.
nsr_shutdown
command:
# nsr_shutdown
# cd /nsr # mv res res.orig # mv res.R res
# nsrd # nsrexecd
# rm -r /nsr/res.orig
This section describes how to recover from a crash in which one or more files of an Oracle8 or Oracle8i database were damaged on your UNIX Oracle Server.
The first sign of a disk crash will usually be an I/O error. Oracle will usually document the I/O error in the trace file and in the alert log.
If only one database file is affected, your database may have encountered a bad spot on the disk. This can be circumvented by reformatting the disk to make a new list of bad blocks. However, this can be time-consuming, so it is prudent to have a spare disk available to swap in while you reformat.
If several database files are affected, all on the same disk, you could have a disk controller problem or a disk head crash. A bad controller can be replaced, and data on the disk will often be in perfect shape. But after a head crash, you will need to use that spare disk as a replacement.
Keep an extra disk on hand as a "hot spare," in case a disk failure occurs. Format it and verify that it works. If a disk failure does occur, it is much faster to swap in a spare disk than it is to rename database files and update the control file accordingly.
After you have determined the Oracle data that needs to be recovered, you must first restore the relevant files.
You can restore and recover the Oracle database files by using one of these programs:
For more information about the Oracle Enterprise Manager, see "Using the Oracle Enterprise Manager Backup Manager". For complete details about using RMAN in Oracle8i, refer to the Oracle8i Backup and Recovery Guide or, for Oracle8, the comparable guide.
This section describes the situation where your original LSM machine is beyond repair, so you want to move LSM to a new machine. This procedure assumes that you are not updating the operating system or the LSM software.
If you want to make changes to the operating system or the LSM software, we strongly suggest that you configure the new machine exactly like the original, using the same version of the operating system and LSM software. After configuring the new machine, make sure the system is operational, perform a couple of successful backups, and then, update or upgrade the operating system or the LSM software, one at a time.
To move LSM to a new machine, use the same steps for recovering a primary disk and the LSM indexes and configuration files. See "Recovering the Operating System and LSM Software" and "Recovering LSM Indexes and Configuration Files" for complete information.
However, you should be aware of the following requirements for configuring the software:
After LSM is moved to another machine, you must recover the LSM resource database (nsr.res file) to have the same resource and attribute settings on your new machine as you had on the previous one.
After you successfully move your server, check the following:
savegrp -O
command to perform a manual bootstrap backup as soon as possible. See "Manual Bootstrap Backup" for more information.
While performing any disaster recovery procedures on your Windows NT system, keep in mind the following hardware, operating system, and LSM requirements. Fulfill the requirements that are pertinent to the disaster recovery procedure that you are following.
Use the following list to install and configure your system hardware correctly:
Adhere to the following list when you reinstall the Windows NT operating system:
Fulfill the following requirements to ensure that you reinstall LSM successfully. Refer to the Oracle installation guide for your Windows NT system for LSM installation instructions.
When a disk with the operating system and LSM binaries has been damaged or completely destroyed, you need to replace the damaged disk and reinstall both the operating system and the LSM software. If the disk was not completely destroyed and the operating system or LSM is still operational, use only those steps in this section that apply to your situation.
To recover the operating system and LSM software, follow these steps:
You need to fully configure the operating system by recreating any unique configurations that existed before you lost data or experienced a disk crash.
If you lost the LSM indexes and configuration files directory, you will need to follow the instructions in the next section to recover them.
If the LSM indexes and configuration files that reside in the \nsr directory have been destroyed, you need to use the mmrecov
command to recover them.
If the operating system and the LSM software were also destroyed, they must be reinstalled prior to recovering the \nsr directory contents. See the preceding section, "Recovering the Operating System and LSM Software".
When you use the mmrecov
command to recover the \nsr directory, you actually recover the contents of three important directories:
mmrecov
recovers the \nsr\res directory as \nsr\res.R. Later, you must change the directory name to \nsr\res.
The mmrecov
command asks you for the bootstrap save set identification number (ssid). If you followed the recommended procedures to prepare for loss of critical data, you have a copy of the bootstrap file as a hardcopy printout with the name of the backup media you need and the bootstrap save set ID (ssid).
In the following example, ssid "1148869870" is the most recent bootstrap backup:
August 20 03:30 1997 LSM bootstrap information Page 1 date time level ssid file record volume 8/08/97 7:44:38 full 1148869706 55 0 mars.004 8/09/97 6:12:09 full 1148869754 48 0 mars.005 8/10/97 6:14:23 full 1148869808 63 0 mars.006 8/11/97 6:29:58 full 1148869870 88 0 mars.006
If you do not have this information, you can still recover the indexes by finding the bootstrap ssid using the scanner -B
command. See "Bootstrap Save Set ID".
After you locate the bootstrap with the most recent date on your Windows NT system, you can run the mmrecov
command, supplying the save set ID and file number displayed by the scanner
command, to recover the LSM Server indexes and resource configuration.
With the operating system and LSM software in place, recover the indexes and configuration files from the backup media:
mmrecov
command to extract the contents of the bootstrap save set. For example:
mmrecov C:\win32app\nsr\bin>mmrecov mmrecov: Using mars.universe.com as server NOTICE: mmrecov is used to recover the LSM server's on-line file and media indexes from media (backup tapes or disks) when either of the server's on-line file or media index has been lost or damaged. Note that this command will OVERWRITE the server's existing on-line file and media indexes. mmrecov is not used to recover LSM clients' on-line indexes; normal recover procedures may be used for this purpose. Enter the latest bootstrap save set id []: 15132 Enter starting file number (if known) [0]: 9 Enter starting record number (if known) [0]: Please insert the volume on which save set id 15132 started into \\.\Tape0. When you have done this, press <RETURN>: Scanning \\.\Tape0 for save set 15132; this may take a while.. scanner: scanning 4mm tape mars.universe.com.001 on \\.\Tape0 C:\win32app\nsr\res\nsr.res C:\win32app\nsr\res\nsrjb.res C:\win32app\nsr\res\nsrla.res C:\win32app\nsr\res\ nsrmmdbasm -r C:\win32app\nsr\mm\mmvolume C:\win32app\nsr\mm\mmvolume: file exists, overwriting nsrindexasm -r C:\win32app\nsr\index\mars.universe.com\db C:\win32app\nsr\index\mars.universe.com\ C:\win32app\nsr\index\ C:\win32app\nsr\mm\ C:\win32app\nsr\ C:\win32app\ C:\ scanner: ssid 15132: scan complete scanner: ssid 15132: 290 KB, 12 files takin.legato.com: 2247 records recovered, 0 discarded. Cross checking index for client mars.universe.com to remove duplicate records The index for 'mars.universe.com' is now fully recovered. \\.\Tape0: mount operation in progress \\.\Tape0: verifying label, moving backward 2 files \\.\Tape0: mounted 4mm tape mars.universe.com.001 The bootstrap entry in the on-line index for mars.universe.com has been recovered. If your resource files were lost, they are now recovered in the 'res.R' directory. Copy or move them to the 'res' directory, after you have shut down the service. Then restart the service. Otherwise, just recycle the service.
The LSM Server indexes and configuration files should be fully recovered.
Unlike the \nsr\index directory, the \nsr\res directory that contains the configuration files cannot be reliably overwritten while LSM is running. Therefore, mmrecov
recovers the \nsr\res directory as \nsr\res.R. To complete the recovery of the LSM configuration files, shut down LSM, rename the recovered \nsr\res.R directory to \nsr\res, and then restart LSM.
When the mmrecov
program finishes recovering the indexes and configuration files, it displays this final message:
The on-line index for 'server' is now fully recovered.
Complete these steps after mmrecov
completes:
Once you recover the LSM Server's indexes and configuration files, you can recover the Windows NT system Registry by using the NetWorker User program.
The NetWorker User program is provided as part of your LSM installation. You can start the NetWorker User program by selecting the NetWorker User icon from the taskbar or double-clicking the NetWorker User icon from the Program Manager. For more information about how to use the program, see the NetWorker User online help or refer to Appendix D, "Running the NetWorker User Program".
To recover the Windows NT system Registry on your LSM Server, follow these steps:
The system should be restored to its status prior to the disk crash.
This section describes how to recover from a crash in which one or more files of an Oracle database were damaged on your Windows NT Oracle Server.
The first sign of a disk crash will usually be an I/O error. Oracle will usually document the I/O error in the trace file and in the alert log.
If only one database file is affected, your database may have encountered a bad spot on the disk. This can be circumvented by reformatting the disk to make a new list of bad blocks. However, this can be time-consuming, so it is prudent to have a spare disk available to swap in while you reformat.
If several database files are affected, all on the same disk, you could have a disk controller problem or a disk head crash. A bad controller can be replaced, and data on the disk will often be in perfect shape. But after a head crash, you will need to use that spare disk as a replacement.
Keep an extra disk on hand as a "hot spare," in case a disk failure occurs. Format it and verify that it works. If a disk failure does occur, it is much faster to swap in a spare disk than it is to rename database files and update the control file accordingly.
After you have determined the Oracle data that needs to be recovered, you must first restore the relevant files.
You can restore and recover the Oracle database files by using one of these programs:
For more information about the Oracle Enterprise Manager, see "Using the Oracle Enterprise Manager Backup Manager" .
To recover an Oracle8i or Oracle8 database using Recovery Manager, please see the Oracle8i Backup and Recovery Guide, or, for Oracle8, the comparable level guide.
This section describes the situation where your original LSM machine is beyond repair, so you want to move LSM to a new machine. This procedure assumes that you are not updating the operating system or the LSM software.
If you want to make changes to the operating system or the LSM software, we strongly suggest that you configure the new server exactly like the original, using the same version of the operating system and LSM software. After configuring the new server, make sure the system is operational, perform a couple of successful backups, and then, update or upgrade the operating system or the LSM software, one at a time.
To move LSM to a new machine, use the same steps for recovering a primary disk and the LSM indexes and configuration files. See "Recovering the Operating System and LSM Software" and "Recovering LSM Indexes and Configuration Files" for complete information.
However, you should be aware of the following requirements for configuring the software:
After LSM is moved to another machine, you must recover the LSM resource database (nsr.res file) to have the same resource and attribute settings on your new machine as you had on the previous one.
After you successfully move your server, check the following:
savegrp -O
command to perform a manual bootstrap backup as soon as possible. See "Manual Bootstrap Backup" for more information.
The methods for restoring and recovering Oracle database files are outlined in the Oracle8i Backup and Recovery Guide (or comparable Oracle8 guide) for RMAN.