Type of Resynchronization Used for a Specific Failure Mode [ ALLBASE/Replicate User's Guide ] MPE/iX 5.0 Documentation
ALLBASE/Replicate User's Guide
Type of Resynchronization Used for a Specific Failure Mode
A variety of circumstances can cause interruption of the soft
resynchronization process. Planned downtime, network failure, soft
failures, hard failures, processing problems, and data corruption can
occur that affect the master, the slave, or both.
You may recover from the interruption and resume operation without
bringing up a backup system, or you may switch roles so that a slave
becomes a master DBEnvironment while repairs are underway, and then
switch back after repairs are completed.
A variety of recovery processes are available, and the appropriate
recovery process depends on the circumstances causing the failure and the
configuration of the ALLBASE/Replicate environment.
Short, Planned Interruptions of the Slave
For short, planned interruptions of soft resynchronization on the slave,
recovery amounts to restarting the soft resynchronization process. This
assumes that the log files did not wrap around and overwrite log
information for transactions committed on the master that have not yet
committed on the slave. During the interruption, audit log records will
continue to accumulate in the log files on the master DBEnvironment.
When restarted, the slave will tell the master the transaction identifier
for the transaction most recently committed to each partition being
replicated. The master will find the next transaction to send to the
slave, and soft resynchronization will resume where it left off, without
losing any transactions.
Extended, Planned Interruptions of the Slave
If planned interruptions of the slave will be for an extended period of
time, it may take longer to soft resynchronize the slave than to use hard
resynchronization. In that case, take actions in preparation for doing
the hard resynchronization.
Plan to STORE or UNLOAD the master DBEnvironment in parallel with
whatever work you are doing on the slave. This will save time in
restoring the slave. If STOREONLINE is available, you can leave the
master in service while working on the slave.
If you cannot use STOREONLINE, halt operation of the master
DBEnvironment, or lock the tables that are being unloaded during the
unload process. With a planned interruption, schedule the time the
master will be out of service to minimize inconvenience to the users.
After the image is stored, place the master back in service. Ensure that
there is enough log space on the master to record all new transactions
applied to the DBEnvironment from the time the store is made until the
slave is reactivated. After the image has been restored on the slave,
soft resynchronization can bring the slave up to date with the new
transactions on the master.
If the master is resynchronizing to more than one slave, you may choose
to continue resynchronization between the master and the second slave.
When the first slave is back in service, you may choose to hard
resynchronize the slave that was down, using data from the slave that
remained in service. With this strategy, you could avoid taking the
master DBEnvironment out of service.
Short, Unplanned Interruptions of the Slave
Short, unplanned occurrences such as a short power failure at the slave
site, or a short network outage, can be handled much like short, planned
interruptions. The slave site should coordinate with the master site to
ensure that the master site provides sufficient log space on the
master.[REV BEG] Check how far the slave is behind the master to
determine if it would save time to do a hard resynchronization.[REV END]
Extended, Unplanned Interruptions of the Slave
If the slave will be out of service for a significant period of time,
plan for interim processing, and prepare for recovery using hard
resynchronization.
Depending on the use of the slave, there may be additional issues. If
the slave serves primarily as a backup for the master, then the major
issue is getting the slave back in service and resynchronized.
However, if the slave is being used to offload read-only processing from
the master, decide whether to halt read-only processing altogether until
the slave is back in service, or to switch the read-only applications to
read from the master until the slave is online.
Failure of the Slave to Keep Up with the Master
Under certain conditions, transactions may be committed on the master
very quickly, and yet much more slowly on the slave. This may happen if
most operations on the master are bulk operations, such as updates that
operate on many rows at a time. Bulk updates on the master translate
into row-by-row operations on the slave. Because the row-by-row
operations are much slower than the bulk operations, the slave can fall
behind the master. If the slave falls too far behind the master, a hard
resynchronization may become necessary to catch the slave up to the
master.
Insufficient network capacity may inhibit rapid transmittal of records
from the master to the slave. A hard resynchronization may be necessary
to catch the slave up with the master until network capacity can be
increased.
Short, Planned Interruptions of the Master
If a master must be available almost continuously, then consider
switching roles between the master and the slave during short, planned
interruptions. Let the slave act as the master while the necessary work
is being done on the old master.
There is no such thing as an instantaneous switchover from a master to a
slave. It takes some time to ensure that all transactions have been
transferred. It also takes time to shut down applications executing
against the master and reexecute them against the slave.
When the master is brought back into service, decide whether it can be
soft resynchronized from the slave or must be hard resynchronized. It is
important that the old master is brought back into synchronization with
the transactions applied to the slave (while it served as master), before
the old master resumes its master role. If you do not update the master,
when you try to restart soft resynchronization from the master to the
slave, the master will give an error message that the slave is ahead of
the master and soft resynchronization will abort.
[REV BEG]
If the master can be out of service for a short time, you can continue
operating the slave without switching roles between the master and the
slave. Be sure to shut the master down cleanly. Stop all write activity
against the system to prevent additional new transactions from being
entered on the master. Ensure that all transactions committed on the
master were replicated on the slave before completely shutting down the
master. When the master is brought up, restart the slave and master
resynchronization applications, and the process will pick up where it
left off without losing any transactions from the master.[REV END]
You can use two of the ALLBASE/Replicate resynchronization options,
enabled through the use of environment variables, to tell you when the
committed transactions have been completely transmitted to the slave.
The RESYNCstoplog option, set to 1, tells the master application to stop
the resynchronization process when it runs out of committed transactions
to send to the slave. The RESYNCrptnotrx option, set to 1, tells the
master application to print a message when it is unable to find any
transactions to send to the slave.[REV BEG] To use these options, stop
the resynchronization applications for a moment using BREAK or
ABORTJOB. Set the appropriate environment variables, then restart the
resynchronization applications.[REV END] When there are no more
transactions on the master for the slave, the applications will stop and
send a console message that there are no more transactions to send.
(Refer to Tables 3-1 and 3-2 in chapter 3 for further explanation of the
environment variables.)
Extended, Planned Interruptions of the Master
The same considerations mentioned for short, planned interruptions of the
master apply here. For extended interruptions, you will probably want to
switch roles and let the slave become the master. Shut the master down
cleanly and let all transactions committed on the master be replicated on
the slave with soft resynchronization. However, a hard resynchronization
of the master from the slave will probably be necessary if the master is
down for an extended period of time.
Short, Unplanned Interruptions of the Master
For short, unplanned interruption on the master, such as a short power
failure, transactions in process against the master from connected user
applications will be rolled back if they did not commit before the
interruption.
You may choose to bring up the master without the lost transactions, and
re-establish soft resynchronization with the slave. After the lost
transactions are identified and applied to the master, they will be
automatically applied to the slave.
Extended, Unplanned Interruptions of the Master
Usually extended, unplanned interruptions on the master are the result of
equipment failure. You can switch roles so that the slave assumes the
master role during repairs. Be sure that all the transactions committed
on the master have been replicated on the slave before the slave assumes
the master role.
The failure may have caused the master to shut down before all its
committed transactions have been replicated on the slave. If possible,
access the log files containing the unreplicated committed transactions
on the master and replicate them to the slave, before it becomes the
master.
If the log files are still accessible, you may be able to use the SQLUtil
WRAPDBE command to create a temporary "WRAPPER" DBEnvironment for
associating the log files. Then use soft resynchronization to transfer
the unreplicated transactions from the WRAPPER DBE to the slave. This is
a good reason for using archive logging, dual logging, and/or mirrored
disks to ensure that at least one set of log files will be usable. See
more about WRAPPER DBEnvironments later in this chapter.
If the log files are not available, either manually reapply the lost
transactions before the slave assumes the master role, or agree to
operate with missing transactions until you can make corrections later.
Protecting Transactions in the Master Log Files from Being Overwritten
When doing hard resynchronization, there is a scenario where master log
records for transactions not yet committed on the slave can be
overwritten. If the log records are overwritten, you will be forced to
do a hard resynchronization instead of a soft resynchronization. How
this can occur, and how to protect against this loss are discussed below.
During normal soft resynchronization, a lock is placed in the master log
files that prevents the overwriting of master log records not yet
replicated. However, when the soft resynchronization process is halted
or aborts, the master log lock is lost.
Non-archive log files are reused after a transaction is committed, and
archive log files can be reused after their contents are stored.
Therefore, there is the possibility that log records for transactions on
the master, not yet committed on the slave, can be overwritten when the
log files are reused.
To protect against overwriting the log files, ensure that there is enough
room in the log files to contain all transactions generated on the master
during the period of the outage. You may need to use the SQLUtil ADDLOG
statement to add enough log file space to guarantee the needed room.
You cannot protect against overwriting the log files by manually placing
a lock in the master log files after the slave fails (using the SQLAudit
GET AUDITPOINT or LOCK AUDITPOINT command). When the slave application
fails, the log lock placed by the master on its log files is terminated.
Log files needed for soft resynchronization can then be overwritten. By
the time you notify the master location that the slave has failed, it may
be too late to manually place a lock on the log files.
It is important to monitor master log space consumption while the slave
system is out of use to avoid these problems.[REV BEG] If you are using
archive logging, you may associate a WRAPPER DBEnvironment with the
stored log files to retrieve the transactions.[REV END]
[REV BEG]
MPE/iX 5.0 Documentation