Software Resiliency Changes [ COMMUNICATOR 3000 MPE MPE/iX RELEASE 4.0 ] MPE/iX Communicators
COMMUNICATOR 3000 MPE MPE/iX RELEASE 4.0
Software Resiliency Changes
by Donna Gracyk
Commercial Systems Division
In a continuing effort to improve the high availability of the MPE/iX
operating system and make the system more resilient to software and
hardware failures, changes have been made in this release of MPE/iX. This
article discusses two areas of the operating system where software
resiliency changes have been made.
KSAM XL
Changes have been made in KSAM XL to better handle the detection of fatal
errors (for example, detecting a bad pointer in a corrupt key index
structure). Instead of aborting the system when an unexpected error is
encountered during the opening of the KSAM file or during an FGETKEYINFO,
KSAM XL sets a field in the KSAM control block, an internal data
structure which is part of the KSAM XL file itself, to indicate that an
internal error was detected.
For recovery purposes, KSAM XL allows you to read from a KSAM XL file
even when the internal error field is set, but the attempt may not be
successful. KSAM XL does not allow users to write to the file when the
internal error is set to prevent any file corruption (or any further
corruption if the file is already corrupt). Any attempts to write to a
file with an internal error result in the following file system error:
KSAM INTERNAL ERROR (FSERR 175)
Since a file system error is returned in some additional cases now, it is
important to make sure application programs always check the condition
code returned from intrinsic calls. For more information on recovering a
corrupt KSAM XL file, refer to the "Recovering from Index Corruption"
section in the Using KSAM XL Reference Manual (32650-90168).
VOLUME MANAGEMENT
Changes have also been made in the area of volume management to make the
system more resilient to fatal errors detected when mounting a user
volume. MPE/iX can now detect more error conditions that would have
previously brought the system down when mounting a user volume and will
now cause the following error messages to be displayed on the console:
*** ERROR MOUNTING VOLUME ***
COULD NOT MOUNT VOLUME ON LDEV x. INFO xxxx; SUBSYS xxxx.
VOLUME WILL BE MOUNTED AS ERROR VOLUME.
An example of an error that would have previously caused a system abort
is a disk hardware problem, which caused corruption in the disk free
space map that was detected during the volume mount.
The output from a DSTAT command lets you know if an error was detected
during the mount of a user volume. The STATUS column contains the status
ERROR-MOUNT as shown below:
SYSA (PUB.SYS): DSTAT ALL
LDEV-TYPE STATUS VOLUME (VOLUME SET - GEN)
----------- --------- ---------------------------
1- 079371 MASTER MEMBER1 (MPEXL_SYSTEM_VOLUME_SET-0)
2- 079371 MEMBER MEMBER2 (MPEXL_SYSTEM_VOLUME_SET-0)
3- 079371 MEMBER MEMBER3 (MPEXL_SYSTEM_VOLUME_SET-0)
4- 079371 MEMBER MEMBER4 (MPEXL_SYSTEM_VOLUME_SET-0)
5- 079371 MEMBER MEMBER5 (MPEXL_SYSTEM_VOLUME_SET-0)
35- 079370 MEMBER MEMBER35 (MPEXL_SYSTEM_VOLUME_SET-0)
50- 022040 MASTER MEMBER1 (DEPTB_VS-0)
51- 022040 ERROR-MOUNT MEMBER2 (DEPTB_VS-0)
MPE/iX Communicators