0. Introduction
Sometimes, the computer "dies"... so this note discusses
system failures,
system hangs,
memory dumps,
subsystem numbers,
and interpreting a system abort number.
Sometimes, the system is alive ... so the free
speedometer
is discussed.
There are two basic kinds of system
failure that an MPE/iX (or MPE XL) user will encounter: a "System
Failure" and a "system hang". The former is easily identfied by the
"System Failure" message that appears on the hardware console (ldev 20).
The latter is typified by users complaining that the machine is "hung".
Each will be discussed below.
1. System Failure
A System Failure reports the following information on the hardware console:
SYSTEM ABORT 504 FROM SUBSYSTEM 143
SECONDARY STATUS: INFO = -34, SUBSYS = 107
SYSTEM HALT 7, $01F8
Additionally, the hex display (enabled on the console by typing control-B)
displays something like:
B007 0101 02F8 DEAD
Note that the "504
" and "$1F8
" above are the same
value, shown in decimal and in hex. Further, the hex display shows
"0101
" and "02F8
". These two numbers are reporting
the following:
0101 02F8
The bold (and, depending on your Web browser, underlined) portions indicate
packets 1 and 2 of the System Abort number (01F8) (i.e., the first two hex
nibbles (01
and 02
above)
of each 4-digit hex number are "packet numbers").
Note: if the System Abort number is in the range 0 to $FF (decimal 255),
only one "Part" will be needed to represent it, and no "Part 2" will be shown.
1.1 Interpreting the System Failure Number
The System Failure number (504 in the example above) can be converted to
"english" by doing the following on a live MPE/iX machine:
:hello manager.sys
(or any user with PM capability)
:debug
= errmsg (#504, #98)
(the "#"signs are required!)
'Prefetch of needed data for a READ/WRITE request could not be made.'
c
The above "= errmsg" command looks up the System Failure number (message #504)
in the system error catalog (set #98...a magic number). This catalog is
not complete, and some System Failure numbers are not in the catalog.
1.2 Interpreting the Subsystem Number
If the System Failure reported a subsystem (143 in the example above), the
following might convert it to a subsystem name:
:debug
= errmsg (#32765, #143)
c
Note: In the above example, the "#"signs are required, and the 32765 is
"magic" number.
Here are two examples, one which succeeds, and one which fails:
:debug
= errmsg (#32765, #143)
'File System'
= errmsg (#32765, #129)
'External error - subsys: #129 info: #32765'
If the above doesn't produce a useful string, you can
try two other approaches, both of which use the appropriate SYMOS file
after loading the DAT macros.
The first uses a macro called "subsysstr", which knows
about 30 hand-coded subsystem numbers, and also knows how to use the
"errmsg" function (shown above):
:debug
use datinit.dat.telesup
macstart , '1'
= subsysstr (#129)
'7978 Tape Device Mgr'
If that doesn't work, then you probably are looking at a relatively unusual
subsystem number. The slowest, but most reliable, method of translating a
subsystem number into something is to search the SYMOS for a constant of
the form SUBSYS_
xxxxx. Here's an example, using subsystem number
#129.
:debug
use datinit.dat.telesup
macstart , '1'
env filter '129'
set dec /* important, because "129" is decimal, not hex */
symlist subsys@ ,,c
SUBSYS_7978_DM CONST INTEGER #129
SUBSYS_TAPE CONST INTEGER #129
env filter ''
In the above example, two lines matched the filter,
showing that subsystem #129 is either "SUBSYS_TAPE" or "SUBSYS_7978_DM".
Since a 7978 is a tape drive, I'd suspect that SUBSYS_7978_DM is the
most likely "answer" to the "what is subsystem 129" question. (I also
submitted a bug report to HP: no two *different* SUBSYS constants should
ever have the same value!)
An optional step to dramatically improve the
performance of the SYMLIST command is to prefetch the SYMOS file into
memory. An example is:
:fetch symos.osb79.telesup
The SYMOS file you should fetch is the one that was
opened by the MACSTART command above. You can see which one this is by
doing a SYMINFO command:
symf
1.3 Interpreting the Secondary Status Number
The Secondary Status line may provide some additional information about the
System Failure, if the INFO and SUBSYS values are not 0.
Take the two numbers (in the above example, INFO = -34, and SUBSYS = 107), and
use the "errmsg" function as follows:
:debug
=errmsg (-#34, #107) /* "#"s are necessary */
'The length specified was beyond the bounds of the specified object.'
c
Not all Secondary Status messages are in the catalog. If you had tried one that
is not, you would see:
:debug
= errmsg (-#51, #107)
'External error - subsys: #107 info: #51'
Note: I recommend submitting a bug report to HP for any System Failure or
Secondary Status values that are not in the catalog!
1.4 Other System Halts
The most common type of system failure is a deliberate
call to an internal MPE routine called system_abort. When this kind of
system failure occurs, the three line message shown above is printed.
Note the third line, that said "SYSTEM HALT 7, $01F8
".
The "7
" means that system_abort was called. At least seven other
kinds of system halts are defined (SYSTEM HALT 0
through
SYSTEM HALT 6
).
The SYSTEM HALTS 0..6
represent system
failures for problems other than system_abort, and usually reflect a
problem "lower" in the operating system (e.g.: in the interrupt handling code).
SYSTEM HALTS 1..7
should produce a multi-line printout on the
console. SYSTEM HALT 0
does not.
If the console output is missing, or corrupted, you can determine the type of
SYSTEM HALT
that occurred by looking at the hex display.
You can think of the hex display as presenting a series
of 16-bit numbers (4 hex digits) in a sequence. The sequence is
repeated over and over, with a pause of about 1/2 second between each number.
The last number in the sequence is usually $DEAD
. The first
number is usually of the form $Bnxx
. The "xx
" portion
(the bottom two hex digits) reports the type of SYSTEM HALT
that
occurred.
In the example at the start of this note, the hex display is showing:
B007 0101 02F8 DEAD
The "07" means: SYSTEM HALT 7 (system_abort was called_
2. System Hangs
Sometimes, the system seems to "hang", and little or no
response is seen by the users. When this happens, it is important to
characterize what is hung, and what isn't. The following questions
should be asked before stopping the machine and taking a memory dump:
- What does the hex display show? (See: Speedometer in section 4 below)
- Does any terminal get a response from the Command Interpreter?
(If a terminal is sitting with a ":" prompt, hit return. Does another ":"
prompt come out?)
- Is the hardware console (ldev 20) hung?
- If a terminal can be found that is working, does a :SHOWPROC command hang
the terminal?
- Does a control-A at the hardware console (ldev 20) result in an "="
prompt?
- Are the disc drives active?
The answers to these questions will aid the person who analyzes the dump.
3. Dump Loading
Once a memory dump has been taken, and the system
rebooted, you will probably want to load the dump for analysis. The
following steps should be done:
- Logon as MGR.TELESUP, DUMPS
(Note: if the DUMPS group does not exist, logon as MGR.TELESUP and do
: NEWGROUP DUMPS, and then CHGROUP DUMPS)
- Enter: DAT.DAT
This will run the DAT (Dump Analysis Tool) program.
- Enter: GETDUMP FOO
"FOO" will be the name of the dump. This name must begin with a letter,
and be 1 to 5 letters and/or digits long. One recommendation is to call
the dump S#### where #### is the System Abort number (e.g.: S0504).
DAT will request a tape whose formal name is DUMPTAPE (this may be file
equated before running DAT, if necessary).
- REPLY to the tape request.
DAT will read the first few records of the tape, and report how much disc
storage will be required to hold the dump. DAT will then allocate all of
the necessary disc storage "up front", before reading the rest of the
tape.
If DAT is able to allocate enough disc space, and if the dump is on a
single tape (or DDS), you can now walk away for awhile.
*** Please do the next two steps even if you think you don't want to
analyze the dump yourself! It saves 5 to 15 minutes for the next person
who analyzes the dump!
- Enter: MACSTART "FOO", "1"
This will tell DAT that you want to start analyzing the dump. The "FOO"
(in quotes!) is the name of the dump you used on the earlier GETDUMP
command (which didn't use quotes!). The extra "1" tells DAT that you are
only interested in "macros" for the operating system.
If this process encounters a few errors, please do a screen capture
(e.g.: PSCREEN) so we can analyze them later.
- Enter: PROCESS_WAIT ; UI_SHOWJOB
These two commands may take up to 15 minutes to run.
- Enter: EXIT
You have now loaded a dump, FOO, and "prepared" it. If you want to send the dump
to anyone for analysis, use STORE to store it as follows:
STORE FOO@
The "@" is important, because the dump is actually
stored on disc as FOOMEM and FOOVAR, where "FOO" is the name you picked
for the dump. Someday, dumps may be stored as even more files (e.g.:
FOO001, FOOMEM, FOO002, FOOVAR), so the "@" will always be needed.
4. Speedometer
The HP 3000, running MPE/iX (or MPE XL) has a free "speedometer", which tells
us how busy the computer is.
When the system is alive, the hex display on the
hardware console functions as a speedometer, reporting how busy the
system is. (Remember: some machines have LED hex displays, and all have
the ability to put the hex display on the status line of the hardware
console, when control-B is hit.)
The speedometer will typically cycle between two values:
F
xFF
and FFFF
.
Ignore the FFFF
value.
The "x" digit in the F
xFF
value reports what percentage busy the CPU is. The number should be multiplied
by 10 to obtain the percentage.
Examples:
F4FF
means: 40% busy.
FAFF
means: 100% busy ("A
" is the hex value for decimal 10).
F0FF
means: idle (0% busy).
Note: on newer HP 3000s, you will have to interact with the GSP (Guardian
Service Processor) to see the speedometer. A typical scenario is:
- Connect to the GSP (press control-B at the hardware console, or telnet to
the GSP port, or use a browser and logon to the GSP via a Secure Web
Server);
- Login to the GSP (often just by pressing <return> twice);
- Get to the Virtual Front Panel by entering: VFP <return>
- If asked, say "Yes" to the "Proceed with Live Mode of VFP? (Y/[N]) y"
question;
- Watch for a few updates:
unknown, no source stated legacy PA HEX chassis-code FFFF
unknown, no source stated legacy PA HEX chassis-code F0FF
unknown, no source stated legacy PA HEX chassis-code FFFF
unknown, no source stated legacy PA HEX chassis-code F0FF
(The above says: F0FF, which is 0% busy)
- Exit the VFP by typing "q": q
- (optional) exit the GSP by typing "co": co