Copyright (c) 1995 Allegro Consultants, Inc.
0. Introduction
Sometimes, the computer "dies"... so this note discusses
system failures,
system hangs,
memory dumps,
subsystem numbers,
and interpreting a system abort number.
Sometimes, the system is alive ... so the free
speedometer is discussed.
There are two basic kinds of system failure that an MPE/iX (or
MPE XL) user will encounter: a "System Failure" and
a "system hang". The former is easily identfied by the
"System Failure" message that appears on the hardware
console (ldev 20). The latter is typified by users complaining
that the machine is "hung". Each will be discussed below.
1. System Failure
A System Failure reports the following information on the hardware
console:
SYSTEM ABORT 504 FROM SUBSYSTEM 143
SECONDARY STATUS: INFO = -34, SUBSYS = 107
SYSTEM HALT 7, $01F8
Additionally, the hex display (enabled on the console by typing
control-B) displays something like:
B007 0101 02F8 DEAD
Note that the "504 " and "$1F8 "
above are the same value, shown in decimal and in hex. Further,
the hex display shows "0101 " and "02F8 ".
These two numbers are reporting the following:
0101 02F8
The bold (and, depending on your Web browser, underlined) portions indicate
packets 1 and 2 of the System Abort number (01F8) (i.e., the first two hex
nibbles (01 and 02 above)
of each 4-digit hex number are "packet numbers").
Note: if the System Abort number is in the range 0 to $FF (decimal
255), only one "Part" will be needed to represent it,
and no "Part 2" will be shown.
1.1 Interpreting the System Failure Number
The System Failure number (504 in the example above)
can be converted to "english" by doing the following
on a live MPE/iX machine:
:hello manager.sys (or any user with PM capability)
:debug
= errmsg (#504, #98)
(the "#"signs are required!)
'Prefetch of needed data for a READ/WRITE request could not be made.'
c
The above "= errmsg" command looks up the System Failure
number (message #504) in the system error catalog (set #98...a magic number).
This catalog is not complete, and some System Failure numbers
are not in the catalog.
1.2 Interpreting the Subsystem Number
If the System Failure reported a subsystem (143 in the example
above), the following might convert it to a subsystem name:
:debug
= errmsg (#32765, #143)
c
Note: In the above example, the "#"signs are required,
and the 32765 is "magic" number.
Here are two examples, one which succeeds, and one which fails:
:debug
= errmsg (#32765, #143)
'File System'
= errmsg (#32765, #129)
'External error - subsys: #129 info: #32765'
If the above doesn't produce a useful string, you can try two other
approaches, both of which use the appropriate SYMOS file after
loading the DAT macros.
The first uses a macro called "subsysstr", which knows about 30
hand-coded subsystem numbers, and also knows how to use the "errmsg"
function (shown above):
:debug
use datinit.dat.telesup
macstart , '1'
= subsysstr (#129)
'7978 Tape Device Mgr'
If that doesn't work, then you probably are looking at a relatively unusual
subsystem number. The slowest, but most reliable, method of translating
a subsystem number into something is to search the SYMOS for
a constant of the form SUBSYS_ xxxxx. Here's an example, using
subsystem number #129.
:debug
use datinit.dat.telesup
macstart , '1'
env filter '129'
set dec /* important, because "129" is decimal, not hex */
symlist subsys@ ,,c
SUBSYS_7978_DM CONST INTEGER #129
SUBSYS_TAPE CONST INTEGER #129
env filter ''
In the above example, two lines matched the filter, showing that
subsystem #129 is either "SUBSYS_TAPE" or "SUBSYS_7978_DM". Since
a 7978 is a tape drive, I'd suspect that SUBSYS_7978_DM is the most
likely "answer" to the "what is subsystem 129" question. (I also
submitted a bug report to HP: no two *different* SUBSYS
constants should ever have the same value!)
An optional step to dramatically improve the performance of the SYMLIST command
is to prefetch the SYMOS file into memory. An example is:
:fetch symos.osb79.telesup
The SYMOS file you should fetch is the one that was opened by the MACSTART
command above. You can see which one this is by doing a SYMINFO command:
symf
1.3 Interpreting the Secondary Status Number
The Secondary Status line may provide some additional information
about the System Failure, if the INFO and SUBSYS values are not 0.
Take the two numbers (in the above example, INFO = -34, and SUBSYS
= 107), and use the "errmsg" function as follows:
:debug
=errmsg (-#34, #107) /* "#"s are necessary */
'The length specified was beyond the bounds of the specified object.'
c
Not all Secondary Status messages are in the catalog. If you had
tried one that is not, you would see:
:debug
= errmsg (-#51, #107)
'External error - subsys: #107 info: #51'
Note: I recommend submitting a bug report to HP for any System
Failure or Secondary Status values that are not in the catalog!
1.4 Other System Halts
The most common type of system failure is a deliberate call to
an internal MPE routine called system_abort. When this kind of
system failure occurs, the three line message shown above is printed.
Note the third line, that said "SYSTEM HALT 7, $01F8 ".
The "7 " means that system_abort was called.
At least seven other kinds of system halts are defined (SYSTEM
HALT 0 through SYSTEM HALT 6 ).
The SYSTEM HALTS 0..6 represent system failures for problems
other than system_abort, and usually reflect a problem "lower"
in the operating system (e.g.: in the interrupt handling code).
SYSTEM HALTS 1..7 should produce a multi-line printout
on the console. SYSTEM HALT 0 does not.
If the console output is missing, or corrupted, you can determine
the type of SYSTEM HALT that occurred by looking at the
hex display.
You can think of the hex display as presenting a series of 16-bit
numbers (4 hex digits) in a sequence. The sequence is repeated
over and over, with a pause of about 1/2 second between each number.
The last number in the sequence is usually $DEAD .
The first number is usually of the form $Bnxx .
The "xx "
portion (the bottom two hex digits) reports the type of SYSTEM
HALT that occurred.
In the example at the start of this note, the hex display is showing:
B007 0101 02F8 DEAD
The "07" means: SYSTEM HALT 7 (system_abort was
called_
2. System Hangs
Sometimes, the system seems to "hang", and little or
no response is seen by the users. When this happens, it is important
to characterize what is hung, and what isn't. The following questions
should be asked before stopping the machine and taking a memory dump:
- What does the hex display show? (See: Speedometer in section 4 below)
- Does any terminal get a response from the Command Interpreter?
(If a terminal is sitting with a ":" prompt, hit return. Does another ":"
prompt come out?)
- Is the hardware console (ldev 20) hung?
- If a terminal can be found that is working, does a :SHOWPROC command hang
the terminal?
- Does a control-A at the hardware console (ldev 20) result in an "=" prompt?
- Are the disc drives active?
The answers to these questions will aid the person who analyzes the dump.
3. Dump Loading
Once a memory dump has been taken, and the system rebooted, you
will probably want to load the dump for analysis. The following
steps should be done:
- Logon as MGR.TELESUP, DUMPS
(Note: if the DUMPS group does not exist, logon as MGR.TELESUP
and do: NEWGROUP DUMPS, and then CHGROUP DUMPS)
- Enter: DAT.DAT
This will run the DAT (Dump Analysis Tool) program.
- Enter: GETDUMP FOO
"FOO" will be the name of the dump. This name must begin with a letter, and
be 1 to 5 letters and/or digits long. One recommendation is to call the dump
S#### where #### is the System Abort number (e.g.: S0504).
DAT will request a tape whose formal name is DUMPTAPE (this may
be file equated before running DAT, if necessary).
- REPLY to the tape request.
DAT will read the first few records of the tape, and report how
much disc storage will be required to hold the dump. DAT will
then allocate all of the necessary disc storage "up front",
before reading the rest of the tape.
If DAT is able to allocate enough disc space, and if the dump
is on a single tape (or DDS), you can now walk away for awhile.
*** Please do the next two steps even if you think you don't want
to analyze the dump yourself! It saves 5 to 15 minutes for the
next person who analyzes the dump!
- Enter: MACSTART "FOO", "1"
This will tell DAT that you want to start analyzing the dump.
The "FOO" (in quotes!) is the name of the dump you used
on the earlier GETDUMP command (which didn't use quotes!). The
extra "1" tells DAT that you are only interested in
"macros" for the operating system.
If this process encounters a few errors, please do a screen capture
(e.g.: PSCREEN) so we can analyze them later.
- Enter: PROCESS_WAIT ; UI_SHOWJOB
These two commands may take up to 15 minutes to run.
- Enter: EXIT
You have now loaded a dump, FOO, and "prepared" it.
If you want to send the dump to anyone for analysis, use STORE
to store it as follows:
STORE FOO@
The "@" is important, because the dump is actually stored
on disc as FOOMEM and FOOVAR, where "FOO" is the name
you picked for the dump. Someday, dumps may be stored as even
more files (e.g.: FOO001, FOOMEM, FOO002, FOOVAR), so the "@"
will always be needed.
4. Speedometer
The HP 3000, running MPE/iX (or MPE XL) has a free "speedometer", which
tells us how busy the computer is.
When the system is alive, the hex display on the hardware console
functions as a speedometer, reporting how busy the system is.
(Remember: some machines have LED hex displays, and all have the ability
to put the hex display on the status line of the hardware console, when
control-B is hit.)
The speedometer will typically cycle between two values:
F xFF and FFFF .
Ignore the FFFF value.
The "x" digit in the
F xFF value reports what percentage busy
the CPU is. The number should be multiplied by 10 to obtain the percentage.
Examples:
F4FF
means: 40% busy.
FAFF
means: 100% busy ("A " is the hex
value for decimal 10).
F0FF
means: idle (0% busy).
Note: on newer HP 3000s, you will have to interact with the GSP (Guardian
Service Processor) to see the speedometer. A typical scenario is:
- Connect to the GSP (press control-B at the hardware console, or telnet to
the GSP port, or use a browser and logon to the GSP via a Secure Web
Server);
- Login to the GSP (often just by pressing <return> twice);
- Get to the Virtual Front Panel by entering: VFP <return>
- If asked, say "Yes" to the "Proceed with Live Mode of VFP? (Y/[N]) y"
question;
- Watch for a few updates:
unknown, no source stated legacy PA HEX chassis-code FFFF
unknown, no source stated legacy PA HEX chassis-code F0FF
unknown, no source stated legacy PA HEX chassis-code FFFF
unknown, no source stated legacy PA HEX chassis-code F0FF
(The above says: F0FF, which is 0% busy)
- Exit the VFP by typing "q": q
- (optional) exit the GSP by typing "co": co
|