MPE/iX and Performance : Not Incompatible
by
1989
(converted to HTML 96/03/05)
(updated 2002-07-29)
Contents:
0. Abstract
With the release of MPE/iX (now called MPE/iX) on HPPA (Hewlett Packard
Precision Architecture), many new features have
arrived for the programmer. These include mapped files and a
very large address space. One new feature overlooked by many is
the RISC architecture. Although RISC means "reduced complexity",
optimizing performance on RISC is paradoxically more complex than
on the classic HP3000. This paper asks: "what can we do to
maximize performance?" Some answers are presented, and
particular attention is given to the characteristics of mapped
files, the file system, and Native Mode versus Compatibility Mode.
1. Mapped Files
This section will introduce mapped files and discuss their
performance characteristics.
1.1 Mapped File Introduction
From a programmer's viewpoint, MPE/iX has two basic types of
files: the ordinary, record-oriented files that have existed
since the birth of MPE, and mapped files.
A mapped file is an MPE ordinary file that is going to be
accessed via virtual memory loads and stores instead of (or in addition to)
via file
system intrinsics. Instead of calling FOPEN, a programmer can
call the new HPFOPEN intrinsic, and specify that a file is to be
opened for "mapped" access. This will result in two pieces of
information being returned to the program: a file number (like
FOPEN would have returned), and a virtual memory address. The
virtual memory address returned is the address of the first byte
of data in the file. If the address is stored in a pointer, as
shown in the following example, and the pointer is then
"de-referenced", the first byte from the file is brought into
memory.
HP Pascal/XL SPLash!
var filedata : ^char; virtual byte pointer filedata;
firstbyte : char; byte firstbyte;
double filedata'spaceid = filedata;
... ...
hpfopen (..., filedata, ...); hpfopen (..., filedata'spaceid, ...);
firstbyte := filedata^; firstbyte := filedata;
Note: the above example was done with HP Pascal/XL, but most of the
rest of the examples in this paper will be done in SPLash!, a
native mode version of SPL/V, which allows easy manipulation of
32 bit and 64 bit virtual addresses. Mapped file access is also
available in HP C/XL. (The SPLash! example shows the need to get the
pointer passed by reference...the intrinsic declaration of HPFOPEN
has no way of telling a compiler that that parameter expects to be
a pointer-by-reference, so SPLash! would treat "filedata" as a
request to pass the address that filedata *points to*, instead of
the address of the filedata pointer itself.)
With the above fragment of code, let's look at fetching the first
two 80-byte records.
byte array
rec0' (0 : 79),
rec1' (0 : 79);
move rec0' := filedata, (80); ! get first 80 bytes
move rec1' := filedata (80), (80); ! get second 80 bytes
If the file system had been used to access the first two records,
as in:
fread (fid, rec0', -80);
fread (fid, rec1', -80);
then the total CPU utilized by the FREADs would be much greater
than the CPU used by the two "move" statements.
1.2 How are Mapped Files Implemented?
In MPE/iX, all files are stored on disc as an array of bytes. A
file is called a "mapped file" if it happens to have been opened
by a user who requested its virtual address be returned as a
result of the HPFOPEN intrinsic. At the lowest level of MPE/iX,
ALL disc files are always opened as mapped files. Usually, we
call a file a "mapped file" if we intend to access its data via
virtual memory along with (or instead of via) the file system intrinsics.
Two aspects of disc files have changed from MPE V to MPE/iX:
- The file label is not stored as part of the file.
- There is no wasted space between records or between blocks.
The first change is a decade overdue. The second change is a
direct result of the virtual memory system of HPPA.
When any disc file is opened in MPE/iX, a module called the
"Virtual Space Manager" allocates a range of virtual addresses
sufficient to cover the entire file. The process is called
"mapping", as in: mapping the file into virtual memory.
"Mapping" provides a one-to-one correspondence between a virtual
memory address and a byte of disc data for every byte in the file.
If a program tries to use a virtual address that has been mapped
onto a file to fetch a byte of data, the following is done by
hardware:
- Extract the upper 53 bits of the 64 bit virtual address,
calling it the VPN (Virtual Page Number).
- Is the virtual page "in" memory. (I.e.: is there a
physical page of 2,048 bytes that has been assigned to that VPN?)
- If yes, then using the bottom 11 bits (the page "offset")
of the original 64 bit virtual address, index into the
physical page, fetch the byte, and return.
- If no, interrupt and ask the software to bring our page
into physical memory.
- When our page arrives in memory, our process will be
restarted at step #1 above.
- The above process can be phrased in a simpler manner:
- If the virtual address is in real memory, fetch the data;
otherwise do a "page fault" and swap the page into memory and
then fetch the byte.
Note: this description of virtual memory is simplified, and
omits features such as the Translation Lookaside Buffer (TLB).
Thus, to fetch the first byte of the 100th record of an 80-byte
record file, we can simply take the virtual address of the first
byte of the file, add 8000 to it, and then fetch a byte from that
address. Sooner, or later, the byte will appear in the register
that we asked it to be loaded into.
- The detailed workings of virtual memory are quite complex, and
beyond the scope of this paper. For now, let's just remember:
- When bytes of a file are accessed via a virtual address, the
data is brought into memory as needed by the operating system
via "page faults". Once a page is in memory, its data can be
accessed at main-memory speeds. On a typical MPE/iX machine,
many millions of bytes of mapped files could be in memory all
at the same time.
If anything is stored into the virtual address, the physical page
is marked dirty. Dirty pages are eventually written out to disc,
but this process might not occur for quite some time.
When we talk about a "page" in reference to the CPU hardware, we
generally mean a "physical page" of 2,048 bytes. At most other
times, "page" refers to a "logical page" (sometimes incorrectly
called a "virtual page") of 4,096 bytes. When a logical page is
brought into memory, it will occupy two consecutive physical pages.
1.3 Prefetch
"Prefetching" is the act of bringing more data from disc into
memory than was immediately requested by a user, in an attempt to
prevent a second disc read shortly after the first.
The disc caching code on MPE V had two "dials" the system manager
could twist to control the amount of data prefetched. One dial
to control the size of cache domains created for sequential disc
reads, and another to control the size of domains created for
random disc reads.
On MPE/iX, the system manager has no such controls. Instead, the
prefetch size is determined (at present) by one primary factor:
what subsystem is asking for the data to be read from disc. If
the request to read data from disc is from the memory manager
(due to a page fault), one logical page is read. If the request
is from the file system, several logical pages are read.
Clearly, this has enormous performance implications. Consider a
program accessing a file of 256 byte records in a sequential
manner. Assuming the file has about 90,000 records, and assuming
that the file system requests 4 logical pages at a time, then the
memory mapped access will have 5,625 page faults versus 1,406 for
the file system accessor. (Remember: a logical page is 4,096 bytes,
and a physical page is 2,048 bytes. Unless dealing with the
lowest levels of MPE/iX, we normally refer to logical pages.)
As a test of the above, a program was run that did a simple sequential
read of the file SL.PUB.SYS (89,867 records of 256 bytes). This
file takes about 22 megabytes of disc space. The following
table show the CPU and Elapsed times required to read the
file. In between each run, a separate 16 megabyte file was read
in an attempt to flush as much of the SL.PUB.SYS file data from
memory as possible (see the section: Measurement Problems).
The following table shows the time the test program needed to read
SL.PUB.SYS. The test program was running in Native Mode.
SL.PUB.SYS sequential read (times in milliseconds)
CPU Elapsed Delta Access Method
----- ------- ------ -------------
19686 146298 126612 Memory Mapped
35398 44361 8963 FreadDir
36590 44957 8367 Fread
39465 46802 7337 FreadDir & FreadSeek
48650 51949 3299 Fread & FreadSeek
The "Delta" column shows the amount of time the program was
presumably waiting for the data to come from disc.
The "FreadDir" access method consisted of using the FREADDIR
intrinsic with ascending record numbers, which results in
reading exactly the same records as the FREAD intrinsic. The
last two rows added a call to the FREADSEEK intrinsic in an
attempt to have MPE/iX prefetch data before it was read. For
those two tests, FREADSEEK was called once every 4 reads, with
a request to prefetch the fourth record following the current.
The implications:
- Use sequential FREADDIR to sequentially read a file that is
not already in memory (see note below);
- Don't use FREADSEEK. At least in these tests, it never seems
to help, and only costs extra CPU time.
Taking the first delta figure, 126,612, and guessing that we can
do a disc read in 22.5 milliseconds, we get an estimate of 5,627
disc reads, which matches our prediction.
If we take the delta for the FREAD test, 8,367, and using the
same estimate of 22.5 milliseconds per disc read, we see 372 disc
reads. This implies that FREAD is prefetching in chunks of 15 or
16 logical pages, not the 4 originally assumed.
Note that with the FREAD & FREADSEEK test the delta was cut about
in half, at the cost of greatly increased CPU time.
A second large file was tested, NL.PUB.SYS (64,275 records of 256
bytes each, 16 megabytes):
CPU Elapsed Delta Access Method NL.PUB.SYS
----- ------- ------ ------------- (sequential)
11507 74920 63413 Memory Mapped
22109 26240 4131 FreadDir
23857 27364 3507 Fread
25735 28124 2389 FreadDir & FreadSeek
28887 31151 2264 Fread & FreadSeek
These results mirror those for reading SL.PUB.SYS.
1.4 Memory Resident Data
The previous section examined the performance of mapped files
versus the file system for data that was out on disc.
Frequently, the data for a file will happen to be resident in
memory. This is the case when a file is accessed multiple times
in a relatively short period. This section examines the
performance of accessing file data that is already in memory.
Using the same Native Mode program (an SPL/V program compiled
with SPLash!), the file CATALOG.PUB.SYS was sequentially read.
This file has 7040 records of 80 bytes each for a total of 0.5 megabytes.
CPU Elapsed Access Method
---- ------- -------------
181 182 Mapped File
1660 1677 FreadDir
1678 1680 Fread
1959 1976 FreadDir & FreadSeek
1977 1994 Fread & FreadSeek
The file CATALOG was read once to bring it into memory. The
time to do this is not reflected in the above table.
Note that the elapsed time is just slightly more than the CPU
time. This is because the process is never paused to wait for
disc I/O.
The implications:
- If the file's data is likely to be in memory, use mapped
file access!
- FREADSEEK should not be used for files where the data is in
memory already.
1.5 NM vs CM vs OCT
MPE/iX can execute in any of three modes: Native Mode (executing
RISC instructions), Compatibility Mode (emulating classic HP3000
CISC instructions), and a blend of the two produced by the Object
Code Translator (OCT). Briefly, a Compatibility Mode (CM)
program can be run through the OCT to produce a hybrid program
file that contains the original CISC instructions as well as
their translation into RISC instructions. OCT'ed programs must
obey ALL the same restrictions as CM programs (e.g.: 16-bit wide
stack of 65,535 bytes). (For more information on OCT, CM, and
NM, the reader is directed to the book "Beyond RISC" from
Software Research Northwest.)
The data in the preceding tests was obtained from a Native Mode
program. This section examines the performance of the file system
when called from the three types of program code: NM, OCT, and CM.
As a reminder of what can be accomplished by what my partner,
Steve Cooper, calls the "second migration", mapped file access is
also shown in the table. The "second migration" is the process of
adapting programs to take advantage of the new features in MPE/iX.
The "first migration" is the one HP talks about: porting a
program to Native Mode (which usually means minimal changes).
The file CATALOG.PUB.SYS was sequentially read in the same
manners as before, with the IDENTICAL program compiled in SPL/V
(CM), run through the Object Code Translator (OCT), and compiled
by SPLash! (NM). The following table shows the results:
CATALOG.PUB.SYS (times in milliseconds)
CPU Elapsed Mode Access Method
---- ------- ---- ---------
181 182 NM Mapped (requires NM)
1660 1677 NM FreadDir
1678 1680 NM Fread
1959 1976 NM FreadDir & FreadSeek
1977 1994 NM Fread & FreadSeek
3326 3343 OCT FreadDir
3838 3854 OCT Fread
4196 4214 CM FreadDir
4850 4881 CM Fread
5196 5216 OCT FreadDir & FreadSeek
5670 5690 OCT Fread & FreadSeek
6471 6493 CM FreadDir & FreadSeek
7473 7493 CM Fread & FreadSeek
The implications:
- NM is far faster than CM or OCT.
- Calling FREADSEEK from CM or OCT programs is even more of a
penalty than calling it from NM programs.
- FREADDIR is still slightly faster than FREAD.
The test program was produced from the source file "READER" with
the following commands:
CM: spl reader, $newpass, $null
prep $oldpass, reader.cm
OCT: octcomp reader.cm, readero.cm, , noovf
NM: splasm reader
Note that the "noovf" option on the "octcomp" command tells the
OCT that the program does not expect to generate arithmetic
overflows and to optimize its translation with that in mind.
This results in slightly faster OCT'ed programs.
The basic reason that the CM and OCT programs are so much slower
is that simple disc files are handled by Native Mode portions of
MPE/iX. Some types of disc files are still handled by
Compatibility Mode portions of MPE/iX, ported from MPE V/E. These
include message files, RIO files, Circular files, and KSAM files.
When a CM or OCT program calls the FREAD intrinsic to read a
record from an ordinary disc file, the FREAD intrinsic must
"switch" to Native Mode and call the Native Mode FREAD intrinsic.
This switch is not inexpensive. OCT programs pay the same switch
overhead as CM programs because they are still emulating the
Classic instruction set, albeit faster than the emulator. NM
programs (e.g.: HP Pascal/XL and SPLash!) are already in Native
Mode when they call FREAD, so no switch is necessary.
The next test shows the results of serially reading a KSAM file
of 1,000 80 byte records from NM, OCT, and CM programs. As in
the CATALOG test, the file was brought into memory before the
start of the test.
CPU Elapsed Mode Access Method
---- ------- ---- -------------
2475 2494 OCT Fread
2677 2696 CM Fread
3239 3257 NM Fread
Note that the FREAD intrinsic returns the records in key
order, not the chronological order in which they were written.
Note that the FreadDir test was dropped. The FREADDIR
intrinsic cannot be used on KSAM files.
The mapped file test was dropped because it reads the data in
chronological order, not key order.
The implications:
If KSAM is being used heavily, don't migrate the programs
into NM until a native mode version of KSAM is available
(from HP or another vendor).
[Update 96/03/07: KSAM/iX is in Native Mode]
2. Memory & Disc Utilization
In MPE V, stacks were limited to a maximum of 65,535 bytes. In
MPE/iX, the limitation is 1 gigabyte (1,073,741,824 bytes).
(This limit includes the CM stack & heap, the NM stack, the
NM heap, and the XRT.)
In MPE V, if any part of the stack was in memory, then the entire
stack was in memory. In MPE/iX, only the logical pages recently
referenced are likely to be in memory at any time. Additionally,
only those pages that have EVER been referenced are allocated
disc storage. As more and more stack/heap pages are touched,
more and more pages are allocated on disc. This means that
having an array of 1,000,000 bytes in SPLash! (or Pascal/XL, or
any NM language) is not expensive...until you use it. A megabyte
array will have 1 million bytes of virtual address assigned to
it, but the disc storage will range from 0 to 256 logical pages!
Disc files are allocated storage exactly like the stack/heap:
only those pages ever touched are allocated disc sectors. (Since
extents may be allocated several logical pages at a time, some
rounding-up does occur.) This means that it is feasible to have
"sparse" files. For example, a file with 1 byte for every
possible Social Security number would have a limit of 999,999,999
bytes. If a single write is done to record 2345, then a single
extent will be allocated. A test done on MPE XL 1.1 resulted in
an extent of 2,048 sectors being allocated. This does not mean
that all future extents will be of equal size. Unfortunately,
the programmer has no control over the extent size.
3. Data Alignment
On the Classic HP3000, the natural data alignment was 16 bits.
With rare exceptions, 32-bit and 64-bit data could be placed at
any 16-bit boundary with impunity and no performance
ramifications.
On the HPPA HP3000s, the natural data alignment is 32 bits for
32-bit data, and (sometimes) 64-bits for 64-bit data. (The 64-
bit alignment applies primarily to IEEE 64 bit floating point
numbers.)
As a result, if code is ported from a CM language to its NM
equivalent, one of two problems can result: program aborts (or
other errors) due to misaligned data; or performance slowdowns.
Most NM compilers provide a means of specifying that certain
variables are only 16-bit aligned. When this is done, then the
compilers will typically emit 3 instructions to load a 32 bit
variable instead of the 1 that would have been required if the
variable was aligned on a 32-bit boundary. This is necessary
because the RISC hardware does not allow the LDW (Load 32-bit
Word) instruction to be given an address that is not a multiple
of 4 bytes (32 bits). Instead, 2 LDH (Load 16-bits) instructions
and one DEP (deposit) instruction must be used to build the 32
bit value in a register.
No performance data is shown here because the implications are
clear from the instruction count: 1 versus 3.
4. SORT vs HPSORT
Compatibility Mode programs that call the SORT intrinsics still
get the old sort package, running in OCT.
Native Mode programs have a choice of two intrinsics to do
sorting: SORT and HPSORT. These two intrinsics are interfaces
to a new sort package which runs in Native Mode. The native mode
sort package lacks some of the features of the CM sort facility
(e.g.: the ability to pass procedures to do the comparison), and
has one additional wrinkle: sometimes it calls the CM sort to do the sort!
In its present incarnation, NM Sort will call CM Sort when it
gets a "difficult" sort. This includes sorts that specify an
alternate collating sequence.
Additionally, when NM Sort does stay in NM, it does NOT open a
temporary file called SORTSCR. Instead, it uses two temporary
files that are either nameless or have a name like HPSORT1 and
HPSORT2 (?), depending on the release of MPE/iX. This means that
if a fairly simple sort is requested from a NM program, the
programmer cannot point the sort scratch file to a disc drive
he/she knows is separate from the input and output data.
In short, NM Sort is still evolving. Test runs should be made
before converting to NM simply to call NM sort.
5. System Performance
The overall system performance can still be affected by proper
tuning of the C, D, and E subqueues via the TUNE command.
The choice of disc drives for a file can also be controlled in
the usual manner (e.g.: BUILD FOO;DEV=3). However, the number of
extents cannot be easily controlled any more. The basic choice is
one extent or many extents.
Main memory is vital to the performance of the system. Unlike
MPE V, which tended to degrade slowly, MPE/iX will suffer a very
sharp drop in performance when not enough memory is available.
Economize on everything else ... and buy memory.
A 950 (and 955) will support up to 256 megabytes (128 per memory
controller). Three vendors offer memory for the machine: HP,
Kelly Computer Systems (the first to put 256 megabytes in a
user's computer), and EMC. Sites with Classic HP3000s may
be interested in Kelly's RAMDISC for the 3000, which can be
traded in on HPPA memory when needed.
6. NM vs. CM : Intrinsics
In an earlier section, we determined that some types of files are
still implemented with Compatibility Mode code.
File system intrinsics are not the only ones that might actually
be implemented in CM. The ASCII, BINARY, DASCII and DBINARY
Native Mode intrinsics currently switch to CM to do their work.
Although this may change in the future, the performance
implications are still interesting today.
Porting a program into Native Mode may reveal other intrinsics
that are still implemented in Compatibility Mode.
The following table shows the result of calling the ASCII
intrinsic a large number of times from programs written in NM,
OCT, and CM:
CPU Elapsed Mode
---- ------- ----
9051 9084 NM
11688 11728 OCT
12211 12252 CM
Although the Native Mode program was the fastest, it is by a very
narrow margin.
The ASCII/BINARY/etc. intrinsics have always been a performance
bottleneck on MPE V. They haven't changed in MPE/iX. The
following table shows the results of calling the ASCII intrinsic
versus calling a "clone" of the intrinsic:
CPU Elapsed Mode Procedure
---- ------- ---- ---------
457 471 NM ASCII clone
9009 9040 NM ASCII intrinsic
Similar savings can be obtained for BINARY, DASCII, DBINARY, and
CTRANSLATE. Contact the author for NMOBJ files that can be used
as replacement intrinsics.
7. Measurement Problems
Measuring performance on MPE/iX is extremely difficult. Unlike
MPE V, MPE/iX provides no control over what disc data is (or is
not) in memory. As a result, tests must be run multiple times
with best-case (or average-case) times used.
(Note: the program
FLUSH
can be used to flush pages of closed files out of memory.)
The difficulty of measuring performance is at its worst when
looking at disc I/O. The following is a partial list of features
that would aid this type of analysis:
- An intrinsic that will make free all pages of memory that
are not marked memory-resident or locked.
- An intrinsic that will force all pages that are dirty to disc.
- An intrinsic that would return for a virtual address
information like: size of object and number of logical
pages currently in memory.
The first feature would allow the system to be returned to a
known "blank slate" state, allowing repeatable performance
testing.
Note: an intrinsic allows the system manager/performance tuner/
software developer the ability to exercise the above functions
programmatically. This is clearly superior to simply having a
command for two reasons:
- A command can be written by the user which simply calls the
intrinsic. The opposite is not inexpensively true.
- Intrinsics are not as easy to abuse by the casual user.
[Update 96/03/07: the author has a utility program, FFLUSH, which
flushes the pages for all closed files out of memory. This allows
for relatively easy and repeatable testing for many of the timing questions
considered in this paper.]
One valuable tool used in this paper is DEBUG. Given a virtual
address associated with a mapped file, the debugger can be used
to determine the number of logical pages that are currently in
memory. Assuming the file starts at virtual address $123.0, then
the debugger command:
= vainfo ($123.0, "pages_in_mem"), #
will report (in decimal) the number of 4,096 byte logical pages
that are currently in memory.
8. Conclusions
Obtaining optimum performance with MPE/iX is more difficult than
on MPE V ... there are more things to tune, with much less
knowledge. Things to remember:
- The amount of memory on the machine is critical;
- Migration to Native Mode is important, but should not be
done blindly. If an application is a heavy KSAM or message
file user, do some timing tests first.
- The "second migration" is more important ... it means
taking advantage of the new features.
Perhaps, when MPE/iX begins to stabilize, and third-party
performance tools are developed and marketed, the folklore on how
to maximize performance will begin to grow as it did under MPE V.
In the meantime, keep the faith!
NOTE: All timings in this paper were obtained running under MPE
XL 1.1. Initial testing on MPE XL 1.2 shows no major
differences.
WAIT!
Postscript!
FREADSEEK has been given a bad name in this article. Well, like
the "goto", it has its uses. Further testing (and a lot of
thought) resulted in a modification to the test program that was
reading SL.PUB.SYS with the results:
SL.PUB.SYS sequential read (times in milliseconds)
CPU Elapsed Delta Access Method
----- ------- ----- -------------
15273 49525 34252 Memory Mapped & FreadSeek <---
19686 146298 126612 Memory Mapped
35398 44361 8963 FreadDir
35903 65936 30033 FreadDir & FreadSeek
36590 44957 8367 Fread
39769 64529 24760 Fread & FreadSeek
Notice the incredible change in the "Memory Mapped & FreadSeek"
numbers (first line). The crucial difference here is in the timing and
quantity of calls to the FREADSEEK intrinsic. Earlier testing
showed that the best case "throughput" for reading data with a
mapped file (where the data was already memory resident) was
about 3,111 bytes per millisecond. (Obtained from the
memory-resident speed of reading CATALOG.PUB.SYS (80 * 7040
bytes) in 181 milliseconds.) Clearly, any prefetch should be
done far enough ahead of time that the data is in memory by the
time it is needed. The above calculation showed that if we
assume it takes 30 milliseconds to read data from disc, then it
must be requested 30 * 3,111 bytes before it is needed.
The test program was adjusted to prefetch 128 records ahead
(instead of 4 records ahead). The next round of timings showed a
gain, but not as much as hoped for. Then, we realized that the
prefetch was reading 8 logical pages. So, after processing 24
logical pages (100,000 bytes) the test program was prefetching 8
logical pages instead of 24! The program was modified again, to
fetch 24 logical pages at a time, resulting in the times shown
above.
Moral: prefetching via FREADSEEK is worth the time, but ONLY
after careful analysis. Failure to prefetch at the right time,
or not enough data, is worse than not prefetching at all.
(back to Table of Contents)