0. Abstract
With the release of MPE/iX (now called MPE/iX) on
HPPA (Hewlett Packard Precision Architecture), many new features have
arrived for the programmer. These include mapped files and a very large
address space. One new feature overlooked by many is the RISC
architecture. Although RISC means "reduced complexity", optimizing
performance on RISC is paradoxically more complex than on the classic
HP3000. This paper asks: "what can we do to maximize performance?" Some
answers are presented, and particular attention is given to the
characteristics of mapped files, the file system, and Native Mode versus
Compatibility Mode.
1. Mapped Files
This section will introduce mapped files and discuss their performance
characteristics.
1.1 Mapped File Introduction
From a programmer's viewpoint, MPE/iX has two basic
types of files: the ordinary, record-oriented files that have existed
since the birth of MPE, and mapped files.
A mapped file is an MPE ordinary file that is going
to be accessed via virtual memory loads and stores instead of (or in
addition to) via file system intrinsics. Instead of calling FOPEN, a
programmer can call the new HPFOPEN intrinsic, and specify that a file
is to be opened for "mapped" access. This will result in two pieces of
information being returned to the program: a file number (like FOPEN
would have returned), and a virtual memory address. The virtual memory
address returned is the address of the first byte of data in the file.
If the address is stored in a pointer, as shown in the following
example, and the pointer is then "de-referenced", the first byte from
the file is brought into memory.
HP Pascal/XL SPLash!
var filedata : ^char; virtual byte pointer filedata;
firstbyte : char; byte firstbyte;
double filedata'spaceid = filedata;
... ...
hpfopen (..., filedata, ...); hpfopen (..., filedata'spaceid, ...);
firstbyte := filedata^; firstbyte := filedata;
Note: the above example was done with HP Pascal/XL, but
most of the rest of the examples in this paper will be done in SPLash!, a
native mode version of SPL/V, which allows easy manipulation of 32 bit
and 64 bit virtual addresses. Mapped file access is also available in HP
C/XL. (The SPLash! example shows the need to get the pointer passed by
reference...the intrinsic declaration of HPFOPEN has no way of telling a
compiler that that parameter expects to be a pointer-by-reference, so
SPLash! would treat "filedata" as a request to pass the address that
filedata *points to*, instead of the address of the filedata pointer itself.)
With the above fragment of code, let's look at fetching the first two 80-byte
records.
byte array
rec0' (0 : 79),
rec1' (0 : 79);
move rec0' := filedata, (80); ! get first 80 bytes
move rec1' := filedata (80), (80); ! get second 80 bytes
If the file system had been used to access the first two records, as in:
fread (fid, rec0', -80);
fread (fid, rec1', -80);
then the total CPU utilized by the FREADs would be much greater than the CPU
used by the two "move" statements.
1.2 How are Mapped Files Implemented?
In MPE/iX, all files are stored on disc as an array of bytes. A
file is called a "mapped file" if it happens to have been opened by a
user who requested its virtual address be returned as a result of the
HPFOPEN intrinsic. At the lowest level of MPE/iX, ALL disc files are
always opened as mapped files. Usually, we call a file a "mapped file"
if we intend to access its data via virtual memory along with (or
instead of via) the file system intrinsics.
Two aspects of disc files have changed from MPE V to MPE/iX:
- The file label is not stored as part of the file.
- There is no wasted space between records or between blocks.
The first change is a decade overdue. The second change is a direct result of
the virtual memory system of HPPA.
When any disc file is opened in MPE/iX, a module called the
"Virtual Space Manager" allocates a range of virtual addresses
sufficient to cover the entire file. The process is called "mapping", as
in: mapping the file into virtual memory. "Mapping" provides a
one-to-one correspondence between a virtual memory address and a byte of
disc data for every byte in the file.
If a program tries to use a virtual address that has been mapped
onto a file to fetch a byte of data, the following is done by hardware:
- Extract the upper 53 bits of the 64 bit virtual address, calling it the
VPN (Virtual Page Number).
- Is the virtual page "in" memory. (I.e.: is there a physical page of 2,048
bytes that has been assigned to that VPN?)
- If yes, then using the bottom 11 bits (the page "offset") of the original
64 bit virtual address, index into the physical page, fetch the byte, and
return.
- If no, interrupt and ask the software to bring our page into physical
memory.
- When our page arrives in memory, our process will be restarted at step #1
above.
- The above process can be phrased in a simpler manner:
- If the virtual address is in real memory, fetch the data;
otherwise do a "page fault" and swap the page into memory and then fetch
the byte.
Note: this description of virtual memory is simplified, and omits features such
as the Translation Lookaside Buffer (TLB).
Thus, to fetch the first byte of the 100th record of an 80-byte
record file, we can simply take the virtual address of the first byte of
the file, add 8000 to it, and then fetch a byte from that address.
Sooner, or later, the byte will appear in the register that we asked it
to be loaded into.
- The detailed workings of virtual memory are quite complex, and beyond the
scope of this paper. For now, let's just remember:
- When bytes of a file are accessed via a virtual address, the data is
brought into memory as needed by the operating system via "page faults".
Once a page is in memory, its data can be accessed at main-memory speeds.
On a typical MPE/iX machine, many millions of bytes of mapped files could
be in memory all at the same time.
If anything is stored into the virtual address, the physical page is
marked dirty. Dirty pages are eventually written out to disc, but this
process might not occur for quite some time.
When we talk about a "page" in reference to the CPU hardware, we
generally mean a "physical page" of 2,048 bytes. At most other times,
"page" refers to a "logical page" (sometimes incorrectly called a
"virtual page") of 4,096 bytes. When a logical page is brought into
memory, it will occupy two consecutive physical pages.
1.3 Prefetch
"Prefetching" is the act of bringing more data from disc into
memory than was immediately requested by a user, in an attempt to
prevent a second disc read shortly after the first.
The disc caching code on MPE V had two "dials" the system manager
could twist to control the amount of data prefetched. One dial to
control the size of cache domains created for sequential disc reads, and
another to control the size of domains created for random disc reads.
On MPE/iX, the system manager has no such controls. Instead, the
prefetch size is determined (at present) by one primary factor: what
subsystem is asking for the data to be read from disc. If the request to
read data from disc is from the memory manager (due to a page fault),
one logical page is read. If the request is from the file system,
several logical pages are read.
Clearly, this has enormous performance implications. Consider a
program accessing a file of 256 byte records in a sequential manner.
Assuming the file has about 90,000 records, and assuming that the file
system requests 4 logical pages at a time, then the memory mapped access
will have 5,625 page faults versus 1,406 for the file system accessor.
(Remember: a logical page is 4,096 bytes, and a physical page is 2,048
bytes. Unless dealing with the lowest levels of MPE/iX, we normally
refer to logical pages.)
As a test of the above, a program was run that did a simple
sequential read of the file SL.PUB.SYS (89,867 records of 256 bytes).
This file takes about 22 megabytes of disc space. The following table
show the CPU and Elapsed times required to read the file. In between
each run, a separate 16 megabyte file was read in an attempt to flush as
much of the SL.PUB.SYS file data from memory as possible (see the
section: Measurement Problems).
The following table shows the time the test program needed to read SL.PUB.SYS.
The test program was running in Native Mode.
SL.PUB.SYS sequential read (times in milliseconds)
CPU Elapsed Delta Access Method
----- ------- ------ -------------
19686 146298 126612 Memory Mapped
35398 44361 8963 FreadDir
36590 44957 8367 Fread
39465 46802 7337 FreadDir & FreadSeek
48650 51949 3299 Fread & FreadSeek
The "Delta" column shows the amount of time the program was presumably
waiting for the data to come from disc.
The "FreadDir" access method consisted of using the FREADDIR
intrinsic with ascending record numbers, which results in reading
exactly the same records as the FREAD intrinsic. The last two rows added
a call to the FREADSEEK intrinsic in an attempt to have MPE/iX prefetch
data before it was read. For those two tests, FREADSEEK was called once
every 4 reads, with a request to prefetch the fourth record following
the current.
The implications:
- Use sequential FREADDIR to sequentially read a file that is not already
in memory (see note below);
- Don't use FREADSEEK. At least in these tests, it never seems to help,
and only costs extra CPU time.
Taking the first delta figure, 126,612, and guessing that we can do a
disc read in 22.5 milliseconds, we get an estimate of 5,627 disc reads,
which matches our prediction.
If we take the delta for the FREAD test, 8,367, and using the
same estimate of 22.5 milliseconds per disc read, we see 372 disc reads.
This implies that FREAD is prefetching in chunks of 15 or 16 logical
pages, not the 4 originally assumed.
Note that with the FREAD & FREADSEEK test the delta was cut about in half,
at the cost of greatly increased CPU time.
A second large file was tested, NL.PUB.SYS (64,275 records of 256 bytes each,
16 megabytes):
CPU Elapsed Delta Access Method NL.PUB.SYS
----- ------- ------ ------------- (sequential)
11507 74920 63413 Memory Mapped
22109 26240 4131 FreadDir
23857 27364 3507 Fread
25735 28124 2389 FreadDir & FreadSeek
28887 31151 2264 Fread & FreadSeek
These results mirror those for reading SL.PUB.SYS.
1.4 Memory Resident Data
The previous section examined the performance of mapped files
versus the file system for data that was out on disc. Frequently, the
data for a file will happen to be resident in memory. This is the case
when a file is accessed multiple times in a relatively short period.
This section examines the performance of accessing file data that is
already in memory. Using the same Native Mode program (an SPL/V program
compiled with SPLash!), the file CATALOG.PUB.SYS was sequentially read.
This file has 7040 records of 80 bytes each for a total of 0.5
megabytes.
CPU Elapsed Access Method
---- ------- -------------
181 182 Mapped File
1660 1677 FreadDir
1678 1680 Fread
1959 1976 FreadDir & FreadSeek
1977 1994 Fread & FreadSeek
The file CATALOG was read once to bring it into memory. The time to do this
is not reflected in the above table.
Note that the elapsed time is just slightly more than the CPU time. This is
because the process is never paused to wait for disc I/O.
The implications:
- If the file's data is likely to be in memory, use mapped file access!
- FREADSEEK should not be used for files where the data is in memory already.
1.5 NM vs CM vs OCT
MPE/iX can execute in any of three modes: Native Mode (executing
RISC instructions), Compatibility Mode (emulating classic HP3000 CISC
instructions), and a blend of the two produced by the Object Code
Translator (OCT). Briefly, a Compatibility Mode (CM) program can be run
through the OCT to produce a hybrid program file that contains the
original CISC instructions as well as their translation into RISC
instructions. OCT'ed programs must obey ALL the same restrictions as CM
programs (e.g.: 16-bit wide stack of 65,535 bytes). (For more
information on OCT, CM, and NM, the reader is directed to the book
"Beyond RISC" from Software Research Northwest.)
The data in the preceding tests was obtained from a Native Mode
program. This section examines the performance of the file system when
called from the three types of program code: NM, OCT, and CM. As a
reminder of what can be accomplished by what my partner, Steve Cooper,
calls the "second migration", mapped file access is also shown in the
table. The "second migration" is the process of adapting programs to
take advantage of the new features in MPE/iX. The "first migration" is
the one HP talks about: porting a program to Native Mode (which usually
means minimal changes).
The file CATALOG.PUB.SYS was sequentially read in the same
manners as before, with the IDENTICAL program compiled in SPL/V (CM),
run through the Object Code Translator (OCT), and compiled by SPLash!
(NM). The following table shows the results:
CATALOG.PUB.SYS (times in milliseconds)
CPU Elapsed Mode Access Method
---- ------- ---- ---------
181 182 NM Mapped (requires NM)
1660 1677 NM FreadDir
1678 1680 NM Fread
1959 1976 NM FreadDir & FreadSeek
1977 1994 NM Fread & FreadSeek
3326 3343 OCT FreadDir
3838 3854 OCT Fread
4196 4214 CM FreadDir
4850 4881 CM Fread
5196 5216 OCT FreadDir & FreadSeek
5670 5690 OCT Fread & FreadSeek
6471 6493 CM FreadDir & FreadSeek
7473 7493 CM Fread & FreadSeek
The implications:
- NM is far faster than CM or OCT.
- Calling FREADSEEK from CM or OCT programs is even more of a penalty than
calling it from NM programs.
- FREADDIR is still slightly faster than FREAD.
The test program was produced from the source file "READER" with the following commands:
CM: spl reader, $newpass, $null
prep $oldpass, reader.cm
OCT: octcomp reader.cm, readero.cm, , noovf
NM: splasm reader
Note that the "noovf" option on the "octcomp" command tells the OCT
that the program does not expect to generate arithmetic overflows and to
optimize its translation with that in mind. This results in slightly
faster OCT'ed programs.
The basic reason that the CM and OCT programs are so much slower
is that simple disc files are handled by Native Mode portions of MPE/iX.
Some types of disc files are still handled by Compatibility Mode
portions of MPE/iX, ported from MPE V/E. These include message files,
RIO files, Circular files, and KSAM files.
When a CM or OCT program calls the FREAD intrinsic to read a
record from an ordinary disc file, the FREAD intrinsic must "switch" to
Native Mode and call the Native Mode FREAD intrinsic. This switch is not
inexpensive. OCT programs pay the same switch overhead as CM programs
because they are still emulating the Classic instruction set, albeit
faster than the emulator. NM programs (e.g.: HP Pascal/XL and SPLash!)
are already in Native Mode when they call FREAD, so no switch is necessary.
The next test shows the results of serially reading a KSAM file
of 1,000 80 byte records from NM, OCT, and CM programs. As in the
CATALOG test, the file was brought into memory before the start of the test.
CPU Elapsed Mode Access Method
---- ------- ---- -------------
2475 2494 OCT Fread
2677 2696 CM Fread
3239 3257 NM Fread
Note that the FREAD intrinsic returns the records in key order, not the
chronological order in which they were written.
Note that the FreadDir test was dropped. The FREADDIR intrinsic cannot be
used on KSAM files.
The mapped file test was dropped because it reads the data in chronological
order, not key order.
The implications:
If KSAM is being used heavily, don't migrate the programs into NM
until a native mode version of KSAM is available (from HP or another
vendor).
[Update 96/03/07: KSAM/iX is in Native Mode]
2. Memory & Disc Utilization
In MPE V, stacks were limited to a maximum of 65,535 bytes. In
MPE/iX, the limitation is 1 gigabyte (1,073,741,824 bytes). (This limit
includes the CM stack & heap, the NM stack, the NM heap, and the XRT.)
In MPE V, if any part of the stack was in memory, then the entire
stack was in memory. In MPE/iX, only the logical pages recently
referenced are likely to be in memory at any time. Additionally, only
those pages that have EVER been referenced are allocated disc storage.
As more and more stack/heap pages are touched, more and more pages are
allocated on disc. This means that having an array of 1,000,000 bytes in
SPLash! (or Pascal/XL, or any NM language) is not expensive...until you
use it. A megabyte array will have 1 million bytes of virtual address
assigned to it, but the disc storage will range from 0 to 256 logical pages!
Disc files are allocated storage exactly like the stack/heap:
only those pages ever touched are allocated disc sectors. (Since extents
may be allocated several logical pages at a time, some rounding-up does
occur.) This means that it is feasible to have "sparse" files. For
example, a file with 1 byte for every possible Social Security number
would have a limit of 999,999,999 bytes. If a single write is done to
record 2345, then a single extent will be allocated. A test done on MPE
XL 1.1 resulted in an extent of 2,048 sectors being allocated. This does
not mean that all future extents will be of equal size. Unfortunately,
the programmer has no control over the extent size.
3. Data Alignment
On the Classic HP3000, the natural data alignment was 16 bits.
With rare exceptions, 32-bit and 64-bit data could be placed at any
16-bit boundary with impunity and no performance ramifications.
On the HPPA HP3000s, the natural data alignment is 32 bits for
32-bit data, and (sometimes) 64-bits for 64-bit data. (The 64- bit
alignment applies primarily to IEEE 64 bit floating point numbers.)
As a result, if code is ported from a CM language to its NM
equivalent, one of two problems can result: program aborts (or other
errors) due to misaligned data; or performance slowdowns.
Most NM compilers provide a means of specifying that certain
variables are only 16-bit aligned. When this is done, then the compilers
will typically emit 3 instructions to load a 32 bit variable instead of
the 1 that would have been required if the variable was aligned on a
32-bit boundary. This is necessary because the RISC hardware does not
allow the LDW (Load 32-bit Word) instruction to be given an address that
is not a multiple of 4 bytes (32 bits). Instead, 2 LDH (Load 16-bits)
instructions and one DEP (deposit) instruction must be used to build the
32 bit value in a register.
No performance data is shown here because the implications are clear from the
instruction count: 1 versus 3.
4. SORT vs HPSORT
Compatibility Mode programs that call the SORT intrinsics still get the old
sort package, running in OCT.
Native Mode programs have a choice of two intrinsics to do
sorting: SORT and HPSORT. These two intrinsics are interfaces to a new
sort package which runs in Native Mode. The native mode sort package
lacks some of the features of the CM sort facility (e.g.: the ability to
pass procedures to do the comparison), and has one additional wrinkle:
sometimes it calls the CM sort to do the sort!
In its present incarnation, NM Sort will call CM Sort when it
gets a "difficult" sort. This includes sorts that specify an alternate
collating sequence.
Additionally, when NM Sort does stay in NM, it does NOT open a
temporary file called SORTSCR. Instead, it uses two temporary files that
are either nameless or have a name like HPSORT1 and HPSORT2 (?),
depending on the release of MPE/iX. This means that if a fairly simple
sort is requested from a NM program, the programmer cannot point the
sort scratch file to a disc drive he/she knows is separate from the
input and output data.
In short, NM Sort is still evolving. Test runs should be made before converting
to NM simply to call NM sort.
5. System Performance
The overall system performance can still be affected by proper tuning of the
C, D, and E subqueues via the TUNE command.
The choice of disc drives for a file can also be controlled in
the usual manner (e.g.: BUILD FOO;DEV=3). However, the number of extents
cannot be easily controlled any more. The basic choice is one extent or
many extents.
Main memory is vital to the performance of the system. Unlike MPE V, which
tended to degrade slowly, MPE/iX will suffer a very sharp drop in performance
when not enough memory is available. Economize on everything else ... and buy
memory.
A 950 (and 955) will support up to 256 megabytes (128 per memory
controller). Three vendors offer memory for the machine: HP, Kelly
Computer Systems (the first to put 256 megabytes in a user's computer),
and EMC. Sites with Classic HP3000s may be interested in Kelly's RAMDISC
for the 3000, which can be traded in on HPPA memory when needed.
6. NM vs. CM : Intrinsics
In an earlier section, we determined that some types of files are still
implemented with Compatibility Mode code.
File system intrinsics are not the only ones that might actually
be implemented in CM. The ASCII, BINARY, DASCII and DBINARY Native Mode
intrinsics currently switch to CM to do their work. Although this may
change in the future, the performance implications are still interesting today.
Porting a program into Native Mode may reveal other intrinsics that are still
implemented in Compatibility Mode.
The following table shows the result of calling the ASCII
intrinsic a large number of times from programs written in NM, OCT, and CM:
CPU Elapsed Mode
---- ------- ----
9051 9084 NM
11688 11728 OCT
12211 12252 CM
Although the Native Mode program was the fastest, it is by a very narrow margin.
The ASCII/BINARY/etc. intrinsics have always been a performance
bottleneck on MPE V. They haven't changed in MPE/iX. The following table
shows the results of calling the ASCII intrinsic versus calling a
"clone" of the intrinsic:
CPU Elapsed Mode Procedure
---- ------- ---- ---------
457 471 NM ASCII clone
9009 9040 NM ASCII intrinsic
Similar savings can be obtained for BINARY, DASCII, DBINARY, and
CTRANSLATE. Contact the author for NMOBJ files that can be used as
replacement intrinsics.
7. Measurement Problems
Measuring performance on MPE/iX is extremely difficult. Unlike
MPE V, MPE/iX provides no control over what disc data is (or is not) in
memory. As a result, tests must be run multiple times with best-case (or
average-case) times used.
(Note: the program
FLUSH can be used to flush pages of
closed files out of memory.)
The difficulty of measuring performance is at its worst when
looking at disc I/O. The following is a partial list of features that
would aid this type of analysis:
- An intrinsic that will make free all pages of memory that are not marked
memory-resident or locked.
- An intrinsic that will force all pages that are dirty to disc.
- An intrinsic that would return for a virtual address information like:
size of object and number of logical pages currently in memory.
The first feature would allow the system to be returned to a known
"blank slate" state, allowing repeatable performance testing.
Note: an intrinsic allows the system manager/performance tuner/
software developer the ability to exercise the above functions
programmatically. This is clearly superior to simply having a command
for two reasons:
- A command can be written by the user which simply calls the intrinsic.
The opposite is not inexpensively true.
- Intrinsics are not as easy to abuse by the casual user.
[Update 96/03/07: the author has a utility program, FFLUSH, which
flushes the pages for all closed files out of memory. This allows for
relatively easy and repeatable testing for many of the timing questions
considered in this paper.]
One valuable tool used in this paper is DEBUG. Given a virtual
address associated with a mapped file, the debugger can be used to
determine the number of logical pages that are currently in memory.
Assuming the file starts at virtual address $123.0, then the debugger
command:
= vainfo ($123.0, "pages_in_mem"), #
will report (in decimal) the number of 4,096 byte logical pages that are
currently in memory.
8. Conclusions
Obtaining optimum performance with MPE/iX is more difficult than
on MPE V ... there are more things to tune, with much less knowledge.
Things to remember:
- The amount of memory on the machine is critical;
- Migration to Native Mode is important, but should not be done blindly. If
an application is a heavy KSAM or message file user, do some timing tests
first.
- The "second migration" is more important ... it means taking advantage of
the new features.
Perhaps, when MPE/iX begins to stabilize, and third-party performance
tools are developed and marketed, the folklore on how to maximize
performance will begin to grow as it did under MPE V. In the meantime,
keep the faith!
NOTE: All timings in this paper were obtained running under MPE
XL 1.1. Initial testing on MPE XL 1.2 shows no major differences.
WAIT!
Postscript!
FREADSEEK has been given a bad name in this article. Well, like
the "goto", it has its uses. Further testing (and a lot of thought)
resulted in a modification to the test program that was reading
SL.PUB.SYS with the results:
SL.PUB.SYS sequential read (times in milliseconds)
CPU Elapsed Delta Access Method
----- ------- ----- -------------
15273 49525 34252 Memory Mapped & FreadSeek <---
19686 146298 126612 Memory Mapped
35398 44361 8963 FreadDir
35903 65936 30033 FreadDir & FreadSeek
36590 44957 8367 Fread
39769 64529 24760 Fread & FreadSeek
Notice the incredible change in the "Memory Mapped & FreadSeek"
numbers (first line). The crucial difference here is in the timing and
quantity of calls to the FREADSEEK intrinsic. Earlier testing showed
that the best case "throughput" for reading data with a mapped file
(where the data was already memory resident) was about 3,111 bytes per
millisecond. (Obtained from the memory-resident speed of reading
CATALOG.PUB.SYS (80 * 7040 bytes) in 181 milliseconds.) Clearly, any
prefetch should be done far enough ahead of time that the data is in
memory by the time it is needed. The above calculation showed that if we
assume it takes 30 milliseconds to read data from disc, then it must be
requested 30 * 3,111 bytes before it is needed.
The test program was adjusted to prefetch 128 records ahead
(instead of 4 records ahead). The next round of timings showed a gain,
but not as much as hoped for. Then, we realized that the prefetch was
reading 8 logical pages. So, after processing 24 logical pages (100,000
bytes) the test program was prefetching 8 logical pages instead of 24!
The program was modified again, to fetch 24 logical pages at a time,
resulting in the times shown above.
Moral: prefetching via FREADSEEK is worth the time, but ONLY
after careful analysis. Failure to prefetch at the right time, or not
enough data, is worse than not prefetching at all.