 |
» |
|
|
|
|  |  |
Process management uses kernel structures down to the pregions
to execute a process. The u_area, proc
structure, vas, and pregion
are per-process resources, because each process has its own unique
copies of these structures, which are not shared among multiple
processes. Below the pregion level are
the systemwide resources. These structures can be shared among
multiple processes (although they are not required to be shared). Memory management kernel structures map pregions to physical
memory and provide support for the processor's ability
to translate virtual addresses to physical memory. The table that
follows introduces the structures involved in memory management;
these are discussed later in detail. Table 1-7 Principal
Memory Management Kernel Structures Kernel structure | Purpose |
---|
vas | Keeps track of the structural elements associated
with a process in memory. One vas
maintained per process. | pregion | A per-process resource that describes the regions
attached to the process. | region | A memory-resident system resource that can
be shared among processes. Points to the process's B-tree,
vnode, pregions. | B-tree | Balanced tree that stores pairs of page indices
and chunk addresses. At the root of a B-tree
of VFDs and DBDs
is struct broot. | hpde | Contains information for virtual to physical
translation (that is, from VFD to
physical memory). |
Virtual Address Space (vas) |  |
The vas represents the virtual
address space of a process and serves as the head of a doubly linked
list of process region data structures called pregions. The vas
data structure is always memory resident. When aprocess is invoked, the system allocates a vas structure
and puts its address in p_vas,
a field in the proc structure.
The virtual address space of a process is broken down into
logical chunks of virtually contiguous pages. (See the Process
Management white paper for table of vas
entries.) Virtual memory elements of a pregion |  |
Each pregion represents a
process's view of a particular portion of pages and information
on getting to those pages. The pregion points
to the region data structure that describes the pages'
physical locations in memory or in secondary storage. The pregion
also contains the virtual addresses to which the process's
pages are mapped, the page usage (text, data, stack, and so forth),
and page protections (read, write, execute, and so on). The following elements of a per-process pregion
structure are important to the virtual memory subsystem. Table 1-8 Principal
elements of struct pregion Element | Purpose |
---|
p_type | Type of pregion | *p_reg | Pointer to the region attached by the pregion. | p_space, p_vaddr | Virtual address of the pregion,
based on virtual space and virtual offset. | p_off | Offset into the region, specified in pages.
| p_count | Number of pages mapped by the pregion. | p_ageremain, p_agescan, p_stealscan, p_bestnice | Used in the vhand
algorithm to age and steal pages of memory (discussed later). | *p_vas | Pointer to the vas
to which the pregion is linked. | p_forw, p_back | The doubly-linked list, used by vhand
to walk the active pregions. | p_deactsleep | The address at which a deactivated process
is sleeping. | p_pagein | Size of an I/O, used for scheduling when moving
data into memory. | p_strength, p_nextfault | Used to track the ratio between sequential
and random faults; used to adjust p_pagein. |
The Region, a system resource |  |
The region is a system-wide kernel data structure that associates
groups of pages with a given process. Regions can be one of two
types, private (used by a single process) or shared (able to be
used by more than one process). Space for a region data structure
is allocated as needed. The region structure is never written to
a swap device, although its B-tree
may be. Regions are pointed to by pregions, which are a per-process
resource. Regions point to the vnode
where the blocks of data reside when not in memory. Table 1-9 region (struct region) Element | Meaning |
---|
r_flags | Region flags (enumerated shortly). | r_type | RT_PRIVATE:
Multiple processes cannot share region. PT_DATA
and PT_STACK pregions point to
RT_PRIVATE regions. RT_SHARED: Multiple
processes can share region. PT_SHMEM
and most PT_TEXT pregions point
to RT_SHARED regions.
| r_pgsz | Size of region in pages if all pages are in
memory. | r_nvalid | Number of valid pages in region. This equals
the number of valid vfds in the B-tree
or b_chunk. | r_dnvalid | Number of pages in swapped region. If the
system swaps the entire process, the value of r_nvalid
is copied here to later calculate how many pages the process will
need when it faults back in. This information is used to decide
which process to reactivate. | r_swalloc | Total number of pages reserved and allocated
for this region on the swap device. Does not account for swap space
allocated for vfd/dbd pairs. | r_swapmem, r_vfd_swapmem | Memory reserved for pseudo-swap or vfd
swap. | r_lockmem | Number of pages currently allocated to the
region for lockable memory, including lockable memory allocated
for vfd/dbd pairs. | r_pswapf, r_pswapb | Forward and backward pointers to lists of pseudo-swap
pages. | r_refcnt | Number of pregions
pointing at the region | r_zomb | Set to indicate modified text. If an executing
a.out file on a remote system has changed, the pages are flushed
from the processor's cache, causing the next attempted
access to fault. The fault handler finds that r_zomb
is non-zero, prints the message Pid %d killed due to text modification
or page I/O error and sends the
process a SIGKILL. | r_off | Offset into the page-aligned vnode, specified
in pages; valid only if RF_UNALIGNED
is not set. Page r_off of the
vnode is referenced by the first
entry of the first chunk of the region's B-tree. | r_incore | Number of pregions
sharing the region whose associated processes have the SLOAD
flag set. | r_mlockcnt | Number of processes that have locked this region
in memory. | r_dbd | Disk block descriptor for B-tree
pages written to a swap device Specifies the location of the first
page in the linked list of pages. | r_fstore, r_bstore | Pointers to vnode
of origin and destination of block. This data depends on the type
of pregion above the region. In
most cases, r_bstore is set to
the paging system vnode, the global
swapdev_vp that is initialized
at system startup. | r_forw, r_back | Pointers to linked list of all active pregions.
| r_hchain | Hash for region. | r_lock | Region lock structure used to get read or read/write
locks to modify the region structure. | r_mlock | Wait for region to be locked in memory. | r_poip | Number of page I/Os in progress | r_root | Root of B-tree;
if referencing more than one chunk, r_key
is set to DONTUSE_IDX. | r_key, r_chunk | Used instead of B-tree
search if referencing 32 or fewer pages. | r_excproc | Pointer to the proc
table entry, if the process has RF_EXCLUSIVE
set in r_flags. | r_hdl | Hardware-dependent layer structure | r_next, r_prev | Circularly linked list of all regions sharing
pages/vnode. | r_pregs | List of pregions
pointing to the region. | r_lchain | Linked list of memory lock ranges | r_mlockswap | swap reserved to cover locks |
a.out Support for
Unaligned PagesText and data of most executables start on a four-kilobyte
page boundary. HP-UX can treat these as memory-mapped files, because
a page in the file maps directly to a page in memory. In addition to the fields shown, struct region has fields
to support executables compiled on older versions of HP-UX whose
text and data do not align on a (4 KB) page boundary. These executables
are referenced by regions whose r_flag
is set to RF_UNALIGNED. Table 1-10 a.out
support by regions Element | Meaning |
---|
r_byte, r_bytelen | Offset into the a.out
file and length of its text. | r_hchain | Hash list of unaligned regions. |
Various indicators of the state of the region are specified
in r_flags. Table 1-11 Region flags Region flag | Meaning |
---|
RF_ALLOC | Always set because HP-UX regions are allocated
and freed on demand; there is no free list. | RF_MLOCKING | Indicator of whether a region is locked; set
before r_mlock, cleared after r_mlock
is released. | RF_UNALIGNED | Set if text of an executable does not start
on a page boundary. In this case, the text is read through the
buffer cache to align it, and the vfds
are pointed at the buffer cache pages. | RF_WANTLOCK | Set if another stream wanted to lock this region,
but found it already locked and went to sleep. After the region
is unlocked, this flag ensures that wakeup()
is called so the waiting stream(s) can proceed. | RF_HASHED | The text is unaligned (RF_UNALIGNED)
and thus is on a hash chain. The region is hashed with r_fstore
and r_byte; the head of each hash
chain is in texts[]. The RF_UNALIGNED
flag may be set without the RF_HASHED
flag (if the system tries to get the hashed region but it is locked,
the system will create a private one), but the RF_HASHED
flag will never be set without the RF_UNALIGNED
flag. | RF_EVERSWP, RE_NOWSWP | Set if the B-tree
has ever been or is now written to a swap device. These flags are
used for debugging. | RF_IOMAP | This region was created with an iomap()
system call, and thus requires special handling when calling exit(). | RF_LOCAL | Region is swapped locally. | RF_EXCLUSIVE | The mapping process is allowed exclusive access
to the region. This flag is set, and r_excproc
is set to the proc table pointer. | RF_SWLAZYWRT | If an a.out is marked EXEC_MAGIC,
a lazy swap algorithm is used, meaning swap is not reserved or allocated
until needed. The text file is not likely to be modified, but if
it is, a page of swap will be reserved for it at that time. | RF_STATIC_PREDICT | Text object uses static branch prediction for
compiler optimization. | RF_ALL_MLOCKED | Entire region is memory locked, as a result
of a plock having been performed
on the pregion associated with the region. | RF_SWAPMEM | Region is using pseudo-swap; that is, a portion
of memory is being held for swap use. | RF_LOCKED_LARGE | Region is using large pages; used with superpages. | RF_SUPERPAGE_TEXT | Text region using large pages. | RF_FLIPPER_DISABLE | Disable kernel assist prediction; a flag used
for performance profiling. | RF_MPROTECTED | Some part of the region is subject to the system
call mprotect, which is performed
on an memory-mapped file. |
pseudo-vas for Text
and Shared Library pregions |  |
When a file is opened as an a.out or shared library, the easiest
way to keep track of the region is to create a pseudo-vas
the first time the file is opened as an executable. This is done
by calling mapvnode() and storing
the vas pointer in the vnode's
v_vas element. On subsequent opens
of the file as an executable, the non-NULL value in v_vas
aids in finding the region to which the virtual address space is
being attached. The pseudo-vas is type PT_MMAP,
and the associated pregion has
PF_PSEUDO set in p_flags.
This pregion is attached to the
region for this vnode. All the
processes that use this executable or shared library (non-pseudo
pregions) then attach to the region with type PT_TEXT
(a.out) or PT_MMAP
(shared library). The number of processes using a particular vnode
as an executable is kept in the pseudo-vas
in va_refcnt. All pregions associated with
a region are connected with a doubly-linked list that begins
with the region element r_pregs,
and is defined in the pregions by p_prpnext
and p_prpprev. The list is sorted
by p_off, the pregion's
offset into the region, and is NULL-terminated. Even after all processes using the a.out or
shared library exit, the handle to the region remains; its pages
can be disposed of at that time. Chunks -- Keeping the vfds and dbds together in one
place |  |
Since information is typically needed about groups of (rather
than individual) pages, pages are grouped into chunks. A chunk
contains 32 pairs of virtual frame descriptors and disk block desciptors: The kernel looks for a page in memory
by its virtual frame descriptor (vfd).
The kernel looks for a page on disk by its disk
block descriptor (dbd). By definition, if the vfd's
pg_v bit is set, the vfd
is used; if not, the dbd is used.
A one-to-one correspondence is maintained between vfd
and dbd through the vfddbd
structure, which simply contains one vfd (c_vfd)
and one dbd (c_dbd). HP-UX regions use chunks of vfds
and dbds to keep track of page
ownership: For assignment from virtual page to
physical page if the page is valid. (This is required in addition
to the PDIR. The term "assignment" is used (rather
than mapping) because the page might not be translated but valid. Other virtual attributes of the page (such as whether
the page is locked in memory, or whether it is valid). Location on disk for front-store and back-store
pages.
Virtual Frame Descriptors (vfd)A one-word structure called a virtual frame descriptor enables
processes to reference pages of memory. The vfd
is used when the process is in memory, and can be used to refer
to the page of memory described in pfdat. Table 1-12 Virtual Frame
Descriptor (struct vfd) Element | Meaning |
---|
pg_v | Valid flag. If set, this page of memory contains
valid data and pg_pfnum is valid.
If not set, the page's valid data is on a swap device. | pg_cw | Copy-on-write flag. If set, a write to the
page causes a data protection fault, at which time the system copies
the page. | pg_lock | Lock flag. If set, raw I/O is occurring on
this page. Either the data is being transferred between the page
and the disk, or data is being transferred between two memory pages.
The kernel sleeps waiting for completion of I/O before launching
further raw I/O to or from this page. Nothing can read the page
while it is being written to disk. | pg_mlock | If set, the page is locked in memory and cannot
be paged out. | pg_pfnum (aliased
as pg_pfn) | Page frame number, from which can be accessed
the correct pfdat entry for this
page. |
Disk Block Descriptor (dbd) |  |
When the pg_v bit in a vfd
is not set, the vfd is invalid
and the page of data is not in memory but on disk. In this case,
the disk block descriptor (dbd)
gives valid reference to the data. Like the vfd
structure, the dbd is one word
long. Table 1-13 Disk Block
Descriptor (struct dbd) Element | Meaning |
---|
dbd_type | One of six three-bit flags used to interpret
dbd_data: DBD_NONE: No copy
of this data exists on disk. DBD_FSTORE, DBD_BSTORE:
Page can be found on a "front or back store" device,
pointed to by a region's vnode. [1] DBD_DFILL: This
is a demand-fill page. No space is allocated; when a fault occurs
it is initialized by filling it with data from disk. DBD_DZERO: This
is a demand zero page; when requested, allocate a new page and initialize
it with zeroes. DBD_HOLE: Used
for a sparse memory-mapped file; when read, the page gives zeros.
When written to, a page is allocated, initialized to zero, data
inserted, at which time the dbd
type changes to DBD_NONE.
| dbd_data | vnode type
(nfs, ufs) specific data. A pointer
points to data in a file pointed to by a vnode. |
Balanced Trees (B-Trees) |  |
Each region contains either a single array of vfd/dbd
(chunk) or a pointer to a B-tree.
The structure called a B-tree allows
for quick searches and efficient storage of sparse data. A bnode
is the same size as a chunk; both can be gotten from the same source
of memory. The region's B-tree stores pairs of page indices
and chunk addresses. HP-UX uses an order 29 B-tree. A B-tree is searched with
a key and yields a value. In the region B-tree,
the key is the page number in the region divided by 32, the number
of vfddbds in a chunk. Each node of a B-tree contains
room for order+1 keys (or index numbers) and order+2 values. If
a node grows to contain more than order keys, it is split into two
nodes; half of the pairs are kept in the original node and the other
half are copied to the new node. The B-tree
node data also includes the number of valid elements contained in
that node. Table 1-14 B-tree
Node Description (struct bnode) Element | Meaning |
---|
b_key[B_SIZE] | The array of keys used for each page index
of the bnode. | b_nelem | Number of valid keys/values in the bnode. | b_down[B_SIZE+1] | The array of values in the bnode,
either pointers to another bnode
(if this is an interior bnode)
or pointers to chunks (if this is a leaf bnode). | b_scr1, b_scr2 | bnode padding
to the size of a chunk, to allow bnodes
and chunks to be allocated from the same pool of memory. |
Root of the B-tree |  |
A structure called struct broot
points to the start of the B-tree. Table 1-15 Struct broot Element | Meaning |
---|
b_root | Pointer to the initial point of the B-tree. | b_depth | Number of levels in the B-tree | b_npages | Pages used to construct the B-tree,
counting both pages used for chunks and bnodes. | b_rpages | Number of real pages in the region; swap pages
reserved for the B-tree by the
kernel, using the routine vfdpgs().
Amount of swap allocated for the vfd/dbd
pairs in the B-tree structure. | b_list | Pointer to a linked list of memory pages from
which new bnodes or chunks can
be added to the B-tree. | b_nfrag | Number of the next chunk available, derived
from the unused 256-byte fragments in b_list. | b_rp | Pointer to the region using the B-tree. | b_protoidx, b_proto1, b_proto2 | Stores page index of default dbd
and prototype to minimize time and memory costs to allocate chunk
space. | b_vproto | List of page ranges whose bits are marked copy
on write. | b_key_cache, b_val_cache | Caches of most recently used keys and pointers
to chunks associated with the keys; checked first when querying
the virtual memory subsystem. |
The struct vfdcw governs
the vfd prototype. Table 1-16 vfd
prototype (struct vfdcw) Element | Meaning |
---|
v_start[MAXVPROTO] | Page that indexes start of copy-on-write range;
set to -1 if unused. | v_end[MAXVPROTO] | End of copy-on-write range |
Hardware-Independent Page Information table (pfdat) |  |
The hardware independent layer of the virtual memory subsystem
manages pages in memory, pages written to swap devices, and the
movement of pages from one to the other. The act of moving data
from physical memory to a swap device, or moving data from a swap
device to physical memory, is called paging. Basic to hardware independence is the page frame data table
(pfdat), a big array indexed directly
through the page number. Each page of available memory is represented
by one pfdat structure; one pfdat entry
represents each page frame writable to a swap device. HP-UX never
pages kernel memory (the pages containing kernel text, stack, and
data); thus, pfdat manages only
the subset representing freely available physical memory. When
the pfdat is initialized, all free
pages are linked in a list pointed to by phead. Table 1-17 Principal
entries in struct pfdat (page frame
data) Element | Meaning |
---|
pf_hchain | Hash chain link. | pf_flags | Page frame data flags (shown in the next table). | pf_pfn | Physical page frame number. | pf_use | Number of regions sharing the page; when pf_use
drops to zero, the page can be placed on the free linked list. | pf_devvp | vnode for
swap device. (Hashing is done on the tuple of (pf_devvp, pf_data).) | pf_data | Disk block number on swap device. | pf_next, pf_prev | Next and previous free pfdat
entries. | pf_cache_waiting | If set, this element means that a thread is
waiting to grab the pf_lock on that page. Required for synchronization. | pf_lock | Lock pfdat
entry (beta semaphore), used to lock the page while modifying the
pde (physical-to-virtual translation,
access rights, or protection ID) | pf_hdl | Hardware dependent layer elements (see hdl_pfdat
discussion, shortly). |
Flags showing the Status of the PageTable 1-18 Principal
pf_flag values Flag | Meaning |
---|
P_QUEUE | Page is on the free queue, headed by phead. | P_BAD | Page is marked as bad by the memory deallocation
subsystem. | P_HASH | Page is on a hash queue; contains head of queue. | P_ALLOCATING | Page is being allocated; prevents another
process from taking the page while it is being remapped. | P_SYS | Page is being used by the kernel rather than
by a user process. Pages marked with this flag include dynamic
buffer cache pages, B-tree pages
and the results of kernel memory allocation. They are used by the
kernel for critical data structures in addition to the kernel static
pages that were not included in pfdat. | P_DMEM | Page is locked by the memory diagnostics subsystem;
set and cleared with an ioctl()
call to the dmem driver. | P_LCOW | Page is being remapped by copy-on-write. | P_UAREA | Page is used by a pregion of type PT_UAREA. |
Hardware-Dependent Layer page frame data entry If pf_hdl is referenced in
struct pfdat, the struct hdlpfdat
(defined in hdl_pfdat.h) is used.
pf_hdl is a type of struct hdlpfdat. Table 1-19 struct hdlpfdat Element | Meaning |
---|
hdlpf_flags | Flags that show the HDL status of the
page: HDLPF_TRANS:
A virtual address translation exists for this page. HDLPF_PROTECT:
Page is protected from user access. This flag indicates that the
saved values are valid. HDLPF_STEAL: Virtual
translation should be removed when pending I/O is complete. HDLPF_MOD: Analogous
to changing the pde_modified flag in the pde. HDLPF_REF: Analogous
to changing the pde_ref flag in
the pde. HDLPF_READA: Read-ahead
page in transit; used to indicate to the hdl_pfault()
routine that it should start the next I/O request before waiting
for the current I/O request to complete.
| hdlpf_savear | Saved page access rights. | hdlpf_saveprot | Saved page protection ID. |
|