HPlogo HP-UX Memory Management: White Paper > Chapter 1 MEMORY MANAGEMENT

VIRTUAL MEMORY STRUCTURES

» 

Technical documentation

Complete book in PDF

 » Table of Contents

Figure 1-13 Memory management structures

[Memory management structures]

Process management uses kernel structures down to the pregions to execute a process. The u_area, proc structure, vas, and pregion are per-process resources, because each process has its own unique copies of these structures, which are not shared among multiple processes.

Below the pregion level are the systemwide resources. These structures can be shared among multiple processes (although they are not required to be shared).

Memory management kernel structures map pregions to physical memory and provide support for the processor's ability to translate virtual addresses to physical memory. The table that follows introduces the structures involved in memory management; these are discussed later in detail.

Table 1-7 Principal Memory Management Kernel Structures

Kernel structurePurpose
vasKeeps track of the structural elements associated with a process in memory. One vas maintained per process.
pregionA per-process resource that describes the regions attached to the process.
regionA memory-resident system resource that can be shared among processes. Points to the process's B-tree, vnode, pregions.
B-treeBalanced tree that stores pairs of page indices and chunk addresses. At the root of a B-tree of VFDs and DBDs is struct broot.
hpdeContains information for virtual to physical translation (that is, from VFD to physical memory).

 

Virtual Address Space (vas)

The vas represents the virtual address space of a process and serves as the head of a doubly linked list of process region data structures called pregions. The vas data structure is always memory resident.

When aprocess is invoked, the system allocates a vas structure and puts its address in p_vas, a field in the proc structure.

The virtual address space of a process is broken down into logical chunks of virtually contiguous pages. (See the Process Management white paper for table of vas entries.)

Virtual memory elements of a pregion

Each pregion represents a process's view of a particular portion of pages and information on getting to those pages. The pregion points to the region data structure that describes the pages' physical locations in memory or in secondary storage. The pregion also contains the virtual addresses to which the process's pages are mapped, the page usage (text, data, stack, and so forth), and page protections (read, write, execute, and so on).

Figure 1-14 Virtual memory elements of the pregion

[Virtual memory elements of the pregion]

The following elements of a per-process pregion structure are important to the virtual memory subsystem.

Table 1-8 Principal elements of struct pregion

ElementPurpose
p_typeType of pregion
*p_regPointer to the region attached by the pregion.
p_space, p_vaddrVirtual address of the pregion, based on virtual space and virtual offset.
p_offOffset into the region, specified in pages.
p_countNumber of pages mapped by the pregion.
p_ageremain, p_agescan, p_stealscan, p_bestniceUsed in the vhand algorithm to age and steal pages of memory (discussed later).
*p_vasPointer to the vas to which the pregion is linked.
p_forw, p_backThe doubly-linked list, used by vhand to walk the active pregions.
p_deactsleepThe address at which a deactivated process is sleeping.
p_pageinSize of an I/O, used for scheduling when moving data into memory.
p_strength, p_nextfaultUsed to track the ratio between sequential and random faults; used to adjust p_pagein.

 

The Region, a system resource

The region is a system-wide kernel data structure that associates groups of pages with a given process. Regions can be one of two types, private (used by a single process) or shared (able to be used by more than one process). Space for a region data structure is allocated as needed. The region structure is never written to a swap device, although its B-tree may be.

Regions are pointed to by pregions, which are a per-process resource. Regions point to the vnode where the blocks of data reside when not in memory.

Table 1-9 region (struct region)

ElementMeaning
r_flagsRegion flags (enumerated shortly).
r_type
  • RT_PRIVATE: Multiple processes cannot share region. PT_DATA and PT_STACK pregions point to RT_PRIVATE regions.

  • RT_SHARED: Multiple processes can share region. PT_SHMEM and most PT_TEXT pregions point to RT_SHARED regions.

r_pgszSize of region in pages if all pages are in memory.
r_nvalidNumber of valid pages in region. This equals the number of valid vfds in the B-tree or b_chunk.
r_dnvalidNumber of pages in swapped region. If the system swaps the entire process, the value of r_nvalid is copied here to later calculate how many pages the process will need when it faults back in. This information is used to decide which process to reactivate.
r_swallocTotal number of pages reserved and allocated for this region on the swap device. Does not account for swap space allocated for vfd/dbd pairs.
r_swapmem, r_vfd_swapmemMemory reserved for pseudo-swap or vfd swap.
r_lockmemNumber of pages currently allocated to the region for lockable memory, including lockable memory allocated for vfd/dbd pairs.
r_pswapf, r_pswapbForward and backward pointers to lists of pseudo-swap pages.
r_refcntNumber of pregions pointing at the region
r_zombSet to indicate modified text. If an executing a.out file on a remote system has changed, the pages are flushed from the processor's cache, causing the next attempted access to fault. The fault handler finds that r_zomb is non-zero, prints the message Pid %d killed due to text modification or page I/O error and sends the process a SIGKILL.
r_offOffset into the page-aligned vnode, specified in pages; valid only if RF_UNALIGNED is not set. Page r_off of the vnode is referenced by the first entry of the first chunk of the region's B-tree.
r_incoreNumber of pregions sharing the region whose associated processes have the SLOAD flag set.
r_mlockcntNumber of processes that have locked this region in memory.
r_dbdDisk block descriptor for B-tree pages written to a swap device Specifies the location of the first page in the linked list of pages.
r_fstore, r_bstorePointers to vnode of origin and destination of block. This data depends on the type of pregion above the region. In most cases, r_bstore is set to the paging system vnode, the global swapdev_vp that is initialized at system startup.
r_forw, r_backPointers to linked list of all active pregions.
r_hchainHash for region.
r_lockRegion lock structure used to get read or read/write locks to modify the region structure.
r_mlockWait for region to be locked in memory.
r_poipNumber of page I/Os in progress
r_rootRoot of B-tree; if referencing more than one chunk, r_key is set to DONTUSE_IDX.
r_key, r_chunkUsed instead of B-tree search if referencing 32 or fewer pages.
r_excprocPointer to the proc table entry, if the process has RF_EXCLUSIVE set in r_flags.
r_hdlHardware-dependent layer structure
r_next, r_prevCircularly linked list of all regions sharing pages/vnode.
r_pregsList of pregions pointing to the region.
r_lchainLinked list of memory lock ranges
r_mlockswapswap reserved to cover locks

 

a.out Support for Unaligned Pages

Text and data of most executables start on a four-kilobyte page boundary. HP-UX can treat these as memory-mapped files, because a page in the file maps directly to a page in memory.

In addition to the fields shown, struct region has fields to support executables compiled on older versions of HP-UX whose text and data do not align on a (4 KB) page boundary. These executables are referenced by regions whose r_flag is set to RF_UNALIGNED.

Table 1-10 a.out support by regions

ElementMeaning
r_byte, r_bytelenOffset into the a.out file and length of its text.
r_hchainHash list of unaligned regions.

 

Region flags

Various indicators of the state of the region are specified in r_flags.

Table 1-11 Region flags

Region flagMeaning
RF_ALLOCAlways set because HP-UX regions are allocated and freed on demand; there is no free list.
RF_MLOCKINGIndicator of whether a region is locked; set before r_mlock, cleared after r_mlock is released.
RF_UNALIGNEDSet if text of an executable does not start on a page boundary. In this case, the text is read through the buffer cache to align it, and the vfds are pointed at the buffer cache pages.
RF_WANTLOCKSet if another stream wanted to lock this region, but found it already locked and went to sleep. After the region is unlocked, this flag ensures that wakeup() is called so the waiting stream(s) can proceed.
RF_HASHED The text is unaligned (RF_UNALIGNED) and thus is on a hash chain. The region is hashed with r_fstore and r_byte; the head of each hash chain is in texts[]. The RF_UNALIGNED flag may be set without the RF_HASHED flag (if the system tries to get the hashed region but it is locked, the system will create a private one), but the RF_HASHED flag will never be set without the RF_UNALIGNED flag.
RF_EVERSWP, RE_NOWSWP Set if the B-tree has ever been or is now written to a swap device. These flags are used for debugging.
RF_IOMAPThis region was created with an iomap() system call, and thus requires special handling when calling exit().
RF_LOCALRegion is swapped locally.
RF_EXCLUSIVEThe mapping process is allowed exclusive access to the region. This flag is set, and r_excproc is set to the proc table pointer.
RF_SWLAZYWRTIf an a.out is marked EXEC_MAGIC, a lazy swap algorithm is used, meaning swap is not reserved or allocated until needed. The text file is not likely to be modified, but if it is, a page of swap will be reserved for it at that time.
RF_STATIC_PREDICTText object uses static branch prediction for compiler optimization.
RF_ALL_MLOCKEDEntire region is memory locked, as a result of a plock having been performed on the pregion associated with the region.
RF_SWAPMEMRegion is using pseudo-swap; that is, a portion of memory is being held for swap use.
RF_LOCKED_LARGERegion is using large pages; used with superpages.
RF_SUPERPAGE_TEXTText region using large pages.
RF_FLIPPER_DISABLEDisable kernel assist prediction; a flag used for performance profiling.
RF_MPROTECTEDSome part of the region is subject to the system call mprotect, which is performed on an memory-mapped file.

 

pseudo-vas for Text and Shared Library pregions

When a file is opened as an a.out or shared library, the easiest way to keep track of the region is to create a pseudo-vas the first time the file is opened as an executable. This is done by calling mapvnode() and storing the vas pointer in the vnode's v_vas element. On subsequent opens of the file as an executable, the non-NULL value in v_vas aids in finding the region to which the virtual address space is being attached.

The pseudo-vas is type PT_MMAP, and the associated pregion has PF_PSEUDO set in p_flags. This pregion is attached to the region for this vnode. All the processes that use this executable or shared library (non-pseudo pregions) then attach to the region with type PT_TEXT (a.out) or PT_MMAP (shared library). The number of processes using a particular vnode as an executable is kept in the pseudo-vas in va_refcnt.

All pregions associated with a region are connected with a doubly-linked list that begins with the region element r_pregs, and is defined in the pregions by p_prpnext and p_prpprev. The list is sorted by p_off, the pregion's offset into the region, and is NULL-terminated.

Even after all processes using the a.out or shared library exit, the handle to the region remains; its pages can be disposed of at that time.

Figure 1-15 Mapping the pseudo-vas structures

[Mapping the pseudo-vas structures]

Chunks -- Keeping the vfds and dbds together in one place

Since information is typically needed about groups of (rather than individual) pages, pages are grouped into chunks. A chunk contains 32 pairs of virtual frame descriptors and disk block desciptors:

  • The kernel looks for a page in memory by its virtual frame descriptor (vfd).

  • The kernel looks for a page on disk by its disk block descriptor (dbd).

  • By definition, if the vfd's pg_v bit is set, the vfd is used; if not, the dbd is used.

A one-to-one correspondence is maintained between vfd and dbd through the vfddbd structure, which simply contains one vfd (c_vfd) and one dbd (c_dbd).

Figure 1-16 A chunk contains 32 vfddbd (256 bytes)

[A chunk contains 32 vfddbd (256 bytes)]

HP-UX regions use chunks of vfds and dbds to keep track of page ownership:

  • For assignment from virtual page to physical page if the page is valid. (This is required in addition to the PDIR. The term "assignment" is used (rather than mapping) because the page might not be translated but valid.

  • Other virtual attributes of the page (such as whether the page is locked in memory, or whether it is valid).

  • Location on disk for front-store and back-store pages.

Virtual Frame Descriptors (vfd)

A one-word structure called a virtual frame descriptor enables processes to reference pages of memory. The vfd is used when the process is in memory, and can be used to refer to the page of memory described in pfdat.

Figure 1-17 Virtual frame descriptor (vfd) contents

[Virtual frame descriptor (vfd) contents]

Table 1-12 Virtual Frame Descriptor (struct vfd)

ElementMeaning
pg_vValid flag. If set, this page of memory contains valid data and pg_pfnum is valid. If not set, the page's valid data is on a swap device.
pg_cwCopy-on-write flag. If set, a write to the page causes a data protection fault, at which time the system copies the page.
pg_lockLock flag. If set, raw I/O is occurring on this page. Either the data is being transferred between the page and the disk, or data is being transferred between two memory pages. The kernel sleeps waiting for completion of I/O before launching further raw I/O to or from this page. Nothing can read the page while it is being written to disk.
pg_mlockIf set, the page is locked in memory and cannot be paged out.

pg_pfnum

(aliased as pg_pfn)

Page frame number, from which can be accessed the correct pfdat entry for this page.

 

Disk Block Descriptor (dbd)

When the pg_v bit in a vfd is not set, the vfd is invalid and the page of data is not in memory but on disk. In this case, the disk block descriptor (dbd) gives valid reference to the data. Like the vfd structure, the dbd is one word long.

Figure 1-18 Contents of disk block descriptor (dbd)

[Contents of disk block descriptor (dbd)]

Table 1-13 Disk Block Descriptor (struct dbd)

ElementMeaning
dbd_type

One of six three-bit flags used to interpret dbd_data:

  • DBD_NONE: No copy of this data exists on disk.

  • DBD_FSTORE, DBD_BSTORE: Page can be found on a "front or back store" device, pointed to by a region's vnode. [1]

  • DBD_DFILL: This is a demand-fill page. No space is allocated; when a fault occurs it is initialized by filling it with data from disk.

  • DBD_DZERO: This is a demand zero page; when requested, allocate a new page and initialize it with zeroes.

  • DBD_HOLE: Used for a sparse memory-mapped file; when read, the page gives zeros. When written to, a page is allocated, initialized to zero, data inserted, at which time the dbd type changes to DBD_NONE.

dbd_datavnode type (nfs, ufs) specific data. A pointer points to data in a file pointed to by a vnode.

[1] When the dbd_type is DBD_FSTORE, it means that the page of data resides in the file pointed to by v_fstore (typically a file system). When the dbd_type is DBD_BSTORE, the page of data resides in the file or device file pointed to by r_bstore (typically a swap device).

 

Balanced Trees (B-Trees)

Each region contains either a single array of vfd/dbd (chunk) or a pointer to a B-tree. The structure called a B-tree allows for quick searches and efficient storage of sparse data. A bnode is the same size as a chunk; both can be gotten from the same source of memory. The region's B-tree stores pairs of page indices and chunk addresses. HP-UX uses an order 29 B-tree.

A B-tree is searched with a key and yields a value. In the region B-tree, the key is the page number in the region divided by 32, the number of vfddbds in a chunk.

Figure 1-19 A sample B-tree (order = 3, depth = 3)

[A sample B-tree (order = 3, depth = 3)]

Each node of a B-tree contains room for order+1 keys (or index numbers) and order+2 values. If a node grows to contain more than order keys, it is split into two nodes; half of the pairs are kept in the original node and the other half are copied to the new node. The B-tree node data also includes the number of valid elements contained in that node.

Table 1-14 B-tree Node Description (struct bnode)

ElementMeaning
b_key[B_SIZE]The array of keys used for each page index of the bnode.
b_nelemNumber of valid keys/values in the bnode.
b_down[B_SIZE+1]The array of values in the bnode, either pointers to another bnode (if this is an interior bnode) or pointers to chunks (if this is a leaf bnode).
b_scr1, b_scr2bnode padding to the size of a chunk, to allow bnodes and chunks to be allocated from the same pool of memory.

 

Root of the B-tree

A structure called struct broot points to the start of the B-tree.

Table 1-15 Struct broot

ElementMeaning
b_rootPointer to the initial point of the B-tree.
b_depthNumber of levels in the B-tree
b_npagesPages used to construct the B-tree, counting both pages used for chunks and bnodes.
b_rpagesNumber of real pages in the region; swap pages reserved for the B-tree by the kernel, using the routine vfdpgs(). Amount of swap allocated for the vfd/dbd pairs in the B-tree structure.
b_listPointer to a linked list of memory pages from which new bnodes or chunks can be added to the B-tree.
b_nfragNumber of the next chunk available, derived from the unused 256-byte fragments in b_list.
b_rpPointer to the region using the B-tree.
b_protoidx, b_proto1, b_proto2Stores page index of default dbd and prototype to minimize time and memory costs to allocate chunk space.
b_vprotoList of page ranges whose bits are marked copy on write.
b_key_cache, b_val_cacheCaches of most recently used keys and pointers to chunks associated with the keys; checked first when querying the virtual memory subsystem.

 

vfd/dbd prototypes

The struct vfdcw governs the vfd prototype.

Table 1-16 vfd prototype (struct vfdcw)

ElementMeaning
v_start[MAXVPROTO]Page that indexes start of copy-on-write range; set to -1 if unused.
v_end[MAXVPROTO] End of copy-on-write range

 

Hardware-Independent Page Information table (pfdat)

The hardware independent layer of the virtual memory subsystem manages pages in memory, pages written to swap devices, and the movement of pages from one to the other. The act of moving data from physical memory to a swap device, or moving data from a swap device to physical memory, is called paging.

Basic to hardware independence is the page frame data table (pfdat), a big array indexed directly through the page number. Each page of available memory is represented by one pfdat structure; one pfdat entry represents each page frame writable to a swap device. HP-UX never pages kernel memory (the pages containing kernel text, stack, and data); thus, pfdat manages only the subset representing freely available physical memory. When the pfdat is initialized, all free pages are linked in a list pointed to by phead.

Table 1-17 Principal entries in struct pfdat (page frame data)

ElementMeaning
pf_hchainHash chain link.
pf_flagsPage frame data flags (shown in the next table).
pf_pfnPhysical page frame number.
pf_useNumber of regions sharing the page; when pf_use drops to zero, the page can be placed on the free linked list.
pf_devvpvnode for swap device. (Hashing is done on the tuple of (pf_devvp, pf_data).)
pf_dataDisk block number on swap device.
pf_next, pf_prevNext and previous free pfdat entries.
pf_cache_waitingIf set, this element means that a thread is waiting to grab the pf_lock on that page. Required for synchronization.
pf_lockLock pfdat entry (beta semaphore), used to lock the page while modifying the pde (physical-to-virtual translation, access rights, or protection ID)
pf_hdlHardware dependent layer elements (see hdl_pfdat discussion, shortly).

 

Flags showing the Status of the Page

Table 1-18 Principal pf_flag values

FlagMeaning
P_QUEUEPage is on the free queue, headed by phead.
P_BADPage is marked as bad by the memory deallocation subsystem.
P_HASHPage is on a hash queue; contains head of queue.
P_ALLOCATINGPage is being allocated; prevents another process from taking the page while it is being remapped.
P_SYSPage is being used by the kernel rather than by a user process. Pages marked with this flag include dynamic buffer cache pages, B-tree pages and the results of kernel memory allocation. They are used by the kernel for critical data structures in addition to the kernel static pages that were not included in pfdat.
P_DMEMPage is locked by the memory diagnostics subsystem; set and cleared with an ioctl() call to the dmem driver.
P_LCOWPage is being remapped by copy-on-write.
P_UAREAPage is used by a pregion of type PT_UAREA.

 

Hardware-Dependent Layer page frame data entry

If pf_hdl is referenced in struct pfdat, the struct hdlpfdat (defined in hdl_pfdat.h) is used. pf_hdl is a type of struct hdlpfdat.

Table 1-19 struct hdlpfdat

ElementMeaning
hdlpf_flags

Flags that show the HDL status of the page:

  • HDLPF_TRANS: A virtual address translation exists for this page.

  • HDLPF_PROTECT: Page is protected from user access. This flag indicates that the saved values are valid.

  • HDLPF_STEAL: Virtual translation should be removed when pending I/O is complete.

  • HDLPF_MOD: Analogous to changing the pde_modified flag in the pde.

  • HDLPF_REF: Analogous to changing the pde_ref flag in the pde.

  • HDLPF_READA: Read-ahead page in transit; used to indicate to the hdl_pfault() routine that it should start the next I/O request before waiting for the current I/O request to complete.

hdlpf_savearSaved page access rights.
hdlpf_saveprotSaved page protection ID.