SWAP SPACE MANAGEMENT

Swap space is an area on a high-speed storage device (almost always a disk drive), reserved for use by the virtual memory system for deactivation and paging processes. At least one swap device (primary swap) must be present on the system.

During system startup, the location (disk block number) and size of each swap device is displayed in 512-KB blocks. The swapper reserves swap space at process creation time, but does not allocate swap space from the disk until pages need to go out to disk. Reserving swap at process creation protects the swapper from running out of swap space. You can add or remove swap as needed (that is, dynamically) while the system is running, without having to regenerate the kernel.

HP-UX uses both physical and pseudo swap to enable efficient execution of programs.

Pseudo-Swap Space

System memory used for swap space is called pseudo-swap space. It allows users to execute processes in memory without allocating physical swap. Pseudo-swap is controlled by an operating-system parameter; by default, swapmem_on is set to 1, enabling pseudo-swap.

Typically, when the system executes a process, swap space is reserved for the entire process, in case it must be paged out. According to this model, to run one gigabyte of processes, the system would have to have one gigabyte of configured swap space. Although this protects the system from running out of swap space, disk space reserved for swap is under-utilized if minimal or no swapping occurs.

To avoid such waste of resources, HP-UX is configured to access up to three-quarters of system memory capacity as pseudo-swap. This means that system memory serves two functions: as process-execution space and as swap space. By using pseudo-swap space, a one-gigabyte memory system with one-gigabyte of swap can run up to 1.75 GB of processes. As before, if a process attempts to grow or be created beyond this extended threshold, it will fail.

When using pseudo swap for swap, the pages are locked; as the amount of pseudo-swap increases, the amount of lockable memory decreases.

For factory-floor systems (such as controllers), which perform best when the entire application is resident in memory, pseudo-swap space can be used to enhance performance: you can either lock the application in memory or make sure the total number of processes created does not exceed three-quarters of system memory.

Pseudo-swap space is set to a maximum of three-quarters of system memory because the system can begin paging once three-quarters of system available memory has been used. The unused quarter of memory allows a buffer between the system and the swapper to give the system computational flexibility.

When the number of processes created approaches capacity, the system might exhibit thrashing and a decrease in system response time. If necessary, you can disable pseudo-swap space by setting the tunable parameter swapmem_on in /usr/conf/master.d/core-hpux to zero.

At the head of a doubly linked list of regions that have pseudo-swap allocated is a null terminated list called pswaplist.

Physical Swap Space

There are two kinds of physical swap space: device swap and file-system swap.

Device Swap Space

Device swap space resides in its own reserved area (an entire disk or logical volume of an LVM disk) and is faster than file-system swap because the system can write an entire request (256 KB) to a device at once.

File-System Swap Space

File-system swap space is located on a mounted file system and can vary in size with the system's swapping activity. However, its throughput is slower than device swap, because free file-system blocks may not always be contiguous; therefore, separate read/write requests must be made for each file-system block.

To optimize system performance, file-system swap space is allocated and de-allocated in swchunk-sized chunks. swchunk is a configurable operating system parameter; its default is 2048 KB (2 MB). Once a chunk of file system space is no longer being used by the paging system, it is released for file system use, unless it has been preallocated with swapon.

If swapping to file-system swap space, each chunk of swap space is a file in the file system swap directory, and has a name constructed from the system name and the swaptab index (such as becky.6 for swaptab[6] on a system named becky).

Swap Space Parameters

Several configurable parameters deal with swapping.

Table 1-24 Configurable swap-space parameters

Parameter	Purpose
`swchunk`	The number of `DEV_BSIZE` blocks in a unit of swap space, by default, 2 MB on all systems.
`maxswapchunks`	Maximum number of swap chunks allowed on a system.
`swapmem_on`	Parameter allowing creation of more processes than you have physical swap space for, by using pseudo-swap.

Swap Space Global Variables

When the kernel is initialized, conf.c includes globals.h, which contains numerous characteristics related to swap space, shown in the next table. The most important to swap space reservation are swapspc_cnt, swapspc_max, swapmem_cnt, swapmem_max, and sys_mem

Table 1-25 Swap-space characteristics in globals.h

Element	Meaning
`bswlist`	head of free swap header list.
`*pageoutbp`	pointer to `swbuf` header used by `pageout` when swapping.
`ref_hand`	current reference hand used by `pageout` daemon.
maxmem	page count of actual max memory per process.
`physmem`	page count of physical memory on this CPU.
`nswdev`	number of swap devices.
`nswap`	page count of size of swap space.
`*fswdevt`	pointer to file system swap table.
`*swaptab`	pointer to the table of swap chunks.
`swapphys_buf`	pages of physical swap space to keep available.
`swapphys_cnt`	pages of available physical swap space on disk.
`swapspc_cnt`	Total amount of swap currently available on all devices and file systems enabled in units of pages. Updated each time a device or file system is enabled for swapping.
`swapspc_max`	Total amount of device and file-system swap currently enabled on the system in units of pages. Updated each time a device or file system is enabled for swapping.
`swapspc_debit`	number of swap blocks by which to adjust `swapspc_cnt.`
`swapspc_sparing`	number of swap blocks unavailable to swap.
`swapmem_max`	Maximum number of pages of pseudo-swap enabled. Initialized to 3/4 available system memory.
`swapmem_cnt`	Total number of pages of pseudo-swap currently available. Initialized to 3/4 available system memory.
`maxfs_pri`	highest available device priority.
`maxdev_pri`	highest available swap prioirity.
`sys_mem`	Number of pages of memory not available for use as pseudo-swap. Initialized to 1/4 available system memory.
`sysmem_max`	maximum pages not available for swap.
`freemem`	page count of remaining blocks of free memory.
`freemem_cnt`	Number of processes waiting for memory.

Swap Space Values

System swap space values are calculated as follows:

Total swap available on the system is swapspc_max (for device swap and file system swap) + swapmem_max (for pseudo-swap).
Allocated swap is swapspc_max - [sum(swdevt[n].sw_nfpgs) + sum(fswdevt[n].fsw_nfpgs)] (for device swap and file system swap) + (swapmem_max - swapmem_cnt) (for pseudo-swap).

In HP-UX, only data area growth (using sbrk()) or stack growth will cause a process to die for lack of swap space. Program text does not use swap.

Reservation of Physical Swap Space

Swap reservation is a numbers game. The system has a finite number of pages of physical swap space. By decrementing the appropriate counters, HP-UX reserves space for its processes.

Most UNIX systems and UNIX-like systems allocate swap when needed. However, if the system runs out of swap space but needs to write a process' page(s) to a swap device, it has no alternative but to kill the process. To alleviate this problem, HP-UX reserves swap at the time the process is forked or exec'd. When a new process is forked or executed, if insufficient swap space is available and reserved to handle the entire process, the process may not execute.

At system startup, swapspc_cnt and swapmem_cnt are initialized to the total amount of swap space and pseudo-swap available.

Whenever the swapon() call is made to a device or file syste, the amount of swap newly enabled is converted to units of pages and added to the two global swap-reservation counters swapspc_max (total enabled swap) and swapspc_cnt (available swap space).

Each time swap space is reserved for a process (that is, at process creation or growth time), swapspc_cnt is decremented by the number of pages required. The kernel does not actually assign disk blocks until needed.

Once swap space is exhausted (that is, swapspc_cnt == 0), any subsequent request to reserve swap causes the system to allocate addition chunk of file-system swap space. If successful, both swapspc_max and swapspc_cnt are updated and the current (and subsequent requests) can be satisfied. If a file-system chunk cannot be allocated, the request fails, unless pseudo-swap is available.

When swap space is no longer needed (due to process termination or shrinkage), swapspc_cnt is incremented by the number of pages freed. swapspc_cnt never exceeds swapspc_max and is always greater than or equal to zero. If a chunk of file-system swap is no longer needed, it is released back to the file system and swapspc_max and swapspc_cnt are updated.

If no device or file system swap space is available, the system uses pseudo-swap as a last resort. It decrements swapmem_cnt and locks the pages into memory. Pseudo swap is either free or allocated; it is never reserved.

Swap Reservation Spinlock

The rswap_lock spinlock guards the swap reservation structures swapspc_cnt, swapspc_max, swapmem_cnt, swapmem_max, sys_mem, and pswaplist.

Reservation of Pseudo-Swap Space

Approximately 3/4 of available system memory is available as pseudo-swap space if the tunable parameter swapmem_on is set to 1. Pseudo-swap is tracked in the global pseudo swap reservation counters swapmem_max (enabled pseudo-swap) and swapmem_cnt (currently available pseudo-swap). If physical swap space is exhausted and no additional file-system swap can be acquired, pseudo swap space is reserved for the process by decrementing swapmem_cnt.

For example, on a 64MB system, swapmem_max and swapmem_cnt track approximately 48MB of pseudo-swap space, the remainder tracked by the global sys_mem, which represents the number of pages reserved for system use only.

Processes track the number of pseudo swap pages allocated to them by incrementing a per region counter r_swapmem. All regions using pseudo swap are linked on the pseudo swap list pswaplist. Once pseudo swap is exhausted (that is, swapmem_cnt==0), attempts at process creation or growth will fail.

Because the swapper competes with the operating system for use of memory, swapmem_cnt can also be decremented by the operating system for any dynamically allocated memory. Once swapmem_cnt is exhausted, subsequent requests for swap space fail; however, the operating system can still reserve memory out of the malloc pool.

Once a process no longer needs its allocated pseudo swap space, swapmem_cnt is incremented by the amount released and r_swapmem is updated. If the system returns the pseudo swap space used for dynamically allocated kernel memory, the amount being released is firtst added to sys_mem. Once sys_mem grows to its maximum value, any additional pages returned are used to update swapmem_cnt.

swapmem_cnt must be less than or equal to swapmem_max and greater than or equal to zero.

Because pseudo swap is shared by the swapper and memory allocation routines, it is used sparingly. The operating system periodically checks to see if physical swap space has been recently freed. If it has, the system attempts to migrate processes using pseudo swap only to use the available physical swap by walking the doubly linked list of pseudo swap regions. swapspc_cnt is decremented by the r_swapmem value for each region on the list until either swapspc_cnt drops to zero or no other regions utilize pseudo swap. swapmem_cnt is then incremented by the amount of pseudo swap successfully migrated.

Pseudo Swap and Lockable Memory

Because pseudo swap is related to system memory usage, the swap reservation scheme reflects lockable memory policies.

Although the system is not necesarily allocating additional memory when a process locks itself into memory, locked pages are no longer available for general use. This causes swapmem_cnt to be decremented to account for the pages. swapmem_cnt is also decremented by the size of the entire process if that process gets plocked in memory

Figure 1-28 Reserving swap space from file-system swap to memory

How Swap Space is Prioritized

All swap devices and file systems enabled for swap have an associated priority, ranging from 0 to 10, indicating the order that swap space from a device or file system is used. System administrators can specify swap-space priority using a parameter of the swapon(1M) command.

Swapping rotates among both devices and file systems of equal priority. Given equal priority, however, devices are swapped to by the operating system before file systems, because devices make more efficient use of CPU time.

We recommend that you assign the same swapping priority to most swap devices, unless a device is significantly slower than the rest. Assigning equal priorities limits disk head movement, which improves swapping performance.

Three Rules of Swap Space Allocation

Start at the lowest priority swap device or file system. The lower the number, the higher priority; that is, space is taken from a system with a zero priority before it is taken from a system with a one priority.
If multiple devices have the same priority, swap space is allocated from the devices in a round-robin fashion. Thus, to interleave swap requests between a number of devices, the devices should be assigned the same priority. Similarly, if multiple file systems have the same priority, requests for swap are interleaved between the file systems. In the figure, swap requests are initially interleaved between the two swap devices at priority 0.
If a device and a file system have the same swap priority, all the swap space from the device is allocated before any file-system swap space. Thus, the device at priority 1 will be filled before swap is allocated from the file system at priority 1.

Figure 1-29 Choosing a swap location

Swap Space Structures

Swapping is accomplished on HP-UX using the following data structures:

Device swap priority array (swdev_pri[]), used to link together swap devices with the same priority. That is, the entry in swdev_pri[n] is the head of a list of swap devices having priority n. The first field in swdev_pri[] structure is the head of the list; the sw_next field in the swdevt[] structure links each device into the appropriate priority list.
File system swap priority array (swfs_pri[]), which serves the same purpose as swdev_pri[], but for file system swap priority.
Device swap table (struct swdevt), defined in conf.h to establish the fundamental swap device information.
File system swap table (struct fswdevt), defined in swap.h for supplimentary file-system swap.
Swap table of available chunks (struct swaptab), which keeps track of the available free pages of swap space.
Mapping of swap pages (struct swapmap), whose entries together with swaptab combine for a swap disk block descriptor.

The following table details the elements of the struct swdevt.

Table 1-26 Device swap table (struct swdevt)

Element	Meaning
`sw_dev`	Actual swap device, as defined by its major (upper 8 bits) and minor (lower 24 bits) numbers.
`sw_enable`	Enabled flag. Zero if device swap is disabled; one if enabled.
`sw_start`	Offset into the swap area on disk, in kilobytes.
`sw_nblksavail`	Size of swap area, in kilobytes.
`sw_nblksenabled`	Number of blocks enabled for swap. Must be a multiple of `swchunk` (2MB default).
`sw_nfpgs`	Number of free swap pages on the device. Updated whenever a page is used or freed.
`sw_priority`	Priority of swap device (1-10).
`sw_head , sw_tail`	First and last `swaptab[]` entry associated with swap device.
sw_next	Pointer to the next device swap entry (`swdevt`) at this priority; implemented as a circular list used to update the pointer in `swdev_pri` for round-robin use of all devices at a particular priority.

The following table details the principle elements of the struct fswdevt.

Table 1-27 File system swap table (struct fswdevt)

Element	Meaning
`fsw_next`	Pointer to next file system swap (fswdevt entry) at this priority; implemented as a circular list.
`fsw_enable`	Enabled flag. Zero if file-system swap is disabled; one if enabled.
`fsw_nfpgs`	Number of free swap pages in this file system swap; updated whenever a page is used or freed.
`fsw_allocated`	Number of `swchunks` (2MB default) allocated on this file-system swap.
`fsw_min`	Minimum `swchunks` to be preallocated when the file-system swap is enabled.
`fsw_limit`	Maximum `swchunks` allowed on file system; unlimited if set to zero.
`fsw_reserve`	Minimum blocks (of size `fsw_bsize`) reserved for non-swap use on this file system.
`fsw_priority`	Priority of device (0-10). Priority can also be determined by identifying `swfs_pri[]` linked list.
`fsw_vnode`	`vnode` of the file system swap directory (`/paging`) under which the swap files are created.
`fsw_bsize`	Block size used on this file system; used to determine how much space `fsw_reserve` is reserving
`fsw_head fsw_tail`	Index into `swaptab[]` of first, last entry associated with this file system swap.
`fsw_mntpoint`	File system mount point; character representation of `fsw_vnode`, used for utilities (such as `swapinfo(1M)`) and error messages.

`swaptab` and `swapmap` Structures

Two structures track swap space. The swaptab[] array tracks a chunk of swap space. swapmap entries hold swap information on a per-page level. swaptab defaults to track a 2MB chunk of space and swapmap tracks each page within that 2MB chunk.

Each entry in the swaptab[] array has a pointer (called st_swpmp) to a unique swapmap. swapmap entries have backwards pointers to the swaptab index. There is one entry in the swapmap for each page represented by the swaptab entry (default 2 MB, or 512 pages); that is, swapmap conforms in size to swchunk.

A linked list of free swap pages begin at the swaptab entry's st_free and use each free swapmap entry's sm_next. When a page of swap is needed, the kernel walks the structures (using the getswap() routine in vm_swalloc.c), which calls other routines that actually locate the chunk, and so forth.

Beginning with the lowest priority, we begin by examining swdev_pri[].curr, which points to a swdevt entry.
If sw_nfpgs is zero (no free pages), we follow the pointer sw_next to get the next swdevt entry at this priority.
If none of these have free pages, we move on to swfs_pri[].curr, the file system swap at this priority, checking fsw_nfpgs for free pages.
If we are still unsuccessful, we move to the next priority and try again.
Once we find a swdevt or fswdevt with free pages, we walk that device's swaptab list, starting with sw_head or fsw_head, and using st_next in each swaptab entry, until we find a swaptab entry with non-zero st_nfpgs.
st_free points to the first free swapmap entry (and thus first free page) in this swaptab chunk.
The swalloc() routine creates a disk block descriptor (dbd) using 14 bits of dbd_data for the swaptab index and 14 bits for the swapmap index. The r_bstore in the region is set to the disk device vnode or the file system directory vnode, and the dbd is marked DBD_BSTORE.
When faulting in from swap, the same process is followed as for faulting in from the file system: r_bstore and dbd_data are hashed together and checked for a soft fault, then devswap_pagein() is called. The devswap_pagein() routine uses the dbd_data as a 14-bit swaptab index and a 14-bit swapmap index to determine the location of the page on disk.

Now all information needed to retrieve the page from swap has been stored.

Figure 1-30 The swaptab and swapmap structures

Table 1-28 Swap table entry (struct swaptab)

Element	Meaning
`st_free`	Index to the first free page in the chunk. Each entry maps to a 4KB-age of swap.
`st_next`	Index to next `swaptab` entry for same device or file-system swap; at end of list, `st_next` is -1.
`st_flags`	`ST_INDEL`: File-system swap flag, indicating chunk is being deleted; do not allocate pages from it. Set only by the `realswapoff()` routine. `ST_FREE`: File-system swap flag, indicating chunk may be deleted, because none of its pages are in use. In the case of remote swap, the chunk should not be deleted immediately; set `st_free_time` to current time plus 30 minutes (1800 seconds) when setting this flag. Once 30 minutes has elapsed, the chunk can be freed. If the chunk is needed during the interim, the flag can be cleared using `chunk_release()`, called from `lsync()`. `ST_INUSE`: `swaptab` entry is being changed.
`st_dev , st_fsp`	Pointers to `swdevt` entry that references the `swaptab` entry.
`st_nfpgs`	Number of free pages in this (`swchunk`) `swaptab` entry.
`st_swpmp`	Pointer to `swapmap[]` array that defines this `swchunk` of swap pages.
`st_free_time`	Indicates when remote `fs` chunk can be freed (see explanation of `ST_FREE` flag).

Table 1-29 swap map entry (struct swapmap)

Element	Meaning
`sm_ucnt`	Number of threads using the page. When decremented to zero, the swap page is free and the free pages linked list can be updated.
`sm_next`	Index of the next free page in the `swapmap[]`. This is valid only if `sm_ucnt` is zero; that means that this `swapmap` entry is included in the linked list beginning with `swaptab's st_free`.

Deactivation using the pager

Since vhand()is tuned to be nice regarding I/O usage and CPU usage, it allows the pager to fault out swapped processes. The swapper marks the process to be swapped for deactivation, which takes it off the run queue. Since it cannot run once its pages are aged, they cannot be referenced again. When the steal hand comes around, it steals all the pages in the region.

When memory pressure is high, sched() selects a process to swap using the routine choose_deactivate(). This routine is biased to choose non-interactive processes over interactive ones, sleeping processes over running ones, and long-running processes over newer ones.

Once a process has been chosen to be deactivated, the following actions occur:

The process's SDEACT flag and its threads' TSDEACT flags are set.
The process's threads are removed from the run queue. It the process is waiting for I/O, its SDEACTSELF flag and its threads' TSDEACTSELF flags are set. When I/O completes, the process deactivates in the paging routines.
The process's p_deactime in the proc structure is set to the current time to establish a record of how long the process is deactivated.
The process is positioned in the active pregion chain to ready it for the steal hand.
The uarea pregion is added to the list of active regions for it to get paged out.
The global counter deactive_cnt is incremented.

A process that has been inactive long enough for all its pages to have been aged and stolen is virtually swapped out already. The global deactprocs points to the head of a list of inactive processes, its chain running through the pregion element p_nextdeact. If the average number of free pages drops below lotsfree, these pages are swapped out.

When memory pressure eases, a deactivated process is reactivated. The choose_reactivate() routine is biased to choose interactive over non-interactive ones processes, runnable processes over sleeping ones, and processes that have been deactivated longest over those more recently deactivated.

SWAP SPACE MANAGEMENT

Technical documentation

» Table of Contents

Pseudo-Swap Space

Physical Swap Space

Device Swap Space

File-System Swap Space

Swap Space Parameters

Swap Space Global Variables

Swap Space Values

Reservation of Physical Swap Space

Swap Reservation Spinlock

Reservation of Pseudo-Swap Space

Pseudo Swap and Lockable Memory

How Swap Space is Prioritized

Three Rules of Swap Space Allocation

Swap Space Structures

`swaptab` and `swapmap` Structures

Deactivation using the pager

SWAP SPACE MANAGEMENT

Technical documentation

» Table of Contents

Pseudo-Swap Space

Physical Swap Space

Device Swap Space

File-System Swap Space

Swap Space Parameters

Swap Space Global Variables

Swap Space Values

Reservation of Physical Swap Space

Swap Reservation Spinlock

Reservation of Pseudo-Swap Space

Pseudo Swap and Lockable Memory

How Swap Space is Prioritized

Three Rules of Swap Space Allocation

Swap Space Structures

swaptab and swapmap Structures

Deactivation using the pager

`swaptab` and `swapmap` Structures