MAINTAINING PAGE AVAILABILITY

Two computational elements maintain page availability:

Paging thresholds trigger the gamut of paging events.
The vhand and sched daemons (system processes) handle the actual paging and deactivation.

vhand monitors free pages to keep their number above a threshold and ensure sufficient memory for demand paging. vhand governs the overall state of the paging system. sched becomes operative when the number of pages available in memory diminishes below a certain level. vhand and sched will be described in the context of their work shortly.




	NOTE: The `sched` process is known colloquially as the swapper.

Paging Thresholds

Memory management uses paging thresholds that trigger various paging activities. The figure shows the full range of available memory and indicates what paging activity occurs when memory level falls below each paging threshold.

Figure 1-24 Available memory in the system

The value termed freemem represents the total number of free pages in the phead linked list, which includes all memory available in a system after kernel initialization.

Three tunable paging thresholds are initialized by the setmemthresholds() routine.

Table 1-20 setmemthresholds() paging thresholds

Paging threshold	Meaning
`lotsfree`	Plenty of free memory, specified in pages. The upper bound from which the paging daemon will begins to steal pages.
`desfree`	Amount of memory desired free, specified in pages. This is the lower bound at which the paging daemon begins stealing pages.
`minfree`	The minimal amount of free memory tolerable, specified in pages. If free memory drops below this boundary, `sched()` recognizes the system is desperate for memory and deactivates entire processes whether they are runnable or not.

The `gpgslim` Paging Threshold

The gpgslim paging threshold is the point at which vhand starts paging. gpgslim adjusts dynamically according to the needs of the system. It oscillates between an upper bound called lotsfree and a lower bound called desfree. Both lotsfree and desfree are calculated when the system boots up and are based on the size of system memory.

When the system boots, gpgslim is set to 1/4 the distance between lotsfree and desfree (desfree + (lotsfree - desfree)/4). As the system runs, this value fluctuates between desfree and lotsfree. When the sum of available memory and the number of pages scheduled for I/O (soon to be freed) falls below gpgslim, vhand() begins aging and stealing little-used pages in an attempt to increase the available memory above this threshold.

The system wants to keep memory at gpgslim. If the system is not stressed, gpgslim starts rising, because it does not need to have a lot more pages freed. As memory becomes more scarce, the system tries to maintain the pool of free memory, causing gpgslim to fall. If gpgslim decreases to minfree, the system starts to deactivate entire processes.

How Memory Thresholds are Tuned

Performance testing has shown that memory usage differs for a server versus a workstation. Workstations typically run a few large applications whereas servers typically run many applications of varying size. Consequently, the paging and deactivation thresholds on workstations are a smaller fraction of memory than on the servers. In a typical workstation environment, applications start up requiring a large number of pages, which eventually reduce to a smaller working set of pages. By allowing applications to claim more memory before paging or deactivating, the working set is more likely to stay in memory.

Paging and activation algorithms take these and other differences into account. Depending on the physical memory size of the system, the paging thresholds are initialized to either a "small memory" or "large memory" set of values.

Small Memory Thresholds

For small memory systems (that is, systems with 32MB or less of freemem), the paging thresholds are set to a smaller fraction of total memory to allow applications to utilize more memory before the system begins paging and deactivating. The paging thresholds are set as follows:

Table 1-21 Small-memory paging thresholds

Threshold	Limit	Not to exceed
`lotsfree`	1/8 `freemem`	1MB
`desfree`	1/16 `freemem`	240 KB
`minfree`	1/2 `desfree`	100 KB

Large Memory Thresholds

For large memory systems (that is, systems with greater than 32 MB of freemem), the paging thresholds are set to a larger fraction of memory to allow vhand() to start paging earlier so that it can efficiently walk a (potentially) longer active pregions list. This also helps sched() process a potentially longer active process list by starting process deactivation earlier. The paging thresholds are set as follows:

Table 1-22 Large-memory paging thresholds

Threshold	Limit	Capacity if `freemem` < 2 GB	Capacity if `freemem` > 2 GB
`lotsfree`	1/16 `freemem`	32 MB	64 MB
`desfree`	1/64 `freemem`	4 MB	12 MB
`minfree`	1/4 `desfree`	1 MB	5 MB

These settings result in a linear increase of the paging thresholds up to a certain memory size, after which the thresholds remain fixed. For example, lotsfree increases linearly and reaches its maximum value of 32 MB when freemem is 512 MB. For memory sizes beyond 512 MB, lotsfree remains fixed at 32 MB. This results in the system paging earlier for smaller memory configurations and later for larger sizes.

When physical memory sizes exceed 2 GB, all the paging thresholds are increased to a larger set of fixed values.

How Paging is Triggered

The rate schedpaging() runs is termed vhandrunrate, a tunable parameter (set to run by default at eight times per second) activated when the sum of free memory and paroled memory (freemem + parolemem) is less than lotsfree.

`vhand`, the pageout daemon

Programmatically, vhand is awakened by schedpaging() periodically to maintain recently referenced pages and to move pages out when memory is tight. vhand operates on the basis of vhandargs_t, which consists of a pointer to the target pregion, a count of the physical pages visited, and a nice value for preferential aging.

vhand can also be awakened by allocpfd2() (in vm_page.c), a routine that allocates a single page of memory.

If all the pages on the free memory list (phead) are locked, or the routine has been called while using the interrupt control stack (ICS) and all pages on the free list are also in the page cache (phash), allocpfd2() cannot get any pages.

If on the ICS without any available pages, allocpfd2() wakes the page daemon. Regardless of which stack the system is running on, allocpfd2() then wakes up unhashdaemon, which removes pages from the page cache.

If on the ICS, allocpfd2() returns NULL; if not on the ICS, allocpfd2() sleeps waiting for a page to become available, and then retry.

Two-Handed Clock Algorithm

A doubly linked list of pregions, termed the active pregion list, is used by vhand to examine memory availability. Conceptually, the pregions can be visualized as being linked in a circle, in the center of which are two clock-like hands. The two hands function as a steal hand and an age hand.

A steal hand removes pages whose reference bits remain clear since the most recent pass of the age hand.
An age hand clears reference bits on in-core pages in an active pregion.

The kernel automatically keeps an appropriate distance between the hands, based on the available paging bandwidth, the number of pages that need to be stolen, the number of pages already scheduled to be freed, and the frequency by which vhand runs.

Figure 1-25 Two-handed vhand clock algorithm, showing also the factors that affect vhand

The two hands cycle through the active pregion linked lists of physical memory to look for memory pages that have not been referenced recently and move them to secondary storage - the swap space. Pages that have not been referenced from the time the age hand passes to the time the steal hand passes are pushed out of memory. The hands rotate at a variable rate determined by the demand for memory.

The vhand daemon decides when to start paging by determining how much free memory is available. Once free memory drops below the gpgslim threshold, paging occurs. vhand attempts to free enough pages to bring the supply of memory back up to gpgslim. Between gpgslim and lotsfree, the page daemon continues to age pages (that is, clear their reference bits) but no longer steals pages.

Factors Affecting `vhand`

vhand responds to various workloads, transient situations, and memory configurations. When aging and stealing from regions, vhand

ages some constant fraction of each pregion.
uses the pregion field p_agescan to track the last age hand location.
uses the pregion field p_ageremain to track remaining pages to be aged.
uses the pregion field p_stealscan to track the last steal hand location.
pushes vfd/dbd pairs to swap if they have no valid pages.

When the age hand arrives at a region, it ages some constant fraction of pages before moving to the next region (by default 1/16 of the region's total pages). The p_agescan tag enables the age hand to move to the location within a pregion where it left off during its previous pass, while the p_ageremain charts how many pages must be aged to fill the 1/16 quota before moving on to the next pregion.

The steal hand uses the pregion field p_stealscan to locate itself within a pregion and resume taking pages that have not been referenced since last aged. If no valid page remain, vhand pushes out of memory the vfd/dbd pairs associated with the region.

How much to age and steal depends on several factors:

frequency of vhand runs (by default eight times per second).
available paging bandwidth (based on comparison with a global rate of pageouts completed within an interval of time).
how often the system falls to zero free memory.
position of the paging threshold gpgslim.
number of pages already scheduled to be freed.

vhand is biased against threads that have nice priorities: the nicer a thread, the more likely vhand will steal its pages. The pregion field p_bestnice reflects the best (numerically, the smallest value) nice value of all threads sharing a region.

What Happens when `vhand` Wakes Up

Refer to the table that follows for explanations of the vhand variables.

vhand establishes pagecounts for pages to age and pages to steal, and sets the coalescecnt to zero.
vhand uses the SCRITICAL flag to get access to the system critical memory pool. (The SCRITICAL flag for the vhand process is set when the process starts running for the first time.)
vhand increments the value coalescecnt and compares it to the value coalescerate. If coalescecnt is higher, vhand attempts to remove pages from kernel allocation buckets until freemem is above lotsfree. Then vhand resets coalescecnt to zero.
Next vhand updates the value of gpgslim, based on value of memzeroperiod.
vhand updates pageoutrate, using pageoutcnt.
vhand updates targetlaps, the number of desired laps between the age and steal hands. If less CPU cycles are being used than the value of targetcpu, vhand increases the value of targetlaps (up to a maximum of 15); if more CPU cycles are being used than targetcpu, targetlaps is decreased.
vhand updates agerate, the number of pages to age per second.
If vhandinfoticks is non-zero, diagnostic information prints to the console.




	NOTE: None of the variables in the table that follows may be tuned.

Table 1-23 Variables affecting vhand

Variable	Purpose
`coalescerate`	How often `vhand()` attempts to reclaim unused memory from the kernel allocation buckets, beginning at 128; that is, every 128th time `vhand` runs, it attempts to return memory to the system. If successful, `vhand` resets `coalescerate` to every 128th time. If unsuccessful, `vhand` multiplies `coalescerate` by two (checks memory half as often) up to every 512th time.
`memzeroperiod`	Minimum time period (default=3 seconds) permissible for `freemem` to reach zero events; determines how often `gpgslim` is adjusted when `vhand()` is running. `gpgslim` is incremented if `freemem` does not reach zero twice within `memzeroperiod.` `gpgslim` is decremented if `freemem` reaches zero twice within `memzeroperiod` slightly above `lotsfree`.
`pageoutrate`	Current pageout rate, calculated empirically from number of pageouts completed.
`pageoutcnt`	Recent count of pageouts completed
`targetlaps`	Ideal gap between steal and age hands for `handlaps`; adapts at run time. During normal operation, the hands should be as far apart as possible to give processes maximum time to reset a cleared reference bit being used by a page. `targetlaps` is defined in the kernel as a static variable; it does not appear in the symbol table.
`targetcpu`	Maximum percentage of CPU vhand should spend paging. (default value is 10%.)
`handlaps`	Actual number of laps between the age and steal hands.
`agerate`	Number of pages the age hand visits to age per second; adapts continually to system load.These are defined in the kernel as static variables (meaning they do not appear in the symbol table).
`stealrate`	How many pages the steal hand visits per second; adapts continually to system load. These are defined in the kernel as static variables (meaning they do not appear in the symbol table).

`vhand` Steals and Ages Pages

Once vhand establishes its criteria, it proceeds to traverse the linked list of pregions. Continuing in the clock-hands analogy, vhand is ready to move its hands.

vhand determines how many pages and what pages are available to steal.
Next, vhand moves the age hand to clear the reference bit from a selected number of pages.
- If the steal hand is pointing to bufcache_preg, vhand steals buffers from the buffer cache with the stealbuffers() routine. The global parameter dbc_steal_factor determines how much more aggressively to steal buffer cache pages than pregion pages. If dbc_steal_factor has a value of 16, buffer cache pages are treated no differently than pregion pages; the default value of 48 means that buffer cache pages are stolen three times as aggressively as pregion pages.
- If the steal hand points to a pregion whose region has no valid pages (that is, r_nvalid == 0), vhand pushes its B-tree out to the swap device. If none of the processes using the region are loaded in memory (that is, r_incore == 0), the entire region may be swapped out.
- Otherwise, vhand steals all pages between p_stealhand and (p_agescan - p_count/16 * handlaps), up to the steal quota (calculated from stealrate).
- vhand updates p_stealscan to the page number following the last stolen page of the affected pregion.
- If vhand has not stolen as many pages as permissible (calculated from stealrate), it moves to the next pregion and repeats the process until it satisfies the system's demand.
- If the age hand points to bufcache_preg, vhand ages one sixteenth of the pages in the buffer cache with the agebuffers() routine.
- vhand determines the best nice value (that is, the lowest number) of all the pregions using the region. For each page in the region, if the nice value exceeds a randomly generated number, vhand does not age the page.
- Otherwise, vhand ages all pages between p_agehand and (p_agehand + p_ageremain) by clearing the pde_ref bit and purging the TLB.
- Finally, vhand updates p_agehand to be the page number after the last aged page in the affected pregion.

Note, the steal hand is moved first to keep it behind the age hand and prevent aging and stealing a page in the same cycle.

Figure 1-26 Ranges within which pregion pages are aged and stolen

The `sched()` routine

The sched() routine (colloquially termed "the swapper") handles the deactivation and reactivation of processes when free memory falls below minfree, or when the system appears to be thrashing.




	NOTE: Deactivation occurs on a per-thread basis. `sched()` chooses to deactivate on a process level and then deactivates each thread.

Deactivation occurs when sched() determines the system:

is low on memory; that is, if freemem falls below the deactivation threshold minfree and more than one process is running.
appears to be thrashing; that is, if the system has a high paging rate and low CPU usage.

Reactivation occurs when the system is no longer low on memory or thrashing.

What to Deactivate or Reactivate

Deactivation and reactivation are determined by:

process priority; the lower the process priority (meaning the higher the nice value), the more likely it will be deactivated. The higher the process priority, the more likely it will be reactivated. Real-time processes are ineligible for deactivation.
process size; the larger the process resident set size, the more likely it will be deactivated.
process state; a process that has been sleeping or has been in memory for some time is likely to be deactivated. A process deactivated for a while and is now ready to run is likely to be reactivated.
process type. A batch process (one that works continuously) or one marked for serialization is more likely than an interactive process (one that works in spurts) to be deactivated. Interactive processes are more likely to be reactivated than batch or serialized processes.
time in current state

The swapper deactivates processes and prevents them from running, thus reducing the rate at which new pages are accessed. Once swapper detects that available memory has risen above minfree and the system is not thrashing, the swapper reactivates the deactivated processes and continues monitoring memory availability.

Figure 1-27 sched() chooses processes to deactivate based on size, nice priority, and how long it has been running.

sched() walks the chain of active processes, examining each, and deciding the best candidate to be deactivated based on size, nice priority, and how long it has been running.

Programmatically, sched() deactivates and reactivates processes.

If the system appears to be thrashing or experiencing memory pressure, the sched routine walks through the active process list calculating each process's deactivation priority based on type, state, size, length of time in memory, and how long it has been sleeping. (Batch and processes marked for serialization by the serialize() command are more likely to be deactivated than interactive processes.) The best candidate is then marked for deactivation.

If the system is not thrashing or experiencing memory pressure, the sched routine walks through the active proc list calculating each deactivated process' reactivation priority based on how long it has been deactivated, its size, state, and type. Batch processes and those marked by the serialize() command are less likely to be reactivated than is an interactive process. Once the most deserving process has been determined, it is reactivated.

When a process is deactivated

Once a process and its pregions are marked for deactivation, sched()

removes the process from the run queue.
adds its uarea to the active pregion list so that vhand can page it out.
moves all the pregions associated with the target process in front of the steal hand, so that vhand can steal from them immediately.
enables vhand to scan and steal pages from the entire pregion, instead of 1/16.

Eventually, vhand pushes the deactivated process's pages to secondary storage.

When a process is reactivated

Processes stay deactivated until the system has freed up enough memory and the paging rate has slowed sufficiently to return processes to the run queue. The process with the highest reactivation priority is then returned to the run queue.

Once a process and its pregions are marked for reactivation, sched():

removes the process's uarea from the active pregion list.
clears all deactivation flags.
brings in the vfd/dbd pairs.
faults in the uarea.
adds the process to the run queue.

Self-Deactivation

Earlier HP-UX implementations did not permit a process to be swapped out if it was holding a lock, doing I/O, or was not at a signalable priority. Even if priority made it most likely to be deactivated, vhand bypassed the process.

Now, if the most deserving process cannot be deactivated immediately, it is marked for self-deactivation; that is, the process sets a self-deactivation flag. The next time the process must fault in a page, it deactivates itself.




	NOTE: `sched()` deactivates and reactivates processes. As part of a process's deactivation or reactivation, all its threads get deactivated or reactivated. `sched()` does not deactivate or reactivate threads individually.

Thrashing

Thrashing is defined as low CPU usage with high paging rate. Thrashing might occur when several processes are running, several processes are waiting for I/O to complete, or active processes have been marked for serialization.

On systems with very demanding memory needs (for example, systems that run many large processes), the paging daemons can become so busy deactivating/reactivating, and swapping pages in and out that the system spends too much time paging and not enough time running processes.

When this happens, system performance degrades rapidly, sometimes to such a degree that nothing seems to be happening. At this point, the system is said to be thrashing, because it is doing more overhead than productive work.

If your working set is larger than physical memory, the system will thrash. To solve the problem,

reduce the working set of running processes by deactivation, or
increase the size of physical memory.

If you are left with one huge process constrained with physical memory and the system still thrashes, you will need to rewrite the application so that it uses fewer pages simultaneously, by grouping data structures according to access, for example.

Serialization

All processes marked by the serialize command are run serially. This functionality unjams the bottleneck (recognizable by process throughput degradation) caused by groups of large processes contending for the CPU. By running large processes one at a time, the system can make more efficient use of the CPU as well as system memory since each process does not end up constantly faulting in its working set, only to have the pages stolen when another process starts running.

As long as there is enough memory in the system, processes marked by serialize() behave no differently than other processes in the system. However, once memory becomes tight, processes marked by serialize are run one at a time in priority order. Each process runs for a finite interval of time before another serialized process may run. The user cannot enforce an execution order on serialized processes.

serialize() can be run from the command line or with a PID value. serialize() also has a timeshare option that returns the PID specified to normal timeshare scheduling algorithms.

If serialization is insufficient to eliminate thrashing, you will need to add more main memory to the system.

MAINTAINING PAGE AVAILABILITY

Technical documentation

» Table of Contents

Paging Thresholds

The `gpgslim` Paging Threshold

How Memory Thresholds are Tuned

Small Memory Thresholds

Large Memory Thresholds

How Paging is Triggered

`vhand`, the pageout daemon

Two-Handed Clock Algorithm

Factors Affecting `vhand`

What Happens when `vhand` Wakes Up

`vhand` Steals and Ages Pages

The `sched()` routine

What to Deactivate or Reactivate

When a process is deactivated

When a process is reactivated

Self-Deactivation

Thrashing

Serialization

MAINTAINING PAGE AVAILABILITY

Technical documentation

» Table of Contents

Paging Thresholds

The gpgslim Paging Threshold

How Memory Thresholds are Tuned

Small Memory Thresholds

Large Memory Thresholds

How Paging is Triggered

vhand, the pageout daemon

Two-Handed Clock Algorithm

Factors Affecting vhand

What Happens when vhand Wakes Up

vhand Steals and Ages Pages

The sched() routine

What to Deactivate or Reactivate

When a process is deactivated

When a process is reactivated

Self-Deactivation

Thrashing

Serialization

The `gpgslim` Paging Threshold

`vhand`, the pageout daemon

Factors Affecting `vhand`

What Happens when `vhand` Wakes Up

`vhand` Steals and Ages Pages

The `sched()` routine