HPlogo HP-UX Memory Management: White Paper > Chapter 1 MEMORY MANAGEMENT

MEMORY-RELEVANT PORTIONS OF THE PROCESSOR

» 

Technical documentation

Complete book in PDF

 » Table of Contents

Figure 1-5 Processor architecture, showing major components

[Processor architecture, showing major components]

The figure above and the table that follows, name the principal processor components; of them, registers, translation lookaside buffer, and cache are crucial to memory management, and will be discussed in greater detail following the table.

Table 1-2 Processor Architecture, components and purposes

ComponentPurpose
Central Processing Unit (CPU)

The main component responsible for reading program and data from memory, and executing the program instructions. Within the CPU are the following:

  • Registers, high-speed memory used to hold data while it is being manipulated by instructions, for computations, interruption processing, protection mechanisms, and virtual memory management. Registers are discussed shortly in greater detail.

  • Control Hardware (also called instruction or fetch unit) that coordinates and synchronizes the activity of the CPU by interpreting (decoding) instructions to generate control signals that activate the appropriate CPU hardware.

  • Execution Hardware to perform the actual arithmetic, logic, and shift operations. Execution Hardware can take on many specialized tasks but most common are the Arithmetic and Logic Unit (ALU) and the Shift Merge Unit (SMU).

Instruction and Data CacheThe cache is a portion of high-speed memory used by the CPU for quick access to data and instructions. The most recently accessed data is kept in the cache.
Translation Lookaside Buffer (TLB)

The processor component that enables the CPU to access data through virtual address space by :

  • Translating the virtual address to physical address.

  • Checking access rights, so that access is granted to instructions, data, or I/O only if the requesting process has proper authorization.

Floating Point CoprocessorAn assist processor that carries out specialized tasks for the CPU.
System Interface Unit (SIU)Bus circuitry that allows the CPU to communicate with the central (native) bus.

 

The Page Table or PDIR

The operating system maintains a table in memory called the Page Directory (PDIR) which keeps track of all pages currently in memory. When a page is mapped in some virtual address space, it is allocated an entry in the PDIR. The PDIR is what links a physical page in memory to its virtual address.

The PDIR is implemented as a memory-resident table of software structures called page directory entries (PDEs), which contain virtual addresses. The PDIR maps the entire physical memory with one entry for every page in physical memory. Each entry contains a 48/64 bit virtual address. When the processor needs to find a physical page not indexed in the TLB, it can search the PDIR with a virtual address until it finds a matching address.

The PDIR table is a hash table with collision chains. The virtual address is used to hash into one of the buckets in the hash table and the corresponding chain is searched until a chain entry with a matching virtual address is found.

Page Fault

A trap occurs because translation is missing in the translation lookaside buffer (TLB, discussed shortly). If the processor can find the missing translation in the PDIR, it installs it in the TLB and allows execution to continue. If not, a page fault occurs.

A page fault is a trap taken when the address needed by a process is missing from the main memory. This occurrance is also known as a PDIR miss. A PDIR miss indicates that the page is either on the free list, in the page cache, or on disk; the memory management system must then find the requested page on the swap device or in the file system and bring it into main memory.

Conversely, a PDIR hit indicates that a translation exists for the virtual address in the TLB.

The Hashed Page Directory (hpde) structure

Each PDE contains information on the virtual-to-physical address translation, along with other information necessary for the management of each page of virtual memory. The structural elements of the hashed page directory for PA-RISC 1.1 are shown in the following table.

Table 1-3 struct hpde, the hashed page directory

ElementMeaning
pde_validFlag set by the kernel to indicate a valid pde entry.
pde_vpageVirtual page - high 15 bits of the virtual offset
pde_spaceContains the complete 16-bit virtual space
pde_refReference bit set by the kernel when it receives certain interrupts; used by vhand() to tell if a page has been used recently
pde_accessedUsed by the stingy cache flush algorithm (a performance enhancement) to indicate that the page may be in data cache.
pde_rtrapData reference trap enable bit; when set, any access to the page causes a page reference trap interruption
pde_dirtyDirty bit; marked if the page differs in memory from what is on disk.
pde_dbrkData break; used by the TLB
pde_arAccess rights; used by the TLB. (See mmap(2) and mprotect(2) for information about how programs can manipulate this field.)
pde_protidProtection ID, used by the TLB.
pde_executedUsed by the stingy cache flush algorithm to indicate that page is referenced as text.
pde_uipLock flag used by trap-handling code.
pde_physPhysical page number; the physical memory address divided by the page size.
pde_modifiedIndicator to the high-level virtual memory routines as to whether the page has been modified since last written to a swap device.
pde_ref_trickleTrickle-up bit for references. Used with pde_ref on systems whose hardware can search the htbl directly.
pde_block_mappedBlock mapping flag; indicates page is mapped by block TLB and cannot be aliased.
pde_aliasVirtual alias field. If set, the pde has been allocated from elsewhere in kernel memory, rather than as a member of the sparse PDIR.
pde_nextPointer to next entry, or null if end of list.

 

A word-oriented hpde structure (struct whpde) is implemented for faster manipulation and is documented in /usr/include/machine/pde.h. The pde.h header file also contains the definitions space for manipulation, maximum number of entries in the PDIR hashtable, constants related to field positions within the PDE structure, access rights (which are now given on a region basis), and another hashed page directory (struct hpde2_0) for PA-RISC 2.0.

NOTE: The 2.0 version of the hpde structure has a field named var_page that can hold the page size information. This is used in implementing super-pages (>4K) on systems based on the PA 2.0 processor.

Translation Lookaside Buffer (TLB)

The translation lookaside buffer (TLB) translates virtual addresses to physical addresses.

Figure 1-6 Role of the TLB.

[Role of the TLB.]

Address translation is handled from the top of the memory hierarchy hitting the fastest components first (such as the TLB on the processor) and then moving on to the page directory table (pdir in main memory) and lastly to secondary storage.

Organization and Types of TLB

Depending on model, the TLB may be organized on the processor in one of two ways:

  • Unified TLB - A single TLB that holds translations for both data and instructions.

  • Split Data and Instruction TLB - Dual TLB units in the processor each of which hold translations specifically for data or instructions.

At one time many systems were being designed with split Data TLB (DTLB) and Instruction TLB (ITLB), to account for the different characteristics of data and instruction locality and type of access (frequent random access of data versus relatively sequential single usage of instructions). Cost factors have allowed the inclusion of much larger TLBs on processors, which has lessened the disadvantages of a unified TLB. As a result many newer processors have unified TLBs.

Block TLB

In addition to the standard TLB that maps each entry to a single page of memory, many processors also have a block TLB. The block TLB is used to map entries to virtual address ranges larger than a single page, that is, multiple hpdes. Block TLB entries are used to reference kernel memory that remains resident. Since the operating system moves data in and out of memory by pages, a range of pages referenced by a block TLB entry is locked in memory and cannot be paged out. Addressing blocks of pages thus increases the overall address range of the TLB and the speed with which large transactions can be serviced, and thus may be thought of as a hardware implementation of large pages. The block TLB is typically used for graphics, because their data is accessed in huge chunks. It is also used for mapping other static areas such as kernel text and data.

Figure 1-7 The TLB is a cache for address translations

[The TLB is a cache for address translations]

The TLB translates addresses

The TLB looks up the translation for the virtual page numbers (VPNs) and gets the physical page numbers (PPNs) used to reference physical memory.

Ideally the TLB would be large enough to hold translations for every page of physical memory; however this is prohibitively expensive; instead the TLB holds a subset of entries from the page directory table (PDIR) in memory. The TLB speeds up the process of examining the PDIR by caching copies of its most recently utilized translations.

Because the purpose of the TLB is to satisfy virtual to physical address translation, the TLB is only searched when memory is accessed while in virtual mode. This condition is indicated by the D-bit in the PSW (or the I-bit for instruction access).

TLB Entries

Since the TLB translates virtual to physical addresses, each entry contains both the Virtual Page Number (VPN) and the Physical Page Number (PPN). Entries also contain Access Rights, an Access Identifier, and five flags.

Table 1-4 TLB flags (PA 2.x architecture)

FlagMeaning
O Ordered. Accesses to data for load and store are ranked by strength -- strongly ordered, ordered, and weakly ordered. (See PA-RISC 2.0 specifications for model and definitions.)
UUncacheable. Determines whether data references to a page from memory address space may be moved into the cache. Typically set to 1 for data references to a page that maps to the I/O address space or for memory address space that must not be moved into cache.
TPage Reference bit. If set, any access to this page causes a reference trap to be handled either by hardware or software trap handlers
D Dirty Bit. When set, this bit indicates that the associated page in memory differs from the same page on disk. The page must be flushed before being invalidated.
BBreak.This bit causes a trap on any instruction that is capable of writing to this page
PPrediction method for branching; optional, used for performance tuning.

 

The T,D, and B flags are only present in data or unified TLBs.

In PA 1.x architecture, an E bit (or "valid" bit) indicates that the TLB entry reflects the current attributes of the physical page in memory.

Instruction and Data Cache

Cache is fast, associative memory on the processor module that stores recently accessed instructions and data. From it, the processor learns whether it has immediate access to data or needs to go out to (slower) main memory for it.

Cacheable data going to the CPU from main memory passes through the cache. Conversely, the cache serves as the means by which the CPU passes data to and from main memory. Cache reduces the time required for the CPU to access data by maintaining a copy of the data and instructions most recently requested.

A cache improves system performance because most memory accesses are to addresses that are very close to or the same as previously accessed addresses. The cache takes advantage of this property by bringing into cache a block of data whenever the CPU requests an address. Though this depends on size of the cache, associativity, and workload, a vast majority of the time (according to performance measurements), the cache has what you're looking for the next time, enabling you to reference it.

Cache Organization

Depending on model, PA-RISC processors are equipped with either a unified cache or separate caches for instructions and data (for better locality and faster performance). In multiprocessing systems, each processor has its own cache, and a cache controller maintains consistency.

Cache memory itself is organized as follows:

  • A quantity of equal-sized blocks called cache lines, defined to be the same unit of size as data passed between cache and main memory. A cache line can be 16, 32, or 64 bytes long, aligned.

  • One 15-bit long cache tag for every cache line, to describe its contents and determine if the desired data is present. The tag contains

    • Physical Page Number (PPN), identifying the page in main memory where the data resides.

    • Flag Bits When set, a valid flag indicates the cache line contains valid data. A dirty bit is set if the CPU has modified contents of the cache line; that is, the cache (not main memory) contains the most current data. If the dirty bit is not set, the flag is said to be "clean," meaning that the cache line does not have modified contents. Other implementation-specific flags may be present.

  • Both the cache tag and cache line have associated parity bits used for checksumming, to make sure the line is correct.

Figure 1-8 Every cache entry consists of a cache tag and cache line.

[Every cache entry consists of a cache tag and cache line]

How the CPU Uses Cache And TLB

When a process executes, it stores its code (text) and data in processor registers for referencing. If the data or code is not present in the registers, the CPU supplies the virtual address of the desired data to the TLB and to the cache controller. Depending on implementation, caches can be direct mapped, set associative, or fully associative. Recent PA implementations use direct associative caches and fully associative TLBs. Virtual addresses can be sent in parallel to the TLB and cache because the cache is virtually indexed.

A physical page may not be referenced by more than one virtual page, and a virtual address cannot translate to two different physical addresses; that is, PA-RISC does not support hardware address aliasing, although HP-UX implements software address aliasing for text only in EXEC_MAGIC executables.

The cache controller uses the low-order bits of the virtual address to index into the direct-mapped cache. Each index in the cache finds a cache tag containing a physical page number (PPN) and a cache line of data. If the cache controller finds an entry at the cache location, the cache line is checked to see whether it is the right one by llooking at the PPN in the cache tag and the one returned by the TLB, because blocks from many different locations in main memory can be mapped legitimately to a given cache location. If the data is not in cache but the page is translated, the resultant data cache miss is handled completely by the hardware. A TLB miss occurs if the page is not translated in the TLB; if the translation is also not in the PDIR, HP-UX uses the page fault code to fault it in. If not in RAM, the data and code might have to be paged from disk, in which case the disk-to-memory transaction must be performed.

Figure 1-9 PPNs from Cache and TLB are compared

[PPNs from Cache and TLB are compared]

On a more detailed level, the next figure demonstrates the mapping of virtual and physical address components.

Figure 1-10 Virtual address translation

[Virtual address translation]

TLB Hits and Misses

The sequence followed by the processor as it validates addresses is one of "hit or miss."

  • The TLB is searched; that is, each virtual address and byte offset issued by the processor indexes an entry in the TLB.

    • If the entry is valid, it is known as a TLB hit. The TLB contains a valid physical page number (PPN), which might be accessed in cache.

    • If the entry is invalid or the TLB cannot provide a physical page number, a TLB miss occurs and must be handled. On certain systems, a hardware walker searches the PDIR and if it finds the page, updates the TLB. On systems not equipped with a hardware TLB handler or if the hardware walker does not find an entry in the PDIR, a software interrupt is generated. The software interrupt resolves the fault and updates the TLB, allowing the access to proceed.

There are five TLB miss handlers (instruction, data, non-access instruction, nonaccess data, and dirty) located in locore.s; the header file pde.h has the TLB/PDIR structure definition.

TLB Role in Access Control and Page Protection

In addition to assisting in virtual address translation, the translation lookaside buffer (TLB) serves a security function on behalf of the processor, by controlling access and ensuring that a user process sees only data for which it has privilege rights.

The TLB contains access rights and protection identifiers. PA-RISC allows up to four protection IDs to be associated with each process. These IDs are held in control registers CR-8, CR-9, CR-12, and CR-13.

Table 1-5 Security checks in the TLB

Security checkPurpose
Protection Checks

The P-bit (Protection ID Validation Enable bit) of the Processor Status Word (PSW) is checked:

  • If not set, protection checking on the page is waived, as though passed and checking proceeds to access rights validation.

  • If the protection ID validation bit is set, the access ID of the TLB entry is compared to the protection IDs in CR-8, CR-9, CR-12, and CR-13.

Access Rights Check

Access Rights are stored in a seven-bit field containing permissible access type and two privilege levels affecting the executing instruction:

  • Access types are read, write, execute.

  • Privilege levels checked for read access and write access, kernel and user execution.

 

Figure 1-11 “Access control to virtual pages” shows the checkpoints for controlling access to a page of data through the TLB. Two checks are performed: protection check and access rights check. If both checks pass, access is granted to the page referenced by the TLB.

Figure 1-11 Access control to virtual pages

[Access control to virtual pages]

Cache Hits and Misses

  • When the cache line was first copied into the cache, its Physical Page Number was stored in the corresponding cache tag. The cache controller compares the PPN from the tag to the PPN supplied by the TLB.

    • If the PPN in the cache tag matches the PPN from the TLB, a cache hit occurs. The data is present in the cache and is supplied to the CPU.

    • If the PPN in the cache tag does not match the PPN from the TLB, a cache miss occurs. In a cache miss, the cache line is loaded from memory, because the byted referenced on the virtual page are not yet in cache. (Typically, our implementations do not load an entire page at a time to the cache; they load a cache line at a time.) The data is absent from cache and the CPU must wait while the data is brought into the cache from main memory.

If the two PPNs do not match (assuming a TLB hit), the cache line is loaded because the bytes referenced on the virtual page are not yet in the cache. The time it takes to service a cache miss varies depending on if the data already present in the cache is clean or dirty. (When the cache is dirty, the old contents are written out to memory and the new contents are read in from memory.) If the cache line is "clean" (that is, not modified), it does not have to be written back to main memory, and the penalty is fewer instruction cycles than if the cache is dirty and must be written back to main memory.

All PA-RISC machines use a cache write-back policy, meaning that the main memory is updated only when the cache line is replaced.

Figure 1-12 Summary of page retrieval from TLB, Cache, PDIR

[Summary of page retrieval from TLB, Cache, PDIR]

PA-RISC allows for privilege level promotion by using a GATEWAY instruction. This instuction performs an interspace branch to increase the privilege level. The most common example of this in HP-UX is a system call, which changes the privilege level from user to kernel.

Registers

Registers, high-speed memory in the processor's CPU, are used by the software as storage elements that hold data for instruction control flow, computations, interruption processing, protection mechanisms, and virtual memory management.

All computations are performed between registers or between a register and a constant (embedded in an instruction), which minimizes the need to access main memory or code. This register-intensive approach accelerates performance of a PA-RISC system. This memory is much faster than conventional main memory but it is also much more expensive, and therefore used for processor-specific purposes.

Registers are classified as privileged or non-privileged, depending on the privilege level of the instruction being executed.

Table 1-6 Types of Registers

Type of RegisterPurpose
32 General Registers, each 32 bits in size.(non-privileged)

Used to hold immediate results or data that is accessed frequently, such as the passing of parameters. Listed are those with uses specified by PA-RISC or HP-UX.

  • GR0 - Permanent Zero

  • GR1 - ADDIL target address

  • GR2 - Return pointer. Contains the instruction offset of the instruction to which to return

  • GR23 - Argument word 3 (arg3)

  • GR24 - Argument word 2 (arg2)

  • GR25 - Argument word 1 (arg1)

  • GR26 - Argument word 0 (arg0)

  • GR27 - Global data pointer (dp)

  • GR28 - Return value

  • GR29 - Return value (double)

  • GR30 - Stack pointer (sp)

7 Shadow Registers(privileged)Store contents of GR1,8,9,16,17,24, and 25 on interrupt, so that they can be restored on return from interrupt. Numbered SHR0-SHR6.
8 Space Registers, holding 16, 24, or 32-bit space ID.(SR5-SR7 are privileged)

Hold the space IDs for the current running process.

  • SR0 - Instruction address space link register used for branch and link external instructions.

  • SR1-SR7 - Used to form virtual addresses for processes.

32 Control Registers, each 32 bits. (most are privileged)

Used to reflect different states of the system, many related primarily to interrupt handling.

  • CR0 - Recovery Counter, used to provide software recovery of hardware faults in fault-tolerant systems and for debugging.

  • CR10 - Low-order bits are known as the Coprocessor Configuration Register (CCR), 8 bits that indicate presence and usability of coprocessors. Bits 0, 1 correspond to the floating point coprocessor; bit 2, the performance monitor coprocessor.

  • CR14 - Interruption Vector Address (IVA)

  • CR16 - Interval Timer. Two internal registers, one counting at a rate between twice and half the implementation-specific "peak instruction rate", the other register containing a 32-bit comparison value. Each processor in a multi-processor system has its own Interval Timer, but they need not be synchronized nor clock at the same frequency.

  • CR17 - Stores the contents of the Instruction Address Space Queue at time of interruption.

  • CR19 - Used to pass an instruction to an interrupt handler.

  • CR20, CR21 - Used to pass a virtual address to an instruction handler.

  • CR26, CR27 - Temporary registers readable by code executing at any privilege level but writable only by privileged code.

64 Floating Point Registers, 32-bits each, or 32, 64-bits each.

Data registers used to hold computations.

  • FP-0L - Status register. Controls arithmetic modes, enables traps, indicates exceptions, results of comparison, and identifies coprocessor implementation.

  • FP-0R through FP-3 - Exception registers, containing information on floating point operations whose execution has completed and caused a delayed trap.

2 Instruction Address Queues, each 64 bits

Two queues 2 elements deep. The front elements of the queues (IASQ_Front and IAOQ_Front) form the virtual address of the current instruction, while the back elements (IASQ_Back and IAOQ_Back) contain the address of the following instruction.

  • Instruction Address Space Queue holds the space ID of the current and following instruction.

  • Instruction Address Offset Queue holds the offset of the instruction for the given space High-order 62 bits contain the wrod offset of the instruction; the 2 low-order bits maintain the privilege level of the instruction.

1 Processor Status Word (PSW), 32 bits(privileged)Contains the current processor state. When an interruption occurs, the PSW is saved into the Interrupt Processor Status Word (IPSW), to be restored later. Low-order five bits of the PSW are the system mask, and are defined as mask/unmask or enable/disable. Interrupts disabled by PSW bit are ignored by the processor; interrupts masked remain pending until unmasked.