When a process is fork'd, a duplicate copy of its parent process forms the basis of the child process. .

Region Type Dictates Complexity

Under the kernel procdup() routine, the system walks the pregion list of the parent process, duplicating each pregion for the child process. How this is done is dictated by the region type.

  • If the region is type RT_SHARED, a new pregion is created that attaches to the parent's region.

  • If the region is type RT_PRIVATE, the region is duplicated first, and then a new pregion is created and attached to the new region.

Duplicating pregions for Shared Regions

Because a region of type RT_SHARED is shared by parent and child, fewer changes occur to the pregions and region: Only a new pregion must be created and attached to the shared region.

  • A new pregion is allocated and fields copied from the parent pregion to the child pregion.

  • The pregion elements used by vhand (p_agescan, p_ageremain, and p_stealscan) are initialized to zero and the child pregion is added to the active pregion chain just before the stealhand, to prevent it from being stolen yet.

  • The region elements r_incore and r_refcnt are incremented to reflect the number of in-core pregions accessing the region and the number of pregions, in-core or paged, accessing the region.

The procedure is considerably more complex when an RT_PRIVATE region is copied.

Figure 1-31 Duplicating pregions with shared regions

[Duplicating pregions with shared regions]

Duplicating pregions for Private Regions

Forking a process with a region of type RT_PRIVATE requires that a new child region be allocated first.

  • The child region's pointers are set:

    • r_fstore, the forward store pointer is pointed to the same value as the parent's, and the vnode's reference count (v_count) is incremented.

    • r_bstore, the backward store pointer is set to the kernel global swapdev_vp, and its v_count is incremented also.

  • The child region is attached to the end of the linked list of active regions.

  • Swap is reserved. If insufficient swap space is available, fork() fails and returns the error ENOMEM.

  • The child region's B-tree structures are initialized and sufficient swap space is reserved for a completely filled B-tree.

Figure 1-32 Duplicating a child process of type RT_PRIVATE

[Duplicating a child process of type RT_PRIVATE]
  • The parent's vfd and dbd proto values are copied to the child's B-tree root.

  • The vfd proto in both the parent region and the child region are set so that all pages of the region are copy-on-write.

  • The B-tree element b_vproto is set to indicate that the copy-on-write flag (pg_cw) must be set in the vfd for any new vfddbd pair added to the B-tree.

  • A chunk of vfddbds is created for the child's B-tree (equal to each chunk of vfddbds in the parent's B-tree) and filled with proto values. The pg_cw bit is already set to copy-on-write for all default vfds in the child B-tree's chunk.

Setting copy-on-write when the vfd is valid

Before the chunks of vfddbds in the child region can be used, the validity of every entry must be checked.

  • If a vfd is not valid (that is, its pv_v is not set), the pg_cw of the parent's vfd must be set and copied to the child. If pg_lock is set in the parent, it must be unset in the child, as locks are not inherited.

Once the vfd is valid, further modifications are made to the low-level structures:

  • The r_nvalid element in the child region is incremented to reflect the number of valid pages.

  • The vfd contains a pfn (page frame number), which indexes into the pfdat[] array. The pfdat entry pf_use count (number of regions using this page) must be incremented.

  • If the parent vfd's copy-on-write bit isn't set, the pde must be set for translations to the page to behave as copy-on-write.

Reconciling the Page and Swap Image

If a page has been written to a swap device, but has since been modified, the swap-device data now differs from the data in memory. The disk page must be disassociated from the page in memory by setting the dbd type to DBD_NONE. Then, the next time the page is written to a swap device, it will be assigned a new location.

Everything is now set up from the perspective of the parent's B-tree for copy-on-write.

Setting the child region's copy-on-write status

  • The child's r_swalloc is set to the number of region and B-tree pages reserved.

  • The r_prev and r_next are set to link the child region to the parent region.

  • The kernel chooses new space for the pregion, rather than copying it from the parent pregion. This establishes two ranges of virtual addresses (different space, same offset) translating to the single range of physical address.

    • If a parent process accesses its virtual addresses, it willl get a TLB miss fault because the addresses have been purged from the TLB.

    • If a child process accesses any of its virtual addresses, it will also get a TLB miss fault because the addresses did not previously exist in the TLB, and do not exist in HTBL.

Duplicating a Process Address Space to Make the Process copy-on-write

  • procdup() creates a duplicate copy of a process based on forktype, parent process (pp), child process (cp), and parent thread (pt) and child thread (ct).

    procdup() allocates memory for the uarea of the child. (In fact, procdup() is the routine that calls createU() to create the uarea too.)

    procdup() calls dupvas to duplicate the parent's virtual address space, based on the kind of process (fork vs vfork) being executed.

  • If the process was created by fork, dupvas duplicates the parent process's virtual address space; if the process was vfork'd the parent's virtual address space is used.

    dupvas looks for and finds each private data object, does whatever each requires to be duplicated (there are special considerations required for text, memory mapping, data objects, graphics), and when it finishes duplicating the special objects, calls private_copy or shared_copy, depending on whether it is dealing with a private or shared region.

    • If the region is shared, shared_copy increments the reference count on the region to indicate it is being shared.

    • If the region is private, private_copy locks the region and enables the region to be duplicated by calling dupreg().

  • dupreg() allocates a new region for the child, duplicates the parent's vfds and the entire region structure, then calls do_dupc to duplicate entries under the region.

  • do_dupc() sets up a parent-child relationship, and by duplicating the relationship, sets up the child to be copy-on-write. It makes sure the parent's region is valid, sets copy on write for the child, sets the translation as rx (read-execute) only, duplicates information for every vfddbd combination in the region.

    once do_dupc() completes, the child process exists as a duplicated version of the parent process. The child process is attached to the child's address space and is no longer dependent on the parent.

  • do_dupc then calls hdl_cw() to update the child's access rights and make the child copy on write.

Duplicating the uarea for the Child's Process

The createU() routine builds a uarea and address space for the child process. The uarea is set up last for a fork'd process, to prevent the child process from resuming in the middle of pregion duplication code. If the process is vfork'd, the uarea is created during exec(). Until then, the child uses the parent's uarea.

  • When a user process is created with FORK_PROCESS, a temporary space is allocated for a working copy of the parent's uarea to be modifed into the child's uarea. The temporary space will be freed after the uarea is copied to the new region. fork() updates the savestate in the parent uarea's u_pcb just before copying the data. (vfork() does not do this because it creates the uarea during exec(), and the savestate will change immediately.)

  • A region is allocated for the new uarea, its data structure is initialized, its r_bstore value set back to the swap device, and the new region is added to the list of active regions. The uarea has no r_fstore value, since it comes with ready-made data.

  • Space is allocated for the uarea's pregion, which is initialized. Each uarea has a unique space ID. The new pregion is marked with the PF_NOPAGE flag. uarea pregions are unaffected by vhand because they are not added to the list of active pregions. Only if an entire process is swapped out are the uarea's pages written to a swap device.

  • Once created, the pregion is attached into the linked list of pregions connected to the vas. Its pointer is stored in r_pregs, its p_prpnext set to NULL, and its r_incore and r_refcnt set to one.

  • Once swap space is reserved for the uarea and B-tree pages and the default dbd is set to DBD_DFILL, the uarea pages (UPAGES) are allocated. Each page requires a pfdat entry from phead (sleeping if none is available immediately). The pfn is stored in the vfd, the pg_v is set as valid, r_nvalid is incremented, and a pde is created for the physical-to-virtual translation. The pfdat entry's P_UAREA and HDLPF_TRANS flags are set, and the dbd is set to DBD_NON.

  • The pointers u_procp (to the child process) and u_kthreadp (to the child thread) are pointed to the child uarea.

Conceivably, the child can now run successfully. The current state is therefore saved in the copied uarea with a setjmp() call and pointed to with pcb_sswap. Thus, when the child first calls the resume() routine, it detects that pcb_sswap is non-zero and does a longjmp() to get back here. The child then return from procdup() with the value FORKRTN_CHILD.

The parent's open file table is copied to the child and the copied uarea is copied into the actual pregion. This copy causes TLB miss faults that cause the pregion's pdes to be written to the TLB, thus associating the uarea's virtual address with the physical pages just set up. The process completes by returning from procdup with the return value FORKRTN_PARENT.

Reading from the parent's copy-on-write page

When the parent region accesses one of its RT_PRIVATE pages for read, the processor generates a TLB miss fault, which the kernel handles as an interrupt. The TLB miss fault handler finds the pde and inserts the information (including the new access rights) into the processor's TLB. On return from the interrupt, the processor retries the read and is successful, since PDE_AR_CW allows user-mode read and execute access

Figure 1-33 The first time a read is done to a copy-on-write page

[The first time a read is done to a copy-on-write page]

Reading from the child's copy-on-write page

When the child region accesses one of its pages for read, the TLB miss handler does not find a pde for the virtual address, because none has been set one up yet. The virtual address was set up in the pregion structure. If you are not doing copy-on-access (which is now the default) and the page is needed, the aliased translation must be made.

  • First a save state is created.

  • The vas pointer is taken and the skip list searched to find the pregion containing the page with this address.

  • If the page translates to more than one virtual address, the appropriate alias is acquired.

  • The child region fails to access a page for read and gets a TLB miss, but the miss handler finds a translation and loads it into the TLB.

  • The routine returns from interrupt and succeeds in reading the page.

Faulting In A Page

When regions are initialized, the disk block descriptor (dbd) dbd_data field of the is set to DBD_DINVAL (0xfffffff) in all cases. The prototype dbd_type values are set as follows:

  • DBD_FSTORE for text and initialized data,

  • DBD_DZERO for stack and uninitialized data.

When a page is read for the first time, a TLB miss fault results because the physical page (and therefore its translation in the sparse PDIR) does not yet exist. The fault handler is responsible for bringing in the page and restarting the instruction that faulted. In determining whether or not the page is valid, the fault handler determines which pregion in the faulting process contains the faulting address. The fault code eventually calls virtual_fault(), the primary virtual-fault handling routine . The arguments passed to this routine are the virtual address causing the fault, the virtual address and virtual space of the pregion, and a flag indicating read or write access.

The kernel searches the B-tree for the vfd and dbd of the page. If the valid bit in the vfd flag is set, another process has read the address into memory already. If the r_zomb flag is set in the region, the program prints Pid %d killed due to text modification or page I/O error message and returns SIGKILL, which the handler sends to the process.

Faulting In a Page of Stack or Uninitialized Data

If the dbd_type value is set to DBD_DZERO (as is the case for stack and uninitialized data), the process sets the copy-on-write bit to zero. The kernel then checks to determine whether the page pertains to a system process or to a high-priority thread. If neither and memory is tight, the process sleeps until free memory is driven down to the priority associated with the process. (In worst case, a thread might wait until memory is above desfree.)

Once the process is restarted, vfd and dbd pointers are examined to ensure their continued accuracy. A free pfdat entry is acquired from phead, its pfn (pf_pfn) placed in the vfd, the vfd's valid bit set, and the region's r_nvalid counter (number of valid pages) incremented. The process changes dbd_type to DBD_NONE and dbd_data to 0xfffff0c. Finally, the virtual-to-physical translation of the page is added to the sparse PDIR and the page is zeroed.

Figure 1-34 Checking the free list to fault in a DBD_FSTORE page

[Checking the free list to fault in a DBD_FSTORE page]

Faulting in a Page of Text or Initialized Data

If a process has a virtual fault on a DBD_FSTORE page, the kernel uses the r_fstore pointer of the region's vnode, to determine which file-system specific pagein() routine (for example, ufs_pagein(), nfs_pagein(), cdfs_pagein(), vx_pagein()) to call. The pagein() routines are used to recover the correct page from a free list of memory pages or to read in a correct page from disk.

The pagein routine gets information about the page being faulted from the vm_pagein_init() routine, which gets the vfd/dbd pairs, sets up the region index, and ascertains that no valid page already exists.

One page must be reserved. Then vm_no_io_required() is called to determine if the page can be satisfied locally, either by a zero-filled page (sparse file) or from the page cache.

vm_no_io_required() checks for the faulted page in the page cache:

  • vm_no_io_required acquires the device vnode pointer (devvp) that points to the actual disk device (such as /dev/vg00/lvol5) rather than to the file referenced by r_fstore.

  • If the dbd data field is DBD_DINVAL, vm_no_io_required gets the actual location of the disk block on the disk device and stores this value in the dbd data field.

  • vm_no_io_required calls pageincache() with the device vnode pointer and the dbd_data to determine whether the faulted page is on the hash list.

  • The pageincache() routine hashes on the vnode pointer and data to choose a pfdat pointer in phash[]. The routine walks the pf_hchain chain of pfdat entries looking for a matching vnode pointer (pf_devvp) and data value (pf_data). If it finds a match, it removes it from the free list.

  • If pageincache() returns a pfdat entry, the region's valid page count (r_nvalid) is incremented, the vfd is updated with the pfn (pf_pfn), and a virtual-to-physical translation for the page to the sparse PDIR is added (if it had been removed).

On successfully finding the page in the free list, vm_no_io_required() returns a 1, meaning that no I/O is required to retrieve the page. This is called a soft page fault.

If vm_no_io_required() cannot find the page locally, it returns 0, meaning the page must be faulted in from disk.

Retrieving the Page of Text or Initialized Data from Disk

If the required page is not found in the free list, the pagein() routines refer to dbd to ascertain which page to fetch. (The information had been stored in the dbd by vm_no_io_required().) The pagein() routines also schedule read-ahead pages for I/O, the number of read-ahead pages based on the value of p_pagein in the pregion. This value is adjusted based on whether the file is being accessed at random or sequentially.

Figure 1-35 DBD_FSTORE fault of data not in the free list

[DBD_FSTORE fault of data not in the free list]

If it is being accessed at random, a minimal number of read-ahead pages are required; if sequentially, a maximal number of read-ahead pages are desired, up to the end of the pregion's pages.

  • Each time I/O is scheduled for the pregion, the p_nextfault bit in the pregion structure is set to the page expected to be read next if further reading is required.

  • If the next page fault matches p_nextfault, the file is being accessed sequentially. In this case, the value of p_pagein is multiplied by two, up to maxpagein_size, a global set to 64. If p_strength is less than 100 (defined as PURELY_SEQUENTIAL), it is also incremented.

  • If the next fault does not match p_nextfault, the file is being accessed at random. In this case, the value of p_pagein is divided by two, down to no less than minpagein_size, a global set to 1. If p_strength is greater than -100 (defined as PURELY_RANDOM), it is also decremented.

A page of memory is allocated from phead, a virtual-to- physical translation added to the sparse PDIR, the I/O scheduled from the disk to the page, and the process put to sleep awaiting the non-read-ahead I/O to complete (the process does not await read-ahead I/O to complete). The vfd is marked valid. The dbd is left with dbd_type set to DBD_FSTORE and dbd_data set to the block address on the disk.

Regardless of whether the page data is retrieved from zero-fill, free list, or disk, the page directory entry (pde) has been touched. The instruction is retried and gets a TLB miss fault; the miss handler writes the modified pde data into the TLB; the instruction is retried again and succeeds.

p_strength varies between -100 and 100; p_pagein varies by powers of two between 1 and 64.