 |
» |
|
|
|
|  |  |
When a process is fork'd,
a duplicate copy of its parent process forms the basis of the child
process. . Region Type Dictates Complexity |  |
Under the kernel procdup()
routine, the system walks the pregion list of the parent process,
duplicating each pregion for the child process. How this is done
is dictated by the region type. If the region is type RT_SHARED,
a new pregion is created that attaches
to the parent's region. If the region is type RT_PRIVATE,
the region is duplicated first, and then a new pregion
is created and attached to the new region.
Duplicating pregions
for Shared Regions |  |
Because a region of type RT_SHARED
is shared by parent and child, fewer changes occur to the pregions
and region: Only a new pregion
must be created and attached to the shared region. A new pregion
is allocated and fields copied from the parent pregion to the child
pregion. The pregion elements used by vhand
(p_agescan, p_ageremain,
and p_stealscan) are initialized
to zero and the child pregion is added to the active pregion
chain just before the stealhand,
to prevent it from being stolen yet. The region elements r_incore
and r_refcnt are incremented to
reflect the number of in-core pregions
accessing the region and the number of pregions,
in-core or paged, accessing the region.
The procedure is considerably more complex when an RT_PRIVATE
region is copied. Duplicating pregions
for Private Regions |  |
Forking a process with a region of type RT_PRIVATE
requires that a new child region be allocated first. The child region's pointers
are set: r_fstore,
the forward store pointer is pointed to the same value as the parent's,
and the vnode's reference
count (v_count) is incremented. r_bstore, the backward
store pointer is set to the kernel global swapdev_vp,
and its v_count is incremented
also.
The child region is attached to the
end of the linked list of active regions. Swap is reserved. If insufficient swap space is
available, fork() fails and returns
the error ENOMEM. The child region's B-tree structures
are initialized and sufficient swap space is reserved for a completely
filled B-tree.
The parent's vfd and
dbd proto values are copied to
the child's B-tree root.
The vfd
proto in both the parent region and the child region are set so
that all pages of the region are copy-on-write. The B-tree element
b_vproto is set to indicate that
the copy-on-write flag (pg_cw)
must be set in the vfd for any
new vfddbd pair added to the B-tree.
A chunk of vfddbds
is created for the child's B-tree
(equal to each chunk of vfddbds
in the parent's B-tree)
and filled with proto values. The pg_cw
bit is already set to copy-on-write for all default vfds
in the child B-tree's
chunk.
Setting copy-on-write when the vfd is
valid |  |
Before the chunks of vfddbds
in the child region can be used, the validity of every entry must
be checked. If a vfd
is not valid (that is, its pv_v
is not set), the pg_cw of the parent's
vfd must be set and copied to the
child. If pg_lock is set in the
parent, it must be unset in the child, as locks are not inherited.
Once the vfd is valid, further
modifications are made to the low-level structures: The r_nvalid
element in the child region is incremented to reflect the number
of valid pages. The vfd contains
a pfn (page frame number), which
indexes into the pfdat[] array.
The pfdat entry pf_use
count (number of regions using this page) must be incremented. If the parent vfd's
copy-on-write bit isn't set, the pde
must be set for translations to the page to behave as copy-on-write.
Reconciling the Page and Swap Image |  |
If a page has been written to a swap device, but has since
been modified, the swap-device data now differs from the data in
memory. The disk page must be disassociated from the page in memory
by setting the dbd type to DBD_NONE.
Then, the next time the page is written to a swap device, it will
be assigned a new location. Everything is now set up from the perspective of the parent's
B-tree for copy-on-write. Setting the child region's copy-on-write
status |  |
The child's r_swalloc
is set to the number of region and B-tree
pages reserved. The r_prev and
r_next are set to link the child
region to the parent region. The kernel chooses new space for the pregion,
rather than copying it from the parent pregion.
This establishes two ranges of virtual addresses (different space,
same offset) translating to the single range of physical address. If a parent process accesses its virtual
addresses, it willl get a TLB miss fault because the addresses have
been purged from the TLB. If a child process accesses any of its virtual addresses,
it will also get a TLB miss fault because the addresses did not
previously exist in the TLB, and do not exist in HTBL.
Duplicating a Process Address Space to Make the Process
copy-on-write |  |
procdup()
creates a duplicate copy of a process based on forktype, parent
process (pp), child process (cp),
and parent thread (pt) and child
thread (ct). procdup() allocates memory
for the uarea of the child. (In
fact, procdup() is the routine
that calls createU() to create
the uarea too.) procdup() calls dupvas
to duplicate the parent's virtual address space, based
on the kind of process (fork vs
vfork) being executed. If the process was created by fork,
dupvas duplicates the parent process's
virtual address space; if the process was vfork'd
the parent's virtual address space is used. dupvas looks for and finds
each private data object, does whatever each requires to be duplicated
(there are special considerations required for text, memory mapping,
data objects, graphics), and when it finishes duplicating the special
objects, calls private_copy or
shared_copy, depending on whether
it is dealing with a private or shared region. If the region is shared, shared_copy
increments the reference count on the region to indicate it is being
shared. If the region is private, private_copy
locks the region and enables the region to be duplicated by calling dupreg().
dupreg()
allocates a new region for the child, duplicates the parent's
vfds and the entire region structure,
then calls do_dupc to duplicate
entries under the region. do_dupc() sets
up a parent-child relationship, and by duplicating the relationship,
sets up the child to be copy-on-write.
It makes sure the parent's region is valid, sets copy
on write for the child, sets the translation as rx
(read-execute) only, duplicates information for every vfddbd combination
in the region. once do_dupc() completes,
the child process exists as a duplicated version of the parent process.
The child process is attached to the child's address space
and is no longer dependent on the parent. do_dupc then calls
hdl_cw() to update the child's
access rights and make the child copy on write.
Duplicating the uarea
for the Child's Process |  |
The createU() routine builds
a uarea and address space for the
child process. The uarea is set
up last for a fork'd process,
to prevent the child process from resuming in the middle of pregion
duplication code. If the process is vfork'd,
the uarea is created during exec().
Until then, the child uses the parent's uarea. When a user process is created with
FORK_PROCESS, a temporary space
is allocated for a working copy of the parent's uarea
to be modifed into the child's uarea. The temporary space
will be freed after the uarea is
copied to the new region. fork() updates
the savestate in the parent uarea's
u_pcb just before copying the data.
(vfork() does not do this because
it creates the uarea during exec(),
and the savestate will change immediately.) A region is allocated for the new uarea,
its data structure is initialized, its r_bstore
value set back to the swap device, and the new region is added to
the list of active regions. The uarea
has no r_fstore value, since it
comes with ready-made data. Space is allocated for the uarea's
pregion, which is initialized.
Each uarea has a unique space
ID. The new pregion is marked with the PF_NOPAGE
flag. uarea pregions are unaffected
by vhand because they are not added
to the list of active pregions.
Only if an entire process is swapped out are the uarea's
pages written to a swap device. Once created, the pregion
is attached into the linked list of pregions
connected to the vas. Its pointer is stored in r_pregs,
its p_prpnext set to NULL,
and its r_incore and r_refcnt
set to one. Once swap space is reserved for the uarea
and B-tree pages and the default
dbd is set to DBD_DFILL,
the uarea pages (UPAGES)
are allocated. Each page requires a pfdat
entry from phead (sleeping if none
is available immediately). The pfn
is stored in the vfd, the pg_v
is set as valid, r_nvalid is incremented,
and a pde is created for the physical-to-virtual
translation. The pfdat entry's
P_UAREA and HDLPF_TRANS
flags are set, and the dbd is set
to DBD_NON. The pointers u_procp
(to the child process) and u_kthreadp
(to the child thread) are pointed to the child uarea.
Conceivably, the child can now run successfully. The current
state is therefore saved in the copied uarea
with a setjmp() call and pointed
to with pcb_sswap. Thus, when
the child first calls the resume() routine,
it detects that pcb_sswap is non-zero
and does a longjmp() to get back
here. The child then return from procdup() with
the value FORKRTN_CHILD. The parent's open file table is copied to the child
and the copied uarea is copied into the actual pregion. This copy
causes TLB miss faults that cause the pregion's
pdes to be written to the TLB, thus associating the uarea's
virtual address with the physical pages just set up. The process
completes by returning from procdup
with the return value FORKRTN_PARENT. Reading from the parent's copy-on-write page |  |
When the parent region accesses one of its RT_PRIVATE
pages for read, the processor generates a TLB miss fault, which
the kernel handles as an interrupt. The TLB miss fault handler
finds the pde and inserts the information
(including the new access rights) into the processor's
TLB. On return from the interrupt, the processor retries the read
and is successful, since PDE_AR_CW
allows user-mode read and execute access Reading from the child's copy-on-write
page |  |
When the child region accesses one of its pages for read,
the TLB miss handler does not find a pde
for the virtual address, because none has been set one up yet.
The virtual address was set up in the pregion
structure. If you are not doing copy-on-access (which is now the
default) and the page is needed, the aliased translation must be
made. First a save state
is created. The vas pointer
is taken and the skip list searched to find the pregion
containing the page with this address. If the page translates to more than one virtual
address, the appropriate alias is acquired. The child region fails to access a page for read
and gets a TLB miss, but the miss handler finds a translation and
loads it into the TLB. The routine returns from interrupt and succeeds
in reading the page.
Faulting In A Page |  |
When regions are initialized, the disk block descriptor (dbd)
dbd_data field of the is set to DBD_DINVAL (0xfffffff) in
all cases. The prototype dbd_type values are set as follows: DBD_FSTORE
for text and initialized data, DBD_DZERO for stack
and uninitialized data.
When a page is read for the first time, a TLB miss fault results
because the physical page (and therefore its translation in the
sparse PDIR) does not yet exist. The fault handler is responsible
for bringing in the page and restarting the instruction that faulted.
In determining whether or not the page is valid, the fault handler
determines which pregion in the
faulting process contains the faulting address. The fault code
eventually calls virtual_fault(),
the primary virtual-fault handling routine . The arguments passed
to this routine are the virtual address causing the fault, the virtual
address and virtual space of the pregion,
and a flag indicating read or write access. The kernel searches the B-tree
for the vfd and dbd of
the page. If the valid bit in the vfd
flag is set, another process has read the address into memory already.
If the r_zomb flag is set in the
region, the program prints Pid %d killed due to text modification or
page I/O error message and returns SIGKILL,
which the handler sends to the process. Faulting In a Page of Stack or Uninitialized DataIf the dbd_type value is
set to DBD_DZERO (as is the case
for stack and uninitialized data), the process sets the copy-on-write
bit to zero. The kernel then checks to determine whether the page
pertains to a system process or to a high-priority thread. If neither
and memory is tight, the process sleeps until free memory is driven
down to the priority associated with the process. (In worst case,
a thread might wait until memory is above desfree.) Once the process is restarted, vfd
and dbd pointers are examined to
ensure their continued accuracy. A free pfdat entry
is acquired from phead, its pfn (pf_pfn) placed
in the vfd, the vfd's
valid bit set, and the region's r_nvalid
counter (number of valid pages) incremented. The process changes
dbd_type to DBD_NONE
and dbd_data to 0xfffff0c.
Finally, the virtual-to-physical translation of the page is added
to the sparse PDIR and the page is zeroed. Faulting in a Page of Text or Initialized DataIf a process has a virtual fault on a DBD_FSTORE
page, the kernel uses the r_fstore
pointer of the region's vnode,
to determine which file-system specific pagein() routine
(for example, ufs_pagein(), nfs_pagein(),
cdfs_pagein(), vx_pagein())
to call. The pagein() routines
are used to recover the correct page from a free list of memory
pages or to read in a correct page from disk. The pagein routine gets information
about the page being faulted from the vm_pagein_init() routine,
which gets the vfd/dbd pairs, sets
up the region index, and ascertains that no valid page already exists. One page must be reserved. Then vm_no_io_required() is
called to determine if the page can be satisfied locally, either
by a zero-filled page (sparse file) or from the page cache. vm_no_io_required() checks
for the faulted page in the page cache: vm_no_io_required
acquires the device vnode pointer (devvp) that points to the actual
disk device (such as /dev/vg00/lvol5) rather than to the file referenced
by r_fstore. If the dbd data
field is DBD_DINVAL, vm_no_io_required
gets the actual location of the disk block on the disk device and
stores this value in the dbd data
field. vm_no_io_required
calls pageincache() with the device
vnode pointer and the dbd_data
to determine whether the faulted page is on the hash list. The pageincache() routine
hashes on the vnode pointer and
data to choose a pfdat pointer in phash[].
The routine walks the pf_hchain chain
of pfdat entries looking for a
matching vnode pointer (pf_devvp)
and data value (pf_data). If it
finds a match, it removes it from the free list. If pageincache()
returns a pfdat entry, the region's
valid page count (r_nvalid) is
incremented, the vfd is updated
with the pfn (pf_pfn),
and a virtual-to-physical translation for the page to the sparse
PDIR is added (if it had been removed).
On successfully finding the page in the free list, vm_no_io_required() returns
a 1, meaning that no I/O is required to retrieve the page. This
is called a soft page fault. If vm_no_io_required() cannot
find the page locally, it returns 0, meaning the page must be faulted
in from disk. Retrieving the Page of Text or Initialized Data from
DiskIf the required page is not found in the free list, the pagein() routines
refer to dbd to ascertain which
page to fetch. (The information had been stored in the dbd
by vm_no_io_required().) The pagein()
routines also schedule read-ahead pages for I/O, the number of read-ahead
pages based on the value of p_pagein
in the pregion. This value is adjusted based on whether the file
is being accessed at random or sequentially. If it is being accessed at random, a minimal number of read-ahead
pages are required; if sequentially, a maximal number of read-ahead
pages are desired, up to the end of the pregion's
pages. Each time I/O is scheduled for the pregion,
the p_nextfault bit in the pregion
structure is set to the page expected to be read next if further
reading is required. If the next page fault matches p_nextfault,
the file is being accessed sequentially. In this case, the value
of p_pagein is multiplied by two,
up to maxpagein_size, a global
set to 64. If p_strength is less
than 100 (defined as PURELY_SEQUENTIAL),
it is also incremented. If the next fault does not match p_nextfault,
the file is being accessed at random. In this case, the value of
p_pagein is divided by two, down
to no less than minpagein_size,
a global set to 1. If p_strength
is greater than -100 (defined as PURELY_RANDOM),
it is also decremented.
A page of memory is allocated from phead,
a virtual-to- physical translation added to the sparse PDIR, the
I/O scheduled from the disk to the page, and the process put to
sleep awaiting the non-read-ahead I/O to complete (the process does
not await read-ahead I/O to complete). The vfd
is marked valid. The dbd is left
with dbd_type set to DBD_FSTORE
and dbd_data set to the block address
on the disk. Regardless of whether the page data is retrieved from zero-fill,
free list, or disk, the page directory entry (pde)
has been touched. The instruction is retried and gets a TLB miss
fault; the miss handler writes the modified pde data into the TLB;
the instruction is retried again and succeeds. p_strength varies between
-100 and 100; p_pagein varies by
powers of two between 1 and 64.
|