 |
» |
|
|
|
This section provides information on parallel execution. Transforming Eligible Loops for Parallel Execution
(+Oparallel) |  |
The +Oparallel
option causes the compiler to transform eligible loops for parallel
execution on multiprocessor machines. If you link separately from the compile line and you compiled
with the +Oparallel
option, you must link with the cc
command and specify the +Oparallel
option to link in the right startup files and runtime support. When a program is compiled with the +Oparallel
option, the compiler looks for opportunities for parallel execution
in loops and generates parallel code to execute the loop on the
number of processors set by the MP_NUMBER_OF_THREADS
environment variable discussed below. By default, this is the number
of processors on the executing machine. For a discussion of parallelization, including how to use
the +Oparallel
option, see "Parallelizing C Programs" below. For more detail on
+Oparallel, see
the description in "Controlling Specific Optimizer Features" earlier
in this chapter. Environment Variable for Parallel Programs |  |
The environment variable MP_NUMBER_OF_THREADS
is available for use with parallel programs. The MP_NUMBER_OF_THREADS
environment variable enables you to set the number of processors
that are to execute your program in parallel. If you do not set
this variable, it defaults to the number of processors on the executing
machine. On the C shell, the following command sets MP_NUMBER_OF_THREADS
to indicate that programs compiled for parallel execution can execute
on two processors: setenv MP_NUMBER_OF_THREADS 2
|
If you use the Korn shell, the command is: export MP_NUMBER_OF_THREADS=2
|
Parallelizing C Programs |  |
The following sections discuss how to compile C programs for
parallel execution and inhibitors to parallelization. Compiling Code for Parallel Execution The following command lines compile (without linking) three
source files: x.c,
y.c, and z.c.
The files x.c
and y.c are compiled
for parallel execution. The file z.c
is compiled for serial execution, even though its object file will
be linked with x.o
and y.o. cc +O3 +Oparallel -c x.c y.c cc +O3 -c z.c
|
The following command line links the three object files, producing
the executable file para_prog: cc +O3 +Oparallel -o para_prog x.o y.o z.o
|
As this command line implies, if you link and compile separately,
you must use cc,
not ld. The command
line to link must also include the +Oparallel
and +O3 options
in order to link in the right startup files and runtime support.  |  |  |  |  | NOTE: To ensure the best performance from a parallel program,
do not run more than one parallel program on a multiprocessor machine
at the same time. Running two or more parallel programs simultaneously
or running one parallel program on a heavily loaded system, will
slow performance. You should run a parallel-executing program at
a higher priority than any other user program; see rtprio(1)
for information about setting real-time priorities. |  |  |  |  |
HP-UX 10.20 users: At runtime, compiler-inserted
code performs a check to determine if the call is to a system routine
or to a user-defined routine with the same name as a system routine.
If the call is to a system routine, the code inhibits parallel execution.
If your program makes explicit use of threads, do not attempt to
parallelize it. Profiling Parallelized Programs Profiling a program that has been compiled for parallel execution
is performed in much the same way as it is for non-parallel programs: Compile the program with the -G
option. Run the program to produce profiling data. Run gprof
against the program. View the output from gprof.
The differences are: Running the program in Step 2 produces
a gmon.out file
for the master process and gmon.out.1,
gmon.out.2, and
so on for each of the slave processes. Thus, if your program is
to execute on two processors, Step 2 will produce two files, gmon.out
and gmon.out.1. The flat profile that you view in Step 4 indicates
loops that were parallelized with the following notation: routine_name##pr_line_0123
|
where routine_name is the name
of the routine containing the loop, pr
(parallel region) indicates that the loop was parallelized, and
0123 is the line
number of the beginning of the loop or loops that are parallelized.
Conditions Inhibiting Loop Parallelization The following sections describe different conditions that
can inhibit parallelization. Additionally, +Onoloop_transform
and +Onoinline
may be helpful options if you experience any problem while using
+Oparallel. Calling Routines with Side Effects The compiler will not parallelize any loop containing a call
to a routine that has side effects. A routine has side effects if
it does any of the following: Modifies an extern
or static variable Redefines variables that are local to the calling
routine Calls another subroutine or function that does any
of the above
Indeterminate Iteration Counts If the compiler determines that a runtime determination of
a loop's iteration count cannot be made before the loop starts to
execute, the compiler will not parallelize the loop. The reason
for this precaution is that the runtime code must know the iteration
count in order to know how many iterations to distribute to the
different processors for execution. The following conditions can prevent a runtime count: The loop is an infinite loop. A conditional break statement or goto
out of the loop appears in the loop. The loop modifies either the loop-control or loop-limit
variable. The loop is a while
construct and the condition being tested is defined within the loop.
When a loop is parallelized, the iterations are executed independently
on different processors, and the order of execution will differ
from the serial order that occurs on a single processor. This effect
of parallelization is not a problem. The iterations could be executed
in any order with no effect on the results. Consider the following
loop: for (i=0; i<5; i++) a[i] = a[i] * b[i]
|
In this example, the array a
would always end up with the same data regardless of whether the
order of execution were 0-1-2-3-4, 4-3-2-1-0, 3-1-4-0-2, or any
other order. The independence of each iteration from the others
makes the loop eligible candidate for parallelization. Such is not the case in the following: for (i=1; i<5; i++) a[i] = a[i-1] * b[i]
|
In this loop, the order of execution does matter. The data
used in iteration i
is dependent upon the data that was produced
in the previous iteration [i-1].
a would end up
with very different data if the order of execution were any other
than 1-2-3-4. The data dependence in this loop thus makes it ineligible
for parallelization. Not all data dependences must inhibit parallelization. The
following paragraphs discuss some of the exceptions. Nested Loops and Matrices Some nested loops that operate on matrices may have a data
dependence in the inner loop only, allowing the outer loop to be
parallelized. Consider the following: for (i=0; i<10; i++) for (j=1; j<100; j++) a[i][j] = a[i][j-1] + 1;
|
The data dependence in this nested loop occurs in the inner
[j] loop: Each
row access of a[j][i]
depends upon the preceding row [j-1]
having been assigned in the previous iteration. If the iterations
of the j loop
were to execute in any other order than the one in which they would
execute on a single processor, the matrix would be assigned different
values. The inner loop, therefore, must not be parallelized. But no such data dependence appears in the outer loop: Each
column access is independent of every other column access. Consequently,
the compiler can safely distribute entire columns of the matrix
to execute on different processors; the data assignments will be
the same regardless of the order in which the columns are executed,
so long as each executes in serial order. When analyzing a loop, the compiler will err on the safe side
and assume that what looks like a data dependence really is one
and so not parallelize the loop. Consider the following: for (i=100; i<200; i++) a[i] = a[i-k];
|
The compiler will assume that a data dependence exists in
this loop because it appears that data that has been defined in
a previous iteration is being used in a later iteration. However,
if the value of k
is 100, the dependence is assumed rather than real because a[i-k]
is defined outside the loop.
|