Level 2 performs optimizations within each procedure. At level
2, the optimizer performs all optimizations performed at the prior
level, with the following additions:
Coloring register allocation.
Induction variable elimination and strength reduction.
Local and global common subexpression elimination.
Advanced constant folding and propagation. (Simple
constant folding is done by level 0 optimization.)
Loop invariant code motion.
Unused definition elimination.
The examples in this section are shown at the source code
level wherever possible. Transformations that cannot be shown at
the source level are shown in assembly language.
Coloring Register Allocation |
 |
The name of this optimization comes from the similarity to
map coloring algorithms in graph theory. This optimization determines
when and how long commonly used variables and expressions occupy
a register. It minimizes the number of references to memory (loads
and stores) a code segment makes. This can improve run-time speed.
You can help the optimizer understand when certain variables
are heavily used within a function by declaring these variables
with the register
qualifier. The first 10 register
qualified variables encountered in the source are honored. You should
pick the ten most important variables to be most effective.
The coloring register allocator may override your choices
and promote to a register a variable not declared register
over one that is, based on estimated speed improvements.
The following code shows the type of optimization the coloring
register allocation module performs. The code:
LDI 2,r104 COPY r104,r103 LDO 5(r103),r106 COPY r106,r105 LDO 10(r105),r107
|
becomes:
LDI 2,r25 LDO 5(r25),r26 LDO 10(r26),r31
|
Induction Variables and Strength Reduction |
 |
The induction variables and strength reduction module removes
expressions that are linear functions of a loop counter and replaces
each of them with a variable that contains the value of the function.
Variables of the same linear function are computed only once. This
module also simplifies the function by replacing multiplication
instructions with addition instructions wherever possible.
For example, the code:
for (i=0; i<25; i++) { r[i] = i * k; }
|
becomes:
t1 = 0; for (i=0; i<25; i++) { r[i] = t1; t1 += k; }
|
Local and Global Common Subexpression Elimination |
 |
The common subexpression elimination module identifies expressions
that appear more than once and have the same result, computes the
result, and substitutes the result for each occurrence of the expression.
The types of subexpression include instructions that load values
from memory, as well as arithmetic evaluation.
For example, the code:
a = x + y + z; b = x + y + w;
|
becomes:
t1 = x + y; a = t1 + z; b = t1 + w;
|
Constant Folding and Propagation |
 |
Constant folding computes the value of a constant expression
at compile time. For example:
A = 10; B = A + 5; C = 4 * B;
|
can be replaced by:
Loop Invariant Code Motion |
 |
The loop invariant code motion module recognizes instructions
inside a loop whose results do not change and moves them outside
the loop. This ensures that the invariant code is only executed
once.
For example, the code:
x = z; for(i=0; i<10; i++) { a[i] = 4 * x + i; }
|
becomes:
x = z; t1 = 4 * x; for(i=0; i<10; i++) { a[i] = t1 + i; }
|
Store/Copy Optimization |
 |
Where possible, the store/copy optimization module substitutes
registers for memory locations, by replacing store instructions
with copy instructions and deleting load instructions.
For example, the following HP C code:
a = x + 23; where a is a local variable
|
produces the following code for the unoptimized case:
LDO 23(r26),r1 STW r1,-52(0,sp) LDW -52(0,sp),ret0
|
and this code for the optimized case:
Unused Definition Elimination |
 |
The unused definition elimination module removes unused memory
location and register definitions. These definitions are often a
result of transformations made by other optimization modules.
For example, the function:
f(int x) { int a,b,c: a = 1; b = 2; c = x * b; return c; }
|
becomes:
f(int x) { int a,b,c; b = 2; c = x * b; return c; }
|
Software Pipelining |
 |
Software pipelining is a code transformation that optimizes
program loops. It rearranges the order in which instructions are
executed in a loop. It generates code that overlaps operations from
different loop iterations. Software pipelining is useful for loops
that contain arithmetic operations on floats and doubles.
The goal of this optimization is to avoid CPU stalls due to
memory or hardware pipeline latencies. The software pipelining transformation
adds code before and after the loop to achieve a high degree of
optimization within the loop.
The following pseudo-code fragment shows a loop before and
after the software pipelining optimization. Four significant things
happen:
A portion of the first iteration of
the loop is performed before the loop.
A portion of the last iteration of the loop is performed
after the loop.
The loop is unrolled twice.
Operations from different loop iterations are interleaved
with each other.
The following is a C for
loop:
#define SIZ 10000 float x[SIZ], y[SIZ]; \*Software pipelining works with*\ int i; \*floats and doubles. *\ init(); for (i = 0;i<= SIZ;i++); { x[i] =x[i] / y[i] + 4.00 }
|
When this loop is compiled with software pipelining, the optimization
can be expressed in pseudo-code as follows:
R1 = 0; Initialize array index.
|
R2 = 4.0; Load constant value.
|
R3 = Y[0]; Load first Y value.
|
R4 = X[0]; Load first X value.
|
R5 = R4 / R3; Perform division on first element:n = X[0] / Y[0].
|
R6 = R1; Save current array index.
|
R1++; Increment array index.
|
R7 = X[R1]; Load current X value.
|
R8 = Y[R1]; Load current Y value.
|
R9 = R5 + R2; Perform addition on prior row:X[i] = n + 4.0.
|
R10 = R7 / R8; Perform division on current row:m = X[i+1] / Y[i+1].
|
X[R6] = R9; Save result of operations on prior row.
|
R6 = R1; Save current array index.
|
R1++; Increment array index.
|
R4 = X[R1]; Load next X value.
|
R3 = Y[R1]; Load next Y value.
|
R11 = R10 + R2; Perform addition on current row:X[i+1] = m + 4
|
R5 = R4 / R3; Perform division on next row:n = X[i+2] / Y[i+2]
|
X[R6] = R11 Save result of operations on current row.
|
} while (R1 <= 100); End loop.
|
R9 = R5 + R2; Perform addition on last row:X[i+2] = n + 4
|
X[R6] = R9; Save result of operations on last row.
|
This transformation stores intermediate results of the division
instructions in unique registers (noted as n
and m). These registers are not referenced
until several instructions after the division operations. This decreases
the possibility that the long latency period of the division instructions
will stall the instruction pipeline and cause processing delays.
Prerequisites of Pipelining
Software pipelining is attempted on a loop that meets the
following criteria:
It is the innermost loop.
There are no branches or function calls within the
loop.
The loop is of moderate size.
This optimization produces slightly larger program files and
increases compile time. It is most beneficial in programs containing
loops that are executed a large number of times. This optimization
is not recommended for loops that are executed only a small number
of times.
Use the +Onopipeline
option with the +O2,
+O3, or +O4
option to suppress software pipelining if program size is more important
than execution speed. This will perform level two optimization,
but disable software pipelining.
Register Reassociation |
 |
Array references often require one or more instructions to
compute the virtual memory address of the array element specified
by the subscript expression. The register reassociation optimization
implemented in the PA-RISC compilers tries to reduce the cost of
computing the virtual memory address expression for array references
found in loops.
Within loops, the virtual memory address expression can be
rearranged and separated into a loop varying term and a loop invariant
term. Loop varying terms are those items whose values may change
from one iteration of the loop to another. Loop invariant
terms are those items whose values are constant throughout
all iterations of the loop. The loop varying term
corresponds to the difference in the virtual memory address associated
with a particular array reference from one iteration of the loop
to the next.
The register reassociation optimization dedicates a register
to track the value of the virtual memory address expression for
one or more array references in a loop and updates the register
appropriately in each iteration of a loop.
The register is initialized outside the loop to the loop invariant
portion of the virtual memory address expression and the register
is incremented or decremented within the loop by the loop variant
portion of the virtual memory address expression. On PA-RISC, the
update of such a dedicated register can often be performed for "free"
using the base-register modification capability of load and store
instructions.
The net result is that array references in loops are converted
into equivalent but more efficient pointer dereferences.
For example:
int a[10][20][30]; void example (void) { int i, j, k; for (k = 0; k < 10; k++) for (j = 0; j < 10; j++) for (i = 0; i < 10; i++) { a[i][j][k] = 1; } }
|
after register reassociation is applied to the innermost loop
becomes:
int a[10][20][30]; void example (void) { int i, j, k; register int (*p)[20][30]; for (k = 0; k < 10; k++) for (j = 0; j < 10; j++) for (p = (int (*)[20][30]) a[0][j][k], i = 0; i < 10; i++) { *(p++[0][0]) = 1; } }
|
In the above example, the compiler-generated temporary register
variable, p,
strides through the array a
in the innermost loop. This register pointer variable is initialized
outside the innermost loop and auto-incremented within the innermost
loop as a side-effect of the pointer dereference.
Register reassociation can often enable another loop optimization.
After performing the register reassociation optimization, the loop
variable may be needed only to control the iteration count of the
loop. If this is case, the original loop variable can be eliminated
altogether by using the PA-RISC ADDIB
and ADDB machine
instructions to control the loop iteration count.