HP 3000 Manuals

Level 2 Optimization Modules [ HP C Programmer's Guide ] MPE/iX 5.0 Documentation
HP C Programmer's Guide

Level 2 Optimization Modules 

Level 2 performs optimizations within each procedure.  At level 2, the
optimizer performs all optimizations performed at the prior level, with
the following additions:

   *   Coloring register allocation.
   *   Induction variable elimination and strength reduction.
   *   Local and global common subexpression elimination.
   *   Advanced constant folding and propagation.  (Simple constant
       folding is done by level 0 optimization.)
   *   Loop invariant code motion.
   *   Store/copy optimization.
   *   Unused definition elimination.
   *   Software pipelining.
   *   Register reassociation.

The examples in this section are shown at the source code level wherever
possible.  Transformations that cannot be shown at the source level are
shown in assembly language.

Coloring Register Allocation 

The name of this optimization comes from the similarity to map coloring
algorithms in graph theory.  This optimization determines when and how
long commonly used variables and expressions occupy a register.  It
minimizes the number of references to memory (loads and stores) a code
segment makes.  This can improve run-time speed.

You can help the optimizer understand when certain variables are heavily
used within a function by declaring these variables with the register
qualifier.  The first 10 register qualified variables encountered in the
source are honored.  You should pick the ten most important variables to
be most effective.

The coloring register allocator may override your choices and promote to
a register a variable not declared register over one that is, based on
estimated speed improvements.

The following code shows the type of optimization the coloring register
allocation module performs.  The code:

          LDI     2,r104
          COPY    r104,r103
          LDO     5(r103),r106
          COPY    r106,r105
          LDO     10(r105),r107

becomes:

          LDI     2,r25
          LDO     5(r25),r26
          LDO     10(r26),r31

Induction Variables and Strength Reduction 

The induction variables and strength reduction module removes expressions
that are linear functions of a loop counter and replaces each of them
with a variable that contains the value of the function.  Variables of
the same linear function are computed only once.  This module also
simplifies the function by replacing multiplication instructions with
addition instructions wherever possible.

For example, the code:

          for (i=0; i<25; i++) {
               r[i] = i * k;
          }

becomes:

          t1 = 0;
          for (i=0; i<25; i++) {
               r[i] = t1;
               t1 += k;
          }

Local and Global Common Subexpression Elimination 

The common subexpression elimination module identifies expressions that
appear more than once and have the same result, computes the result, and
substitutes the result for each occurrence of the expression.  The types
of subexpression include instructions that load values from memory, as
well as arithmetic evaluation.

For example, the code:

          a = x + y + z;
          b = x + y + w;

becomes:

          t1 = x + y;
          a = t1 + z;
          b = t1 + w;

Constant Folding and Propagation 

Constant folding computes the value of a constant expression at compile
time.  For example:

     A = 10;
     B = A + 5;
     C = 4 * B;

can be replaced by:

     A = 10;
     B = 15;
     C = 60;

Loop Invariant Code Motion 

The loop invariant code motion module recognizes instructions inside a
loop whose results do not change and moves them outside the loop.  This
ensures that the invariant code is only executed once.

For example, the code:

          x = z;
          for(i=0; i<10; i++)
          {
               a[i] = 4 * x + i;
          }

becomes:

          x = z;
          t1 = 4 * x;
          for(i=0; i<10; i++)
          {
               a[i] = t1 + i;
          }

Store/Copy Optimization 

Where possible, the store/copy optimization module substitutes registers
for memory locations, by replacing store instructions with copy
instructions and deleting load instructions.

For example, the following HP C code:

          a = x + 23;     where a is a local variable  

          return a;

produces the following code for the unoptimized case:

          LDO     23(r26),r1
          STW     r1,-52(0,sp)
          LDW     -52(0,sp),ret0

and this code for the optimized case:

          LDO     23(r26),ret0

Unused Definition Elimination 

The unused definition elimination module removes unused memory location
and register definitions.  These definitions are often a result of
transformations made by other optimization modules.

For example, the function:

          f(int x)
          {
               int a,b,c:

               a = 1;
               b = 2;
               c = x * b;
               return c;
          }

becomes:

          f(int x)
          {
               int a,b,c;

               b = 2;
               c = x * b;
               return c;
          }

Software Pipelining 

Software pipelining is a code transformation that optimizes program
loops.  It rearranges the order in which instructions are executed in a
loop.  It generates code that overlaps operations from different loop
iterations.  Software pipelining is useful for loops that contain
arithmetic operations on floats and doubles.

The goal of this optimization is to avoid CPU stalls due to memory or
hardware pipeline latencies.  The software pipelining transformation adds
code before and after the loop to achieve a high degree of optimization
within the loop.

Example.   

The following pseudo-code fragment shows a loop before and after the
software pipelining optimization.  Four significant things happen:

   *   A portion of the first iteration of the loop is performed before
       the loop.

   *   A portion of the last iteration of the loop is performed after the
       loop.

   *   The loop is unrolled twice.

   *   Operations from different loop iterations are interleaved with
       each other.

The following is a C for loop:

     #define SIZ 10000
     float x[SIZ], y[SIZ]; \*Software pipelining works with*\
     int i;                \*floats and doubles.           *\
     init();
     for (i = 0;i<= SIZ;i++);
             {
             x[i] =x[i] / y[i] + 4.00
             }

When this loop is compiled with software pipelining, the optimization can
be expressed in pseudo-code as follows:

     R1 = 0;                   Initialize array index. 

     R2 = 4.0;                 Load constant value. 

     R3 = Y[0];                Load first Y value. 

     R4 = X[0];                Load first X value. 

     R5 = R4 / R3;
                               Perform division on first element: 
                               n = X[0] / Y[0].   

     do {                      Begin loop. 

           R6 = R1;            Save current array index. 

           R1++;               Increment array index. 

           R7 = X[R1];         Load current X value. 

           R8 = Y[R1];         Load current Y value. 

           R9 = R5 + R2;       Perform addition on prior row: 
                               X[i] = n + 4.0. 

           R10 = R7 / R8;      Perform division on current row: 
                               m = X[i+1] / Y[i+1]. 

           X[R6] = R9;         Save result of operations on prior row. 

           R6 = R1;            Save current array index. 

           R1++;               Increment array index.   

           R4 = X[R1];         Load next X value.   

           R3 = Y[R1];         Load next Y value.   

           R11 = R10 + R2;     Perform addition on current row: 
                               X[i+1] = m + 4  

           R5 = R4 / R3;       Perform division on next row: 
                               n = X[i+2] / Y[i+2]  

           X[R6] = R11         Save result of operations on current row. 

     } while (R1 <= 100);      End loop.   

     R9 = R5 + R2;             Perform addition on last row: 
                               X[i+2] = n + 4  

     X[R6] = R9;          Save result of operations on last row. 

This transformation stores intermediate results of the division
instructions in unique registers (noted as n and m).  These registers are
not referenced until several instructions after the division operations.
This decreases the possibility that the long latency period of the
division instructions will stall the instruction pipeline and cause
processing delays.

Prerequisites of Pipelining.   

Software pipelining is attempted on a loop that meets the following
criteria:

   *   It is the innermost loop.
   *   There are no branches or function calls within the loop.
   *   The loop is of moderate size.

This optimization produces slightly larger program files and increases
compile time.  It is most beneficial in programs containing loops that
are executed a large number of times.  This optimization is not
recommended for loops that are executed only a small number of times.

Use the +Onopipeline option with the +O2, +O3, or +O4 option to suppress
software pipelining if program size is more important than execution
speed.  This will perform level two optimization, but disable software
pipelining. 

Register Reassociation 

Array references often require one or more instructions to compute the
virtual memory address of the array element specified by the subscript
expression.  The register reassociation optimization implemented in the
PA-RISC compilers tries to reduce the cost of computing the virtual
memory address expression for array references found in loops.

Within loops, the virtual memory address expression can be rearranged and
separated into a loop varying term and a loop invariant term.  Loop
varying terms are those items whose values may change from one iteration
of the loop to another.  Loop invariant terms are those items whose
values are constant throughout all iterations of the loop.  The loop 
varying term corresponds to the difference in the virtual memory address
associated with a particular array reference from one iteration of the
loop to the next.

The register reassociation optimization dedicates a register to track the
value of the virtual memory address expression for one or more array
references in a loop and updates the register appropriately in each
iteration of a loop.

The register is initialized outside the loop to the loop invariant
portion of the virtual memory address expression and the register is
incremented or decremented within the loop by the loop variant portion of
the virtual memory address expression.  On PA-RISC, the update of such a
dedicated register can often be performed for "free" using the
base-register modification capability of load and store instructions.

The net result is that array references in loops are converted into
equivalent but more efficient pointer dereferences.

For example:

     int a[10][20][30];

     void example (void)
     {
             int i, j, k;

             for (k = 0; k < 10; k++)
                for (j = 0; j < 10; j++)
                    for (i = 0; i < 10; i++)
                    {
                        a[i][j][k] = 1;
                    }
     }

after register reassociation is applied to the innermost loop becomes:

     int a[10][20][30];

     void example (void)
     {
       int i, j, k;
       register int (*p)[20][30];

       for (k = 0; k < 10; k++)
         for (j = 0; j < 10; j++)
           for (p = (int (*)[20][30]) &a[0][j][k], i = 0; i < 10; i++)
           {
               *(p++[0][0]) = 1;
           }
     }

In the above example, the compiler-generated temporary register variable,
p, strides through the array a in the innermost loop.  This register
pointer variable is initialized outside the innermost loop and
auto-incremented within the innermost loop as a side-effect of the
pointer dereference.

Register reassociation can often enable another loop optimization.  After
performing the register reassociation optimization, the loop variable may
be needed only to control the iteration count of the loop.  If this is
case, the original loop variable can be eliminated altogether by using
the PA-RISC ADDIB and ADDB machine instructions to control the loop
iteration count.
MPE/iX 5.0 Documentation