Optimizing Compilers
In the realm of highly parallel code, and even to some extent for moderately parallel code,
advanced compilers have created more instruction-level parallelism. But for slightly parallel
code, most advances in optimizing compilers [1] have actually reduced the amount of
instruction-level parallelism. For example, common subexpression elimination, code motion of
loop invariants, induction variable elimination, and elimination of redundant loads and stores all
reduce redundant computation. And computing something redundantly (e.g., twice in a basic
block) clearly provides an increase in instruction-level parallelism! In our experience for slightly
parallel code, only tree height reduction and reduction in strength provide added instruction-level
parallelism.
Of course many effective compilation techniques have been developed for highly parallel code.
Vectorizing compilers, parallelizing compilers, trace scheduling, loop unrolling, and loop
jamming can all increase the accessible parallelism within code. Finally, for moderately parallel
code, techniques like loop unrolling and trace scheduling can speed up non-vectorizable
applications when running on superscalar or superpipelined machines.
3.3.2. Effects of Longer Base Latencies
So far our assumption has been that the latency of all operations, or at least the simple
operations, is one base machine cycle. As we discussed previously, no known machines have
this characteristic. For example, few machines have one cycle loads without a possible data
interlock either before or after the load. Similarly, few machines can execute floating-point
operations in one cycle. What are the effects of longer latencies? Consider a machine where
ALU operations are one cycle, but loads, stores, and branches are two cycles, and floating-point
operations are three cycles. Then the base machine is actually like a slightly superpipelined
machine. If we multiply the execution frequency of each instruction by its latency, we get the
average degree of superpipelining.
instruction frequency latency contribution
ALU/shift 40% 1 0.4
load/store 35% 2 0.7
branch 15% 2 0.3
FP 10% 3 0.3
total 1.7
Thus, our machine is closer to a superpipelined machine of degree two than it is to our ideal base
machine. To the extent that some operation latencies are greater than one base machine cycle,
the remaining amount of exploitable instruction-level parallelism will be reduced. In this
example, assuming the average degree of instruction-level parallelism in slightly parallel code is
around two, this machine should not stall often because of data-dependency interlocks.