The PA-8000 executes at a peak rate of four instructions
per cycle, enabled by a large complement of computational
units, shown at the left side of Figure 1. For integer operation,
it includes two 64-bit integer ALUs and two 64-bit
shift/merge units. All integer functional units have a singlecycle
latency. For floating-point applications, the chip
includes dual floating-point multiply-and-accumulate (FMAC)
units and dual divide/square root units. We optimized the
FMAC units for performing the common operation A times
B plus C. By fusing an add to a multiply, each FMAC can
execute two floating-point operations in just three cycles. In
addition to providing low latency for floating-point operations,
the FMAC units are fully pipelined so that the PA-8000’s
peak throughput is four floating-point operations per cycle.
The two divide/square root units are not pipelined, but other
floating-point operations can execute on the FMAC units
while the divide/square root units are busy. A singleprecision
divide or square root operation requires 17 cycles;
a double-precision operation requires 31 cycles.
Such a large array of computational units would be pointless
if they could not obtain enough data to operate on. To
that end, the PA-8000 incorporates two complete load/store
pipes, including two address adders, a 96-entry dual-ported
TLB, and a dual-ported cache. The right side of Figure 1
shows the dual load/store units and the memory system
interface. The symmetry of dual functional units throughout
the processor allows a number of simplifications in data
paths, control logic, and signal routing. In effect, this duality
provides separate even and odd machines.