A good example of a pipelined ALU organization for vector processing is the vector
facility developed for the IBM 370 architecture and implemented on the high-end
3090 series [PADE88, TUCK87]. This facility is an optional add-on to the basic system but is highly integrated with it. It resembles vector facilities found on supercomputers, such as the Cray family.
The IBM facility makes use of a number of vector registers. Each register is actually a bank of scalar registers. To compute the vector sum the vectors
Aand Bare loaded into two vector registers.The data from these registers are passed
through the ALU as fast as possible, and the results are stored in a third vector register. The computation overlap, and the loading of the input data into the registers in a
block, results in a significant speeding up over an ordinary ALU operation.
ORGANIZATIONThe IBM vector architecture, and similar pipelined vector ALUs,
provides increased performance over loops of scalar arithmetic instructions in
three ways:
• The fixed and predetermined structure of vector data permits housekeeping
instructions inside the loop to be replaced by faster internal (hardware or microcoded) machine operations.
• Data-access and arithmetic operations on several successive vector elements
can proceed concurrently by overlapping such operations in a pipelined design
or by performing multiple-element operations in parallel.
• The use of vector registers for intermediate results avoids additional storage reference.
Figure 17.19 shows the general organization of the vector facility. Although the
vector facility is seen to be a physically separate add-on to the processor, its architecture is an extension of the System/370 architecture and is compatible with it.The vector facility is integrated into the System/370 architecture in the following ways:
• Existing System/370 instructions are used for all scalar operations.
• Arithmetic operations on individual vector elements produce exactly the same
result as do corresponding System/370 scalar instructions. For example, one
design decision concerned the definition of the result in a floating-point
DIVIDE operation. Should the result be exact, as it is for scalar floating-point
division, or should an approximation be allowed that would permit higherspeed implementation but could sometimes introduce an error in one or more
low-order bit positions? The decision was made to uphold complete compatibility with the System/370 architecture at the expense of a minor performance
degradation.
• Vector instructions are interruptible, and their execution can be resumed from
the point of interruption after appropriate action has been taken, in a manner
compatible with the System/370 program-interruption scheme.