The architecture specifies that each register contains from 8 to 512 scalar
elements. The choice of actual length involves a design trade-off. The time to do a
vector operation consists essentially of the overhead for pipeline startup and reg
ister filling plus one cycle per vector element. Thus, the use of a large number of
register elements reduces the relative startup time for a computation. However,
this efficiency must be balanced against the added time required for saving and
restoring vector registers on a process switch and the practical cost and space lim
its. These considerations led to the use of 128 elements per register in the current
3090 implementation.
Three additional registers are needed by the vector facility. The vector-mask
register contains mask bits that may be used to select which elements in the vector
registers are to be processed for a particular operation. The vector-status register
contains control fields, such as the vector count, that determine how many elements
in the vector registers are to be processed. The vector-activity count keeps track of
the time spent executing vector instructions.
COMPOUND INSTRUCTIONSAs was discussed previously, instruction execution
can be overlapped using chaining to improve performance. The designers of the
IBM vector facility chose not to include this capability for several reasons.The Sys
tem/370 architecture would have to be extended to handle complex interruptions
(including their effect on virtual memory management), and corresponding
changes would be needed in the software. A more basic issue was the cost of in
cluding the additional controls and register access paths in the vector facility for
generalized chaining.
Instead, three operations are provided that combine into one instruction (one
opcode) the most common sequences in vector computation, namely multiplication
followed by addition, subtraction, or summation.The storage-to-register MULTIPLY
AND-ADD instruction, for example, fetches a vector from storage, multiplies it by
a vector from a register, and adds the product to a third vector in a register. By use
of the compound instructions MULTIPLY-AND-ADD and MULTIPLY-AND
SUBTRACT in the example of Figure 17.20, the total time for the iteration is
reduced from 10 to 8 cycles.
Unlike chaining, compound instructions do not require the use of additional
registers for temporary storage of intermediate results, and they require one less
register access. For example, consider the following chain:
In this case, two stores to the vector register VR1 are required. In the IBM architec
ture there is a storage-to-register ADD instruction. With this instruction, only the
sum is placed in VR1. The compound instruction also avoids the need to reflect in
the machine-state description the concurrent execution of a number of instructions,
which simplifies status saving and restoring by the operating system and the han
dling of interrupts.
THE INSTRUCTION SETTable 17.3 summarizes the arithmetic and logical opera
tions that are defined for the vector architecture. In addition, there are memory-to
VR1+VR2SVR1
ASVR1