vector computation to be specified. The notation indicates that operations on all indices J in the given interval are to be carried out as a single operation.
How this can be achieved is addressed shortly.
The program in Figure 17.15b indicates that all the elements of the ith row are
to be computed in parallel. Each element in the row is a summation, and the summations (across K) are done serially rather than in parallel. Even so, only vector
multiplications are required for this algorithm as compared with scalar multiplications for the scalar algorithm.
Another approach,parallel processing,is illustrated in Figure 17.15c. This approach assumes that we have Nindependent processors that can function in parallel. To utilize processors effectively, we must somehow parcel out the computation
to the various processors. Two primitives are used. The primitive FORK ncauses an
independent process to be started at location n. In the meantime, the original
process continues execution at the instruction immediately following the FORK.
Every execution of a FORK spawns a new process. The JOIN instruction is essentially the inverse of the FORK. The statement JOIN N causes Nindependent
processes to be merged into one that continues execution at the instruction following the JOIN. The operating system must coordinate this merger, and so the execution does not continue until all Nprocesses have reached the JOIN instruction.
The program in Figure 17.15c is written to mimic the behavior of the vectorprocessing program. In the parallel processing program, each column of Cis computed by a separate process. Thus, the elements in a given row of Care computed
in parallel.
The preceding discussion describes approaches to vector computation in logical or architectural terms. Let us turn now to a consideration of types of processor
organization that can be used to implement these approaches. A wide variety of
organizations have been and are being pursued. Three main categories stand out:
• Pipelined ALU
• Parallel ALUs
• Parallel processors
Figure 17.16 illustrates the first two of these approaches. We have already discussed pipelining in Chapter 12. Here the concept is extended to the operation of
the ALU. Because floating-point operations are rather complex, there is opportunity for decomposing a floating-point operation into stages, so that different stages
can operate on different sets of data concurrently.This is illustrated in Figure 17.17a.
Floating-point addition is broken up into four stages (see Figure 9.22): compare,
shift, add, and normalize. A vector of numbers is presented sequentially to the first
stage.As the processing proceeds, four different sets of numbers will be operated on
concurrently in the pipeline.
It should be clear that this organization is suitable for vector processing.To see
this, consider the instruction pipelining described in Chapter 12. The processor goes
through a repetitive cycle of fetching and processing instructions. In the absence of
branches, the processor is continuously fetching instructions from sequential locations. Consequently, the pipeline is kept full and a savings in time is achieved. Similarly, a pipelined ALU will save time only if it is fed a stream of data from sequential