Although interleaved multithreading appears to offer better processor utilization
than blocked multithreading, it does so at the sacrifice of single-thread performance.
The multiple threads compete for cache resources, which raises the
probability of a cache miss for a given thread.
More opportunities for parallel execution are available if the processor can
issue multiple instructions per cycle. Figures 17.8d through 17.8i illustrate a number
of variations among processors that have hardware for issuing four instructions per
cycle. In all these cases, only instructions from a single thread are issued in a single
cycle.The following alternatives are illustrated:
• Superscalar: This is the basic superscalar approach with no multithreading.
Until relatively recently, this was the most powerful approach to providing
parallelism within a processor. Note that during some cycles, not all of the
available issue slots are used. During these cycles, less than the maximum
number of instructions is issued; this is referred to as horizontal loss. During
other instruction cycles, no issue slots are used; these are cycles when no instructions
can be issued; this is referred to as vertical loss.
• Interleaved multithreading superscalar: During each cycle, as many instructions
as possible are issued from a single thread.With this technique, potential
delays due to thread switches are eliminated, as previously discussed. However,
the number of instructions issued in any given cycle is still limited by dependencies
that exist within any given thread.
• Blocked multithreaded superscalar: Again, instructions from only one thread
may be issued during any cycle, and blocked multithreading is used.
• Very long instruction word (VLIW): A VLIW architecture, such as IA-64,
places multiple instructions in a single word.Typically, a VLIW is constructed by
the compiler, which places operations that may be executed in parallel in the
same word. In a simple VLIW machine (Figure 17.8g), if it is not possible to completely
fill the word with instructions to be issued in parallel, no-ops are used.
• Interleaved multithreading VLIW: This approach should provide similar efficiencies
to those provided by interleaved multithreading on a superscalar
architecture.
• Blocked multithreaded VLIW: This approach should provide similar efficiencies
to those provided by blocked multithreading on a superscalar architecture.
The final two approaches illustrated in Figure 17.8 enable the parallel, simultaneous
execution of multiple threads:
• Simultaneous multithreading: Figure 17.8i shows a system capable of issuing 8
instructions at a time. If one thread has a high degree of instruction-level
parallelism, it may on some cycles be able fill all of the horizontal slots. On