The load-use latency is three cycles. There is a six-cycle latency
for dependent FP operations. The ICACHE is shared between
all eight threads. Each thread has its own instruction buffer.
The Fetch stage/unit fetches up to four instructions per cycle
and puts them into the thread’s instruction buffer. Threads
can be in “Wait” (as opposed to “Ready”) state due to a ITLB
miss, ICACHE miss, or their Instruction Buffer being full. The
“Least-Recently-Fetched” algorithm is used to select one of
“Ready” threads for which the next instruction will be fetched.
Fig. 7 shows the Integer/Load/Store pipeline and illustrates
how different threads can occupy different pipeline stages in
a given cycle. In other words, threads are interleaved between
pipeline stages with very few restrictions. The Load/Store
and Floating Point units are shared between all eight threads.
The eight threads within each SPARC core are divided into
two thread groups (TGs) of four threads each. Once again,
the threads could be in “Wait” states due to events such as a
DCACHE miss, DTLB miss, or data dependency. The “Pick”
stage tries to find one instruction from all the “Ready” threads
(using the “Least-Recently-Picked” algorithm) from each of
the two TGs to execute every cycle. Since each TG picks independently
(w.r.t. the other TG), it can lead to hazards such as
load instructions being picked from both TGs even though each
SPC has only one load/store unit. These hazards are resolved in
the “Decode” stage.