The load-use latency is three cycles. There is a six-cycle latency
for dependent FP operations. The ICACHE is shared between
all eight threads. Each thread has its own instruction buffer.
The Fetch stage/unit fetches up to four instructions per cycle
and puts them into the thread’s instruction buffer.