The 1990s were what have been referred to as the Golden Age of Microarchitecture. There were many innovations in the basic microarchitecture of a uniprocessor, enabled by the increasing number of transistors provided by Moore’s Law. Classical uniprocessor organizations were totally transformed as a result of these innovations. At the turn of the decade, chips transitioned from uniprocessors to chip multiprocessors (CMPs). The initial organization (or microarchitecture) of the “multiprocessor portion” of a CMP, i.e., the hardware that was not in the processor cores (e.g., shared caches, interconnect) still resembled a canonical symmetric multiprocessor (SMP). It was clear to the second author that continuing transistor bounty could be used to rethink the microarchitecture of the multiprocessor portion of a CMP. Specifically, since the designers of a CMP had complete control of what hardware it contained and how it would function, they could contemplate and implement techniques that could never be practical if they required interactions between distinct chips over which the designers may not have complete control, as was the case in canonical SMPs. Since caches accounted for a significant portion of this hardware, rethinking the organization and functioning of caches in CMPs was a logical place to start.
Our first foray into different cache operations, albeit in the SMP context, was Coherence Decoupling [1], which targeted the latency of coherence misses. Here we proposed to separate two major operations that are needed for correct cache operation on a coherence miss: obtaining the accessed data, and obtaining the
coherence permissions to the data. We observed that data could typically be accessed quicker than all the necessary coherence permissions had been obtained, so a processor could (speculatively) start working with the data while the permissions were still pending, with corrective actions in case the speculative access was incorrect. Encouraged by the promising results over here, we started to contemplate other novel mechanisms for optimizing cache operations.
About the same time, there was a lot of work in the community on optimizing the latency of on-chip caches using novel last-level cache organizations. Although the last-level caches at that time were the L2 caches, the proposed techniques are typically applicable to large on-chip caches (e.g., L3 cache in today’s server processors). The NUCA cache work was an early proposal in the direction [2]. Another line of work observed that running parallel programs, or multiple programs, on the shared resources of a CMP introduced new problems due to interference in the shared resources. A significant body of work was already underway in trying to alleviate the negative impact of such interference.