Figure 3: Design of the GPU-CC architecture. Cores and load-store units communicate via FIFO buffers
and five data lanes named A to E. The single instruction each core executes is stored in a local configuration
register (CR). Only four of the 32 cores and two of the 16 load-store units in an SM are shown for clarity.