7. CONCLUSION
GPUs are throughput-oriented processors that depend on mas- sive multithreading to tolerate long latency memory accesses. The latest GPUs all are equipped with on-chip data caches to reduce the latency of memory accesses and save the bandwidth of NOC and off-chip memory modules. But these tiny data caches are vulnerable to thrashing from massive multithreading, especially when divergent load instructions generate long bursts of cache accesses. Meanwhile, the blocks of divergent loads exhibit high intra-warp locality and are expected to be atomically cached so that the issuing warp can fully hit in L1D in the next load issuance. However, GPU caches are not designed with enough awareness of either SIMD ex- ecution model or memory divergence.
In this work, we renovate the cache management policies to design a GPU-specific data cache, DaCache. This design starts with the observation that warp scheduling can essentially shape the locality pattern in cache access streams. Thus we incorporate the warp scheduling logic into insertion policy so that blocks are inserted into the LRU-chain according to their issuing warp’s scheduling priority. Then we deliberately prioritize coherent loads over divergent loads. In order to enable the thrashing resistance, the cache ways are partitioned by desired warp concurrency into two regions, the locality region and the thrashing region, so that replacement is constrained within the thrashing region. When no replacement candidate is available in the thrashing region, incoming requests are bypassed. We also implement a dynamic partition scheme based on the caching effectiveness that is sampled at runtime. Experiments show that DaCache achieves 40.4% performance improve- ment over the baseline GPU and outperform two state-of-the-art thrashing resistant cache management techniques RRIP and DIP by 40% and 24.9%, respectively.