GPUs are throughput-oriented processors that depend on massive
multithreading to tolerate long latency memory accesses. The
latest GPUs all are equipped with on-chip data caches to reduce
the latency of memory accesses and save the bandwidth of NOC
and off-chip memory modules. But these tiny data caches are vulnerable
to thrashing from massive multithreading, especially when
divergent load instructions generate long bursts of cache accesses.
Meanwhile, the blocks of divergent loads exhibit high intra-warp
locality and are expected to be atomically cached so that the issuing
warp can fully hit in L1D in the next load issuance. However, GPU
caches are not designed with enough awareness of either SIMD execution
model or memory divergence.
In this work, we renovate the cache management policies to design
a GPU-specific data cache, DaCache. This design starts with
the observation that warp scheduling can essentially shape the locality
pattern in cache access streams. Thus we incorporate the
warp scheduling logic into insertion policy so that blocks are inserted
into the LRU-chain according to their issuing warp’s scheduling
priority. Then we deliberately prioritize coherent loads over divergent
loads. In order to enable the thrashing resistance, the cache
ways are partitioned by desired warp concurrency into two regions,
the locality region and the thrashing region, so that replacement is
constrained within the thrashing region. When no replacement candidate
is available in the thrashing region, incoming requests are
bypassed. We also implement a dynamic partition scheme based
on the caching effectiveness that is sampled at runtime. Experiments
show that DaCache achieves 40.4% performance improvement
over the baseline GPU and outperform two state-of-the-art
thrashing resistant cache management techniques