GPUs are throughput-oriented processors that depend on massive
multithreading to tolerate long latency memory accesses. The
latest GPUs all are equipped with on-chip data caches to reduce
the latency of memory accesses and save the bandwidth of NOC
and off-chip memory modules. But these tiny data caches are vulnerable
to thrashing from massive multithreading, especially when
divergent load instructions generate long bursts of cache accesses.
Meanwhile, the blocks of divergent loads exhibit high intra-warp
locality and are expected to be atomically cached so that the issuing
warp can fully hit in L1D in the next load issuance. However, GPU
caches are not designed with enough awareness of either SIMD execution
model or memory divergence.
In this work, we renovate the cache management policies to design
a GPU-specific data cache, DaCache. This design starts with
the observation that warp scheduling can essentially shape the locality
pattern in cache access streams. Thus we incorporate the
warp scheduling logic into insertion policy so that blocks are inserted
into the LRU-chain according to their issuing warp’s scheduling
priority. Then we deliberately prioritize coherent loads over divergent
loads. In order to enable the thrashing resistance, the cache
ways are partitioned by desired warp concurrency into two regions,
the locality region and the thrashing region, so that replacement is
constrained within the thrashing region. When no replacement candidate
is available in the thrashing region, incoming requests are
bypassed. We also implement a dynamic partition scheme based
on the caching effectiveness that is sampled at runtime. Experiments
show that DaCache achieves 40.4% performance improvement
over the baseline GPU and outperform two state-of-the-art
thrashing resistant cache management techniques
GPUs are throughput-oriented processors that depend on massivemultithreading to tolerate long latency memory accesses. Thelatest GPUs all are equipped with on-chip data caches to reducethe latency of memory accesses and save the bandwidth of NOCand off-chip memory modules. But these tiny data caches are vulnerableto thrashing from massive multithreading, especially whendivergent load instructions generate long bursts of cache accesses.Meanwhile, the blocks of divergent loads exhibit high intra-warplocality and are expected to be atomically cached so that the issuingwarp can fully hit in L1D in the next load issuance. However, GPUcaches are not designed with enough awareness of either SIMD executionmodel or memory divergence.In this work, we renovate the cache management policies to designa GPU-specific data cache, DaCache. This design starts withthe observation that warp scheduling can essentially shape the localitypattern in cache access streams. Thus we incorporate thewarp scheduling logic into insertion policy so that blocks are insertedinto the LRU-chain according to their issuing warp’s schedulingpriority. Then we deliberately prioritize coherent loads over divergentloads. In order to enable the thrashing resistance, the cacheways are partitioned by desired warp concurrency into two regions,the locality region and the thrashing region, so that replacement isconstrained within the thrashing region. When no replacement candidateมีในภาค thrashing คำขอขาเข้าข้าม เรายังสามารถใช้แบบไดนามิกพาร์ทิชันที่ใช้เกี่ยวกับประสิทธิผลแคที่ความที่รันไทม์ ทดลองแสดงว่า DaCache ได้รับการปรับปรุงประสิทธิภาพ 40.4%ผ่านพื้นฐาน GPU และมีประสิทธิภาพสูงกว่าสองรัฐ-of-the-artเทคนิคจัดการแค thrashing ทน
การแปล กรุณารอสักครู่..
