In addition to our proof-of-concept design, we present a
set of alternative designs that we also considered, but did
not choose due to poor performance or increased complexity.
These designs include adding a shared L2 TLB, including
a TLB prefetcher, and alternative page walk cache
designs. We also analyzed the impact of large pages on the
GPU TLB. We find that large pages do in fact decrease the
TLB miss rate. However, in order to provide compatibility
with CPU page tables, and ease the burden of the programmer,
we cannot rely solely on large pages for GPU MMU
performance.