The massive parallelism
from these Single-Instruction Multiple Data (SIMD) threads helps
GPUs achieve a dramatic improvement in computational power
compared to CPUs. To reduce the latency of memory operations,
GPU has employed multiple levels of data caches to save off-chip
memory bandwidth when there is locality within the accesses