4.2 High performance using smaller GPU memory Figure 3 shows the execution time of benchmarks when we vary the size of the GPU memory. The baseline and zero-copy scheme run programmer-modified codes for a smaller GPU memory and for bypassed GPU memory, respectively. We evaluate the runtime of ScaleGPU by decreasing the size of the GPU memory from 100% to 12.5% of the data size. We normalize the runtime of smaller GPU memory to that of the baseline with 100% GPU memory. Each runtime is also divided into host-to-device (H2D) transfer latency, device-to-host (D2H) transfer latency, and GPU kernel execution latency. It should be noted that Shortest Path and Bank Account cannot be manually modified to fit in smaller memory due to the nature of their algorithm. First, ScaleGPU achieves an average of 8% speedup for splitfriendly workloads such as vectorAdd and histogram by overlapping the data transfers and the GPU kernel executions. The zero-copy scheme achieves an average of 58% speedup because these applications do not reuse data and all memory accesses are coalesced. ScaleGPU does not reach the zero-copy scheme’s performance as it accesses the GPU memory before forwarding the request to CPU memory. Next, ScaleGPU achieves an average of 32% speedup for hotspot over the manually modified codes. Whereas both the manually modified codes and the zero-copy scheme suffer from significant performance losses due to the increased memory transfer, ScaleGPU maintains baseline performance using only 25% of the memory. Such performance improvement comes from frequent data reuse in GPU memory as well as the overlapping of the data transfers and the GPU execution. Finally, Shortest Path and Bank Account fail to run on baseline with smaller GPU memory because these applications cannot be split to fit in a smaller GPU memory. Although the zero-copy scheme successfully runs on smaller GPU memory, it suffers from a