Since the combinations of optimizations result in seemingly
arbitrary performance benefits, we tested all combinations
and found that with OpenCL on AMD GPUs,
kernel splitting (KS) + register preloading (RP) + image
memory (IM) performs the best. Figure 7 presents
the speedup obtained on both AMD and NVIDIA GPUs
with OpenCL and CUDA, respectively. We compared the
unoptimized version as well as the one with architecturespecific
optimizations and found out that the unoptimized
CUDA implementation performs better than the unoptimized
OpenCL implementation. However, in the case of the optimized
version, OpenCL on AMD GPU is faster by 12% than
CUDA on its NVIDIA counterpart.