In Figure 5, we compare the performance results between
the unoptimized OpenCL implementation and the one optimized
with kernel splitting. We find that kernel splitting
delivers a 1.7-fold performance benefit. This can be reasoned
as follows. The AMD GPU architecture has only one branch
execution unit for five processing cores, as discussed in
Section II-C. Hence, branching on an AMD GPU incurs
a huge performance loss as the branch itself now takes five
times as long as branches on the NVIDIA GPU architecture,
for example.