The GPU implementation (By line (R)) of the 3×3 convolution
has a performance of 7.8 Gpixels/s. GPU-CC achieves
a speed-up of 1.9× with a performance of 14.8 Gpixels/s.
Considering each input and output pixel has to be transfered
at least once, GPU-CC reaches 89% of the peak off-
chip memory bandwidth. For a 5×5 convolution kernel the
GPU-CC architecture attains a speed-up of 2.4× compared
to an optimized implementation on a conventional GPU.
In the standard GPU implementation a total number of
220 thousand instructions are fetched, decoded and issued to