The Naive implementation reads nine input pixels and
writes one output pixel from/to off-chip memory. Threads
are organized in blocks of 16 by 16. This implementation
only exploits the possible locality of the input pixels within
the 16×16 block. In the other implementations, threads
are organized in a vector 512 long, matching the width of
the image. Since there are 14 SMs in the GPU used, each
thread block processes a chunk of 36 or 37 (512/14) lines in
the image, such that previously loaded lines can be re-used.
In the By line implementations this re-use is achieved by
relying on the L1 cache in each SM. In the Shared memory
implementations the re-use is manually managed by loading
rows of the image in the shared memory in the SM. The third
and fifth implementation (annotated with (R)) use an extra
level of re-use by keeping previously loaded lines in registers.
All of these implementations, except Naive, outperform the
NVIDIA CUDA SDK implementations of 2D convolution.