2. BACKGROUND AND MOTIVATION
NVIDIA’s Fermi GPU architecture [8] consists of multiple
independent streaming multiprocessors (SM), sharing an off-
chip memory. Each SM has a private instruction and data
cache, a scratchpad (shared) memory, 32 cores, 16 load-store
units, 4 special function units and two schedulers, see Fig. 1.
GPUs are programmed in an explicitly data-parallel language
such as CUDA or OpenCL. The programmer writes
code for a single thread, specifies how many threads have
to be invoked and groups these threads in blocks, as only
threads within a block can synchronize and share data via
the shared memory.
As an example, consider the activity graph in Fig. 2 of
an SM executing a 2D convolution kernel (see also Section
4). The SM’s activity is split in three groups: (1) integer
instructions representing address calculations and control
operations, (2) floating point instructions on actual data
and (3) load and store operations. Both the naive version
(Fig. 2a) and the optimized version (Fig. 2b) start with address
calculations, after which load instructions are issued.
After an idle period the data arrives from the off-chip memory
and floating point instructions are issued. The optimized
kernel shows fewer load operations (and corresponding address
calculations) than the naive implementation, due to
the caching of data elements in registers (see Section 4.1).
Although the kernel in Fig. 2b is optimized and minimizes
the number of memory loads, there are still idle cycles
where the SM is stalled waiting for data, despite of the many
threads it is executing to hide latency. Furthermore, a lot
of cycles are spent on address calculations and load instructions
rather than calculations on actual data. In 64% of the
clock cycles at least one of the two schedulers in the SM is
idle. Of the executed instructions 34% is used for floating
point instructions on actual data, resulting in only 12% of
the possible executed instructions over the duration of the
kernel being spend on computations on actual data