2. BACKGROUND AND MOTIVATION
NVIDIA’s Fermi GPU architecture [8] consists of multiple
independent streaming multiprocessors (SM), sharing an off-
chip memory. Each SM has a private instruction and data
cache, a scratchpad (shared) memory, 32 cores, 16 load-store
units, 4 special function units and two schedulers, see Fig. 1.
GPUs are programmed in an explicitly data-parallel language
such as CUDA or OpenCL. The programmer writes
code for a single thread, specifies how many threads have
to be invoked and groups these threads in blocks, as only
threads within a block can synchronize and share data via
the shared memory.
As an example, consider the activity graph in Fig. 2 of
an SM executing a 2D convolution kernel (see also Section
4). The SM’s activity is split in three groups: (1) integer
instructions representing address calculations and control
operations, (2) floating point instructions on actual data
and (3) load and store operations. Both the naive version
(Fig. 2a) and the optimized version (Fig. 2b) start with address
calculations, after which load instructions are issued.
After an idle period the data arrives from the off-chip memory
and floating point instructions are issued. The optimized
kernel shows fewer load operations (and corresponding address
calculations) than the naive implementation, due to
the caching of data elements in registers (see Section 4.1).
Although the kernel in Fig. 2b is optimized and minimizes
the number of memory loads, there are still idle cycles
where the SM is stalled waiting for data, despite of the many
threads it is executing to hide latency. Furthermore, a lot
of cycles are spent on address calculations and load instructions
rather than calculations on actual data. In 64% of the
clock cycles at least one of the two schedulers in the SM is
idle. Of the executed instructions 34% is used for floating
point instructions on actual data, resulting in only 12% of
the possible executed instructions over the duration of the
kernel being spend on computations on actual data
2. BACKGROUND AND MOTIVATIONNVIDIA’s Fermi GPU architecture [8] consists of multipleindependent streaming multiprocessors (SM), sharing an off-chip memory. Each SM has a private instruction and datacache, a scratchpad (shared) memory, 32 cores, 16 load-storeunits, 4 special function units and two schedulers, see Fig. 1.GPUs are programmed in an explicitly data-parallel languagesuch as CUDA or OpenCL. The programmer writescode for a single thread, specifies how many threads haveto be invoked and groups these threads in blocks, as onlythreads within a block can synchronize and share data viathe shared memory.As an example, consider the activity graph in Fig. 2 ofan SM executing a 2D convolution kernel (see also Section4). The SM’s activity is split in three groups: (1) integerinstructions representing address calculations and controloperations, (2) floating point instructions on actual dataand (3) load and store operations. Both the naive version(Fig. 2a) and the optimized version (Fig. 2b) start with addresscalculations, after which load instructions are issued.After an idle period the data arrives from the off-chip memoryand floating point instructions are issued. The optimizedkernel shows fewer load operations (and corresponding addresscalculations) than the naive implementation, due tothe caching of data elements in registers (see Section 4.1).Although the kernel in Fig. 2b is optimized and minimizesthe number of memory loads, there are still idle cycleswhere the SM is stalled waiting for data, despite of the manythreads it is executing to hide latency. Furthermore, a lotof cycles are spent on address calculations and load instructionsrather than calculations on actual data. In 64% of theclock cycles at least one of the two schedulers in the SM isidle. Of the executed instructions 34% is used for floatingpoint instructions on actual data, resulting in only 12% ofthe possible executed instructions over the duration of thekernel being spend on computations on actual data
การแปล กรุณารอสักครู่..
