6. CONCLUSIONS & FUTURE WORK
In this paper we proposed the GPU-CC architecture, adding
an extra mode of computation to contemporary GPU
architectures to better utilize its computational resources.
By configuring the cores of a GPU in a network with direct
Table 1: Performance of five versions of 2D convolution
(3×3) for a 512×512 image on an NVIDIA
GTX 470 and on the GPU-CC architecture.
Version Performance Speed-up
Naive 3.5 Gpixels/s 1.0
By line 6.4 Gpixels/s 1.8
By line (R) 7.8 Gpixels/s 2.2
Shared memory 4.7 Gpixels/s 1.3
Shared memory (R) 4.7 Gpixels/s 1.3
GPU-CC 14.8 Gpixels/s 4.2
communication, performance is improved (1.9× and 2.4×
for the 3×3 and 5×5 convolution example) while instruction
fetch and decode count is reduced significantly, resulting in
a reduced power consumption of an estimated 12%, at the
cost of an extra 12.4% of memory space on the GPU.
In future work we plan a more thorough analysis of the
FIFO buffer sizes, the number of data lanes and possibly
other interconnect topologies for the GPU-CC architecture.
We also plan to quantify the energy consumption benefits using
GPGPU-Sim’s power model [4]. Furthermore, we plan to
improve the programmability and experiment with a range
of applications, including applications which require more
instructions than the number of cores available.