cation other than graphics, such as electronic design automation,
medical imaging, and signal processing. Programmable
GPUs can not only be found in desktop computers, but also
in mobile devices such as tablets and in supercomputers,
having in common the need for a large amount of energy
efficient compute power.
GPUs spend most of their hardware on many small (but
heavily pipelined) ‘cores’, with no branch prediction, no
speculative execution and only small caches. Instructions
are issued in SIMD-style vectors, and latency is hidden by
concurrently executing many independent vectors, resulting
in a high-performance energy efficient SIMT architecture.
The number of cores on GPGPUs have increased from
just over a hundred in 2006 [5] to thousands in 2013 [9],
an increase of 21× in just 6.5 years. In the same period
performance (gflops) has increased ‘only’ 9×, and energy
efficiency (gflops/w) by a mere 5×. Also power consumption
(tdp) has reached a ceiling of 250W since 2008, and
at the same time clock frequency diminishes. This together
reveals a trend in which more parallelism by more cores is
preferred over clock frequency, i.e. more hardware is spent
in order to increase performance and energy efficiency.
Simply adding more cores to a GPU does not result in
an equivalent increase in performance or energy efficiency.
Moreover, GPUs spend many cycles on data movement and
control. In this work we propose an extension to the current
GPU architecture in which the cores in an SM can be con-
figured in a network with direct communication, creating
a spatial computing architecture. Furthermore, each core
executes a fixed instruction, reducing instruction fetch and
decode count significantly. Data movement and control of an
application is made implicit in the network, freeing up the
cores for computations on actual data. By better utilizing
the available cores, this results in increased performance and
energy efficiency, while it only adds a relative small amount
of hardware and preserves the original GPU functionality.