GPU-CC ARCHITECTURE
To better utilize the available cores in the GPU, we propose
the GPU-CC architecture, which allows the cores in an
SM to be configured in a network with direct communication,
creating a spatial computing architecture. By moving
data directly from one core to the next, data movement and
control is made implicit in the network and instruction count
can be reduced. Furthermore, each core is assigned one fixed
instruction which it will execute during the whole kernel execution
time. It is stored in a local configuration register
and has to be loaded only once.
The standard GPU architecture is preserved, and no hardware
blocks are removed. Hereby backwards compatibility
for current GPU programs is assured, and programs which
do not benefit of the GPU-CC architecture can use the standard
GPU architecture as is. Only configuration registers
and a communication network with FIFO buffers is added.
The programmer can switch between the GPU’s standard
and GPU-CC architecture at run-time and specifies each
core’s GPU-CC instruction and connections in assembly by
hand. We plan compiler support for future work.
The cores in an SM in the GPU-CC architecture are connected
to each other via a communication network with
FIFO buffers, as shown in Fig. 3. Via five data lanes, named
A to E, cores can send data to each other’s FIFOs. By passing
data directly, the register file is not required and can
be switched off. The multiplexers in the network are controlled
by the configuration registers, creating a static circuit
switched network for the duration of a kernel’s execution.
In GPU-CC the register file and instruction fetch and decode
units are switched off. According to the Integrated
Power and Performance model of Hong and Kim [3] 12% of
the power consumption of a GPU comes from these parts.
Presumably more power is saved because cores execute a
fixed instruction in GPU-CC, and not a mix of (integer and
floating point) instructions. The power used by the communication
network is expected to be low compared to the
register file’s power consumption, as it is smaller in memory
size (see below) and consists of simple FIFO buffers instead
of a multi-bank memory system with operand collectors. In
GPU-CC not all cores are used in every application, which
means some cores can be disabled saving more power.
Each core has three input FIFOs, as a core can execute
instructions with (up to) three input operands. The loadstore
units have two input FIFOs, one for the address and
one for the data in case of a store. All FIFOs have a size of
16 elements, only the address FIFO for the load-store units
is 256 elements. These sizes are empirically determined, in
future work we plan a more detailed evaluation.
Cores are triggered to execute an instruction when all input
FIFOs have a data element available and when all FIFOs
of the receiving cores have space available. The latency of a
load operation in a load-store unit can be very long in case
of a cache miss. The load-store unit only removes an item
from its FIFO if the operation has completed. Therefore the
input FIFO for the addresses is made (much) larger. The
load-store unit has been equipped with a new prefetch element,
which scans the address FIFO. When it detects an
address with a new cache line address, it generates a memory
request to fill the L1 cache with the corresponding cache
line. This way the load-store units’ following load operations
will hit in the L1 cache, resulting in minimal stall cycles.
The main hardware costs of the GPU-CC architecture
are the configuration registers and FIFO buffers. Each of
the 32 cores has a configuration register and three 16 element
FIFOs. Each of the load-store units also has an instruction
cache, one 256 element and one 16 element FIFO.