On BG/Q, the Message Unit (MU) [7, 9], as shown in Figure 2,
bridges the 5D Torus network and the memory subsystem and is
designed to provide ultra-low latency and high throughput. It has injection and reception control logic, which manages message sending and receiving, plus a global barrier control logic providing the
barrier and collective functionality that are integrated onto the same
physical torus network. The MU also supports atomic operations,
L2 atomic, and L2 prefetching (i.e., reading messages from main
memory and loading them into L2). On the sending side, the injection control logic interprets the message descriptor provided by
the software, and fetches the message contents from memory to
send them into the network. When a message arrives, the reception control logic writes it into the appropriate location in the memory
system (if possible, directly into the L2 cache). The hardware provides efficient mechanisms to poll the network device at user level
to detect the arrival of new packets. BG/Q’s system software provides highly optimized C inlines, through the System Programming
Interface (SPI), to program the MU and Torus interconnect.
In Figure 2 we have sketched the main features of BG/Q that we
have exploited in our communication library (a detailed description
is beyond the scope of this paper).