The P7-IH interconnect [1, 21] is built using a network inter-face/hub chip called Torrent as shown in Figure 3, that can connect up to tens of thousands of compute nodes based on a two-tiered dragonfly topology. The Hub Chips are integrated into the compute nodes and designed to provide low latencies at high bandwidth. The Hub Chip contains the PowerBus Interface for coherency op-erations, the Host Fabric Interface (HFI) for communication, the In-tegrated Switch Router (ISR) for routing, the Nest Memory Manage-ment Unit (NMMU) for address mapping, and the Collective Acceler-ation Unit (CAU) for collective operations. Two HFI units in the Hub Chip manage the incoming and outgoing communications. Software prepares the message in main memory and triggers the HFI to fetch it. The HFI can extract data from either the P7-IH memory or directly from the P7-IH processor’s cache and pass it to the ISR for routing to the destination. Upon receiving messages, HFI can write incom-ing network data to memory or directly into a processor’s L3 cache, lowering the data access latency for the application running on that processor. In addition, the Hub Chip supports collective operations such as barrier, reductions, and multi-cast directly in hardware with CAU, a specialized hardware for collective acceleration. The Net-work Datagram Abstraction Interface (NDAI) can be used to directly program the HFIs. While P7-IH’s design is different from the one of BG/Q, the down-to-the metal capabilities are very similar. As in