Graph analytics are arguably one of the most demanding workloads
for high-performance systems and interconnection networks. Graph
applications often display all-to-all, fine-grained, high-rate communication patterns that expose the limits of the network protocol stacks.
Load and communication imbalance generate hard-to-predict network hot-spots, and may require computational steering due to unpredictable data distributions. In this paper we present a lightweight
communication library, implemented “on the metal” of BlueGene/Q
and POWER7 IH that we have used to support large-scale graph
algorithms up to 96K processing nodes and 6 million threads. With
this library we have explored several optimization techniques, including overlapped communication, non-blocking collectives, message aggregation, and computation in the network for special collective communication patterns, such as parallel prefix. The experimental results show significant performance improvements, ranging
from 5X to 10X, when compared to equally optimized MPI implementations