Graph analytics are arguably one of the most demanding workloads for high-performance systems and interconnection networks. Graph applications often display all-to-all, fine-grained, high-rate communi-cation patterns that expose the limits of the network protocol stacks. Load and communication imbalance generate hard-to-predict net-work hot-spots, and may require computational steering due to un-predictable data distributions. In this paper we present a lightweight communication library, implemented “on the metal” of BlueGene/Q and POWER7 IH that we have used to support large-scale graph algorithms up to 96K processing nodes and 6 million threads. With this library we have explored several optimization techniques, in-cluding overlapped communication, non-blocking collectives, mes-sage aggregation, and computation in the network for special col-lective communication patterns, such as parallel prefix. The experi-mental results show significant performance improvements, ranging from 5X to 10X, when compared to equally optimized MPI imple-mentations.