Graph analytics are arguably one of the most demanding workloads
for high-performance systems and interconnection networks. Graph
applications often display all-to-all, fine-grained, high-rate communication patterns that expose the limits of the network protocol stacks.
Load and communication imbalance generate hard-to-predict network hot-spots, and may require computational steering due to unpredictable data distributions. In this paper we present a lightweight
communication library, implemented “on the metal” of BlueGene/Q
and POWER7 IH that we have used to support large-scale graph
algorithms up to 96K processing nodes and 6 million threads. With
this library we have explored several optimization techniques, including overlapped communication, non-blocking collectives, message aggregation, and computation in the network for special collective communication patterns, such as parallel prefix. The experimental results show significant performance improvements, ranging
from 5X to 10X, when compared to equally optimized MPI implementations
Graph analytics are arguably one of the most demanding workloadsfor high-performance systems and interconnection networks. Graphapplications often display all-to-all, fine-grained, high-rate communication patterns that expose the limits of the network protocol stacks.Load and communication imbalance generate hard-to-predict network hot-spots, and may require computational steering due to unpredictable data distributions. In this paper we present a lightweightcommunication library, implemented “on the metal” of BlueGene/Qand POWER7 IH that we have used to support large-scale graphalgorithms up to 96K processing nodes and 6 million threads. Withthis library we have explored several optimization techniques, including overlapped communication, non-blocking collectives, message aggregation, and computation in the network for special collective communication patterns, such as parallel prefix. The experimental results show significant performance improvements, rangingfrom 5X to 10X, when compared to equally optimized MPI implementations
การแปล กรุณารอสักครู่..