As we move towards efficient exascale systems, heterogeneous accelerators like NVIDIA GPUs are becoming a significant compute
component of modern HPC clusters. It has become important to
utilize every single cycle of every compute device available in the
system. From NICs to GPUs to Co-processors, heterogeneous compute resources are the way to move forward. Another important
trend, especially with the introduction of non-blocking collective
communication in the latest MPI standard, is overlapping communication with computation. It has become an important design goal
for messaging libraries like MVAPICH2 and OpenMPI. In this paper, we present an important benchmark that allows the users of different MPI libraries to evaluate performance of GPU-Aware NonBlocking Collectives. The main performance metrics are overlap and latency. We provide insights on designing a GPU-Aware
benchmark and discuss the challenges associated with identifying
and implementing performance parameters like overlap, latency,
effect of MPI_Test() calls to progress communication, effect of
independent GPU communication while the overlapped computation proceeds under the communication, and the effect of complexity, target, and scale of this overlapped computation. To illustrate
the efficacy of the proposed benchmark, we provide a comparative
performance evaluation of GPU-Aware Non-Blocking Collectives
in MVAPICH2 and OpenMPI.