performed using up to 96 BG/Q racks (98,304 nodes) of LLNL’s
Sequoia. We used a bulk-synchronous two-phase approach (one
phase to size the communication buffers, one phase to fill them up)
for the MPI implementation and applied common optimizations, including multiple passes on small buffers and double-buffering the
communication. Both implementations used the direction optimization technique [2].