The traditional BSP model divides the program into a sequence of supersteps, which are synchronized by a global barrier as shown in Figure 5 (a). In each superstep, the local computation steps are executed in parallel and, after that, an all-to-all communication step is carried out, to exchange messages. This is followed by a global synchronization to ensure that all compute nodes have completed the previous superstep before proceeding to the next one. Though BSP offers good scalability on tightly-coupled distributed systems, computation and communication are strictly serialized, and all-to-all communication is required after the computation in each superstep. To address this issue, we break the strict synchronization between the computation and communication as shown in Figure 5 (b). In our programming abstraction, synchronized communication is replaced with asynchronous active messages, which allow overlap of com-munication and computation, and don’t present the need of storing all the messages at the same time. Such space saving is perhaps even more important than the actual performance improvement, in particular when running on large scale configurations.