Message coalescing is one of the key design considerations of our runtime to increase the effective bandwidth seen at user-level. The runtime aggregates the messages on a per-destination basis into a coalescing queue, which can either be a shared or a private coalescing queue. Figure 7 shows an example of a shared coalescing queue, where the runtime preallocates buffers (coalescing queues), one per destination node (Dest0, Dest1, ..., Destn−1), which are shared across all the threads on the same node. When threads generate messages, these are pushed into the coalescing queues for their respective destinations. Given that, at any given moment, multiple threads may be trying to write to the same queue, this operation requires coordination among threads. With private coalescing queues, each thread owns a set of preallocated buffers for all possible destination nodes, and enqueueing messages does not require coordination or locking. However, the memory consumption, compared to that of the case of shared queues, increases as the number of active threads on each node.