many others [10,24,25]. These systems often adopt traditional parallel
programming models, such as the Bulk Synchronous Parallel
model (BSP) [28] and Map-Reduce [11], or more dynamic approaches
[14].
Graph algorithms expose the limitations of conventional architectures,
in particular when accessing large data sets [4,26]. The memory
access patterns are often irregular with little spatial locality and
data reuse, resulting in a high cache miss rate. Graph applications
usually handle sparse data sets with relatively simple accompanying
computational operations. Performance is therefore dominated
by the data transfers as opposed to the computational capabilities,
to sustain a large number of fine-grained, and nearly random remote
data accesses across the full memory system and interconnect.
Our recent experiences with graph algorithms on large-scale distributed
memory machines, show that the network design and how
the network is accessed and utilized,