operations at network line rate, without any measurable performance
degradation, within the router (the green area in
Figure 2). The user can define different logical trees using
Device Control Registers (DCRs) that identify the external input
links, a local contribution and a single output link where
the combined results across all inputs are forwarded. The
root of a tree has no defined output. There are 64 different
classes of DCRs per node that can implement a wide variety
of user-defined collective reduction trees. The router is
equipped with a floating point and an integer unit that can execute
integer operations with up to 512 bytes (equivalent to
a uint4096/int4096). The collective operations are guaranteed
to be reproducible, returning exactly the same result
across multiple executions. We have used these powerful features
to perform non-blocking allreduces with arrays of 64 bit
counters and more sophisticated wavefront algorithms that
operate bitwise parallel prefix operations across all the machine.