3.1 A Resilient, Asynchronous Broadcast
The revocation notification needs to be propagated to all alive processes in the specified communicator, even when new failures happen during the Revoke propagation. Therefore, it is in essence a reliable broadcast. Among the four defining qualities of a reliable broadcast usually considered in the literature (Termination, Validity, Integrity, Agreement) [19], the non-uniform variants of the properties are sufficient, and the integrity criteria can be relaxed in the
context of the Revoke algorithm. First, the agreement and validity properties ensure that
if a process broadcasts a value v, all processes deliver v. In the uniform-agreement case, that property extends to failed processes: if a failed process had delivered the value, then
it must be delivered at all correct processes. In the Revoke operation, if failures kill the initiator as well as all the already notified processes, the Revoke notification is indeed
lost, and surviving processes may never receive the notification. However, either correct processes are not expecting messages from the set of dead processes, therefore no operation can deadlock, or at least a correct process is directly trying to exchange messages with a dead process and will detect its failure, which means that its blocking operations will complete in error, and leave the opportunity for the application to reissue the Revoke operation. In all cases, a nonuniform reliable broadcast is sufficient to ensure deadlock free operation. This is of practical significance, because the reliable broadcast respecting the uniform-agreement property requires that the system is free of send-omission failures (that is, a send has completed, but the receiver does not receive the message) [19]. In MPI, when a send operation
completes, it does not mean that the receiver has delivered the message; the message may still be buffered on the sender process, and when that process is the victim of a crash failure, it may thereby simultaneously commit a send-omission failure. Ensuring that the network is free of send-omission failures requires the acknowledgement of sent messages, or additional rounds of message exchanges before delivering the reliable broadcast. As Revoke can be implemented with a
non-uniformly agreeing reliable broadcast, that extra cost is spared.Second, the integrity property states that a message is delivered once at most, and variants with additional ordering
properties exist, like FIFO or causal ordering between the delivery of different broadcasts. In the case of a Revoke notification, the first Revoke message to reach the process has
the effect of immutably altering the state of the communicator. Supplementary deliveries of Revoke messages for the same communicator have no effect. Similarly, if multiple initiators concurrently broadcast a Revoke notification on the same communicator, the order in which these notifications are delivered has no importance, as the final outcome is always a switch to an immutable revoked state. Therefore, we can retain a non-ordered, relaxed integrity reliable broadcast, in which we allow multiple out-of-order deliveries, but retain the reasonable assumption that Revoke messages do not appear out of “thin air”. Then, as long as the algorithm
still ensures the non-uniform agreement property, there are no opportunities for inconsistent views. These simplified requirements are crucial for decreasing the cost of the Revoke operation, as the size of the messages and the number of message exchanges rounds can be drastically increased when one needs to implement an ordered, uniform reliable broadcast. Given the non-uniform agreement, the no-ordering, and loose integrity properties, in the Revoke reliable broadcast, a process that receives its first Revoke message can perform a single round of emissions to all its neighbors, with a constant message size, and then deliver the Revoke notification immediately, without further verifications. The last important aspect is the topology of the overlaynetwork employed to perform the broadcast operation. In
the reliable broadcast algorithm, when a process receives a
broadcast message for the first time, it immediately broadcasts that same message to all its neighbors in the overlay graph. The agreement property can be guaranteed only
when failures do not disconnect the overlay graph. In early
prototype versions of the ulfm implementation, the reliable
broadcast procedure employed a fully connected network
(which guarantees that disconnected cliques never form).
Obviously, this method scales poorly as, with the number
or processes, the graph degree is linear, and the number of
exchanged messages is quadratic. In practice, at scale, the
large graph degree resulted in the application aborting due
to resource exhaustion (too many open channels simultaneously, not enough memory for unexpected messages, etc.).Therefore, one needs to consider a more scalable overlay
topolog