2.3 Dependencies Between Processes
If the communication pattern is complex, the occurrence
of failures has the potential to deeply disturb the application
and prevent an effective recovery from being implemented.
Consider the example in Figure 1: as long as no failure occurs, the processes are communicating in a point-to-point
pattern (called plan A). Process Pk is waiting to receive a
message from Pk−1, then sends a message to Pk+1 (when such processes exist). Let’s observe the effect of introducing
a failure in plan A, and consider that P1 has failed. As only
P2
communicates directly with P1, other processes do not
detect this condition, and only P2 is informed of the failure of P1. The situation at P2 now raises a dilemma: P3
waits on P2, a non-failed process, therefore the operation
must block until the matching send is posted at P2; however, P2 knows that P1 has failed, and that the application
should branch into its recovery procedure plan B ; if P2 were
to switch abruptly to plan B, it would cease matching the
receives P3 posted following plan A. At this point, P2 needs
an effective way of interrupting operations that it does not
intend to match anymore, otherwise, the application would
reach a deadlock: the messages that P3 to Pn are waiting for
will never arrive. The proposed solution to resolve this scenario is that, before switching to plan B, the user code in P2
calls MPIX_COMM_REVOKE, a new API which notifies all other
processes in the communicator that a condition requiring
recovery actions has been reached. Thanks to this flexibility, the cost associated with consistency in error reporting
is paid only after an actual failure has happened, and only
when necessary to the algorithm, and applications that do
not need consistency, or in which the user can prove that
the communication pattern remains safe, can enjoy better
recovery performance.