Errors can then be captured by the application by setting
the appropriate MPI_ERRHANDLER.
An additional criterion to consider is that some MPI operations are collective, or have a matching call at some other
process (e.g. Send/Recv). Convenience would call for the
same error be returned uniformly at all ranks that participated in the communication. This would easily permit tracking the global progress of the application (and then infer a
consistent, synchronized recovery point). However, the performance consequences are dire, as it requires that every
communication concludes with an agreement operation between its participants in order to determine the global success or failure of the communication, as viewed by each process. Such an operation cannot be possibly achieved in less
than the cost of an AllReduce, even without accounting for
the cost of actually tolerating failures during the operation,
and would thus impose an enormous overhead on communication. In regard to the goal of maintaining an unchanged
level of performance, it is clearly unacceptable to double,
at best, the cost of all latency bound communication operations, especially when no failure has occurred. Furthermore,
it is already customary for MPI operations to have a local
only semantic, for example, when an MPI_REDUCE completes
at a non-root process, there is no guarantee that the root
has received the result of the collective operation yet. The
semantic only specifies that when the operation completes,
the local input buffer can be reused.
As a consequence, in ulfm, the reporting of errors has a local operation semantic: the local completion status (in error,
or successfully) cannot be used to assume if the operation
has failed or succeeded at other ranks. In many applications,
this uncertainty is manageable, because the communication
pattern is simple enough. In some cases, however, the communication pattern does not allow such flexibility, and the
application thereby requires an operation to resolve that uncertainty, as described below.