When a process of the application calls MPIX_COMM_REVOKE (similar operations exist for windows and files, we will, without loss of generality, reason in the case of communicators), all other alive processes in the communicator eventually receive a notification. The MPIX_COMM_REVOKE call has an effect on the entire scope of the communicator, without requiring a collective or matching call at any participant. Instead, the effect of the Revoke operation is observed at other processes during non-matching MPI communication calls: when receiving this notification, any communication on the communicator (ongoing or future) is interrupted and a special error code returned. Then, all surviving processes can safely enter the recovery procedure of the application, knowing that no alive process belonging to that communicator will deadlock as a result. After a communicator has been revoked, its state is definitively altered and it can never be used again to communicate. This alteration is not to be seen as the (direct) consequence of a failure, but as the consequence of the user explicitly calling a specific operation on the communicator. In a sense, Revoking a communicator explicitly achieves the
propagation of failure knowledge that has intentionally not been required, but is provided when the user deems necessary. Because the object is discarded definitively, any stale message matching the revoked object is appropriately ignored without modifications in the matching logic, and multiple processes may simultaneously Revoke the same communicator without fears of injecting delayed Revoke notifications, thereby interfering with post-recovery operations.
In order to restore communication capacity, ulfm provides the repair function MPIX_COMM_SHRINK, which derives new, fresh communicators that do not risk intermixing with prefailure operations or delayed notifications.