The MPI_COMM_FREE function is defined as a collective operation whose implementation is likely to be local, that is, it usually requires no communication. In order to minimize
the performance impact, we designed a fault tolerant barrier that can progress in the background, so that it doesn’t inflict a significant duration increase on the MPI_COMM_FREE call itself. The deallocation of the communicator then becomes lazy, when the application calls MPI_COMM_FREE, the communicator is marked for deallocation (and the user handle can be destroyed immediately), however, the internal representation of the communicator is deallocated only when
it is safe, after the background barrier completes. Similarly to the Revoke operation, this barrier is implemented at the BTL level and essentially performs a binomial reducebroadcast sequence. When a process receives the broadcast direction message, it can infer that every process invoked MPI_COMM_FREE on that communicator, hence all communication on the communicator completed1 (either successfully, or in error when a participant died, or the revoked operation
was interrupted).