and recommendations to improve the application’s MPI performance.
Moreover, MPI Advisor has minimal overhead.
For example, Table 3 shows the performance impact of mpiP
and MPI Advisor on the GFLOP/s attained by HPCG (pure
MPI), which was executed on Texas Advanced Computing
Center’s Stampede cluster using MVAPICH2 and 64 cores.
As can be seen, MPI Advisor’s performance impact is very
low, i.e., it reduces the achieved GFLOP/s by 1.3%, while
mpiP reduces it by less than 0.1%.
4. SUPPORTED TUNING STRATEGIES
Currently MPI Advisor supports four classical and useful
MPI performance-tuning strategies. Although they may
seem commonplace to experts, they are likely not familiar
to non-experts. Below, for each strategy, we provide background
information, motivation for its use, and implementation
details. As mentioned previously, only a single execution
of the input application is required for MPI Advisor
to collect all the data required to detect any performance
bottlenecks that are pertinent to the tuning strategies, and
to provide recommendations for enhancing the application’s
performance.
In addition, MPI Advisor can be easily expanded: each
strategy is implemented as an independent component, which
defines its own data collection and analyses.
4.1 Point-to-Point Protocol Threshold
In general, MPI libraries employ either the eager or rendezvous
protocol for point-to-point communication. As illustrated
in Figure 2, the eager protocol is asynchronous, while
the rendezvous protocol is synchronous. Consequently, they
differ in terms of their relative memory requirements. The
eager protocol, which is used for small messages, stores each
received message in an MPI-defined buffer whether or not
a matching receive has been posted. Although this reduces
synchronization delays and, thus, provides lower message
latencies, significant memory may be required to provide
buffer space for messages. In contrast, the synchronous rendezvous
protocol employs a form of handshaking to initiate a
data payload transfer; a pending message is sent only when
there is adequate user-defined buffer space at the receiver.
Each MPI library defines the largest message size that
it will send using the eager protocol; messages with sizes
above this threshold are sent using the rendezvous protocol.
A user-defined parameter can be used to change this threshold,
which we generically call eager threshold. The MVAPICH2
default value of eager threshold is dependent on the
transport media (shared memory, TCP/IP, Infiniband SDR,
QDR, FDR, etc.); it is 17 KB for intra-node communication
on Stampede, while Intel MPI sets the default eager threshold
to 256 KB independent of the architecture.
Note that we focus only on increasing eager threshold
since decreasing it could lead to more messages using the
rendezvous protocol and, thus, degradation of application
Sender Receiver
MPI_Send
MPI_Recv
Sender Receiver
MPI_Send
MPI_Recv
RTS
CTS
data
end
…
(a) eager protocol
Sender Receiver
MPI_Send
MPI_Recv
Sender Receiver
MPI_Send
MPI_Recv
RTS
CTS
data
end
…
(b) rendezvous protocol
Figure 2: Protocols used for point-to-point operations.
performance. Nonetheless, increasing eager threshold also
increases the size of the buffer associated with the eager
protocol, which may cause program termination due to node
memory exhaustion. However, a study conducted at TACC
shows that, as depicted in Figure 3, most of the MPI jobs
do not use the entire memory available on a typical Stampede
node (32 GB). That being said, to identify a value of
eager threshold that is appropriate for an application requires
knowledge that is not commonplace among many of
the users of an HPC facility.
To tune the eager threshold parameter, MPI Advisor detects
the predominant message sizes transmitted by the input
application and infers the point-to-point protocol in use.
When applicable, to increase application performance, MPI
Advisor recommends the reduction of the number of messages
being transmitted using the rendezvous protocol by increasing
the eager threshold value. This recommendation is
specific to each MPI library, and warns the user that making
the change may increase the memory footprint of the MPI
library. To determine whether or not to increase the eager
threshold, during the data collection and analysis phases,
MPI Advisor: 1) uses MPI_T to identify the value of eager
threshold, 2) uses mpiP performance data to determine the
number and size of messages transmitted via send and receive
operations, and 3) computes the median message size
per call site, determines the maximum of these, compares the
maximum to the value of eager threshold, and determines
the appropriate value for eager threshold. If the computed
value is larger than the default value, then MPI Advisor outputs
its recommendation, instructions, and warnings. This
strategy is based on several experiments and may be improved
in the future.
Memory Utilization Memory Utilization # of Jobs # of Jobs
0-1 1 0 0
1-2 2 0 0
2-3 3