4.2 Algorithms for Collective Operations
Often an MPI library includes more than one algorithm
to implement each collective operation. Since algorithm performance
may depend on system size, message size, and the
communication-initiator task, MPI libraries provide a default
algorithm for each combination of target system, collective
operation, and message size. Note, however, that
Intel MPI’s default settings are based on the performance
of the Intel MPI Micro-Benchmarks (IMB), while MVAPICH2’s
default settings are based on the performance of
the OSU Micro-Benchmarks (OMB). In addition, MPI libraries
often provide user-definable parameters for selection
of an algorithm for either all message sizes or specific ones.
As the communication pattern of a given application may
be different from that of the benchmark used to tune a given
MPI library, using an algorithm that is not the library’s default
can result in surprisingly good performance. For example,
as shown in Figure 4, for small size messages (128
bytes or less), a 256-core execution of OMB’s MPI_Bcast
(osu_bcast) with Intel MPI 4.1 and a user-defined (tuned)
algorithm resulted in an execution time that is 35 times
faster than with Intel MPI 4.1 and its default algorithm,
and similar to MVAPICH2 1.9a2 and its default algorithm.
Another example (shown in Figure 5) results from running
OMB’s MPI_Gather (osu_gather) four times with Intel MPI
4.1, each time using another one of Intel MPI’s three algorithms
(i.e., user-defined fixed value) and one time using the
library’s default algorithm. Even though Intel MPI’s autoselected
algorithm is the best choice for small message sizes
(it selects algorithm 2), for large message sizes Intel MPI’s
MPI_Gather auto-selected algorithm is far from optimal (it
selects algorithm 3). As is indicated by these results, expert
recommendations regarding the algorithms to use for collective
operations could result in better application performance
as compared to that achieved using the MPI library’s
auto-selection strategy.
To tune an MPI library for a specific application, MPI
Advisor detects the algorithm used for each collective operation
employed by the application and determines if it is the
best choice. For each collective operation for which an al-
1
10
100
1000
4 16 64 256 1K 4K 16K 64K 256K
Time (us)
Message Size (Bytes)
MVAPICH2 default
Intel MPI tuned
Intel MPI default
Figure 4: 256-core MPI_Bcast performance of OMB
(osu_bcast) with default and tuned algorithms in Intel MPI
vs. default algorithm in MVAPICH2.
1
10
100
1000
10000
100000
4 16 64 256 1K 4K 16K 64K256K 1M
Latency (us)
Message Size (Bytes)
Default algorithm
Algorithm 1
Algorithm 2
Algorithm 3
Figure 5: 256-core MPI_Gather latency of OMB
(osu_gather) with each algorithm in Intel MPI (userdefined)
vs. default algorithm (auto-selected).
ternate algorithm will provide improved performance, MPI
Advisor recommends the alternate algorithm and provides
the user with instructions on how to select this algorithm.
To accomplish this, during the data collection phase, MPI
Advisor records, via mpiP, the execution time and message
size of each collective operation employed by the application
and, using MPI_T, identifies the algorithm that was employed.
Next, for each of these operations, using the table
that was built at installation time using the CE script,
which includes the execution times of every collective operation
algorithm in each MPI library installed on the system,
it identifies the best algorithm to use on the target architecture
for each message size of interest. If there are collective
operations for which the application should use different algorithms,
MPI Advisor outputs related recommendations.
4.3 Mapping of MPI Tasks to Cores
Each MPI library provides its own default strategy for
mapping tasks to sockets and cores. There is no single best
strategy since the optimal mapping is fairly dependent on
application properties. Since the mapping determines the
proximity of the root process (in general, task 0) to the host
channel adapter (HCA) card, it can impact the efficiency
with which task 0 communicates with other MPI tasks.
For hybrid applications (MPI+OpenMP), the default mappings
provided by MVAPICH2 and Open MPI often do not
deliver the best performance because all of the threads associated
with a MPI task are mapped to the same core. Therefore,
hybrid application performance may be drastically affected
by this mapping. For example, assume that a hybrid
code with four MPI tasks, each with two OpenMP threads,
is executed on a cluster with nodes that are each comprised
of two 4-core processors on sockets S1 and S2, and one HCA
on S2. In this case, by default, MVAPICH2, Intel MPI,
and Open MPI define different tasks-to-cores mappings. As
shown in Figure 6: a) MVAPICH2 maps the four tasks to
S1, b) Open MPI (version 1.7.4 and higher) maps each pair
of tasks to the first two cores of S1 and S2, and c) Intel MPI
maps each task to a pair of core
4.2 Algorithms for Collective OperationsOften an MPI library includes more than one algorithmto implement each collective operation. Since algorithm performancemay depend on system size, message size, and thecommunication-initiator task, MPI libraries provide a defaultalgorithm for each combination of target system, collectiveoperation, and message size. Note, however, thatIntel MPI’s default settings are based on the performanceof the Intel MPI Micro-Benchmarks (IMB), while MVAPICH2’sdefault settings are based on the performance ofthe OSU Micro-Benchmarks (OMB). In addition, MPI librariesoften provide user-definable parameters for selectionof an algorithm for either all message sizes or specific ones.As the communication pattern of a given application maybe different from that of the benchmark used to tune a givenMPI library, using an algorithm that is not the library’s defaultcan result in surprisingly good performance. For example,as shown in Figure 4, for small size messages (128bytes or less), a 256-core execution of OMB’s MPI_Bcast(osu_bcast) with Intel MPI 4.1 and a user-defined (tuned)algorithm resulted in an execution time that is 35 timesfaster than with Intel MPI 4.1 and its default algorithm,and similar to MVAPICH2 1.9a2 and its default algorithm.Another example (shown in Figure 5) results from runningOMB’s MPI_Gather (osu_gather) four times with Intel MPI4.1, each time using another one of Intel MPI’s three algorithms(i.e., user-defined fixed value) and one time using thelibrary’s default algorithm. Even though Intel MPI’s autoselectedalgorithm is the best choice for small message sizes(it selects algorithm 2), for large message sizes Intel MPI’sMPI_Gather auto-selected algorithm is far from optimal (itselects algorithm 3). As is indicated by these results, expertrecommendations regarding the algorithms to use for collectiveoperations could result in better application performanceas compared to that achieved using the MPI library’sauto-selection strategy.To tune an MPI library for a specific application, MPIAdvisor detects the algorithm used for each collective operationemployed by the application and determines if it is thebest choice. For each collective operation for which an al- 1 10 100 1000 4 16 64 256 1K 4K 16K 64K 256KTime (us)Message Size (Bytes)MVAPICH2 defaultIntel MPI tunedIntel MPI defaultFigure 4: 256-core MPI_Bcast performance of OMB(osu_bcast) with default and tuned algorithms in Intel MPIvs. default algorithm in MVAPICH2. 1 10 100 1000 10000 100000 4 16 64 256 1K 4K 16K 64K256K 1MLatency (us)Message Size (Bytes)Default algorithmAlgorithm 1Algorithm 2Algorithm 3Figure 5: 256-core MPI_Gather latency of OMB(osu_gather) with each algorithm in Intel MPI (userdefined)vs. default algorithm (auto-selected).ternate algorithm will provide improved performance, MPIAdvisor recommends the alternate algorithm and providesthe user with instructions on how to select this algorithm.To accomplish this, during the data collection phase, MPIAdvisor records, via mpiP, the execution time and messagesize of each collective operation employed by the applicationand, using MPI_T, identifies the algorithm that was employed.Next, for each of these operations, using the tablethat was built at installation time using the CE script,which includes the execution times of every collective operationalgorithm in each MPI library installed on the system,it identifies the best algorithm to use on the target architecturefor each message size of interest. If there are collectiveoperations for which the application should use different algorithms,MPI Advisor outputs related recommendations.4.3 Mapping of MPI Tasks to CoresEach MPI library provides its own default strategy formapping tasks to sockets and cores. There is no single beststrategy since the optimal mapping is fairly dependent onapplication properties. Since the mapping determines theproximity of the root process (in general, task 0) to the hostchannel adapter (HCA) card, it can impact the efficiencywith which task 0 communicates with other MPI tasks.For hybrid applications (MPI+OpenMP), the default mappingsprovided by MVAPICH2 and Open MPI often do notdeliver the best performance because all of the threads associatedในงาน MPI ถูกแมปไปยังหลักเดียวกัน ดังนั้นไฮบริดสลีประยุกต์ประสิทธิภาพอาจรับผลกระทบอย่างมากโดยการแมปนี้ ตัวอย่างเช่น สมมติว่าไฮบริดสลีรหัสสี่งาน MPI แต่ละคน มีสองเธรด OpenMPดำเนินการในคลัสเตอร์มีโหนที่แต่ละคนจะประกอบด้วยตัวประมวลผล 4 แกนสอง sockets S1 และ S2 และหนึ่ง HCAใน S2 ในกรณีนี้ โดยค่าเริ่มต้น MVAPICH2, Intel MPIและ MPI เปิดกำหนดการแม็ปงานแกนประมวลผลที่แตกต่างกัน เป็นแสดงในรูปที่ 6: การ) MVAPICH2 แผนที่สี่งานในการS1, b) เปิด MPI (รุ่น 1.7.4 และสูงกว่า) แผนที่แต่ละคู่งานที่แกนสองครั้งแรกของ S1 และ S2 และ c) Intel MPIแผนที่แต่ละงานเพื่อจับคู่หลัก
การแปล กรุณารอสักครู่..