4.2 Algorithms for Collective Opera

4.2 Algorithms for Collective Operations
Often an MPI library includes more than one algorithm
to implement each collective operation. Since algorithm performance
may depend on system size, message size, and the
communication-initiator task, MPI libraries provide a default
algorithm for each combination of target system, collective
operation, and message size. Note, however, that
Intel MPI’s default settings are based on the performance
of the Intel MPI Micro-Benchmarks (IMB), while MVAPICH2’s
default settings are based on the performance of
the OSU Micro-Benchmarks (OMB). In addition, MPI libraries
often provide user-definable parameters for selection
of an algorithm for either all message sizes or specific ones.
As the communication pattern of a given application may
be different from that of the benchmark used to tune a given
MPI library, using an algorithm that is not the library’s default
can result in surprisingly good performance. For example,
as shown in Figure 4, for small size messages (128
bytes or less), a 256-core execution of OMB’s MPI_Bcast
(osu_bcast) with Intel MPI 4.1 and a user-defined (tuned)
algorithm resulted in an execution time that is 35 times
faster than with Intel MPI 4.1 and its default algorithm,
and similar to MVAPICH2 1.9a2 and its default algorithm.
Another example (shown in Figure 5) results from running
OMB’s MPI_Gather (osu_gather) four times with Intel MPI
4.1, each time using another one of Intel MPI’s three algorithms
(i.e., user-defined fixed value) and one time using the
library’s default algorithm. Even though Intel MPI’s autoselected
algorithm is the best choice for small message sizes
(it selects algorithm 2), for large message sizes Intel MPI’s
MPI_Gather auto-selected algorithm is far from optimal (it
selects algorithm 3). As is indicated by these results, expert
recommendations regarding the algorithms to use for collective
operations could result in better application performance
as compared to that achieved using the MPI library’s
auto-selection strategy.
To tune an MPI library for a specific application, MPI
Advisor detects the algorithm used for each collective operation
employed by the application and determines if it is the
best choice. For each collective operation for which an al-
1
10
100
1000
4 16 64 256 1K 4K 16K 64K 256K
Time (us)
Message Size (Bytes)
MVAPICH2 default
Intel MPI tuned
Intel MPI default
Figure 4: 256-core MPI_Bcast performance of OMB
(osu_bcast) with default and tuned algorithms in Intel MPI
vs. default algorithm in MVAPICH2.
1
10
100
1000
10000
100000
4 16 64 256 1K 4K 16K 64K256K 1M
Latency (us)
Message Size (Bytes)
Default algorithm
Algorithm 1
Algorithm 2
Algorithm 3
Figure 5: 256-core MPI_Gather latency of OMB
(osu_gather) with each algorithm in Intel MPI (userdefined)
vs. default algorithm (auto-selected).
ternate algorithm will provide improved performance, MPI
Advisor recommends the alternate algorithm and provides
the user with instructions on how to select this algorithm.
To accomplish this, during the data collection phase, MPI
Advisor records, via mpiP, the execution time and message
size of each collective operation employed by the application
and, using MPI_T, identifies the algorithm that was employed.
Next, for each of these operations, using the table
that was built at installation time using the CE script,
which includes the execution times of every collective operation
algorithm in each MPI library installed on the system,
it identifies the best algorithm to use on the target architecture
for each message size of interest. If there are collective
operations for which the application should use different algorithms,
MPI Advisor outputs related recommendations.
4.3 Mapping of MPI Tasks to Cores
Each MPI library provides its own default strategy for
mapping tasks to sockets and cores. There is no single best
strategy since the optimal mapping is fairly dependent on
application properties. Since the mapping determines the
proximity of the root process (in general, task 0) to the host
channel adapter (HCA) card, it can impact the efficiency
with which task 0 communicates with other MPI tasks.
For hybrid applications (MPI+OpenMP), the default mappings
provided by MVAPICH2 and Open MPI often do not
deliver the best performance because all of the threads associated
with a MPI task are mapped to the same core. Therefore,
hybrid application performance may be drastically affected
by this mapping. For example, assume that a hybrid
code with four MPI tasks, each with two OpenMP threads,
is executed on a cluster with nodes that are each comprised
of two 4-core processors on sockets S1 and S2, and one HCA
on S2. In this case, by default, MVAPICH2, Intel MPI,
and Open MPI define different tasks-to-cores mappings. As
shown in Figure 6: a) MVAPICH2 maps the four tasks to
S1, b) Open MPI (version 1.7.4 and higher) maps each pair
of tasks to the first two cores of S1 and S2, and c) Intel MPI
maps each task to a pair of core

4.2 Algorithms for Collective Operations
Often an MPI library includes more than one algorithm
to implement each collective operation. Since algorithm performance
may depend on system size, message size, and the
communication-initiator task, MPI libraries provide a default
algorithm for each combination of target system, collective
operation, and message size. Note, however, that
Intel MPI’s default settings are based on the performance
of the Intel MPI Micro-Benchmarks (IMB), while MVAPICH2’s
default settings are based on the performance of
the OSU Micro-Benchmarks (OMB). In addition, MPI libraries
often provide user-definable parameters for selection
of an algorithm for either all message sizes or specific ones.
As the communication pattern of a given application may
be different from that of the benchmark used to tune a given
MPI library, using an algorithm that is not the library’s default
can result in surprisingly good performance. For example,
as shown in Figure 4, for small size messages (128
bytes or less), a 256-core execution of OMB’s MPI_Bcast
(osu_bcast) with Intel MPI 4.1 and a user-defined (tuned)
algorithm resulted in an execution time that is 35 times
faster than with Intel MPI 4.1 and its default algorithm,
and similar to MVAPICH2 1.9a2 and its default algorithm.
Another example (shown in Figure 5) results from running
OMB’s MPI_Gather (osu_gather) four times with Intel MPI
4.1, each time using another one of Intel MPI’s three algorithms
(i.e., user-defined fixed value) and one time using the
library’s default algorithm. Even though Intel MPI’s autoselected
algorithm is the best choice for small message sizes
(it selects algorithm 2), for large message sizes Intel MPI’s
MPI_Gather auto-selected algorithm is far from optimal (it
selects algorithm 3). As is indicated by these results, expert
recommendations regarding the algorithms to use for collective
operations could result in better application performance
as compared to that achieved using the MPI library’s
auto-selection strategy.
To tune an MPI library for a specific application, MPI
Advisor detects the algorithm used for each collective operation
employed by the application and determines if it is the
best choice. For each collective operation for which an al-
 1
 10
 100
 1000
 4 16 64 256 1K 4K 16K 64K 256K
Time (us)
Message Size (Bytes)
MVAPICH2 default
Intel MPI tuned
Intel MPI default
Figure 4: 256-core MPI_Bcast performance of OMB
(osu_bcast) with default and tuned algorithms in Intel MPI
vs. default algorithm in MVAPICH2.
 1
 10
 100
 1000
 10000
 100000
 4 16 64 256 1K 4K 16K 64K256K 1M
Latency (us)
Message Size (Bytes)
Default algorithm
Algorithm 1
Algorithm 2
Algorithm 3
Figure 5: 256-core MPI_Gather latency of OMB
(osu_gather) with each algorithm in Intel MPI (userdefined)
vs. default algorithm (auto-selected).
ternate algorithm will provide improved performance, MPI
Advisor recommends the alternate algorithm and provides
the user with instructions on how to select this algorithm.
To accomplish this, during the data collection phase, MPI
Advisor records, via mpiP, the execution time and message
size of each collective operation employed by the application
and, using MPI_T, identifies the algorithm that was employed.
Next, for each of these operations, using the table
that was built at installation time using the CE script,
which includes the execution times of every collective operation
algorithm in each MPI library installed on the system,
it identifies the best algorithm to use on the target architecture
for each message size of interest. If there are collective
operations for which the application should use different algorithms,
MPI Advisor outputs related recommendations.
4.3 Mapping of MPI Tasks to Cores
Each MPI library provides its own default strategy for
mapping tasks to sockets and cores. There is no single best
strategy since the optimal mapping is fairly dependent on
application properties. Since the mapping determines the
proximity of the root process (in general, task 0) to the host
channel adapter (HCA) card, it can impact the efficiency
with which task 0 communicates with other MPI tasks.
For hybrid applications (MPI+OpenMP), the default mappings
provided by MVAPICH2 and Open MPI often do not
deliver the best performance because all of the threads associated
with a MPI task are mapped to the same core. Therefore,
hybrid application performance may be drastically affected
by this mapping. For example, assume that a hybrid
code with four MPI tasks, each with two OpenMP threads,
is executed on a cluster with nodes that are each comprised
of two 4-core processors on sockets S1 and S2, and one HCA
on S2. In this case, by default, MVAPICH2, Intel MPI,
and Open MPI define different tasks-to-cores mappings. As
shown in Figure 6: a) MVAPICH2 maps the four tasks to
S1, b) Open MPI (version 1.7.4 and higher) maps each pair
of tasks to the first two cores of S1 and S2, and c) Intel MPI
maps each task to a pair of core

0/5000

จาก: -

เป็น: -

ผลลัพธ์ (ไทย) 1: [สำเนา]

คัดลอก!

4.2 Algorithms for Collective OperationsOften an MPI library includes more than one algorithmto implement each collective operation. Since algorithm performancemay depend on system size, message size, and thecommunication-initiator task, MPI libraries provide a defaultalgorithm for each combination of target system, collectiveoperation, and message size. Note, however, thatIntel MPI’s default settings are based on the performanceof the Intel MPI Micro-Benchmarks (IMB), while MVAPICH2’sdefault settings are based on the performance ofthe OSU Micro-Benchmarks (OMB). In addition, MPI librariesoften provide user-definable parameters for selectionof an algorithm for either all message sizes or specific ones.As the communication pattern of a given application maybe different from that of the benchmark used to tune a givenMPI library, using an algorithm that is not the library’s defaultcan result in surprisingly good performance. For example,as shown in Figure 4, for small size messages (128bytes or less), a 256-core execution of OMB’s MPI_Bcast(osu_bcast) with Intel MPI 4.1 and a user-defined (tuned)algorithm resulted in an execution time that is 35 timesfaster than with Intel MPI 4.1 and its default algorithm,and similar to MVAPICH2 1.9a2 and its default algorithm.Another example (shown in Figure 5) results from runningOMB’s MPI_Gather (osu_gather) four times with Intel MPI4.1, each time using another one of Intel MPI’s three algorithms(i.e., user-defined fixed value) and one time using thelibrary’s default algorithm. Even though Intel MPI’s autoselectedalgorithm is the best choice for small message sizes(it selects algorithm 2), for large message sizes Intel MPI’sMPI_Gather auto-selected algorithm is far from optimal (itselects algorithm 3). As is indicated by these results, expertrecommendations regarding the algorithms to use for collectiveoperations could result in better application performanceas compared to that achieved using the MPI library’sauto-selection strategy.To tune an MPI library for a specific application, MPIAdvisor detects the algorithm used for each collective operationemployed by the application and determines if it is thebest choice. For each collective operation for which an al- 1 10 100 1000 4 16 64 256 1K 4K 16K 64K 256KTime (us)Message Size (Bytes)MVAPICH2 defaultIntel MPI tunedIntel MPI defaultFigure 4: 256-core MPI_Bcast performance of OMB(osu_bcast) with default and tuned algorithms in Intel MPIvs. default algorithm in MVAPICH2. 1 10 100 1000 10000 100000 4 16 64 256 1K 4K 16K 64K256K 1MLatency (us)Message Size (Bytes)Default algorithmAlgorithm 1Algorithm 2Algorithm 3Figure 5: 256-core MPI_Gather latency of OMB(osu_gather) with each algorithm in Intel MPI (userdefined)vs. default algorithm (auto-selected).ternate algorithm will provide improved performance, MPIAdvisor recommends the alternate algorithm and providesthe user with instructions on how to select this algorithm.To accomplish this, during the data collection phase, MPIAdvisor records, via mpiP, the execution time and messagesize of each collective operation employed by the applicationand, using MPI_T, identifies the algorithm that was employed.Next, for each of these operations, using the tablethat was built at installation time using the CE script,which includes the execution times of every collective operationalgorithm in each MPI library installed on the system,it identifies the best algorithm to use on the target architecturefor each message size of interest. If there are collectiveoperations for which the application should use different algorithms,MPI Advisor outputs related recommendations.4.3 Mapping of MPI Tasks to CoresEach MPI library provides its own default strategy formapping tasks to sockets and cores. There is no single beststrategy since the optimal mapping is fairly dependent onapplication properties. Since the mapping determines theproximity of the root process (in general, task 0) to the hostchannel adapter (HCA) card, it can impact the efficiencywith which task 0 communicates with other MPI tasks.For hybrid applications (MPI+OpenMP), the default mappingsprovided by MVAPICH2 and Open MPI often do notdeliver the best performance because all of the threads associatedในงาน MPI ถูกแมปไปยังหลักเดียวกัน ดังนั้นไฮบริดสลีประยุกต์ประสิทธิภาพอาจรับผลกระทบอย่างมากโดยการแมปนี้ ตัวอย่างเช่น สมมติว่าไฮบริดสลีรหัสสี่งาน MPI แต่ละคน มีสองเธรด OpenMPดำเนินการในคลัสเตอร์มีโหนที่แต่ละคนจะประกอบด้วยตัวประมวลผล 4 แกนสอง sockets S1 และ S2 และหนึ่ง HCAใน S2 ในกรณีนี้ โดยค่าเริ่มต้น MVAPICH2, Intel MPIและ MPI เปิดกำหนดการแม็ปงานแกนประมวลผลที่แตกต่างกัน เป็นแสดงในรูปที่ 6: การ) MVAPICH2 แผนที่สี่งานในการS1, b) เปิด MPI (รุ่น 1.7.4 และสูงกว่า) แผนที่แต่ละคู่งานที่แกนสองครั้งแรกของ S1 และ S2 และ c) Intel MPIแผนที่แต่ละงานเพื่อจับคู่หลัก

การแปล กรุณารอสักครู่..

ผลลัพธ์ (ไทย) 2:[สำเนา]

คัดลอก!

4.2 ขั้นตอนวิธีการดำเนิน Collective
บ่อยครั้งที่ห้องสมุด MPI รวมกว่าหนึ่งในขั้นตอนวิธีการ
ในการดำเนินการดำเนินการแต่ละกลุ่ม เนื่องจากผลการดำเนินงานขั้นตอนวิธีการ
อาจขึ้นอยู่กับขนาดของระบบ, ขนาดของข้อความและ
งานสื่อสารริเริ่มห้องสมุด MPI ให้เริ่มต้น
ขั้นตอนวิธีการรวมกันของระบบเป้าหมายโดยรวมของแต่ละ
การดำเนินงานและขนาดของข้อความ แต่โปรดทราบว่า
ตั้งค่าเริ่มต้นของ Intel MPI อยู่บนพื้นฐานของประสิทธิภาพการทำงาน
ของ Intel MPI Micro-มาตรฐาน (IMB) ในขณะที่ MVAPICH2 ของ
การตั้งค่าเริ่มต้นจะขึ้นอยู่กับประสิทธิภาพการทำงานของ
OSU Micro-มาตรฐาน (OMB) นอกจากนี้ห้องสมุด MPI
มักจะให้ค่าพารามิเตอร์ที่ผู้ใช้กำหนดสำหรับการเลือก
ของอัลกอริทึมสำหรับทั้งขนาดข้อความทั้งหมดหรือคนที่เฉพาะเจาะจง.
ในฐานะที่เป็นรูปแบบการสื่อสารของแอพลิเคชันที่ได้รับอาจ
จะแตกต่างจากที่ของมาตรฐานที่ใช้ในการปรับแต่งให้
ห้องสมุด MPI ใช้ อัลกอริทึมที่ไม่ได้เริ่มต้นของห้องสมุด
จะส่งผลให้ประสิทธิภาพการทำงานที่ดีที่น่าแปลกใจ ยกตัวอย่างเช่น
ตามที่แสดงในรูปที่ 4 สำหรับข้อความที่มีขนาดเล็ก (128
ไบต์หรือน้อยกว่า), การดำเนินการ 256 หลักของ OMB ของ MPI_Bcast
(osu_bcast) กับ Intel MPI 4.1 และผู้ใช้กำหนด (ปรับ) ก
อัลกอริทึมส่งผลให้เวลาดำเนินการที่ เป็น 35 ครั้ง
เร็วกว่ากับ Intel MPI 4.1 และขั้นตอนวิธีการเริ่มต้นของ
และคล้ายกับ MVAPICH2 1.9a2 และขั้นตอนวิธีการเริ่มต้น.
อีกตัวอย่างหนึ่ง (แสดงในรูปที่ 5) ผลจากการทำงาน
OMB ของ MPI_Gather (osu_gather) ครั้งที่สี่กับ Intel MPI
4.1 ในแต่ละครั้ง โดยใช้หนึ่งในสามของอัลกอริทึมของ Intel MPI อื่น
(เช่นค่าที่ผู้ใช้กำหนดคงที่) และครั้งเดียวโดยใช้
ขั้นตอนวิธีการเริ่มต้นของห้องสมุด แม้ว่า autoselected Intel MPI ของ
ขั้นตอนวิธีการเป็นตัวเลือกที่ดีที่สุดสำหรับขนาดข้อความขนาดเล็ก
(มันจะเลือกอัลกอริทึมที่ 2) สำหรับข้อความที่มีขนาดใหญ่ขนาด Intel MPI ของ
MPI_Gather ขั้นตอนวิธีการเลือกโดยอัตโนมัติอยู่ไกลจากที่ดีที่สุด (มัน
จะเลือกอัลกอริทึม 3) ในฐานะที่ถูกระบุโดยผลลัพธ์เหล่านี้ผู้เชี่ยวชาญ
แนะนำเกี่ยวกับขั้นตอนวิธีการที่จะใช้สำหรับรวม
การดำเนินงานอาจส่งผลให้ประสิทธิภาพการทำงานของแอพลิเคชันที่ดีขึ้น
เมื่อเทียบกับที่ทำได้โดยใช้ไลบรารี MPI ของ
กลยุทธ์อัตโนมัติเลือก.
ในการปรับแต่งห้องสมุด MPI สำหรับการใช้งานที่เฉพาะเจาะจง MPI
ที่ปรึกษาตรวจพบ อัลกอริทึมที่ใช้สำหรับการดำเนินการแต่ละกลุ่ม
ลูกจ้างโดยการประยุกต์ใช้และกำหนดว่าจะเป็น
ทางเลือกที่ดีที่สุด สำหรับการดำเนินการแต่ละรวมสำหรับการที่อัล
1
10
100
1000
4 16 64 256 1K 4K 16K 64K 256K
เวลา (US)
ขนาดของข้อความ (bytes)
MVAPICH2 เริ่มต้น
Intel MPI ปรับ
Intel MPI เริ่มต้น
รูปที่ 4: 256-core ประสิทธิภาพ MPI_Bcast ของ OMB
( osu_bcast) ที่มีการเริ่มต้นและการปรับขั้นตอนวิธีการใน Intel MPI
เทียบกับ . ขั้นตอนวิธีการเริ่มต้นใน MVAPICH2
1
10
100
1000
10000
100000
4 16 64 256 1K 4K 16K 64K256K 1M
แฝง (US)
ขนาดของข้อความ (bytes)
ขั้นตอนวิธีการเริ่มต้น
ขั้นตอนวิธีการ 1
ขั้นตอนวิธีการ 2
ขั้นตอนวิธีการ 3
รูปที่ 5: 256-core MPI_Gather แฝง OMB
(osu_gather) กับแต่ละขั้นตอนวิธีการใน Intel MPI (userdefined)
เทียบกับ ขั้นตอนวิธีการเริ่มต้น (เลือกโดยอัตโนมัติ).
อัลกอริทึม Ternate จะให้ประสิทธิภาพที่ดีขึ้น, MPI
ปรึกษาแนะนำขั้นตอนวิธีการสำรองและให้
ผู้ใช้ที่มีคำแนะนำเกี่ยวกับวิธีการเลือกขั้นตอนวิธีนี้.
เพื่อให้บรรลุนี้ในระหว่างขั้นตอนการเก็บรวบรวมข้อมูลที่ MPI
ระเบียนที่ปรึกษาทาง mpiP เวลาดำเนินการและข้อความ
ขนาดของแต่ละการดำเนินงานโดยรวมของการจ้างงานโดยแอพลิเคชัน
และใช้ MPI_T ระบุขั้นตอนวิธีการที่ถูกจ้างมา.
ถัดไปสำหรับแต่ละดำเนินงานเหล่านี้โดยใช้ตาราง
ที่ถูกสร้างขึ้นในช่วงเวลาการติดตั้งการใช้สคริปต์ CE,
ซึ่ง รวมถึงเวลาการดำเนินการของการดำเนินการทุกกลุ่ม
อัลกอริทึมในแต่ละห้องสมุด MPI ติดตั้งในระบบที่
จะระบุขั้นตอนวิธีการที่ดีที่สุดที่จะใช้ในสถาปัตยกรรมเป้าหมาย
สำหรับขนาดของข้อความที่น่าสนใจในแต่ละ หากมีการรวม
การดำเนินงานที่แอพลิเคชันควรใช้ขั้นตอนวิธีการที่แตกต่างกัน
MPI เอาท์พุทที่ปรึกษาที่เกี่ยวข้องกับคำแนะนำ.
4.3 การทำแผนที่ของ MPI งานแกน
ห้องสมุด MPI ให้แต่ละกลยุทธ์การเริ่มต้นของตัวเองสำหรับ
งานทำแผนที่เพื่อซ็อกเก็ตและแกน ไม่มีเดียวที่ดีที่สุดคือ
กลยุทธ์ตั้งแต่การทำแผนที่ที่เหมาะสมเป็นธรรมขึ้นอยู่กับ
คุณสมบัติของแอพลิเคชัน ตั้งแต่การทำแผนที่กำหนด
ความใกล้ชิดของกระบวนการราก (โดยทั่วไปงาน 0) ไปยังโฮสต์
อะแดปเตอร์ Channel (HCA) บัตรก็สามารถส่งผลกระทบต่อประสิทธิภาพ
ที่งาน 0 สื่อสารกับงาน MPI อื่น ๆ .
สำหรับการใช้งานไฮบริด (MPI + OpenMP) แมปเริ่มต้น
ให้บริการโดย MVAPICH2 และเปิด MPI มักจะไม่
ส่งมอบประสิทธิภาพที่ดีที่สุดเพราะทุกหัวข้อที่เกี่ยวข้อง
กับงาน MPI ถูกแมปไปยังแกนเดียวกัน ดังนั้น
ประสิทธิภาพของโปรแกรมไฮบริดอาจได้รับผลกระทบอย่างมาก
จากการทำแผนที่นี้ ตัวอย่างเช่นสมมติว่าไฮบริด
รหัสสี่งาน MPI แต่ละคนมีสองหัวข้อ OpenMP,
จะถูกดำเนินการในคลัสเตอร์กับโหนดที่ประกอบด้วยแต่ละ
ของทั้งสองหน่วยประมวลผล 4-Core บนซ็อกเก็ต S1 และ S2 และหนึ่ง HCA
ใน S2 ในกรณีนี้โดยค่าเริ่มต้น MVAPICH2 อินเทล MPI,
และเปิด MPI กำหนดแตกต่างกันแมปงานการแกน ในฐานะที่
แสดงในรูปที่ 6: ก) MVAPICH2 แผนที่สี่งานให้กับ
S1 b) เปิด MPI (เวอร์ชั่น 1.7.4 และสูงกว่า) แผนที่แต่ละคู่
ของงานทั้งสองแกนแรกของ S1 และ S2 และ c) Intel MPI
แผนที่แต่ละ งานคู่ของหลัก

การแปล กรุณารอสักครู่..

ผลลัพธ์ (ไทย) 3:[สำเนา]

คัดลอก!

4.2 ขั้นตอนวิธีสำหรับการรวมมักจะมีมากกว่าหนึ่งวิธีสำหรับห้องสมุดที่จะใช้ในแต่ละกลุ่มงาน เนื่องจากประสิทธิภาพของขั้นตอนวิธีอาจขึ้นอยู่กับขนาด , ระบบข้อความ ขนาด และงานริเริ่มการสื่อสารห้องสมุดสำหรับให้เริ่มต้นขั้นตอนวิธีสำหรับแต่ละชุดของระบบเป้าหมายร่วมการดําเนินงานและข้อความขนาด อย่างไรก็ตาม โปรดสังเกตว่าการตั้งค่าเริ่มต้นสำหรับ Intel จะขึ้นอยู่กับประสิทธิภาพของ Intel PII micro มาตรฐาน ( IMB ) ในขณะที่ mvapich2 คือตั้งค่าเริ่มต้นจะขึ้นอยู่กับประสิทธิภาพของที่ใกล้ micro มาตรฐาน ( OMB ) นอกจากนี้ ห้องสมุดการท่องเที่ยวมักจะให้ผู้ใช้กำหนดพารามิเตอร์สำหรับการเลือกของอัลกอริทึมสำหรับขนาดของข้อความทั้งหมด หรือเฉพาะคนเป็นรูปแบบการสื่อสารของโปรแกรมอาจได้รับแตกต่างจากเกณฑ์มาตรฐานที่ใช้ในการปรับแต่งให้ห้องสมุดสำหรับใช้อัลกอริทึมที่ไม่เริ่มต้นของห้องสมุดสามารถส่งผลงานตื่นตาตื่นใจดี ตัวอย่างเช่นดังแสดงในรูปที่ 4 ข้อความขนาดเล็ก ( 128ไบต์ หรือน้อยกว่า ) , 256 หลักการของ OMB mpi_bcast( osu_bcast ) กับ Intel PII 4.1 และผู้ใช้กำหนดเอง ( ติดตาม )ขั้นตอนวิธีในการทำให้เวลาที่ 35 ครั้งเร็วกว่ากับ Intel PCI 4.1 และขั้นตอนวิธีการเริ่มต้นของมันและคล้ายคลึงกับ mvapich2 1.9a2 และขั้นตอนวิธีการเริ่มต้นของมันอีกตัวอย่าง ( แสดงในรูปที่ 5 ) ผลจากการรันของ OMB mpi_gather ( osu_gather ) สี่ครั้งกับ Intel PII4.1 ทุกครั้งที่ใช้อีกหนึ่งของ Intel PCI 3 ขั้นตอนวิธี( เช่น ผู้ใช้แก้ไขค่า ) และเวลาที่ใช้ขั้นตอนวิธีห้องสมุดเป็นค่าเริ่มต้น แม้ว่าเศรษฐกิจของ autoselected อินเทลวิธีการเลือกที่ดีที่สุดสำหรับขนาดข้อความเล็ก ๆ( เลือกวิธีที่ 2 ) สำหรับข้อความขนาดใหญ่ขนาดเศรษฐกิจของอินเทลmpi_gather อัตโนมัติเลือกวิธีที่เหมาะสม ( มันไกลจากเลือกวิธีที่ 3 ) ตามที่ระบุโดยผลเหล่านี้ ผู้เชี่ยวชาญคำแนะนำเกี่ยวกับวิธีการที่จะใช้สำหรับการรวมได้ผลในการดำเนินงานดีขึ้นเมื่อเทียบกับที่ได้ใช้สำหรับห้องสมุดกลยุทธ์การเลือกโดยอัตโนมัติการปรับแต่งที่หน้าห้องสมุดสำหรับโปรแกรมที่เฉพาะเจาะจง , MPIที่ปรึกษาตรวจสอบขั้นตอนวิธีที่ใช้สำหรับแต่ละกลุ่มปฏิบัติการที่ใช้โดยโปรแกรมประยุกต์และกำหนดว่าเป็นทางเลือกที่ดีที่สุด ของแต่ละกลุ่มงานที่อัล -11010010004 16 64 256 1K 4K 16K 64K 256Kเวลา ( เรา )ขนาดข้อความ ( ไบต์ )mvapich2 โดยปริยายอินเทลปรับเทียบข้อมูลการท่องเที่ยวที่เริ่มต้นรูปที่ 4 : 256 หลัก mpi_bcast ประสิทธิภาพของ OMB( osu_bcast ) กับค่าเริ่มต้นและติดตามข้อมูลสำหรับอัลกอริทึมและเริ่มต้นขั้นตอนวิธีใน mvapich2 .110100100010 , 0001000004 16 64 256 1K 4K 16K 64k256k 1m( ( เรา )ขนาดข้อความ ( ไบต์ )วิธีการเริ่มต้นขั้นตอนวิธี 12 ขั้นตอนวิธีอัลกอริทึม 3รูปที่ 5 : 256 หลัก mpi_gather แฝงของ OMB( osu_gather ) กับแต่ละขั้นตอนวิธีใน Intel PCI ( userdefined )กับขั้นตอนวิธีการเริ่มต้น ( เลือกอัตโนมัติ )ขั้นตอนวิธี Ternate จะให้ปรับปรุงประสิทธิภาพอุตสาหกรรมที่ปรึกษาแนะนำขั้นตอนวิธีอื่นและให้ผู้ใช้ที่มีคำแนะนำเกี่ยวกับวิธีการเลือกวิธีนี้เพื่อให้บรรลุนี้ในการเก็บรวบรวมข้อมูล สำหรับ เฟสประวัติ ที่ปรึกษาทาง mpip การเวลา และข้อความขนาดของแต่ละกลุ่มงาน โดยการใช้และการใช้ mpi_t ระบุวิธีที่ใช้ต่อไปสำหรับแต่ละการดำเนินงานเหล่านี้ การใช้ตารางที่ถูกสร้างขึ้นในเวลาการติดตั้งใช้ CE สคริปต์ซึ่งรวมถึงการรวมการดำเนินงานของทุก ๆครั้งขั้นตอนวิธีในแต่ละหน้าห้องสมุดที่ติดตั้งในระบบระบุว่าวิธีที่ดีที่สุดที่จะใช้ในสถาปัตยกรรมของเป้าหมายสำหรับแต่ละข้อความขนาดของดอกเบี้ย ถ้า มี รวมการดำเนินงานที่ใช้ควรใช้ขั้นตอนวิธีที่แตกต่างกันสำหรับผลผลิตที่เกี่ยวข้องที่ปรึกษาแนะนำ4.3 แผนที่งาน MPI ให้แกนแต่ละประเทศมีห้องสมุดของตัวเองสำหรับกลยุทธ์การเริ่มต้นแผนที่งานซ็อกเก็ตและแกน ไม่มีเดี่ยวที่ดีที่สุดกลยุทธ์ตั้งแต่แผนที่ที่เหมาะสมเป็นธรรม ขึ้นอยู่กับคุณสมบัติของโปรแกรม เนื่องจากแผนที่กำหนดความใกล้ชิดของกระบวนการราก ( โดยทั่วไป งาน 0 ) เจ้าภาพอะแดปเตอร์ช่อง ( HCA ) การ์ด มันสามารถส่งผลกระทบต่อ ประสิทธิภาพที่งาน 0 งานสื่อสารกับประเทศอื่น ๆสำหรับการใช้งานไฮบริด ( หน้า + openmp ) เริ่มต้นชีวิตโดย mvapich2 และเปิดประเทศมักจะไม่ส่งมอบประสิทธิภาพที่ดีที่สุด เพราะทุกกระทู้ที่เกี่ยวข้องกับหน้างานแมปไปยังแกนเดียวกัน ดังนั้นประสิทธิภาพของโปรแกรมลูกผสมอาจจะได้รับผลกระทบอย่างมากโดยแผนที่นี้ ตัวอย่างเช่น สมมติว่า ไฮบริดรหัสที่มีสี่หน้างานแต่ละที่มีสอง openmp กระทู้จะรันบนคลัสเตอร์กับโหนดที่แต่ละประกอบด้วย2 4-core โปรเซสเซอร์บนซ็อกเก็ต S1 และ S2 และ HCAบน S2 . ในคดีนี้ โดยค่าเริ่มต้น mvapich2 , Intel MPI ,และเปิดประเทศกำหนดงานที่แตกต่างกันเพื่อแกนแมป . เป็นแสดงในรูปที่ 6 : ) mvapich2 4 งานแผนที่S1 B ) เปิด MPI ( รุ่น 1.7.4 และสูงกว่า ) แผนที่แต่ละคู่งานสองแกนแรก S1 และ S2 , และ C ) Intel PIIแผนที่แต่ละตา

การแปล กรุณารอสักครู่..

ภาษาอื่น ๆ

การสนับสนุนเครื่องมือแปลภาษา: กรีก, กันนาดา, กาลิเชียน, คลิงออน, คอร์สิกา, คาซัค, คาตาลัน, คินยารวันดา, คีร์กิซ, คุชราต, จอร์เจีย, จีน, จีนดั้งเดิม, ชวา, ชิเชวา, ซามัว, ซีบัวโน, ซุนดา, ซูลู, ญี่ปุ่น, ดัตช์, ตรวจหาภาษา, ตุรกี, ทมิฬ, ทาจิก, ทาทาร์, นอร์เวย์, บอสเนีย, บัลแกเรีย, บาสก์, ปัญจาป, ฝรั่งเศส, พาชตู, ฟริเชียน, ฟินแลนด์, ฟิลิปปินส์, ภาษาอินโดนีเซี, มองโกเลีย, มัลทีส, มาซีโดเนีย, มาราฐี, มาลากาซี, มาลายาลัม, มาเลย์, ม้ง, ยิดดิช, ยูเครน, รัสเซีย, ละติน, ลักเซมเบิร์ก, ลัตเวีย, ลาว, ลิทัวเนีย, สวาฮิลี, สวีเดน, สิงหล, สินธี, สเปน, สโลวัก, สโลวีเนีย, อังกฤษ, อัมฮาริก, อาร์เซอร์ไบจัน, อาร์เมเนีย, อาหรับ, อิกโบ, อิตาลี, อุยกูร์, อุสเบกิสถาน, อูรดู, ฮังการี, ฮัวซา, ฮาวาย, ฮินดี, ฮีบรู, เกลิกสกอต, เกาหลี, เขมร, เคิร์ด, เช็ก, เซอร์เบียน, เซโซโท, เดนมาร์ก, เตลูกู, เติร์กเมน, เนปาล, เบงกอล, เบลารุส, เปอร์เซีย, เมารี, เมียนมา (พม่า), เยอรมัน, เวลส์, เวียดนาม, เอสเปอแรนโต, เอสโทเนีย, เฮติครีโอล, แอฟริกา, แอลเบเนีย, โคซา, โครเอเชีย, โชนา, โซมาลี, โปรตุเกส, โปแลนด์, โยรูบา, โรมาเนีย, โอเดีย (โอริยา), ไทย, ไอซ์แลนด์, ไอร์แลนด์, การแปลภาษา.