2. BACKGROUND AND MOTIVATIONNVIDIA’

2. BACKGROUND AND MOTIVATION
NVIDIA’s Fermi GPU architecture [8] consists of multiple
independent streaming multiprocessors (SM), sharing an off-
chip memory. Each SM has a private instruction and data
cache, a scratchpad (shared) memory, 32 cores, 16 load-store
units, 4 special function units and two schedulers, see Fig. 1.
GPUs are programmed in an explicitly data-parallel language
such as CUDA or OpenCL. The programmer writes
code for a single thread, specifies how many threads have
to be invoked and groups these threads in blocks, as only
threads within a block can synchronize and share data via
the shared memory.
As an example, consider the activity graph in Fig. 2 of
an SM executing a 2D convolution kernel (see also Section
4). The SM’s activity is split in three groups: (1) integer
instructions representing address calculations and control
operations, (2) floating point instructions on actual data
and (3) load and store operations. Both the naive version
(Fig. 2a) and the optimized version (Fig. 2b) start with address
calculations, after which load instructions are issued.
After an idle period the data arrives from the off-chip memory
and floating point instructions are issued. The optimized
kernel shows fewer load operations (and corresponding address
calculations) than the naive implementation, due to
the caching of data elements in registers (see Section 4.1).
Although the kernel in Fig. 2b is optimized and minimizes
the number of memory loads, there are still idle cycles
where the SM is stalled waiting for data, despite of the many
threads it is executing to hide latency. Furthermore, a lot
of cycles are spent on address calculations and load instructions
rather than calculations on actual data. In 64% of the
clock cycles at least one of the two schedulers in the SM is
idle. Of the executed instructions 34% is used for floating
point instructions on actual data, resulting in only 12% of
the possible executed instructions over the duration of the
kernel being spend on computations on actual data

0/5000

จาก: -

เป็น: -

ผลลัพธ์ (ไทย) 1: [สำเนา]

คัดลอก!

2. BACKGROUND AND MOTIVATIONNVIDIA’s Fermi GPU architecture [8] consists of multipleindependent streaming multiprocessors (SM), sharing an off-chip memory. Each SM has a private instruction and datacache, a scratchpad (shared) memory, 32 cores, 16 load-storeunits, 4 special function units and two schedulers, see Fig. 1.GPUs are programmed in an explicitly data-parallel languagesuch as CUDA or OpenCL. The programmer writescode for a single thread, specifies how many threads haveto be invoked and groups these threads in blocks, as onlythreads within a block can synchronize and share data viathe shared memory.As an example, consider the activity graph in Fig. 2 ofan SM executing a 2D convolution kernel (see also Section4). The SM’s activity is split in three groups: (1) integerinstructions representing address calculations and controloperations, (2) floating point instructions on actual dataand (3) load and store operations. Both the naive version(Fig. 2a) and the optimized version (Fig. 2b) start with addresscalculations, after which load instructions are issued.After an idle period the data arrives from the off-chip memoryand floating point instructions are issued. The optimizedkernel shows fewer load operations (and corresponding addresscalculations) than the naive implementation, due tothe caching of data elements in registers (see Section 4.1).Although the kernel in Fig. 2b is optimized and minimizesthe number of memory loads, there are still idle cycleswhere the SM is stalled waiting for data, despite of the manythreads it is executing to hide latency. Furthermore, a lotof cycles are spent on address calculations and load instructionsrather than calculations on actual data. In 64% of theclock cycles at least one of the two schedulers in the SM isidle. Of the executed instructions 34% is used for floatingpoint instructions on actual data, resulting in only 12% ofthe possible executed instructions over the duration of thekernel being spend on computations on actual data

การแปล กรุณารอสักครู่..

ผลลัพธ์ (ไทย) 2:[สำเนา]

คัดลอก!

2.
ความเป็นมาและแรงจูงใจของNVIDIA GPU สถาปัตยกรรม Fermi [8]
ประกอบด้วยหลายมัลติสตรีมมิ่งอิสระ(SM)
ร่วมกันนอกหน่วยความจำชิพ แต่ละเอสเอ็มมีการเรียนการสอนส่วนตัวและข้อมูลแคชเป็น ScratchPad (ร่วม) หน่วยความจำ 32 แกน 16 โหลดเก็บหน่วย4 หน่วยพิเศษและฟังก์ชั่นสอง schedulers ดูรูป 1. GPUs มีโปรแกรมในภาษาอย่างชัดเจนข้อมูลขนานเช่นCUDA หรือ OpenCL โปรแกรมเมอร์เขียนสำหรับหัวข้อเดียวระบุว่าหลายกระทู้ได้ที่จะเรียกและกลุ่มหัวข้อเหล่านี้ในบล็อกเป็นเพียงหัวข้อที่อยู่ในบล็อกสามารถประสานและแบ่งปันข้อมูลผ่านทางหน่วยความจำที่ใช้ร่วมกัน. ตัวอย่างเช่นพิจารณากราฟกิจกรรมในรูป . 2 เอสเอ็มรันเคอร์เนลบิด 2D (ดูมาตรา4) กิจกรรมของเอสเอ็มถูกแบ่งออกเป็นสามกลุ่มคือ (1) จำนวนเต็มคำแนะนำที่เป็นตัวแทนของการคำนวณที่อยู่และการควบคุมการดำเนินงาน(2) คำแนะนำลอยจุดบนข้อมูลจริงและ(3) โหลดและดำเนินการจัดเก็บ ทั้งสองรุ่นที่ไร้เดียงสา(รูป. 2a) และรุ่นที่ดีที่สุด (รูป. 2b) เริ่มต้นด้วยการที่อยู่การคำนวณหลังจากที่คำแนะนำในการโหลดจะออก. หลังจากช่วงเวลาที่ไม่ได้ใช้งานมาถึงข้อมูลจากหน่วยความจำออกชิปและคำแนะนำการจุดลอยออก เพิ่มประสิทธิภาพเคอร์เนลแสดงให้เห็นถึงการดำเนินงานในการโหลดน้อย(และที่อยู่ที่สอดคล้องกันการคำนวณ) มากกว่าการดำเนินงานที่ไร้เดียงสาเนื่องจากแคชขององค์ประกอบข้อมูลในทะเบียน (ดูมาตรา 4.1) ได้. แม้ว่าเคอร์เนลในรูป 2b มีการเพิ่มประสิทธิภาพและลดจำนวนของการโหลดหน่วยความจำที่ยังคงไม่ได้ใช้งานวงจรที่เอสเอ็มที่มีการถ่วงเวลารอข้อมูลแม้จะมีของหลายกระทู้มันจะดำเนินการเพื่อซ่อนแฝง นอกจากนี้จำนวนมากของรอบจะใช้เวลาอยู่กับการคำนวณและคำแนะนำการโหลดมากกว่าการคำนวณข้อมูลที่เกิดขึ้นจริง ใน 64% ของรอบนาฬิกาอย่างน้อยหนึ่งในสองschedulers ในเอสเอ็มเป็นไม่ได้ใช้งาน คำแนะนำในการดำเนินการ 34% จะใช้สำหรับการลอยคำแนะนำจุดบนข้อมูลจริงส่งผลให้เพียง12% ของคำแนะนำการดำเนินการที่เป็นไปได้ในช่วงระยะเวลาของเคอร์เนลเป็นใช้จ่ายในการคำนวณข้อมูลที่เกิดขึ้นจริง

การแปล กรุณารอสักครู่..

ผลลัพธ์ (ไทย) 3:[สำเนา]

คัดลอก!

2 . แรงจูงใจและภูมิหลัง
NVIDIA Fermi GPU สถาปัตยกรรม [ 8 ] ประกอบด้วยหลาย
อิสระแบบมัลติโปรเซสเซอร์ ( SM ) ร่วมกันออก -
ชิปหน่วยความจำ แต่ละเรื่องมีการสอนส่วนตัวและแคชข้อมูล
, scratchpad ( ร่วมกัน ) หน่วยความจำ 32 คอร์ , หน่วยจัดเก็บ
16 โหลด 4 หน่วยฟังก์ชันพิเศษและ schedulers เห็นรูปที่ 1 .
GPUs โปรแกรมในภาษา
อย่างชัดเจน ข้อมูลแบบขนานเช่น OpenCL CUDA หรือ . โปรแกรมเมอร์เขียน
รหัสด้ายเดี่ยว ระบุกี่กระทู้ก็จะเรียกกลุ่มเหล่านี้
และหัวข้อในบล็อกเป็นกระทู้เดียว
ภายในบล็อกสามารถประสานและแลกเปลี่ยนข้อมูลในหน่วยความจำที่ใช้ร่วมกันผ่านทาง
.
เป็นตัวพิจารณากิจกรรมกราฟในรูปที่ 2 ของ
เป็น SM เปิด 2D ขดเคอร์เนล ( ดูมาตรา
4 )กิจกรรมของ SM จะแบ่งเป็นสามกลุ่ม : ( 1 ) คําแนะนําจำนวนเต็ม
แทนการคำนวณที่อยู่ และการดำเนินการควบคุม
( 2 ) ลอยจุดแนะนําเกี่ยวกับ
ข้อมูลที่แท้จริง และ ( 3 ) การโหลดและเก็บ ทั้งซื่อรุ่น
( รูปที่ 2A ) และการปรับรุ่น ( รูปที่ 2B ) เริ่มต้นด้วยที่อยู่
การคำนวณหลังจากที่คำแนะนำ
โหลดออกหลังจากใช้ระยะเวลาข้อมูลมาถึงจากนอกและชิปหน่วยความจำ
จุดลอย คำสั่งจะออก . การเพิ่มประสิทธิภาพการโหลดเคอร์เนล
แสดงน้อยลง ( และการคำนวณที่อยู่
สอดคล้องกัน ) มากกว่าการใช้ซื่อ เนื่องจาก
แคชของข้อมูลองค์ประกอบในทะเบียน ( ดูหมวด 4.1 ) .
ถึงแม้ว่าเคอร์เนลในรูปที่ 2B คือการเพิ่มประสิทธิภาพและลด
จำนวนโหลดหน่วยความจำยังมีว่างรอบ
ที่ SM จะถ่วงเวลารอข้อมูลที่แม้จะมีหลายกระทู้มันรัน
ซ่อนแฝงอยู่ นอกจากนี้มาก
รอบใช้เวลาการคำนวณที่อยู่และโหลดคำแนะนำ
มากกว่าการคำนวณ ข้อมูลที่แท้จริง ใน 64
นาฬิการอบอย่างน้อยหนึ่งในสองตารางเวลาใน SM เป็น
ไม่ได้ใช้งาน ของการสั่งใช้ลอย
34%คําแนะนําในจุดข้อมูลที่เกิดขึ้นจริง ส่งผลให้เพียง 12 %
เป็นไปได้ดำเนินการคำสั่งผ่านระยะเวลาของเคอร์เนลที่ถูกใช้ในการคำนวณ
ข้อมูลที่แท้จริง

การแปล กรุณารอสักครู่..

ภาษาอื่น ๆ

การสนับสนุนเครื่องมือแปลภาษา: กรีก, กันนาดา, กาลิเชียน, คลิงออน, คอร์สิกา, คาซัค, คาตาลัน, คินยารวันดา, คีร์กิซ, คุชราต, จอร์เจีย, จีน, จีนดั้งเดิม, ชวา, ชิเชวา, ซามัว, ซีบัวโน, ซุนดา, ซูลู, ญี่ปุ่น, ดัตช์, ตรวจหาภาษา, ตุรกี, ทมิฬ, ทาจิก, ทาทาร์, นอร์เวย์, บอสเนีย, บัลแกเรีย, บาสก์, ปัญจาป, ฝรั่งเศส, พาชตู, ฟริเชียน, ฟินแลนด์, ฟิลิปปินส์, ภาษาอินโดนีเซี, มองโกเลีย, มัลทีส, มาซีโดเนีย, มาราฐี, มาลากาซี, มาลายาลัม, มาเลย์, ม้ง, ยิดดิช, ยูเครน, รัสเซีย, ละติน, ลักเซมเบิร์ก, ลัตเวีย, ลาว, ลิทัวเนีย, สวาฮิลี, สวีเดน, สิงหล, สินธี, สเปน, สโลวัก, สโลวีเนีย, อังกฤษ, อัมฮาริก, อาร์เซอร์ไบจัน, อาร์เมเนีย, อาหรับ, อิกโบ, อิตาลี, อุยกูร์, อุสเบกิสถาน, อูรดู, ฮังการี, ฮัวซา, ฮาวาย, ฮินดี, ฮีบรู, เกลิกสกอต, เกาหลี, เขมร, เคิร์ด, เช็ก, เซอร์เบียน, เซโซโท, เดนมาร์ก, เตลูกู, เติร์กเมน, เนปาล, เบงกอล, เบลารุส, เปอร์เซีย, เมารี, เมียนมา (พม่า), เยอรมัน, เวลส์, เวียดนาม, เอสเปอแรนโต, เอสโทเนีย, เฮติครีโอล, แอฟริกา, แอลเบเนีย, โคซา, โครเอเชีย, โชนา, โซมาลี, โปรตุเกส, โปแลนด์, โยรูบา, โรมาเนีย, โอเดีย (โอริยา), ไทย, ไอซ์แลนด์, ไอร์แลนด์, การแปลภาษา.