3. AUTOMATIC SPEECH RECOGNITIONThe

3. AUTOMATIC SPEECH RECOGNITION

The ASR service implements an audio segmentation and multi-pass ASR decoding strategy transcription engine. A video is ﬁrst seg- mented into speech utterances and the utterances are subsequently transcribed.
The audio segmentation is based on 64-component Text Inde- pendent Gaussian Mixture Models (TIGMM). An ergodic HMM with state models representing the TIGMMs segments the audio stream into regions of speech, music, noise or silence by computing the Viterbi alignment. Utterances are then deﬁned as the found speech regions. The speech utterances are clustered based on full covariance Gaussians on the Perceptual Linear Prediction (PLP) feature stream. Each cluster is forced to contain at least 10 seconds of speech and utterance in these clusters share adaptation parameters in the subsequent transcription process.
The baseline transcription system was a Broadcast News (BN) system trained on the 96 and 97 DARPA Hub4 acoustic model train- ing sets (about 150 hours of data) and the 1996 Hub4 CSR language model training set (128M words)4 . This system uses a Good-Turing smoothed 4-gram language model, pruned using the Seymore- Rosenfeld algorithm [5] to about 8M n-grams for a vocabulary of about 71k words. The baseline acoustic model is trained on PLP cepstra, uses a linear discriminative analysis transform to project from 9 consecutive 13-dimensional frames to a 39-dimensional fea- ture space and uses Semi-tied Covariances [6]. The acoustic model uses triphone state tying with about 8k distinct distributions. Dis- tributions are modeling emissions using 16-component Gaussian mixture densities. In addition to the baseline acoustic model, a feature space speaker adaptive model is used [6].

3. AUTOMATIC SPEECH RECOGNITION

The ASR service implements an audio segmentation and multi-pass ASR decoding strategy transcription engine. A video is ﬁrst seg- mented into speech utterances and the utterances are subsequently transcribed.
The audio segmentation is based on 64-component Text Inde- pendent Gaussian Mixture Models (TIGMM). An ergodic HMM with state models representing the TIGMMs segments the audio stream into regions of speech, music, noise or silence by computing the Viterbi alignment. Utterances are then deﬁned as the found speech regions. The speech utterances are clustered based on full covariance Gaussians on the Perceptual Linear Prediction (PLP) feature stream. Each cluster is forced to contain at least 10 seconds of speech and utterance in these clusters share adaptation parameters in the subsequent transcription process.
The baseline transcription system was a Broadcast News (BN) system trained on the 96 and 97 DARPA Hub4 acoustic model train- ing sets (about 150 hours of data) and the 1996 Hub4 CSR language model training set (128M words)4 . This system uses a Good-Turing smoothed 4-gram language model, pruned using the Seymore- Rosenfeld algorithm [5] to about 8M n-grams for a vocabulary of about 71k words. The baseline acoustic model is trained on PLP cepstra, uses a linear discriminative analysis transform to project from 9 consecutive 13-dimensional frames to a 39-dimensional fea- ture space and uses Semi-tied Covariances [6]. The acoustic model uses triphone state tying with about 8k distinct distributions. Dis- tributions are modeling emissions using 16-component Gaussian mixture densities. In addition to the baseline acoustic model, a feature space speaker adaptive model is used [6].

0/5000

จาก: -

เป็น: -

ผลลัพธ์ (ไทย) 1: [สำเนา]

คัดลอก!

3. การรู้จำเสียงอัตโนมัติบริการ ASR ดำเนินการแบ่งเสียงและแบบมัลติผ่าน ASR ถอดรหัสกลยุทธ์ถอดเครื่องยนต์ มีโอ mented seg แรกเป็นเสียง utterances และ utterances ที่จะมาทับศัพท์การแบ่งกลุ่มเสียงอยู่บนคอมโพเนนต์ 64 ข้อ Inde - นจี้นที่ผสมรุ่น (TIGMM) การ ergodic HMM แบบรัฐแทนการ TIGMMs กลุ่มกระแสข้อมูลเสียงในภูมิภาคของเสียง ดนตรี เสียง หรือความเงียบจากการจัดตำแหน่ง Viterbi Utterances แล้วจะกำหนดเป็นขอบเขตของเสียงที่พบ การพูดที่มีคลัสเตอร์ utterances อิงแปรปรวนเต็ม Gaussians ในกระแสข้อมูลคุณลักษณะการรับรู้เชิงเส้นคาดเดา (PLP) แต่ละคลัสเตอร์จะถูกบังคับให้ประกอบด้วยการพูดอย่างน้อย 10 วินาที และ utterance ในคลัสเตอร์เหล่านี้แบ่งปันปรับพารามิเตอร์ในกระบวนการถอดตามมาระบบการถอดรหัสข้อมูลพื้นฐานเป็นระบบกระจายข่าว (พันล้าน) ฝึกอบรม 96 และ 97 Hub4 DARPA รุ่นเสียงรถไฟ-ing ชุด (ประมาณ 150 ชั่วโมงของข้อมูล) และปี 1996 Hub4 CSR ภาษาแบบฝึกตั้ง (คำ 128 เมตร) 4 ระบบนี้ใช้รูปแบบทัวริงดีภาษา 4 กรัมเรียบ ล้างโดยใช้อัลกอริทึม Seymore - Rosenfeld [5] การ 8 M n กรัมสำหรับคำศัพท์เกี่ยวกับคำ k 71 รูปแบบเสียงพื้นฐานเป็นการฝึกอบรม PLP cepstra ใช้การแปลงเชิงเส้นการวิเคราะห์ discriminative โครงการจาก 9 ติดต่อกัน 13 มิติเฟรมพื้นที่ 39 มิติ fea-ture และใช้ Semi-tied Covariances [6] รูปแบบเสียงใช้สถานะ triphone ที่ผูก ด้วยประมาณ 8k แตกกระจาย ส่ง tributions เป็นโมเดลปล่อยโดยใช้คอมโพเนนต์ 16 นที่ผสมความหนาแน่น นอกจากรูปแบบเสียงพื้นฐาน ลักษณะพื้นที่ลำโพงรุ่นปรับเป็นใช้ [6]

การแปล กรุณารอสักครู่..

ผลลัพธ์ (ไทย) 2:[สำเนา]

คัดลอก!

3. อัตโนมัติ Speech Recognition บริการ ASR ดำเนินการแบ่งส่วนเสียงและ multi-Pass เครื่องยนต์ ASR กลยุทธ์การถอดรหัสถอดความ วิดีโอถูกแรก seg- มพีเมนท์ลงไปในคำพูดพูดและคำพูดที่มีการคัดลอกในภายหลัง. แบ่งส่วนเสียงจะขึ้นอยู่กับองค์ประกอบ 64 ข้อความ Inde- จี้เสียนรุ่นผสม (TIGMM) อืมอัตลักษณ์ที่มีรูปแบบของรัฐที่เป็นตัวแทนของกลุ่ม TIGMMs สตรีมเสียงลงในพื้นที่ในการพูด, เพลง, เสียงหรือความเงียบโดยการคำนวณการจัดตำแหน่ง Viterbi คำพูดแล้วนิยามเป็นคำพูดของภูมิภาคพบ การพูดคำพูดเป็นคลัสเตอร์บนพื้นฐานของความแปรปรวนเต็ม Gaussians ในการทำนายการรับรู้เชิงเส้น (PLP) กระแสคุณลักษณะ แต่ละคลัสเตอร์ถูกบังคับให้ต้องมีอย่างน้อย 10 วินาทีในการพูดและคำพูดเหล่านี้ในการปรับตัวพารามิเตอร์กลุ่มมีส่วนร่วมในกระบวนการถอดความตามมา. ระบบพื้นฐานถอดความเป็นระบบการออกอากาศข่าวสาร (BN) ได้รับการฝึกฝนใน 96 และ 97 DARPA Hub4 อะคูสติกรุ่น train- ชุดไอเอ็นจี (ประมาณ 150 ชั่วโมงของข้อมูล) และชุด 1996 Hub4 CSR รูปแบบการฝึกอบรมภาษา (128M คำ) 4 ระบบนี้ใช้ดีทัวริงเรียบรุ่นภาษา 4 กรัม pruned โดยใช้อัลกอริทึม Seymore- แจสัน [5] ประมาณ 8M N-กรัมสำหรับคำศัพท์เกี่ยวกับ 71K คำ รูปแบบอะคูสติกพื้นฐานได้รับการฝึกฝนใน PLP cepstra ใช้การวิเคราะห์จำแนกเชิงเส้นเปลี่ยนโครงการจากเฟรมติดต่อกัน 9 13 มิติ 39 มิติพื้นที่ ture fea- และใช้ covariances กึ่งผูก [6] รูปแบบอะคูสติกใช้รัฐ triphone ผูกเกี่ยวกับการกระจาย 8K ที่แตกต่างกัน tributions ปรากฏมีการสร้างแบบจำลองการปล่อยก๊าซโดยใช้ 16 องค์ประกอบความหนาแน่นผสมแบบเกาส์ นอกจากนี้ยังมีรูปแบบอะคูสติกพื้นฐานคุณลักษณะพื้นที่แบบปรับตัวลำโพงจะใช้ [6]

การแปล กรุณารอสักครู่..

ผลลัพธ์ (ไทย) 3:[สำเนา]

คัดลอก!

การแปล กรุณารอสักครู่..

ภาษาอื่น ๆ

การสนับสนุนเครื่องมือแปลภาษา: กรีก, กันนาดา, กาลิเชียน, คลิงออน, คอร์สิกา, คาซัค, คาตาลัน, คินยารวันดา, คีร์กิซ, คุชราต, จอร์เจีย, จีน, จีนดั้งเดิม, ชวา, ชิเชวา, ซามัว, ซีบัวโน, ซุนดา, ซูลู, ญี่ปุ่น, ดัตช์, ตรวจหาภาษา, ตุรกี, ทมิฬ, ทาจิก, ทาทาร์, นอร์เวย์, บอสเนีย, บัลแกเรีย, บาสก์, ปัญจาป, ฝรั่งเศส, พาชตู, ฟริเชียน, ฟินแลนด์, ฟิลิปปินส์, ภาษาอินโดนีเซี, มองโกเลีย, มัลทีส, มาซีโดเนีย, มาราฐี, มาลากาซี, มาลายาลัม, มาเลย์, ม้ง, ยิดดิช, ยูเครน, รัสเซีย, ละติน, ลักเซมเบิร์ก, ลัตเวีย, ลาว, ลิทัวเนีย, สวาฮิลี, สวีเดน, สิงหล, สินธี, สเปน, สโลวัก, สโลวีเนีย, อังกฤษ, อัมฮาริก, อาร์เซอร์ไบจัน, อาร์เมเนีย, อาหรับ, อิกโบ, อิตาลี, อุยกูร์, อุสเบกิสถาน, อูรดู, ฮังการี, ฮัวซา, ฮาวาย, ฮินดี, ฮีบรู, เกลิกสกอต, เกาหลี, เขมร, เคิร์ด, เช็ก, เซอร์เบียน, เซโซโท, เดนมาร์ก, เตลูกู, เติร์กเมน, เนปาล, เบงกอล, เบลารุส, เปอร์เซีย, เมารี, เมียนมา (พม่า), เยอรมัน, เวลส์, เวียดนาม, เอสเปอแรนโต, เอสโทเนีย, เฮติครีโอล, แอฟริกา, แอลเบเนีย, โคซา, โครเอเชีย, โชนา, โซมาลี, โปรตุเกส, โปแลนด์, โยรูบา, โรมาเนีย, โอเดีย (โอริยา), ไทย, ไอซ์แลนด์, ไอร์แลนด์, การแปลภาษา.