I. INTRODUCTIONYet another approach

I. INTRODUCTION
Yet another approach that is getting some traction in the literature is to sidestep the network adaptation problem altogether and train networks on speaker-adapted features instead. Such features can be extracted using the speaker normalization machinery readily available for GMM-HMMs such as vocal tract length normalization and feature-space MLLR. This approach works well despite the fact that the VTLN and FMLLR transforms are estimated assuming a GMM-HMM acoustic model and are now being used in conjunction with a DNN-HMM.
A better way might be to provide the network with untransformed features and let it figure out the speaker normalization during training. In order to do that, the network has to be informed which features belong to which speaker. This can be accomplished by creating two sets of time-synchronous inputs: one set of acoustic features for phonetic discrimination and another set of features that characterize the speaker which provided the audio for the first set of features. This idea is similar to [3] with one important difference: in our proposed work, the features which characterize a speaker are the same for all the data of that speaker. Another work relevant to ours is [4], where the authors propose to learn speaker codes which are fed to a speaker adaptation network. The network produces speaker-adapted features which form the input to a regular DNN. The main difference in our proposed work (besides using i-vectors instead of speaker codes) is that we train a single network that does speaker adaptation and phone classification simultaneously instead of two separate networks. Lastly, noise-aware DNNs proposed in [6], which use as input uncompensated features and time-dependent estimates of the noise, are also relevant to our work.
Why speaker recognition features should be helpful can be shown through a simple thought experiment. Imagine that there are two types of speakers, say A and B, which differ in the way they pronounce the phone /AA/. Speaker type A uses the canonical pronunciaton /AA/ whereas speaker type B systematically pronounces it as /AE/. A DNN without speaker features will tend to classify B’s /AA/ as /AE/ because statistically there will be more /AE/’s with canonical pronunciations in the training data. A DNN with speaker identity features however, will learn to significantly increase the output score for /AA/ when presented with /AE/ acoustics for speakers of type B (but not for speakers of type A). In other words, the network can learn speaker-dependent transforms for the acoustic features in order to create a canonical phone classification space in which inter-speaker variability is significantly reduced.
978-1-4799-2756-2/13/$31.00 ©2013 IEEE ASRU 2013
I-vectors [7] are a popular technique for speaker verification and speaker recognition because they encapsulate all the relevant information about a speaker’s identity in a lowdimensional fixed-length representation [8]. This makes them an attractive tool for speaker adaptation techniques for ASR. A concatenation of i-vectors and ASR features is used in [9] for discriminative speaker adaptation with region dependent linear transforms. I-vectors are also employed in [10], [11] for clustering speakers or utterances on mobile devices for more efficient adaptation. The attractiveness of i-vectors and these

I. INTRODUCTION
Yet another approach that is getting some traction in the literature is to sidestep the network adaptation problem altogether and train networks on speaker-adapted features instead. Such features can be extracted using the speaker normalization machinery readily available for GMM-HMMs such as vocal tract length normalization and feature-space MLLR. This approach works well despite the fact that the VTLN and FMLLR transforms are estimated assuming a GMM-HMM acoustic model and are now being used in conjunction with a DNN-HMM.
A better way might be to provide the network with untransformed features and let it figure out the speaker normalization during training. In order to do that, the network has to be informed which features belong to which speaker. This can be accomplished by creating two sets of time-synchronous inputs: one set of acoustic features for phonetic discrimination and another set of features that characterize the speaker which provided the audio for the first set of features. This idea is similar to [3] with one important difference: in our proposed work, the features which characterize a speaker are the same for all the data of that speaker. Another work relevant to ours is [4], where the authors propose to learn speaker codes which are fed to a speaker adaptation network. The network produces speaker-adapted features which form the input to a regular DNN. The main difference in our proposed work (besides using i-vectors instead of speaker codes) is that we train a single network that does speaker adaptation and phone classification simultaneously instead of two separate networks. Lastly, noise-aware DNNs proposed in [6], which use as input uncompensated features and time-dependent estimates of the noise, are also relevant to our work.
Why speaker recognition features should be helpful can be shown through a simple thought experiment. Imagine that there are two types of speakers, say A and B, which differ in the way they pronounce the phone /AA/. Speaker type A uses the canonical pronunciaton /AA/ whereas speaker type B systematically pronounces it as /AE/. A DNN without speaker features will tend to classify B’s /AA/ as /AE/ because statistically there will be more /AE/’s with canonical pronunciations in the training data. A DNN with speaker identity features however, will learn to significantly increase the output score for /AA/ when presented with /AE/ acoustics for speakers of type B (but not for speakers of type A). In other words, the network can learn speaker-dependent transforms for the acoustic features in order to create a canonical phone classification space in which inter-speaker variability is significantly reduced.
978-1-4799-2756-2/13/$31.00 ©2013 IEEE   ASRU 2013
I-vectors [7] are a popular technique for speaker verification and speaker recognition because they encapsulate all the relevant information about a speaker’s identity in a lowdimensional fixed-length representation [8]. This makes them an attractive tool for speaker adaptation techniques for ASR. A concatenation of i-vectors and ASR features is used in [9] for discriminative speaker adaptation with region dependent linear transforms. I-vectors are also employed in [10], [11] for clustering speakers or utterances on mobile devices for more efficient adaptation. The attractiveness of i-vectors and these

0/5000

จาก: -

เป็น: -

ผลลัพธ์ (ไทย) 1: [สำเนา]

คัดลอก!

I. บทนำแต่ วิธีอื่นที่มีการลากบางในวรรณคดี ได้ sidestep ปัญหาปรับเครือข่ายทั้งหมดฝึกเครือข่ายในลักษณะดัดแปลงลำโพงแทน คุณลักษณะดังกล่าวสามารถสกัดได้โดยใช้ลำโพงฟื้นฟูเครื่องจักรพร้อมสำหรับ GMM HMMs vocal ทางเดินยาวฟื้นฟูและคุณลักษณะพื้นที่ MLLR วิธีการนี้ทำงานได้ดีแม้ว่า แปลง VTLN และ FMLLR ประมาณสมมติว่า GMM อืมมระดับรูปแบบ และตอนนี้ใช้ร่วมกับกับ DNN-อืมมวิธีที่ดีกว่าอาจจะ ให้เครือข่าย มีคุณสมบัติ untransformed และให้มันคิดออกฟื้นฟูลำโพงในระหว่างฝึกอบรมได้ ไม่ว่า เครือข่ายยังต้องทราบคุณลักษณะที่เป็นลำโพงที่ นี้สามารถทำได้ โดยการสร้างเวลาแบบซิงโครนัสอินพุตสองชุด: ชุดของคุณลักษณะระดับแบ่งแยกออกเสียงและอีกชุดของคุณลักษณะที่ลำโพงที่ให้เสียงสำหรับชุดแรกของ การ ความคิดนี้จะคล้ายกับ [3] มีความแตกต่างที่สำคัญหนึ่ง: ในงานนำเสนอ คุณลักษณะซึ่งลักษณะของลำโพงที่จะเหมือนกันสำหรับข้อมูลทั้งหมดของลำโพงนั้น งานอื่นที่เกี่ยวข้องกับเราคือ [4] , ที่ผู้เขียนเสนอการเรียนรู้รหัสลำโพงซึ่งติดตามเครือข่ายปรับลำโพง เครือข่ายสร้างคุณสมบัติปรับลำโพงซึ่งฟอร์มอินพุตให้ DNN ปกติ ข้อแตกต่างหลักในงานนำเสนอ (นอกเหนือจากการใช้เวกเตอร์ i แทนรหัสลำโพง) คือ ว่า เราฝึกเครือข่ายเดียวที่ทำลำโพงประเภทปรับตัวและโทรศัพท์พร้อมกันแทนสองเครือข่ายที่แยกต่างหาก สุดท้ายนี้ DNNs เสียงทราบเสนอ [6], ประเมินที่ใช้คุณลักษณะป้อนค่า uncompensated และขึ้นอยู่ กับเวลาของเสียง ได้ยังเกี่ยวข้องกับงานของเราทำไมลำโพงจำแนกคุณลักษณะควรเป็นประโยชน์สามารถแสดงผ่านง่ายที่คิดว่า การทดลอง คิดว่า มีอยู่สองชนิดของลำโพง พูด A และ B ที่แตกต่างในวิธีพวกเขาออกเสียงในโทรศัพท์ AA / Pronunciaton มาตรฐานใช้ลำโพงที่เป็นแบบ /AA/ ในขณะที่ลำโพงชนิด B ระบบ pronounces มันเป็น AE / DNN โดยลักษณะการทำงานของลำโพงจะมีแนวโน้มในการ จัดประเภทของ B /AA/ เป็น /AE/ เนื่องจากทางสถิติจะมีเพิ่มเติม/แอะ/ของกับออกเสียงเป็นที่ยอมรับในข้อมูลการฝึกอบรม DNN กับคุณลักษณะต่าง ๆ ของตัวลำโพงอย่างไรก็ตาม จะเรียนเพื่อเพิ่มคะแนนผลลัพธ์สำหรับ /AA/ เมื่อนำเสนอ ด้วยเปลือง /AE/ สำหรับลำโพงชนิดบี (แต่ไม่ใช่ สำหรับลำโพงชนิด A) ในคำอื่น ๆ เครือข่ายสามารถเรียนรู้แปลงขึ้นอยู่กับลำโพงในลักษณะอะคูสติกเพื่อสร้างพื้นที่การจัดประเภทมาตรฐานโทรศัพท์ซึ่งความแปรผันระหว่างลำโพงอย่างมีนัยสำคัญลดลง978-1-4799-2756-2/13/$31.00 © 2013 IEEE ASRU 2013ฉันเวกเตอร์ [7] เป็นเทคนิคยอดนิยมสำหรับการตรวจสอบลำโพงและลำโพงรู้เนื่องจากพวกเขาซ่อนข้อมูลที่เกี่ยวข้องทั้งหมดเกี่ยวกับข้อมูลประจำตัวของผู้บรรยายในการแสดงยาว-lowdimensional [8] นี้ทำให้พวกเขาเป็นเครื่องมือที่น่าสนใจสำหรับเทคนิคปรับลำโพงสำหรับ ASR เรียงต่อกันของเวกเตอร์ i และ ASR คุณลักษณะใช้ใน [9] สำหรับปรับลำโพง discriminative มีการแปลงเชิงเส้นขึ้นอยู่กับภูมิภาค -เวกเตอร์ยังได้รับการว่าจ้างใน [10], [11] สำหรับคลัสเตอร์ลำโพงหรือ utterances บนอุปกรณ์มือถือสำหรับการปรับตัวมากขึ้น ความเท่ของ i เวกเตอร์เหล่านี้

การแปล กรุณารอสักครู่..

ผลลัพธ์ (ไทย) 2:[สำเนา]

คัดลอก!

I. บทนำ
แต่อีกวิธีหนึ่งที่ได้รับแรงดึงบางส่วนในวรรณคดีคือการหลีกเลี่ยงปัญหาการปรับตัวของเครือข่ายทั้งหมดและฝึกอบรมเครือข่ายเกี่ยวกับคุณสมบัติลำโพงที่ดัดแปลงมาแทน คุณสมบัติดังกล่าวสามารถสกัดโดยใช้เครื่องจักรฟื้นฟูลำโพงพร้อมสำหรับ GMM-HMMs เช่นระยะเวลาในการฟื้นฟูทางเดินเสียงและคุณลักษณะที่พื้นที่ MLLR วิธีการนี้จะทำงานได้ดีแม้จะมีความจริงที่ว่า VTLN และแปลง FMLLR ประมาณสมมติว่ารูปแบบอะคูสติก GMM-HMM และตอนนี้ถูกนำมาใช้ร่วมกับ DNN-อืมได้.
วิธีที่ดีกว่าอาจจะมีการให้เครือข่ายที่มีคุณสมบัติ untransformed และปล่อยให้มัน คิดออกฟื้นฟูลำโพงระหว่างการฝึกอบรม ในการดำเนินการที่เครือข่ายจะต้องมีการแจ้งให้ทราบซึ่งมีคุณสมบัติเป็นลำโพงที่ นี้สามารถทำได้โดยการสร้างสองชุดของปัจจัยการผลิตเวลาซิงโคร: หนึ่งชุดของคุณสมบัติอะคูสติกสำหรับการเลือกปฏิบัติและการออกเสียงชุดของคุณลักษณะที่เป็นลักษณะลำโพงที่ให้เสียงสำหรับชุดแรกของคุณสมบัติอื่น ความคิดนี้จะคล้ายกับ [3] มีความแตกต่างที่สำคัญอย่างหนึ่งในการทำงานของเราที่นำเสนอคุณสมบัติที่ลักษณะของลำโพงจะเหมือนกันสำหรับข้อมูลทั้งหมดของลำโพงที่ การทำงานที่เกี่ยวข้องกับของเราก็คือ [4] ที่ผู้เขียนนำเสนอการเรียนรู้รหัสลำโพงที่มีการเลี้ยงกับเครือข่ายการปรับตัวลำโพง เครือข่ายคุณสมบัติผลิตลำโพงที่ดัดแปลงมาซึ่งรูปแบบการป้อนข้อมูลไปยังปกติ DNN แตกต่างที่สำคัญในการทำงานของเราที่นำเสนอ (นอกเหนือจากการใช้พาหะฉันแทนของรหัสลำโพง) คือการที่เราฝึกอบรมเครือข่ายเดียวที่จะปรับตัวและการจำแนกลำโพงโทรศัพท์พร้อมกันแทนของทั้งสองเครือข่ายที่แยก สุดท้าย DNNs เสียงตระหนักถึงเสนอใน [6] ซึ่งใช้เป็นคุณสมบัติการป้อนข้อมูล uncompensated และประมาณการขึ้นกับเวลาของเสียงนอกจากนี้ยังมีความเกี่ยวข้องกับการทำงานของเรา.
ทำไมคุณสมบัติที่ได้รับการยอมรับลำโพงควรจะเป็นประโยชน์สามารถแสดงผ่านการทดลองทางความคิดที่เรียบง่าย ลองนึกภาพว่ามีสองประเภทของลำโพงพูด A และ B ซึ่งแตกต่างกันในวิธีที่พวกเขาออกเสียงโทรศัพท์ / AA / ลำโพงประเภทใช้ pronunciaton บัญญัติ / AA / ขณะที่ลำโพงประเภท B ระบบออกเสียงเป็น / AE / DNN ไม่มีคุณสมบัติลำโพงจะมีแนวโน้มที่จะจำแนกประเภทของ B / AA / เป็น / AE / สถิติเพราะจะมีมากขึ้น / AE / ของที่มีการออกเสียงที่ยอมรับในข้อมูลการฝึกอบรม DNN กับตัวตนของลำโพงมี แต่จะเรียนรู้ที่จะมีนัยสำคัญเพิ่มคะแนนสำหรับการส่งออก / AA / เมื่อนำเสนอด้วย / AE / เสียงสำหรับลำโพงประเภท B (แต่ไม่ใช่สำหรับลำโพงประเภท A) ในคำอื่น ๆ เครือข่ายที่สามารถเรียนรู้การแปลงขึ้นอยู่กับลำโพงสำหรับคุณสมบัติอะคูสติกในการที่จะสร้างพื้นที่การจัดหมวดหมู่โทรศัพท์ที่ยอมรับในการที่แปรปรวนระหว่างลำโพงที่จะลดลงอย่างมีนัยสำคัญ.
978-1-4799-2756-2 / 13 / $ 31.00 © 2013 IEEE ASRU 2013
ผมเวกเตอร์ [7] เป็นเทคนิคที่นิยมสำหรับการตรวจสอบและการรับรู้ลำโพงลำโพงเพราะพวกเขาสรุปข้อมูลที่เกี่ยวข้องทั้งหมดเกี่ยวกับตัวตนของผู้พูดในการแสดงความยาวคง lowdimensional [8] ซึ่งทำให้พวกเขาเป็นเครื่องมือที่น่าสนใจสำหรับเทคนิคการปรับตัวลำโพงสำหรับ ASR กำหนดการของเวกเตอร์ฉันและคุณสมบัติ ASR ใช้ใน [9] การปรับตัวลำโพงจำแนกภูมิภาคขึ้นอยู่กับการแปลงเชิงเส้น I-เวกเตอร์มีการจ้างงานยังอยู่ใน [10] [11] สำหรับลำโพงการจัดกลุ่มหรือคำพูดบนโทรศัพท์มือถือในการปรับตัวมีประสิทธิภาพมากขึ้น ความน่าดึงดูดใจของฉันเวกเตอร์และสิ่งเหล่านี้

การแปล กรุณารอสักครู่..

ผลลัพธ์ (ไทย) 3:[สำเนา]

คัดลอก!

ผมแนะนำอีกวิธี
ยังที่ได้รับแรงดึงบางอย่างในวรรณคดีคือ sidestep การปรับตัวปัญหาเครือข่ายทั้งหมดและรถไฟเครือข่ายลำโพงดัดแปลงคุณสมบัติแทน คุณลักษณะดังกล่าวสามารถสกัดโดยใช้ลำโพงการฟื้นฟูเครื่องจักรพร้อมใช้งานสำหรับจีเอ็มเอ็ม hmms เช่นช่องเสียงความยาวบรรทัดฐานและคุณลักษณะ mllr อวกาศวิธีนี้ทำงานได้ดีแม้จะมีความจริงที่ว่า vtln และ fmllr แปลงประมาณสมมติว่า gmm-hmm อะคูสติกรุ่น และตอนนี้ถูกใช้ร่วมกับ dnn-hmm.
วิธีที่ดีอาจจะให้เครือข่ายที่มีคุณสมบัติ untransformed และปล่อยให้มันหาลำโพงปกติในระหว่างการฝึกอบรม เพื่อที่จะทำแบบนั้นได้เครือข่ายได้ทราบ ซึ่งคุณสมบัติของลำโพงที่ นี้สามารถทำได้โดยการสร้างสองชุด ชุดหนึ่งของปัจจัยเวลาซิงโคร : คุณสมบัติอะคูสติกสำหรับการแบ่งแยกสัทศาสตร์และอีกชุดคุณลักษณะที่ลักษณะลำโพงที่ให้เสียงสำหรับชุดแรกที่คุณลักษณะ ความคิดนี้จะคล้ายกับ [ 3 ] กับหนึ่งความแตกต่างที่สำคัญ : ในการนำเสนอผลงานคุณลักษณะซึ่งเป็นลักษณะลำโพงจะเหมือนกันสำหรับข้อมูลทั้งหมดของลำโพง งานอื่นที่เกี่ยวข้องกับเรา [ 4 ] ซึ่งผู้เขียนขอเรียนลำโพงรหัสซึ่งเป็นอาหารให้กับลำโพงของเครือข่าย เครือข่ายผลิตลำโพงดัดแปลงคุณสมบัติซึ่งฟอร์มป้อนข้อมูลลง dnn ปกติ

การแปล กรุณารอสักครู่..

ภาษาอื่น ๆ

การสนับสนุนเครื่องมือแปลภาษา: กรีก, กันนาดา, กาลิเชียน, คลิงออน, คอร์สิกา, คาซัค, คาตาลัน, คินยารวันดา, คีร์กิซ, คุชราต, จอร์เจีย, จีน, จีนดั้งเดิม, ชวา, ชิเชวา, ซามัว, ซีบัวโน, ซุนดา, ซูลู, ญี่ปุ่น, ดัตช์, ตรวจหาภาษา, ตุรกี, ทมิฬ, ทาจิก, ทาทาร์, นอร์เวย์, บอสเนีย, บัลแกเรีย, บาสก์, ปัญจาป, ฝรั่งเศส, พาชตู, ฟริเชียน, ฟินแลนด์, ฟิลิปปินส์, ภาษาอินโดนีเซี, มองโกเลีย, มัลทีส, มาซีโดเนีย, มาราฐี, มาลากาซี, มาลายาลัม, มาเลย์, ม้ง, ยิดดิช, ยูเครน, รัสเซีย, ละติน, ลักเซมเบิร์ก, ลัตเวีย, ลาว, ลิทัวเนีย, สวาฮิลี, สวีเดน, สิงหล, สินธี, สเปน, สโลวัก, สโลวีเนีย, อังกฤษ, อัมฮาริก, อาร์เซอร์ไบจัน, อาร์เมเนีย, อาหรับ, อิกโบ, อิตาลี, อุยกูร์, อุสเบกิสถาน, อูรดู, ฮังการี, ฮัวซา, ฮาวาย, ฮินดี, ฮีบรู, เกลิกสกอต, เกาหลี, เขมร, เคิร์ด, เช็ก, เซอร์เบียน, เซโซโท, เดนมาร์ก, เตลูกู, เติร์กเมน, เนปาล, เบงกอล, เบลารุส, เปอร์เซีย, เมารี, เมียนมา (พม่า), เยอรมัน, เวลส์, เวียดนาม, เอสเปอแรนโต, เอสโทเนีย, เฮติครีโอล, แอฟริกา, แอลเบเนีย, โคซา, โครเอเชีย, โชนา, โซมาลี, โปรตุเกส, โปแลนด์, โยรูบา, โรมาเนีย, โอเดีย (โอริยา), ไทย, ไอซ์แลนด์, ไอร์แลนด์, การแปลภาษา.