State-of-the-art speaker recognition systems tend to use only
short-term spectral features as voice information. Spectral
parameters take into account some aspects of the acoustic level
of the signal, like spectral magnitudes, formant frequencies, etc.,
and they are highly related to the physical traits of the speaker.
However, humans tend to use several linguistic levels like
lexicon, prosody or phonetics to recognise others with voice.
These levels of information are more related to learned habits or
style, and they are mainly manifested in the dialect, sociolect or
idiolect of the speaker.
Since these linguistic levels play an important role in the
human recognition process, a lot of effort has been placed in
adding this kind of information to automatic speaker recognition
systems. [1] showed that idiolectal information provided a good
recognition performance given a sufficient amount of data, and
more recent works [2-4] have demonstrated that prosody helps
to improve voice spectrum based recognition systems, supplying
complementary information not captured in the traditional
acoustic systems. Moreover, some of these parameters have the
advantage of being more robust to some common problems like
noise, transmission channel, speech level or distance between
the speaker and the microphone than spectral features.
There are probably many more characteristics which may
provide complementary information and should be of a great
value for speaker recognition. This work focuses on the use of
jitter and shimmer for a speaker verification system. Jitter and
shimmer are acoustic characteristics of voice signals, and they
are quantified as the cycle-to-cycle variations of fundamental
frequency and waveform amplitude, respectively. Both features
have been largely used to detect voice pathologies (see, e.g. [5,
6]). They are commonly measured for long sustained vowels,
and values of jitter and shimmer above a certain threshold are
considered being related to pathological voices, which are
usually perceived by humans as breathy, rough or hoarse voices.
In [7] it was reported that significant differences can occur in
jitter and shimmer measurements between different speaking
styles, especially in shimmer measurement. Nevertheless,
prosody is also highly-dependant on the emotion of the speaker,
and prosodic features are useful in automatic recognition
systems even when no emotional state is distinguished.
The aim of this work is to improve a prosodic and voice
spectral verification system by introducing new features based
on jitter and shimmer measurements. The experiments have
been done over the Switchboard-I conversational speech
database. Fusion of different features has been performed at the
score level by using z-score normalization and matcher
weighting fusion method.
This paper is organised as follows. In the next section, an
overview of the features used in this work is presented,
including a description of jitter and shimmer measurements. The
experimental setup and verification experiments are shown in
section 3. Finally, conclusions of the experiments are given in
section 4
State-of-the-art speaker recognition systems tend to use onlyshort-term spectral features as voice information. Spectralparameters take into account some aspects of the acoustic levelof the signal, like spectral magnitudes, formant frequencies, etc.,and they are highly related to the physical traits of the speaker.However, humans tend to use several linguistic levels likelexicon, prosody or phonetics to recognise others with voice.These levels of information are more related to learned habits orstyle, and they are mainly manifested in the dialect, sociolect oridiolect of the speaker.Since these linguistic levels play an important role in thehuman recognition process, a lot of effort has been placed inadding this kind of information to automatic speaker recognitionsystems. [1] showed that idiolectal information provided a goodrecognition performance given a sufficient amount of data, andmore recent works [2-4] have demonstrated that prosody helpsto improve voice spectrum based recognition systems, supplyingcomplementary information not captured in the traditionalacoustic systems. Moreover, some of these parameters have theadvantage of being more robust to some common problems likenoise, transmission channel, speech level or distance betweenthe speaker and the microphone than spectral features.There are probably many more characteristics which mayprovide complementary information and should be of a greatvalue for speaker recognition. This work focuses on the use ofjitter and shimmer for a speaker verification system. Jitter andshimmer are acoustic characteristics of voice signals, and theyare quantified as the cycle-to-cycle variations of fundamentalfrequency and waveform amplitude, respectively. Both featureshave been largely used to detect voice pathologies (see, e.g. [5,6]). They are commonly measured for long sustained vowels,and values of jitter and shimmer above a certain threshold areconsidered being related to pathological voices, which areusually perceived by humans as breathy, rough or hoarse voices.In [7] it was reported that significant differences can occur injitter and shimmer measurements between different speakingstyles, especially in shimmer measurement. Nevertheless,prosody is also highly-dependant on the emotion of the speaker,and prosodic features are useful in automatic recognitionsystems even when no emotional state is distinguished.The aim of this work is to improve a prosodic and voicespectral verification system by introducing new features basedon jitter and shimmer measurements. The experiments havebeen done over the Switchboard-I conversational speechdatabase. Fusion of different features has been performed at thescore level by using z-score normalization and matcherweighting fusion method.This paper is organised as follows. In the next section, anoverview of the features used in this work is presented,including a description of jitter and shimmer measurements. Theexperimental setup and verification experiments are shown insection 3. Finally, conclusions of the experiments are given insection 4
การแปล กรุณารอสักครู่..