State-of-the-art speaker recognition systems tend to use only
short-term spectral features as voice information. Spectral
parameters take into account some aspects of the acoustic level
of the signal, like spectral magnitudes, formant frequencies, etc.,
and they are highly related to the physical traits of the speaker.
However, humans tend to use several linguistic levels like
lexicon, prosody or phonetics to recognise others with voice.
These levels of information are more related to learned habits or
style, and they are mainly manifested in the dialect, sociolect or
idiolect of the speaker.
Since these linguistic levels play an important role in the
human recognition process, a lot of effort has been placed in
adding this kind of information to automatic speaker recognition
systems. [1] showed that idiolectal information provided a good
recognition performance given a sufficient amount of data, and
more recent works [2-4] have demonstrated that prosody helps
to improve voice spectrum based recognition systems, supplying
complementary information not captured in the traditional
acoustic systems. Moreover, some of these parameters have the
advantage of being more robust to some common problems like
noise, transmission channel, speech level or distance between
the speaker and the microphone than spectral features.
There are probably many more characteristics which may
provide complementary information and should be of a great
value for speaker recognition. This work focuses on the use of
jitter and shimmer for a speaker verification system. Jitter and
shimmer are acoustic characteristics of voice signals, and they
are quantified as the cycle-to-cycle variations of fundamental
frequency and waveform amplitude, respectively. Both features
have been largely used to detect voice pathologies (see, e.g. [5,
6]). They are commonly measured for long sustained vowels,
and values of jitter and shimmer above a certain threshold are
considered being related to pathological voices, which are
usually perceived by humans as breathy, rough or hoarse voices.
In [7] it was reported that significant differences can occur in
jitter and shimmer measurements between different speaking
styles, especially in shimmer measurement. Nevertheless,
prosody is also highly-dependant on the emotion of the speaker,
and prosodic features are useful in automatic recognition
systems even when no emotional state is distinguished.
The aim of this work is to improve a prosodic and voice
spectral verification system by introducing new features based
on jitter and shimmer measurements. The experiments have
been done over the Switchboard-I conversational speech
database. Fusion of different features has been performed at the
score level by using z-score normalization and matcher
weighting fusion method.
This paper is organised as follows. In the next section, an
overview of the features used in this work is presented,
including a description of jitter and shimmer measurements. The
experimental setup and verification experiments are shown in
section 3. Finally, conclusions of the experiments are given in
section 4