where ξ(x, sj) represents the number of times a certain symbol x is generated
by a state sj in a sequence,while ξ(si, sj) denotes the frequency of
the joint occurrence of two states si and sj at two adjacent time intervals
over a sequence, and α and β are the forward and backward variables,
respectively. These are in fact the sufficient statistics for the emission
probability ai,j and transition probability bi,j given a sequence. ξ(sj) represents
the frequency of state sj occurring in a sequence [37,42]. These
values can be directly obtained from the forward–backward algorithm
[42].
The DHMM kernel vector UX for a given sequence X is simply the
concatenation of the two gradient vectors calculated from Eqs. (4) and
(5) respectively. The length of the resulting feature vector for a sequence
is N × (M + N).
Thus, the similarity of two sequences with a learned HMM could be
evaluated in a kernel fashion using the corresponding Fisher score vector
as follows
KðX; YÞ¼KðUX; UYÞ: ð6Þ
where K(∙) could be any type of standard kernels for an SVM. (We have
used the gpdsHMM tool [41].)
5. Experiments and results
5.1. Datasets
Our study is carried out on three different datasets; GPDS-ULPGC
[30], the PIE dataset [43] and the RaFD database [44]. The GPDS-ULPGC
dataset was collected by us specifically for this study. It consists of fifty
userswith ten samples per user (thus 500 images in total). The database
is composed of 54% males and 46% female,with ages ranging fromten to
sixty. Each sample is a color image of size 768 × 1024 pixels. It is available
for downloading from [30]. The PIE dataset is a publicly available
dataset [43] composed of sixty-eight subjects, with eleven samples per
subject (thus giving 748 images in total), where each sample is a color
image of size 200 × 300. The main characteristic of the dataset is it
contains illumination changes and different hair styles (e.g., bearded
and beardless, as shown in Fig. 3). The RaFD Face Dataset is composed
of sixty subjects with nine samples per subject (thus giving a total of
540 images) [44]. The image resolution is 1024 × 681 and the database
contains eight facial expressions for each subject. Since our study focuses
on static lip features, only three of the eight expressions present in the
database (neutral, sadness and indifference) suit our purpose, and are
used in the experiments. Furthermore, images containing non-frontal
poses for subjects with their mouth open have been removed.
5.2. Experimental methodology
We use a multi-class SVM for classification, which is built using the
one-versus-all strategy. The SVM_light [45] implementation is utilized