The goal of this paper is to describe the voice command
system as part of the multi modal user interface for
residential application project demoed at CES 2012. The
application is a 3D TV panel which can be controlled
through face recognition, gesture, and speech. The speech
interface is invoked using activation keyword, and
terminated in similar fashion with de-activation keyword.
Speaker recognition is performed on the activation keyword
to allow personalization of the voice commands available to
the particular user, who in this scenario is a member of the
household. A separate setting is also devised to enable guest
user to have basic interaction with the system. Template
matching scheme using dynamic time warping is employed
for its simplicity and robustness to noise. The template
chosen is a cluster of Gaussian Mixture Model (GMM), each
representing a sub-word unit. A state model for voice
interaction is presented to allow efficient operation of this
interface.