In general, every input channel in a multimodal system produces
its own continuous stream of (often noisy) sensor data; all of this
data must be combined in a manner which allows a decision-making
system to select appropriate system behaviour. The initial robot bartender
makes use of a rule-based social state recogniser [10], which
infers the users’ social states using guidelines derived from the study
of human-human interactions in the bartender domain [7]. The rulebased
recogniser has performed well in a user evaluation of the initial,
simple scenario [3]. However, as the robot bartender is enhanced
to support increasingly complex scenarios, the range of multimodal
input sensors will increase, as will the number of social states to
recognise, making the rule-based solution less practical. Statistical
approaches to state recognition have also been shown to be more
robust to noisy input [14]. In addition, the rule-based version only
considers the top hypothesis from the sensors and does not consider
their confidence scores: incorporating other hypotheses and confidence
may also improve the performance of the classifier in more
complex scenarios, but again this type of decision-making is dicult
to incorporate into a rule-based framework.