Syllable-unit based speech recognition has been pursued
by some researchers since this approach makes it easier to
incorporate prosodic features, such as tones, into the syllable
units. Moreover, some papers have shown a high accuracy
for tone recognition when applying this method to
isolated syllables and hand-segmented syllables extracted
from continuous speech. Automatic segmentation of continuous
speech into syllable units is therefore an important
issue. Jittiwarangkul et al. (2000) proposed using several
prosody features including short-time energy, a zero-crossing
rate, and pitch with some heuristic rules for syllable
segmentation. Ratsameewichai et al. (2002) suggested the
use of a dual-band energy contour for phoneme segmentation.
This method decomposed input speech into a
low- and a high-frequency component using wavelet transformation,
computed the time-domain normalized energy
of both components, and introduced some heuristic rules
for selecting endpoints of syllables and phonemes based
on energy contours. Although there is no comparative
experiment using other typical techniques, dividing the
speech signal into detailed frequency bands before applying
energy-based segmentation rules seems to be an effective
approach. A phoneme-segmentation experiment on 1000
Thai isolated-syllables from 10 speakers achieved an average
accuracy of nearly 95%