Abstract:
Recently, automatic indexing of a spoken document
using a speech recognizer attracts attention. However,
index generation from an automatic transcription has
many problems because the automatic transcription has
many recognition errors and Out-Of-Vocabulary words. To
solve this problem, we propose a document expansion
method using Web documents. To obtain important
keywords which included in the spoken document but lost
by recognition errors, we acquire Web documents relevant
to the spoken document. Then, an index of the spoken
document is generated by combining an index that
generated from the automatic transcription and the Web
documents. We propose a method for retrieval of relevant
documents, and the experimental result shows that the
retrieved Web document contained many OOV words.
Next, we propose a method for combining the recognized
index and the Web index. The experimental result shows
that the index of the spoken document generated by the
document expansion was closer to an index from the
manual transcription than the index generated by the
conventional method. Finally, we conducted a spoken
document retrieval experiment, and the
document-expansion-based index gave better retrieval
precision than the conventional indexing method.