Figure 1: Internal Organization of eBizSearch
4.3 Enhancing Metadata Extraction
OAI compliance requires additional metadata items than currently
available from CiteSeer. CiteSeer extracts metadata items using customized regular expressions. But the performance for some of them (esp. author(s) and date) turns out to be poor and often requires manual correction. To extend the set of metadata items, and improve the extraction quality, we propose a machine- learning oriented model where the metadata extraction algorithm results from training. The metadata extraction algorithm used is a Support Vector Machine (SVM) [7], a supervised learning and classification method. The algorithms extracts the 13 metadata items defined in [10] from the header of research papers. Table 1 provides a comparison of our latest experimental results to those reported by [10]. It supports the fact our SVM metadata extraction algorithm could achieve better performance than HMM for metadata extraction with less training data.