Abstract: This paper presents an effective keyword search method for data-centric extensive markup lan-
guage (XML) documents. The method divides an XML document into compact connected integral subtrees,
called self-integral trees (SI-Trees), to capture the structural information in the XML document. The SI-Trees
are generated based on a schema guide. Meaningful self-integral trees (MSI-Trees) are identified, which
contain all or some of the input keywords for the keyword search in the XML documents. Indexing is used to
accelerate the retrieval of MSI-Trees related to the input keywords. The MSI-Trees are ranked to identify the
top-k results with the highest ranks. Extensive tests demonstrate that this method costs 10-100 ms to answer
a keyword query, and outperforms existing approaches by 1-2 orders of magnitude.
Key words: keyword searches; extensive markup language (XML); self-integral trees; ranking; indexing