Many documents with mathematical content are published
on the Web, but conventional search engines that rely on
keyword search only cannot fully exploit their mathematical
information. In particular, keyword search is insucient
when expressions in a document are not annotated with natural
keywords or the user cannot describe her query with
keywords. Retrieving documents by querying their mathematical
content directly is very appealing in various domains
such as education, digital libraries, engineering, patent documents,
medical sciences, etc. Capturing the relevance of
mathematical expressions also greatly enhances document
classication in such domains.
Unlike text retrieval, where keywords carry enough semantics
to distinguish text documents and rank them, math
symbols do not contain much semantic information on their
own. In fact, mathematical expressions typically consist of
few alphabetical symbols organized in rather complex structures.
Hence, the structure of an expression, which describes
the way such symbols are combined, should also be considered.
Unfortunately, there is no standard testbed with which
to evaluate the eectiveness of a mathematics retrieval algorithm.
In this paper we study the fundamental and challenging
problems in mathematics retrieval, that is how to capture
the relevance of mathematical expressions, how to query
them, and how to evaluate the results. We describe various
search paradigms and propose retrieval systems accordingly.
We discuss the benets and drawbacks of each approach, and
further compare them through an extensive empirical study.