As mentioned in the challenges list, this task can be done before
the disambiguation pile. In this case, we must prune irrelevant
data. This pruning is mostly done by a “bag of words”
approach. It helps through cosine similarity to rapidly compare
things (documents in [48] and sentences in [49, chap. 5]) and
to select relevant one, according to a threshold. If done after
the pile, this task can be seen as the build of a Big Data index.
Obviously, this problem is mainly broached by people who intend
to design a search engine. Therefore, we have in [38] a
built of an inverted index, for fast keyword-search answering,
where a Lucene document is output for each entity and a
structured index to easily retrieve pieces of information about
a given entity. Likewise, but on RDF databases, [50] use B + −
Trees to index object identifiers of RDF nodes and also use an
inverted index to improve keyword queries. Unlike previous
cited authors, [51] for Querying Distributed RDF Repositories
purposes, built indices on “schema paths” (concepts whose
instances have to be joined to answer a given query) to identify
the sources which may contain the information needed.