As mentioned in the challenges list, this task can be done before the disambiguation pile. In this case, we must prune irrelevant data. This pruning is mostly done by a “bag of words” approach. It helps through cosine similarity to rapidly compare things (documents in [48] and sentences in [49, chap. 5]) and to select relevant one, according to a threshold. If done after the pile, this task can be seen as the build of a Big Data index.Obviously, this problem is mainly broached by people who intend to design a search engine. Therefore, we have in [38] a built of an inverted index, for fast keyword-search answering, where a Lucene document is output for each entity and a structured index to easily retrieve pieces of information about a given entity. Likewise, but on RDF databases, [50] use B + − Trees to index object identifiers of RDF nodes and also use an
inverted index to improve keyword queries. Unlike previous
cited authors, [51] for Querying Distributed RDF Repositories
purposes, built indices on “schema paths” (concepts whose
instances have to be joined to answer a given query) to identify
the sources which may contain the information needed.