4.1. Identifying relevant pieces of information in messy
data
As mentioned in the challenges list, this task can be done before
the disambiguation pile. In this case, we must prune irrelevant
data. This pruning is mostly done by a “bag of words”
approach. It helps through cosine similarity to rapidly compare
things (documents in [48] and sentences in [49, chap. 5]) and
to select relevant one, according to a threshold. If done after
the pile, this task can be seen as the build of a Big Data index.
Obviously, this problem is mainly broached by people who intend
to design a search engine. Therefore, we have in [38] a
built of an inverted index, for fast keyword-search answering,
where a Lucene document is output for each entity and a
structured index to easily retrieve pieces of information about
a given entity. Likewise, but on RDF databases, [50] use B + −
Trees to index object identifiers of RDF nodes and also use an
inverted index to improve keyword queries. Unlike previous
cited authors, [51] for Querying Distributed RDF Repositories
purposes, built indices on “schema paths” (concepts whose
instances have to be joined to answer a given query) to identify
the sources which may contain the information needed.