In this facts harvesting task, some recent approaches focus on scalability in addition to recall and precision. It is the
case of which take advantage of Hadoop MapReduce to distribute the patterns matching part of their algorithm. Now
focusing on the velocity, almost the same group of authors
has proposed a novel approach for population of knowledge
bases in. Here, they propose to extract a certain set of relations from documents in a given “time-slice”. This extraction can be improved based on the topics covered by the document (e.g do not try to extract music-domain relations from
a sport document) or by matching patterns of relations on an
index build from documents. More, since web is redundant
(a given fact is published by tens of sites), a small percentage of documents can cover a significant part of facts. Like wise, RDF-format unstructured data during a time slice
duration. It is important to note that the whole processing of
data gather during a period of time must be done during that
period of time, unless the processing cycle will be blocked.
Recall that relations could be n-ary. For instance, in [64]’s web
representative corpus, n-ary relations represented 40% of all
relations. About n-ary relations extraction, are very relevant work. They both use Stanford CoreNLP typed dependencies paths to extract arguments of different facts. To end with
information extraction, let us precise that is not all about
free text. Some work has thus focus on web tables or lists.