MapReduce. Originally put in place by Google to solve the
web search index creation problem [12],
MapReduce is nowadays
the main programming model and associated implementation
for processing and generating large datasets [19].
The input data format in MapReduce framework is applicationspecific,
is specified by the user [20] and is suitable for semistructured
or unstructured data.
The MapReduce’s output is a
set of pairs. The name “MapReduce” expresses
the fact that users specify an algorithm using two kernel
functions: “Map” and “Reduce”.
The Map function is applied on
the input data and produces a list of intermediate
pairs; and the Reduce function merges all intermediate values
associated with the same intermediate key [19] [20].
In a Hadoop cluster, a job (i.e a MapReduce program [11]) is executed
by subsequently breaking it down into pieces called
tasks.
When a node in Hadoopcluster receives a job, it is able
to divide it, and run it in parallel over other nodes [12].
Here the data location problem is solved by the JobTracker
which communicates with the NameNode to help datanodes
to send tasks to near-data datanodes.
Let us note that this
processing in form of pairs is not a limitation to
processing which does not seem, at first glance, feasible in
map-reduce manner. Indeed, MapReduce has been successfully
used in RDF/RDFS and OWL reasoning [21,22] and in structured
data querying [23].