MapReduce. Originally put in place by Google to solve the
web search index creation problem [12], MapReduce is nowadays the main programming model and associated implementation for processing and generating large datasets [19].
The input data format in MapReduce framework is applicationspecific, is specified by the user [20] and is suitable for semistructured or unstructured data. The MapReduce’s output is a
set of pairs. The name “MapReduce” expresses
the fact that users specify an algorithm using two kernel
functions: “Map” and “Reduce”. The Map function is applied on
the input data and produces a list of intermediate
pairs; and the Reduce function merges all intermediate values associated with the same intermediate key [19] [20]. In
a Hadoop cluster, a job (i.e a MapReduce program [11]) is executed by subsequently breaking it down into pieces called
tasks. When a node in Hadoopcluster receives a job, it is able
to divide it, and run it in parallel over other nodes [12].
Here the data location problem is solved by the JobTracker
which communicates with the NameNode to help datanodes
to send tasks to near-data datanodes. Let us note that this
processing in form of pairs is not a limitation to
processing which does not seem, at first glance, feasible in
map-reduce manner. Indeed, MapReduce has been successfully
used in RDF/RDFS and OWL reasoning [21,22] and in structured data querying [23].
Around HDFS and MapReduce there are tens of projects
which cannot be presented in detail here. Those projects can
be classified according to their capabilities: