Atop this hardware, the Apache
Hadoop25 system implements a MapReduce
model for data analytics. Hadoop
includes a distributed file system
(HDFS) for managing large numbers
of large files, distributed (with block
replication) across the local storage of
the cluster. HDFS and HBase, an opensource
implementation of Google’s
BigTable key-value store,3 are the bigdata
analogs of Lustre for computational
science, albeit optimized for different
hardware and access patterns.
Atop the Hadoop storage system,
tools (such as Pig18) provide a highlevel
programming model for the twophase
MapReduce model. Coupled
with streaming data (Storm and Flume),
graph (Giraph), and relational data
(Sqoop) support, the Hadoop ecosystem
is designed for data analysis. Moreover,
tools (such as Mahout) enable classification,
recommendation, and prediction
via supervised and unsupervised
learning. Unlike scientific computing,
application development for data analytics
often relies on Java and Web services
tools (such as Ruby on Rails).