The Hadoop framework also provides an implementation of the
Google distributed file system [12]. For each benchmark trial, we
store all input and output data in the Hadoop distributed file system
(HDFS). We used the default settings of HDFS of three replicas
per block and without compression; we also tested other configurations, such as using only a single replica per block as well as blockand record-level compression, but we found that our tests almost
always executed at the same speed or worse with these features enabled (see Section 5.1.3). After each benchmark run finishes for a
particular node scaling level, we delete the data directories on each
node and reformat HDFS so that the next set of input data is replicated uniformly across all nodes.