There are two ways to load data into Hadoop’s distributed
file system: (1) use Hadoop’s command-line file utility to upload
files stored on the local filesystem into HDFS or (2) create a custom
data loader program that writes data using Hadoop’s internal I/O
API. We did not need to alter the input data for our MR programs,
therefore we loaded the files on each node in parallel directly into
HDFS as plain text using the command-line utility. Storing the data
in this manner enables MR programs to access data using Hadoop’s
TextInputFormat data format, where the keys are line numbers in each file and their corresponding values are the contents of
each line. We found that this approach yielded the best performance
in both the loading process and task execution, as opposed to using
Hadoop’s serialized data formats or compression features.