We make two adjustments for this task in order to make processing easier in Hadoop. First, we allow the aggregate to include
self-references, as it is non-trivial for a Map function to discover
the name of the input file it is processing. Second, on each node
we concatenate the HTML documents into larger files when storing
them in HDFS. We found this improved Hadoop’s performance by
a factor of two and helped avoid memory issues with the central
HDFS master when a large number of files are stored in the system.