First, despite the increased complexity of the query, the performance of Hadoop is yet again limited by the speed with which the
large UserVisits table (20GB/node) can be read off disk. The MR
program has to perform a complete table scan, while the parallel
database systems were able to take advantage of clustered indexes
on UserVisits.visitDate to significantly reduce the amount of data
that needed to be read. When breaking down the costs of the different parts of the Hadoop query, we found that regardless of the
number of nodes in the cluster, phase 2 and phase 3 took on average 24.3 seconds and 12.7 seconds, respectively. In contrast, phase
1, which contains the Map task that reads in the UserVisits and
Rankings tables, takes an average of 1434.7 seconds to complete.
Interestingly, it takes approximately 600 seconds of raw I/O to read
the UserVisits and Rankings tables off of disk and then another 300 seconds to split, parse, and deserialize the various attributes. Thus,
the CPU overhead needed to parse these tables on the fly is the limiting factor for Hadoop.