The performance numbers for each benchmarked system is displayed
in Fig. 7 and 8. Similar to the Grep task, this query is
limited by reading data off disk. Thus, both commercial systems
benefit from compression and outperform HadoopDB and Hadoop.
We observe a reversal of the general rule that Hive adds an overhead
cost to hand-coded Hadoop in the “small” (substring) aggregation
task (the time taken by Hive is represented by the lower part of
the Hadoop bar in Fig. 8). Hive performs much better than Hadoop
because it uses a hash aggregation execution strategy (it maintains
an internal hash-aggregate map in the Map phase of the job), which
proves to be optimal when there is a small number of groups. In
the large aggregation task, Hive switches to sort-based aggregation
upon detecting that the number of groups is more than half the number
of input rows per block. In contrast, in our hand-coded Hadoop
plan we (and the authors of [23]) failed to take advantage of hash
aggregation for the smaller query because sort-based aggregation
(using Combiners) is a MapReduce standard practice.
These results illustrate the benefit of exploiting optimizers
present in database systems and relational query systems like
Hive, which can use statistics from the system catalog or simple
optimization rules to choose between hash aggregation and sort
aggregation.
Unlike Hadoop’s Combiner, Hive serializes partial aggregates
into strings instead of maintaining them in their natural binary representation.
Hence, Hive performs much worse than Hadoop on the
larger query.
PostgreSQL chooses to use hash aggregation for both tasks as it
can easily fit the entire hash aggregate table for each 1GB chunk
in memory. Hence, HadoopDB outperforms Hadoop on both tasks
due to its efficient aggregation implementation.
This query is well-suited for systems that use column-oriented
storage, since the two attributes accessed in this query (sourceIP
and adRevenue) consist of only 20 out of the more than 200 bytes
in each UserVisits tuple. Vertica is thus able to significantly outperform
the other systems due to the commensurate I/O savings.
6.2.5 Join Task
The join task involves finding the average pageRank of the set
of pages visited from the sourceIP that generated the most revenue
during the week of January 15-22, 2000. The key difference between
this task and the previous tasks is that it must read in two
different data sets and join them together (pageRank information is
found in the Rankings table and revenue information is found in the
UserVisits table). There are approximately 134,000 records in the
UserVisits table that have a visitDate value inside the requisite date
range.
Unlike the previous three tasks, we were unable to use the same
SQL for the parallel databases and for Hadoop-based systems. This
is because the Hive build we extended was unable to execute this