query. Although this build accepts a SQL query that joins, filters
and aggregates tuples from two tables, such a query fails during
execution. Additionally, we noticed that the query plan for joins of
this type uses a highly inefficient execution strategy. In particular,
the filtering operation is planned after joining the tables. Hence,
we are only able to present hand-coded results for HadoopDB and
Hadoop for this query.
In HadoopDB, we push the selection, join, and partial aggregation
into the PostgreSQL instances with the following SQL:
SELECT sourceIP, COUNT(pageRank), SUM(pageRank),
SUM(adRevenue) FROM Rankings AS R, UserVisits AS UV
WHERE R.pageURL = UV.destURL AND
UV.visitDate BETWEEN ‘2000-01-15’ AND ‘2000-01-22’
GROUP BY UV.sourceIP;
We then use a single Reduce task in Hadoop that gathers all of
the partial aggregates from each PostgreSQL instance to perform
the final aggregation.
The parallel databases execute the SQL query specified in [23].
Although Hadoop has support for a join operator, this operator
requires that both input datasets be sorted on the join key. Such
a requirement limits the utility of the join operator since in many
cases, including the query above, the data is not already sorted and
performing a sort before the join adds significant overhead. We
found that even if we sorted the input data (and did not include the
sort time in total query time), query performance using the Hadoop
join was lower than query performance using the three phase MR
program used in [23] that used standard ‘Map’ and ‘Reduce’ operators.
Hence, for the numbers we report below, we use an identical
MR program as was used (and described in detail) in [23].
Fig. 9 summarizes the results of this benchmark task. For
Hadoop, we observed similar results as found in [23]: its performance
is limited by completely scanning the UserVisits dataset on
each node in order to evaluate the selection predicate.
HadoopDB, DBMS-X, and Vertica all achieve higher performance
by using an index to accelerate the selection predicate
and having native support for joins. These systems see slight
performance degradation with a larger number of nodes due to the
final single node aggregation of and sorting by adRevenue.
6.2.6 UDF Aggregation Task
The final task computes, for each document, the number of inward
links from other documents in the Documents table. URL
links that appear in every document are extracted and aggregated.
HTML documents are concatenated into large files for Hadoop
(256MB each) and Vertica (56MB each) at load time. HadoopDB
was able to store each document separately in the Documents table
using the TEXT data type. DBMS-X processed each HTML
document file separately, as described below.
The parallel databases should theoretically be able to use a userdefined
function, F, to parse the contents of each document and
query. Although this build accepts a SQL query that joins, filtersand aggregates tuples from two tables, such a query fails duringexecution. Additionally, we noticed that the query plan for joins ofthis type uses a highly inefficient execution strategy. In particular,the filtering operation is planned after joining the tables. Hence,we are only able to present hand-coded results for HadoopDB andHadoop for this query.In HadoopDB, we push the selection, join, and partial aggregationinto the PostgreSQL instances with the following SQL:SELECT sourceIP, COUNT(pageRank), SUM(pageRank),SUM(adRevenue) FROM Rankings AS R, UserVisits AS UVWHERE R.pageURL = UV.destURL ANDUV.visitDate BETWEEN ‘2000-01-15’ AND ‘2000-01-22’GROUP BY UV.sourceIP;We then use a single Reduce task in Hadoop that gathers all ofthe partial aggregates from each PostgreSQL instance to performthe final aggregation.The parallel databases execute the SQL query specified in [23].Although Hadoop has support for a join operator, this operatorrequires that both input datasets be sorted on the join key. Sucha requirement limits the utility of the join operator since in manycases, including the query above, the data is not already sorted andperforming a sort before the join adds significant overhead. Wefound that even if we sorted the input data (and did not include thesort time in total query time), query performance using the Hadoopjoin was lower than query performance using the three phase MRprogram used in [23] that used standard ‘Map’ and ‘Reduce’ operators.Hence, for the numbers we report below, we use an identicalMR program as was used (and described in detail) in [23].Fig. 9 summarizes the results of this benchmark task. ForHadoop, we observed similar results as found in [23]: its performanceis limited by completely scanning the UserVisits dataset oneach node in order to evaluate the selection predicate.HadoopDB, DBMS-X, and Vertica all achieve higher performanceby using an index to accelerate the selection predicateand having native support for joins. These systems see slightperformance degradation with a larger number of nodes due to thefinal single node aggregation of and sorting by adRevenue.6.2.6 UDF Aggregation TaskThe final task computes, for each document, the number of inwardlinks from other documents in the Documents table. URLlinks that appear in every document are extracted and aggregated.HTML documents are concatenated into large files for Hadoop(256MB each) and Vertica (56MB each) at load time. HadoopDBwas able to store each document separately in the Documents tableusing the TEXT data type. DBMS-X processed each HTMLdocument file separately, as described below.The parallel databases should theoretically be able to use a userdefinedfunction, F, to parse the contents of each document and
การแปล กรุณารอสักครู่..
