is now, more than ever, a requirement to perform data analysis
inside of the DBMS, rather than pushing data to external systems
for analysis. Disk-heavy architectures such as the eBay 96-node
DBMS do not have the necessary CPU horsepower for analytical
workloads [4].
Hence, awaiting us in the future are heavy-weight analytic
database jobs, requiring more time and more nodes. The probability
of failure in these next generation applications will be far larger
than it is today, and restarting entire jobs upon a failure will be
unacceptable (failures might be common enough that long-running
jobs never finish!) Thus, although Hadoop and HadoopDB pay a
performance penalty for runtime scheduling, block-level restart,
and frequent checkpointing, such an overhead to achieve robust
fault tolerance will become necessary in the future. One feature
of HadoopDB is that it can elegantly transition between both ends
of the spectrum. Since one chunk is the basic unit of work, it can
play in the high-performance/low-fault-tolerance space of today’s
workloads (like Vertica) by setting a chunk size to be infinite, or in
high fault tolerance by using more granular chunks (like Hadoop).
In future work, we plan to explore the fault-tolerance/performance
tradeoff in more detail.
8. CONCLUSION
Our experiments show that HadoopDB is able to approach the
performance of parallel database systems while achieving similar
scores on fault tolerance, an ability to operate in heterogeneous environments,
and software license cost as Hadoop. Although the
performance of HadoopDB does not in general match the performance
of parallel database systems, much of this was due to the
fact that PostgreSQL is not a column-store and we did not use data
compression in PostgreSQL. Moreover, Hadoop and Hive are relatively
young open-source projects. We expect future releases to
enhance performance. As a result, HadoopDB will automatically
benefit from these improvements.
HadoopDB is therefore a hybrid of the parallel DBMS and
Hadoop approaches to data analysis, achieving the performance
and efficiency of parallel databases, yet still yielding the scalability,
fault tolerance, and flexibility of MapReduce-based systems.
The ability of HadoopDB to directly incorporate Hadoop and
open source DBMS software (without code modification) makes
HadoopDB particularly flexible and extensible for performing data
analysis at the large scales expected of future workloads.
9. ACKNOWLEDGMENTS
We’d like to thank Sergey Melnik and the three anonymous reviewers
for their extremely insightful feedback on an earlier version
of this paper, which we incorporated into the final version. We’d
also like to thank Eric McCall for helping us get Vertica running
on EC2. This work was sponsored by the NSF under grants IIS-
0845643 and IIS-08444809.
is now, more than ever, a requirement to perform data analysisinside of the DBMS, rather than pushing data to external systemsfor analysis. Disk-heavy architectures such as the eBay 96-nodeDBMS do not have the necessary CPU horsepower for analyticalworkloads [4].Hence, awaiting us in the future are heavy-weight analyticdatabase jobs, requiring more time and more nodes. The probabilityof failure in these next generation applications will be far largerthan it is today, and restarting entire jobs upon a failure will beunacceptable (failures might be common enough that long-runningjobs never finish!) Thus, although Hadoop and HadoopDB pay aperformance penalty for runtime scheduling, block-level restart,and frequent checkpointing, such an overhead to achieve robustfault tolerance will become necessary in the future. One featureof HadoopDB is that it can elegantly transition between both endsof the spectrum. Since one chunk is the basic unit of work, it canplay in the high-performance/low-fault-tolerance space of today’sworkloads (like Vertica) by setting a chunk size to be infinite, or inhigh fault tolerance by using more granular chunks (like Hadoop).In future work, we plan to explore the fault-tolerance/performancetradeoff in more detail.8. CONCLUSIONOur experiments show that HadoopDB is able to approach theperformance of parallel database systems while achieving similarscores on fault tolerance, an ability to operate in heterogeneous environments,and software license cost as Hadoop. Although the
performance of HadoopDB does not in general match the performance
of parallel database systems, much of this was due to the
fact that PostgreSQL is not a column-store and we did not use data
compression in PostgreSQL. Moreover, Hadoop and Hive are relatively
young open-source projects. We expect future releases to
enhance performance. As a result, HadoopDB will automatically
benefit from these improvements.
HadoopDB is therefore a hybrid of the parallel DBMS and
Hadoop approaches to data analysis, achieving the performance
and efficiency of parallel databases, yet still yielding the scalability,
fault tolerance, and flexibility of MapReduce-based systems.
The ability of HadoopDB to directly incorporate Hadoop and
open source DBMS software (without code modification) makes
HadoopDB particularly flexible and extensible for performing data
analysis at the large scales expected of future workloads.
9. ACKNOWLEDGMENTS
We’d like to thank Sergey Melnik and the three anonymous reviewers
for their extremely insightful feedback on an earlier version
of this paper, which we incorporated into the final version. We’d
also like to thank Eric McCall for helping us get Vertica running
on EC2. This work was sponsored by the NSF under grants IIS-
0845643 and IIS-08444809.
การแปล กรุณารอสักครู่..