The production environment for analytical data management applications
is rapidly changing. Many enterprises are shifting away
from deploying their analytical databases on high-end proprietary
machines, and moving towards cheaper, lower-end, commodity
hardware, typically arranged in a shared-nothing MPP architecture,
often in a virtualized environment inside public or private “clouds”.
At the same time, the amount of data that needs to be analyzed is
exploding, requiring hundreds to thousands of machines to work in
parallel to perform the analysis.
There tend to be two schools of thought regarding what technology
to use for data analysis in such an environment. Proponents
of parallel databases argue that the strong emphasis on performance
and efficiency of parallel databases makes them wellsuited
to perform such analysis. On the other hand, others argue
that MapReduce-based systems are better suited due to their superior
scalability, fault tolerance, and flexibility to handle unstructured
data. In this paper, we explore the feasibility of building a hybrid
system that takes the best features from both technologies; the prototype
we built approaches parallel databases in performance and
efficiency, yet still yields the scalability, fault tolerance, and flexibility
of MapReduce-based systems.