In this section we describe the desired properties of a system designed
for performing data analysis at the (soon to be more common)
petabyte scale. In the following section, we discuss how parallel
database systems and MapReduce-based systems do not meet
some subset of these desired properties.
Performance. Performance is the primary characteristic that commercial
database systems use to distinguish themselves from other
solutions, with marketing literature often filled with claims that a
particular solution is many times faster than the competition. A
factor of ten can make a big difference in the amount, quality, and
depth of analysis a system can do.
High performance systems can also sometimes result in cost savings.
Upgrading to a faster software product can allow a corporation
to delay a costly hardware upgrade, or avoid buying additional compute
nodes as an application continues to scale. On public cloud
computing platforms, pricing is structured in a way such that one
pays only for what one uses, so the vendor price increases linearly
with the requisite storage, network bandwidth, and compute power.
Hence, if data analysis software product A requires an order of magnitude
more compute units than data analysis software product B to
perform the same task, then product A will cost (approximately)
an order of magnitude more than B. Efficient software has a direct
effect on the bottom line.
Fault Tolerance. Fault tolerance in the context of analytical data
workloads is measured differently than fault tolerance in the context
of transactional workloads. For transactional workloads, a fault
tolerant DBMS can recover from a failure without losing any data
or updates from recently committed transactions, and in the context
of distributed databases, can successfully commit transactions
and make progress on a workload even in the face of worker node
failures. For read-only queries in analytical workloads, there are
neither write transactions to commit, nor updates to lose upon node
failure. Hence, a fault tolerant analytical DBMS is simply one that