Large-scale analytical data processing has become widespread
in Web companies and across industries, not least
due to low-cost storage that enabled collecting vast amounts
of business-critical data. Putting this data at the fingertips
of analysts and engineers has grown increasingly important;
interactive response times often make a qualitative
difference in data exploration, monitoring, online customer
support,
rapid prototyping, debugging of data pipelines,
and other tasks.
Performing interactive data analysis at scale demands a
high degree of parallelism. For example, reading a terabyte
of compressed data from secondary storage in 1 s would
require more than 10,000 commodity disks. Similarly,
CPU-intensive queries may need to run on thousands of
cores to complete within seconds. At Google, massively
parallel computing is done using shared clusters of commodity
machines.5 A cluster typically hosts a multitude of
distributed applications that share resources, have widely
varying workloads, and run on machines with different
hardware parameters. An individual worker in a distributed
application may take much longer to execute a given
task than others
or may never complete due to failures or
preemption by the cluster management system. Hence,
dealing with stragglers and failures is essential for achieving
fast execution and fault tolerance.
The data used in Web and scientific computing are often
non-relational. Hence, a flexible data model is essential
in these domains. Data structures used in programming
languages,
messages exchanged by distributed systems,
structured documents, etc., lend themselves
naturally to
a nested representation. Normalizing and recombining
such data at Web scale is usually prohibitive. A nested
data model underlies most of the structured data processing
at Google22 and reportedly at other major Web
companies.
This paper describes a system called Dremela that supports
interactive analysis of very large datasets over shared
clusters of commodity machines. Unlike traditional databases,
it is capable of operating on in situ nested data. In situ
refers to the ability to access data “in place,” for example, in
a distributed file system (like Google File System (GFS)14) or
another storage layer (e.g., Bigtable9). Dremel can execute
many queries over such data that would ordinarily require
a sequence of MapReduce (MR12) jobs, but at a fraction of
the execution time. Dremel is not intended as a replacement
for MR and is often used in conjunction with it to
analyze outputs of MR pipelines or rapidly prototype larger
computations.
Dremel has been in production since 2006 and has
thousands of users within Google. Multiple instances of
Dremel are deployed in the company, ranging from tens to
thousands of nodes. Examples of system usage include the
following: