5.1.3 Compression
Almost every parallel DBMS (including DBMS-X and Vertica)
allows for optional compression of stored data. It is not uncommon for compression to result in a factor of 6–10 space savings.
Vertica’s internal data representation is highly optimized for data
compression and has an execution engine that operates directly on
compressed data (i.e., it avoids decompressing the data during processing whenever possible). In general, since analysis tasks on large
data sets are often I/O bound, trading CPU cycles (needed to decompress input data) for I/O bandwidth (compressed data means
that there is less data to read) is a good strategy and translates to
faster execution. In situations where the executor can operate directly on compressed data, there is often no trade-off at all and
compression is an obvious win.
Hadoop and its underlying distributed filesystem support both
block-level and record-level compression on input data. We found,
however, that neither technique improved Hadoop’s performance
and in some cases actually slowed execution. It also required more
effort on our part to either change code or prepare the input data.
It should also be noted that compression was also not used in the
original MR benchmark [8].
In order to use block-level compression in Hadoop, we first had
to split the data files into multiple, smaller files on each node’s local
file system and then compress each file using the gzip tool. Compressing the data in this manner reduced each data set by 20–25%
from its original size. These compressed files are then copied into
HDFS just as if they were plain text files. Hadoop automatically
detects when files are compressed and will decompress them on the
fly when they are fed into Map instances, thus we did not need to
change our MR programs to use the compressed data. Despite the
longer load times (if one includes the splitting and compressing),
Hadoop using block-level compression slowed most the tasks by a
few seconds while CPU-bound tasks executed 50% slower.
We also tried executing the benchmarks using record-level compression. This required us to (1) write to a custom tuple object using Hadoop’s API, (2) modify our data loader program to transform
records to compressed and serialized custom tuples, and (3) refactor each benchmark. We initially believed that this would improve
CPU-bound tasks, because the Map and Reduce tasks no longer
needed to split the fields by the delimiter. We found, however, that
this approach actually performed worse than block-level compression while only compressing the data by 10%.