The main emphasis of previous work has been on the compression
of numerical attributes, where coding techniques
have been employed to reduce the length of integers, floating
point numbers, and dates [13, 25]. However, string attributes
(i.e., attributes declared in SQL of type CHAR(n) or
VARCHAR(n)) often comprise a large portion of the length of
a record and thus have significant impact on query performance.
For example, the TPC-H benchmark schema contains
61 attributes, out of which 26 are string-valued, constituting
60% of the total database size. Surprisingly, there has
not been much work in the database literature on compressing
string attributes. Classic compression methods such as
Huffman coding [18], arithmetic coding [31], Lempel-Ziv [32,
33] (the basis for gzip), and order-preserving methods [4]
all have considerable CPU overhead that offsets the performance
gains of reduced I/O, making their use in databases
infeasible [12]. Hence, existing work in the database literature
employs simple, lightweight techniques such as NULL
suppression and dictionary encoding [6, 29]. This paper
contributes such an effective and practical database compression
method for string-valued attributes. Our method
achieves achieves better compression ratios than existing
methods while avoiding high CPU costs during decompression.