Figure 3 shows the compression methods Scuba uses for each
data type. Integers that can be represented naturally using N bytes
- 1 bit are directly encoded using N bytes. Dictionary encoding
means that each string is stored once in a dictionary and its index in
the dictionary (an integer) is stored in the row. String columns can
be stored compressed or uncompressed, depending on how many
distinct values there are. For compressed string columns, each index is stored using the number of bits necessary to represent the
maximum index. Uncompressed columns store the raw string and
its length. For sets of strings, the indexes are sorted and delta encoded and then each index is Fibonacci encoded. (Fibonacci encoding uses a variable number of bits.) The encoded indexes are stored
consecutively in the row. For vectors, there is a 2 byte count of the
strings and a 1 byte size for the number of bits in the maximum
dictionary index. Each index is then stored consecutively in the
row using size number of bits. All dictionaries are local to each
leaf and separate for each column. Compressing the data reduced
its volume by over a factor of 6 (it varies per table) as compared to
storing 8 byte integers and raw strings in every column.