In addition to query execution, Shark also uses Spark’s execution
engine for distributed data loading. During loading, a table is split
into small partitions, each of which is loaded by a Spark task. The
loading tasks use the data schema to extract individual fields from
rows, marshal a partition of data into its columnar representation,
and store those columns in memory.
Each data loading task tracks metadata to decide whether each
column in a partition should be compressed. For example, the
loading task will compress a column using dictionary encoding
if its number of distinct values is below a threshold. This allows
each task to choose the best compression scheme for each partition,
rather than conforming to a global compression scheme that might
not be optimal for local partitions. These local decisions do not
require coordination among data loading tasks, allowing the load
phase to achieve a maximum degree of parallelism, at the small cost
of requiring each partition to maintain its own compression metadata.
It is important to clarify that an RDD’s lineage does not need
to contain the compression scheme and metadata for each partition.
The compression scheme and metadata are simply byproducts
of the RDD computation, and can be deterministically recomputed
along with the in-memory data in the case of failures.