In-memory computation is essential to low-latency query answering,
given that memory’s throughput is orders of magnitude higherhan that of disks. Naïvely using Spark’s memory store, however,
can lead to undesirable performance. For this reason, Shark implements
a columnar memory store on top of Spark’s native memory
store.
In-memory data representation affects both space footprint and
read throughput. A naïve approach is to simply cache the on-disk
data in its native format, performing on-demand deserialization in
the query processor. This deserialization becomes a major bottleneck:
in our studies, we saw that modern commodity CPUs can
deserialize at a rate of only 200MB per second per core.