Most tables have default constraints of 30 days1 and 100 GB,
although there are higher limits for high volume tables like fbflow,
which holds network traffic data, and ads metrics, which records
revenue-generating data like ad clicks and impressions. Every 15
minutes, a cron job evicts data that has exceeded its age limit. If the
table exceeds its space limit, the oldest data for the table is evicted
until it is under its limit.
In order to keep some data around longer than the space limits
allow, Scuba also provides subsampling of data. In this case, a
uniform fraction of the rows older than a certain age are kept and
the remainder deleted. In the future, we would like to explore more
sophisticated forms of sampling, such as stratified sampling, that
might choose a more representative set of rows.