The time scale of our A/B tests might seem long, especially compared to those used by
many other companies to optimize metrics, such as click-through rates. This is partly
addressed by testing multiple variants against a control in each test; thus, rather
than having two variants, A and B, we typically include 5 to 10 algorithm variants in
each test, for example, using the same new model but different signal subsets and/or
parameters and/or model trainings. This is still slow, however, too slow to help us find
the best parameter values for a model with many parameters, for example. For new
members, more test cells also means more days to allocate new signups into the test to
have the same sample size in each cell.
Another option to speed up testing is to execute many different A/B tests at once
on the same member population. As long as the variations in test experience are
compatible with each other, and we judge them not to combine in a nonlinear way on
the experience, we might allocate each new member into several different tests at once
– for example, a similars test, a PVR algorithm test, and a search test. Accordingly, a
single member might get similars algorithm version B, PVR algorithm version D, and
search results version F. Over perhaps 30 sessions during the test period, the member’s
experience is accumulated into metrics for each of the three different tests.
But to really speed up innovation, we also rely on a different type of experimentation
based on analyzing historical data. This offline experimentation changes from algorithm
to algorithm, but it always consists of computing a metric for every algorithm variant
tested that describes how well the algorithm variants fit previous user engagement.
For example, for PVR, we might have 100 different variants that differ only in the
parameter values used, and that relied on data up to two days ago in their training.
We then use each algorithm variant to rank the catalog for a sample of members using
data up to two days ago, then find the ranks of the videos played by the members
in the sample in the last two days. These ranks are then used to compute metrics
for each user across variants—for example, the mean reciprocal rank, precision, and
recall—that are then averaged across the members in the sample, possibly with some
normalization. For a different and detailed offline metric example, used for our page
construction algorithm, see Alvino and Basilico [2015]. Offline experiments allow us to
iterate quickly on algorithm prototypes, and to prune the candidate variants that we
use in actual A/B experiments. The typical innovation flow is shown in Figure 8.
As appealing as offline experiments are, they have a major drawback: they assume
that members would have behaved the same way, for example, playing the same videos,
if the new algorithm being evaluated had been used to generate the recommendations.
Thus, for instance, a new algorithm that results in very different recommendations
from the production algorithm is unlikely to find that its recommendations have been
played more than the corresponding recommendations from the production algorithm
that actually served the recommendations to our members. This suggests that offline
experiments need to be interpreted in the context of how different the algorithms
being tested are from the production algorithm. However, it is unclear what distance
metric across algorithms can lead to better offline experiment interpretations that will
correlate better with A/B test outcomes, since the latter is what we are after. Thus,
while we do rely on offline experiments heavily, for lack of a better option, to decide
when to A/B test a new algorithm and which new algorithms to test, we do not find
them to be as highly predictive of A/B test outcomes as we would like.