obtaining a sample of test subjects that would faithfully represent
the real user population is virtually impossible. In
addition to the biased sample problem, conducting a user
study to evaluate a large scale video recommendation system
has several other disadvantages.
First, even given a very detailed set of instructions, it
would be difficult for the test subject to judge what would
be the best related video to suggest, since this decision is
very subjective and may be influenced by factors such as
user demographics, geographic location, emotional state and
cultural preferences. Even for relatively objective evaluation
tasks, such as document retrieval [6] the inter-judge agreement
is low. We expect the agreement rate to be even lower
for rating video relatedness, which is highly subjective.
Second, as research shows [22], there is often a disconnect
between what the subjects really want to watch and
what they would like to have watched. This leads to a situation
where there is little correlation between the explicitly
solicited judgments and the observed user behavior in the
system.
Therefore, in the next sections we evaluate the performance
of the proposed methods using user simulations and
a large scale online experiment, and forego evaluation of our
methods on manually labeled data.
7.1.2 Metrics
Given the user-centric evaluation method of our system,
in this section we address the question of what is the most
suitable evaluation metric in this particular setting.
One possible choice of a metric is a click-through rate on
the related video suggestions presented to the user by the
system. However, research shows that the click-through rate
can be highly biased by factors such as position and attractiveness
of the presentation [32]. We expect this bias to be
very strong in our setting, where the results are presented in
ranked order, and each related result is presented as a small
snapshot from the video.
Another choice of metric is based on the main functionality
of the related video suggestion system, and it measures
the watch times of the related videos. Intuitively, a
systematic improvement in functionality will generate more
relevant suggestions, which will result in a higher user engagement
with the system, and lead to longer watch times
of the suggested related videos.
Following this intuition, we choose a watch time metric,
which estimates how much time the user spends watching
videos during the session following a click on a related video
suggestion. While the watch time metric has its limitations
(e.g. it might prefer videos with longer watch times), it is