In this work, we propose to model relative attributes.
As opposed to predicting the presence of an attribute,
a relative attribute indicates the strength of an attribute
in an image with respect to other images. For example, in Figure 1, while it is difficult to assign a meaningful
value to the binary attribute ‘smiling’, we could all agree on
the relative attribute, i.e. Hugh Laurie is smiling less than
Scarlett Johansson, but more than Jared Leto. In addition to
being more natural, relative attributes would offer a richer
mode of communication, thus allowing access to more detailed
human supervision (and so potentially higher recognition
accuracy), as well as the ability to generate more informative
descriptions of novel images.
How can we learn relative properties? Whereas traditional
supervised classification is appropriate to learn attributes
that are intrinsically binary, it falls short when we
want to represent visual properties that are nameable but not
categorical. Our goal is instead to estimate the degree of
that attribute’s presence—which, importantly, differs from
the probability of a binary classifier’s prediction. To this
end, we devise an approach that learns a ranking function
for each attribute, given relative similarity constraints on
pairs of examples (or more generally a partial ordering on
some examples). The learned ranking function can estimate
a real-valued rank1 for images indicating the relative
strength of the attribute presence in them. Then, we introduce
novel forms of zero-shot learning and description that
exploit the relative attribute predictions.
The proposed ranking approach accounts for a subtle but
important difference between relative attributes and conceivable
alternatives based on regression or multi-way classification.
While such alternatives could also allow for a
richer vocabulary, during training they could suffer from
similar inconsistencies as binary attributes. For example,
it is more difficult to define and perhaps more importantly,
agree on, “With what strength is he smiling?” than “Is he
smiling more than she is?”. Thus, we expect the relative
mode of supervision our approach permits to be more natural
and consistent for human labelers.