Human-nameable visual “attributes” can benefit various
recognition tasks. However, existing techniques restrict
these properties to categorical labels (for example, a person
is ‘smiling’ or not, a scene is ‘dry’ or not), and thus
fail to capture more general semantic relationships. We
propose to model relative attributes. Given training data
stating how object/scene categories relate according to different
attributes, we learn a ranking function per attribute.
The learned ranking functions predict the relative strength
of each property in novel images. We then build a generative
model over the joint space of attribute ranking outputs,
and propose a novel form of zero-shot learning in which the
supervisor relates the unseen object category to previously
seen objects via attributes (for example, ‘bears are furrier
than giraffes’). We further show how the proposed relative
attributes enable richer textual descriptions for new images,
which in practice are more precise for human interpretation.
We demonstrate the approach on datasets of faces and
natural scenes, and show its clear advantages over traditional
binary attribute prediction for these new tasks.