A quick glance at an image is sufficient for a human to point
out and describe an immense amount of details about the visual
scene [8]. However, this remarkable ability has proven
to be an elusive task for our visual recognition models. The
majority of previous work in visual recognition has focused
on labeling images with a fixed set of visual categories, and
great progress has been achieved in these endeavors [38, 6].
However, while closed vocabularies of visual concepts constitute
a convenient modeling assumption, they are vastly
restrictive when compared to the enormous amount of rich
descriptions that a human can compose.