Some pioneering approaches that address the challenge of
generating image descriptions have been developed [23, 7].
However, these models often rely on hard-coded visual concepts
and sentence templates, which imposes limits on their
variety. Moreover, the focus of these works has been on reducing
complex visual scenes into a single sentence, which
we consider as an unnecessary restriction.