Hierarchical Scene Graph Encoder-Decoder for Image Paragraph Captioning
When we humans tell a long paragraph about an image, we usuallyfirst implicitly compose a mental “script” and then comply with itto generate the paragraph. Inspired by this, we render the modernencoder-decoder based image paragraph captioning model suchability by proposing Hierarchical Scene Graph Encoder-Decoder(HSGED) for generating coherent and distinctive paragraphs. Inparticular, we use the image scene graph as the “script” to incorporate rich semantic knowledge and, more importantly, the hierarchical constraints into the model. Specifically, we design a sentencescene graph RNN (SSG-RNN) to generate sub-graph level topics,which constrain the word scene graph RNN (WSG-RNN) to generate the corresponding sentences. We propose irredundant attentionin SSG-RNN to improve the possibility of abstracting topics fromrarely described sub-graphs and inheriting attention in WSG-RNNto generate more grounded sentences with the abstracted topics,both of which give rise to more distinctive paragraphs. An efficientsentence-level loss is also proposed for encouraging the sequence ofgenerated sentences to be similar to that of the ground-truth paragraphs. We validate HSGED on Stanford image paragraph datasetand show that it not only achieves a new state-of-the-art 36.02CIDEr-D, but also generates more coherent and distinctive paragraphs under various metrics.