Contextualise, Attend, Modulate and Tell: Visual Storytelling

Zainy M. Malakan, Nayyer Aafaq, Ghulam Mubashar Hassan, Ajmal Mian

Research output: Chapter in Book/Conference paperConference paperpeer-review

4 Citations (Scopus)

Abstract

Automatic natural language description of visual content is an emerging and fast-growing topic that has attracted extensive research attention recently. However, different from typical ‘image captioning’ or ‘video captioning’, coherent story generation from a sequence of images is a relatively less studied problem. Story generation poses the challenges of diverse language style, context modeling, coherence and latent concepts that are not even visible in the visual content. Contemporary methods fall short of modeling the context and visual variance, and generate stories devoid of language coherence among multiple sentences. To this end, we propose a novel framework Contextualize, Attend, Modulate and Tell (CAMT) that models the temporal relationship among the image sequence in forward as well as backward direction. The contextual information and the regional image features are then projected into a joint space and then subjected to an attention mechanism that captures the spatio-temporal relationships among the images. Before feeding the attentive representations of the input images into a language model, gated modulation between the attentive representation and the input word embeddings is performed to capture the interaction between the inputs and their context. To the best of our knowledge, this is the first method that exploits such a modulation technique for story generation. We evaluate our model on the Visual Storytelling Dataset (VIST) employing both automatic and human evaluation measures and demonstrate that our CAMT model achieves better performance than existing baselines.
Original languageEnglish
Title of host publicationProceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications
EditorsGiovanni Maria Farinella, Petia Radeva, Jose Braz, Kadi Bouatouch
PublisherScitepress
Pages196-205
Number of pages10
Volume5
ISBN (Electronic)9789897584886
ISBN (Print)978-989-758-488-6
DOIs
Publication statusPublished - 10 Feb 2021
Event16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Vienna, Austria
Duration: 8 Feb 202110 Feb 2021

Conference

Conference16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications
Abbreviated titleVISIGRAPP
Country/TerritoryAustria
CityVienna
Period8/02/2110/02/21

Fingerprint

Dive into the research topics of 'Contextualise, Attend, Modulate and Tell: Visual Storytelling'. Together they form a unique fingerprint.

Cite this