Sequential Vision to Language as Story: A Storytelling Dataset and Benchmarking

Zainy M. Malakan, Saeed Anwar, Ghulam Mubashar Hassan, Ajmal Mian

Research output: Contribution to journalArticlepeer-review

1 Citation (Scopus)


Storytelling is a remarkable human skill that plays a significant role in learning and experiencing everyday life. Developing narratives is central to human mental health development, simultaneously encapsulating broad details such as psychology, morality and common sense. Contemporary deep-learning algorithms require similar skills to be able to tell a story from a visual perspective. However, most algorithms function at a superficial or factual level, aligning descriptive text with images in a one-to-one manner without considering the temporal relation. Stories are more expressive in style, language and content, involving imaginary concepts not explicit in the images. An ideal deep learning system should learn and develop cohesive, meaningful, and causal stories. Unfortunately, most existing storytelling methods are trained and evaluated on a single dataset, i.e., the VIsual STorytelling (VIST) dataset. Multiple datasets are essential to test the generalization ability of algorithms. We bridge the gap and present a new dataset for expressive and coherent story creation. We present the Sequential Storytelling Image Dataset (SSID, ) consisting of open-source video frames accompanied by story-like annotations. We provide four annotations (stories) for each set of five images. The image sets are collected manually from publicly available videos in three domains: documentaries, lifestyle, and movies, and then annotated manually using Amazon Mechanical Turk. We perform a detailed analysis and benchmarking of the current VIST dataset and our new SSID dataset and show that both datasets exhibit high variance within their multiple ground truth stories corresponding to the same image set. Moreover, our dataset achieves lower mean average scores across all metrics, meaning that the ground truth stories of our dataset are more diverse. Finally, we train and evaluate existing state-of-the-art rheto...
Original languageEnglish
Pages (from-to)70805-70818
Number of pages14
JournalIEEE Access
Early online date10 Jul 2023
Publication statusPublished - 2023


Dive into the research topics of 'Sequential Vision to Language as Story: A Storytelling Dataset and Benchmarking'. Together they form a unique fingerprint.

Cite this