Vision to Language: Methods, Metrics and Datasets

Research output: Chapter in Book/Conference paperChapterpeer-review

4 Citations (Scopus)


Alan Turing's pioneering vision of machines in the 1950s, that are capable of thinking like humans is still what Artificial Intelligence (AI) and Deep Learning research aspires to manifest, 70 years on. With replicating or modeling human intelligence as the ultimate goal, AI's Holy Grail is to create systems that can perceive and reason about the world like humans and perform tasks such as visual interpretation/processing, speech recognition, decision-making and language understanding. In this quest, two of the dominant subfields of AI, Computer Vision and Natural Language Processing, attempt to create systems that can fully understand the visual world and achieve human-like language processing, respectively. To be able to interpret and describe visual content in natural language is one of the most distinguished capabilities of a human. While humans find it rather easy to accomplish, it is very hard for a machine to mimic this complex process. The past decade has seen significant research effort on the computational tasks that involve translation between and fusion of the two modalities of human communication, namely Vision and Language. Moreover, the unprecedented success of deep learning has further propelled research on tasks that link images to sentences, instead of just tags (as done in Object Recognition and Classification). This chapter discusses the fundamentals of generating natural language description of images as well as the prominent and the state-of-the-art methods, their limitations, various challenges in image captioning and future directions to push this technology further for practical real-world applications. It also serves as a reference to a comprehensive list of data resources for training deep captioning models and metrics that are currently in use for model evaluation.
Original languageEnglish
Title of host publicationMachine Learning Paradigms
Subtitle of host publicationAdvances in Deep Learning-based Technological Applications
EditorsGeorge A. Tsihrintzis, Lakhmi C. Jain
Place of PublicationCham
PublisherSpringer International Publishing
Number of pages54
ISBN (Electronic)978-3-030-49724-8
ISBN (Print) 978-3-030-49723-1
Publication statusPublished - 2020

Publication series

NameLearning and Analytics in Intelligent Systems
ISSN (Print)2662-3447
ISSN (Electronic)2662-3455


Dive into the research topics of 'Vision to Language: Methods, Metrics and Datasets'. Together they form a unique fingerprint.

Cite this