Visual Question Generation Under Multi-granularity Cross-Modal Interaction.

Zi Chai, Xiaojun Wan, Soyeon Caren Han, Josiah Poon

Research output: Chapter in Book/Conference paperConference paperpeer-review

1 Citation (Scopus)


Visual question generation (VQG) aims to ask human-like questions automatically from input images targeting on given answers. A key issue of VQG is performing effective cross-modal interaction, i.e., dynamically focus on answer-related regions during question. In this paper, we propose a novel framework based on multi-granularity cross-modal interaction for VQG containing both object-level and relation-level interaction. For object-level interaction, we leverage both semantic and visual features under a contrastive learning scenario. We further illustrate the importance of high-level relations (e.g., spatial, semantic) between regions and answers for generating deeper questions. Since such information were somewhat ignored by prior VQG studies, we propose relation-level interaction based on graph neural networks. We perform experiments on VQA2.0 and Visual7w datasets under automatic and human evaluations and our model outperforms all baseline models.
Original languageEnglish
Title of host publicationMultiMedia Modeling
Subtitle of host publication29th International Conference, MMM 2023, Bergen, Norway, January 9–12, 2023, Proceedings, Part I
EditorsDuc-Tien Dang-Nguyen, Cathal Gurrin, Alan F. Smeaton, Martha Larson, Stevan Rudinac, Minh-Son Dao, Christoph Trattner, Phoebe Chen
Place of PublicationCham
Number of pages12
ISBN (Electronic)978-3-031-27077-2
ISBN (Print)978-3-031-27076-5
Publication statusPublished - 29 Mar 2023
Externally publishedYes

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume13833 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Dive into the research topics of 'Visual Question Generation Under Multi-granularity Cross-Modal Interaction.'. Together they form a unique fingerprint.

Cite this