TY - GEN
T1 - REXUP: I REason, I EXtract, I UPdate with Structured Compositional
Reasoning for Visual Question Answering
AU - Luo, Siwen
AU - Han, Soyeon Caren
AU - Sun, Kaiyuan
AU - Poon, Josiah
PY - 2020
Y1 - 2020
N2 - Visual Question Answering (VQA) is a challenging multi-modal task that requires not only the semantic understanding of images and questions, but also the sound perception of a step-by-step reasoning process that would lead to the correct answer. So far, most successful attempts in VQA have been focused on only one aspect; either the interaction of visual pixel features of images and word features of questions, or the reasoning process of answering the question of an image with simple objects. In this paper, we propose a deep reasoning VQA model (REXUP- REason, EXtract, and UPdate) with explicit visual structure-aware textual information, and it works well in capturing step-by-step reasoning process and detecting complex object-relationships in photo-realistic images. REXUP consists of two branches, image object-oriented and scene graph-oriented, which jointly works with the super-diagonal fusion compositional attention networks. We evaluate REXUP on the benchmark GQA dataset and conduct extensive ablation studies to explore the reasons behind REXUP’s effectiveness. Our best model significantly outperforms the previous state-of-the-art, which delivers 92.7% on the validation set, and 73.1% on the test-dev set. Our code is available at: https://github.com/usydnlp/REXUP/.
AB - Visual Question Answering (VQA) is a challenging multi-modal task that requires not only the semantic understanding of images and questions, but also the sound perception of a step-by-step reasoning process that would lead to the correct answer. So far, most successful attempts in VQA have been focused on only one aspect; either the interaction of visual pixel features of images and word features of questions, or the reasoning process of answering the question of an image with simple objects. In this paper, we propose a deep reasoning VQA model (REXUP- REason, EXtract, and UPdate) with explicit visual structure-aware textual information, and it works well in capturing step-by-step reasoning process and detecting complex object-relationships in photo-realistic images. REXUP consists of two branches, image object-oriented and scene graph-oriented, which jointly works with the super-diagonal fusion compositional attention networks. We evaluate REXUP on the benchmark GQA dataset and conduct extensive ablation studies to explore the reasons behind REXUP’s effectiveness. Our best model significantly outperforms the previous state-of-the-art, which delivers 92.7% on the validation set, and 73.1% on the test-dev set. Our code is available at: https://github.com/usydnlp/REXUP/.
KW - GQA
KW - Scene graph
KW - Visual Question Answering
UR - http://www.scopus.com/inward/record.url?scp=85097060623&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-63830-6_44
DO - 10.1007/978-3-030-63830-6_44
M3 - Conference paper
SN - 9783030638290
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 520
EP - 532
BT - Neural Information Processing - 27th International Conference, ICONIP 2020, Proceedings
A2 - Yang, Haiqin
A2 - Pasupa, Kitsuchart
A2 - Leung, Andrew Chi-Sing
A2 - Kwok, James T.
A2 - Chan, Jonathan H.
A2 - King, Irwin
PB - Springer
T2 - 27th International Conference on Neural Information Processing, ICONIP 2020
Y2 - 18 November 2020 through 22 November 2020
ER -