MMVQA: A Comprehensive Dataset for Investigating Multipage Multimodal Information Retrieval in PDF-based Visual Question Answering

Yihao Ding, Kaixuan Ren, Jiabin Huang, Siwen Luo, Soyeon Caren Han

Research output: Chapter in Book/Conference paperConference paperpeer-review

Abstract

Document Question Answering (QA) presents a challenge in understanding visually-rich documents (VRD), particularly with lengthy textual content. Existing studies primarily focus on real-world documents with sparse text, while challenges persist in comprehending the hierarchical semantic relations among multiple pages to locate multimodal components. The paper introduces MMVQA, a dataset tailored for research journal articles, encompassing multiple pages and multimodal retrieval. Our approach aims to retrieve entire paragraphs containing answers or visually rich document entities like tables and figures. The main contribution is introducing a comprehensive PDF Document VQA dataset, allowing the examination of semantically hierarchical layout structures in text-dominant documents. We also present new VRD-QA frameworks to grasp textual contents and relations among document layouts simultaneously, extending page-level understanding to the entire multi-page document. We aim to enhance the capabilities of existing vision-and-language models in handling challenges posed by text-dominant documents in VRD-QA. Code and Appendix are in https://github.com/adlnlp/pdfmvqa.

Original languageEnglish
Title of host publicationProceedings of the 33rd International Joint Conference on Artificial Intelligence, IJCAI 2024
EditorsKate Larson
PublisherInternational Joint Conferences on Artificial Intelligence
Pages6243-6251
Number of pages9
ISBN (Electronic)9781956792041
Publication statusPublished - 2024
Event33rd International Joint Conference on Artificial Intelligence - Jeju, Korea, Republic of
Duration: 3 Aug 20249 Aug 2024

Publication series

NameIJCAI International Joint Conference on Artificial Intelligence
ISSN (Print)1045-0823

Conference

Conference33rd International Joint Conference on Artificial Intelligence
Abbreviated titleIJCAI 2024
Country/TerritoryKorea, Republic of
CityJeju
Period3/08/249/08/24

Fingerprint

Dive into the research topics of 'MMVQA: A Comprehensive Dataset for Investigating Multipage Multimodal Information Retrieval in PDF-based Visual Question Answering'. Together they form a unique fingerprint.

Cite this