Modularized Zero-shot VQA with Pre-trained Models

Rui Cao, Jing Jiang


Abstract
Large-scale pre-trained models (PTMs) show great zero-shot capabilities. In this paper, we study how to leverage them for zero-shot visual question answering (VQA).Our approach is motivated by a few observations. First, VQA questions often require multiple steps of reasoning, which is still a capability that most PTMs lack. Second, different steps in VQA reasoning chains require different skills such as object detection and relational reasoning, but a single PTM may not possess all these skills. Third, recent work on zero-shot VQA does not explicitly consider multi-step reasoning chains, which makes them less interpretable compared with a decomposition-based approach. We propose a modularized zero-shot network that explicitly decomposes questions into sub reasoning steps and is highly interpretable. We convert sub reasoning tasks to acceptable objectives of PTMs and assign tasks to proper PTMs without any adaptation. Our experiments on two VQA benchmarks under the zero-shot setting demonstrate the effectiveness of our method and better interpretability compared with several baselines.
Anthology ID:
2023.findings-acl.5
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
58–76
Language:
URL:
https://aclanthology.org/2023.findings-acl.5
DOI:
10.18653/v1/2023.findings-acl.5
Bibkey:
Cite (ACL):
Rui Cao and Jing Jiang. 2023. Modularized Zero-shot VQA with Pre-trained Models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 58–76, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Modularized Zero-shot VQA with Pre-trained Models (Cao & Jiang, Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-acl.5.pdf