Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?

Yichi Zhang; Jiayi Pan; Yuchen Zhou; Rui Pan; Joyce Chai

doi:10.18653/v1/2023.emnlp-main.348

Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?

Yichi Zhang, Jiayi Pan, Yuchen Zhou, Rui Pan, Joyce Chai

Abstract

Vision-Language Models (VLMs) are trained on vast amounts of data captured by humans emulating our understanding of the world. However, known as visual illusions, human’s perception of reality isn’t always faithful to the physical world. This raises a key question: do VLMs have the similar kind of illusions as humans do, or do they faithfully learn to represent reality? To investigate this question, we build a dataset containing five types of visual illusions and formulate four tasks to examine visual illusions in state-of-the-art VLMs. Our findings have shown that although the overall alignment is low, larger models are closer to human perception and more susceptible to visual illusions. Our dataset and initial findings will promote a better understanding of visual illusions in humans and machines and provide a stepping stone for future computational models that can better align humans and machines in perceiving and communicating about the shared visual world. The code and data are available at [github.com/vl-illusion/dataset](https://github.com/vl-illusion/dataset).

Anthology ID:: 2023.emnlp-main.348
Volume:: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Houda Bouamor, Juan Pino, Kalika Bali
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5718–5728
Language:
URL:: https://aclanthology.org/2023.emnlp-main.348
DOI:: 10.18653/v1/2023.emnlp-main.348
Bibkey:
Cite (ACL):: Yichi Zhang, Jiayi Pan, Yuchen Zhou, Rui Pan, and Joyce Chai. 2023. Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5718–5728, Singapore. Association for Computational Linguistics.
Cite (Informal):: Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans? (Zhang et al., EMNLP 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.emnlp-main.348.pdf
Video:: https://aclanthology.org/2023.emnlp-main.348.mp4

PDF Cite Search Video