Visual-Language Navigation Pretraining via Prompt-based Environmental Self-exploration

Xiwen Liang; Fengda Zhu; Li Lingling; Hang Xu; Xiaodan Liang

doi:10.18653/v1/2022.acl-long.332

Visual-Language Navigation Pretraining via Prompt-based Environmental Self-exploration

Xiwen Liang, Fengda Zhu, Li Lingling, Hang Xu, Xiaodan Liang

Abstract

Vision-language navigation (VLN) is a challenging task due to its large searching space in the environment. To address this problem, previous works have proposed some methods of fine-tuning a large model that pretrained on large-scale datasets. However, the conventional fine-tuning methods require extra human-labeled navigation data and lack self-exploration capabilities in environments, which hinders their generalization of unseen scenes. To improve the ability of fast cross-domain adaptation, we propose Prompt-based Environmental Self-exploration (ProbES), which can self-explore the environments by sampling trajectories and automatically generates structured instructions via a large-scale cross-modal pretrained model (CLIP). Our method fully utilizes the knowledge learned from CLIP to build an in-domain dataset by self-exploration without human labeling. Unlike the conventional approach of fine-tuning, we introduce prompt tuning to achieve fast adaptation for language embeddings, which substantially improves the learning efficiency by leveraging prior knowledge. By automatically synthesizing trajectory-instruction pairs in any environment without human supervision and instruction prompt tuning, our model can adapt to diverse vision-language navigation tasks, including VLN and REVERIE. Both qualitative and quantitative results show that our ProbES significantly improves the generalization ability of the navigation model.

Anthology ID:: 2022.acl-long.332
Volume:: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: May
Year:: 2022
Address:: Dublin, Ireland
Editors:: Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4837–4851
Language:
URL:: https://aclanthology.org/2022.acl-long.332
DOI:: 10.18653/v1/2022.acl-long.332
Bibkey:
Cite (ACL):: Xiwen Liang, Fengda Zhu, Li Lingling, Hang Xu, and Xiaodan Liang. 2022. Visual-Language Navigation Pretraining via Prompt-based Environmental Self-exploration. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4837–4851, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):: Visual-Language Navigation Pretraining via Prompt-based Environmental Self-exploration (Liang et al., ACL 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.acl-long.332.pdf
Code: liangcici/probes-vln
Data: Conceptual Captions, Objects365, Places

PDF Cite Search Code