Contrastively Pretrained Vision-Language Transformers and Domain Adaptation Methods for Multimodal TOD Systems

Youngjae Chang, Doo Young Kim, Jinyoung Kim, Keunha Kim, Hyunmook Cha, Suyoung Min, Youngjoong Ko, Kye-Hwan Lee, Joonwoo Park


Abstract
The Situated Interactive MultiModal Conversations (SIMMC2.1) Challenge 2022 is hosted by the Eleventh Dialog System Technology Challenge (DSTC11). This is the third consecutive year multimodal dialog systems have been selected as an official track of the competition, promoted by the continued interest in the research community. The task of SIMMC is to create a shopping assistant agent that can communicate with customers in a virtual store. It requires processing store scenes and product catalogs along with the customer’s request. The task is decomposed into four steps and each becomes a subtask. In this work, we explore the common approaches to modeling multimodality and find the method with the most potential. We also identify a discrepancy in using pretrained language models for dialog tasks and devise a simple domain-adaptation method. Our model came in third place for object coreferencing, dialog state tracking, and response generation tasks.
Anthology ID:
2023.dstc-1.4
Volume:
Proceedings of The Eleventh Dialog System Technology Challenge
Month:
September
Year:
2023
Address:
Prague, Czech Republic
Editors:
Yun-Nung Chen, Paul Crook, Michel Galley, Sarik Ghazarian, Chulaka Gunasekara, Raghav Gupta, Behnam Hedayatnia, Satwik Kottur, Seungwhan Moon, Chen Zhang
Venues:
DSTC | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
25–30
Language:
URL:
https://aclanthology.org/2023.dstc-1.4
DOI:
Bibkey:
Cite (ACL):
Youngjae Chang, Doo Young Kim, Jinyoung Kim, Keunha Kim, Hyunmook Cha, Suyoung Min, Youngjoong Ko, Kye-Hwan Lee, and Joonwoo Park. 2023. Contrastively Pretrained Vision-Language Transformers and Domain Adaptation Methods for Multimodal TOD Systems. In Proceedings of The Eleventh Dialog System Technology Challenge, pages 25–30, Prague, Czech Republic. Association for Computational Linguistics.
Cite (Informal):
Contrastively Pretrained Vision-Language Transformers and Domain Adaptation Methods for Multimodal TOD Systems (Chang et al., DSTC-WS 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.dstc-1.4.pdf