Fine-tuning CLIP Text Encoders with Two-step Paraphrasing

Hyunjae Kim, Seunghyun Yoon, Trung Bui, Handong Zhao, Quan Tran, Franck Dernoncourt, Jaewoo Kang


Abstract
Contrastive language-image pre-training (CLIP) models have demonstrated considerable success across various vision-language tasks, such as text-to-image retrieval, where the model is required to effectively process natural language input to produce an accurate visual output. However, current models still face limitations in dealing with linguistic variations in input queries, such as paraphrases, making it challenging to handle a broad range of user queries in real-world applications. In this study, we introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases. Our approach involves a two-step paraphrase generation process, where we automatically create two categories of paraphrases from web-scale image captions by leveraging large language models. Subsequently, we fine-tune the CLIP text encoder using these generated paraphrases while freezing the image encoder. Our resulting model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks, including paraphrased retrieval (with rank similarity scores improved by up to 7.6% and 9.6%), Visual Genome Relation and Attribution, as well as seven semantic textual similarity tasks.
Anthology ID:
2024.findings-eacl.144
Volume:
Findings of the Association for Computational Linguistics: EACL 2024
Month:
March
Year:
2024
Address:
St. Julian’s, Malta
Editors:
Yvette Graham, Matthew Purver
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2175–2184
Language:
URL:
https://aclanthology.org/2024.findings-eacl.144
DOI:
Bibkey:
Cite (ACL):
Hyunjae Kim, Seunghyun Yoon, Trung Bui, Handong Zhao, Quan Tran, Franck Dernoncourt, and Jaewoo Kang. 2024. Fine-tuning CLIP Text Encoders with Two-step Paraphrasing. In Findings of the Association for Computational Linguistics: EACL 2024, pages 2175–2184, St. Julian’s, Malta. Association for Computational Linguistics.
Cite (Informal):
Fine-tuning CLIP Text Encoders with Two-step Paraphrasing (Kim et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-eacl.144.pdf