Nattapong Tiyajamorn


2021

pdf bib
Language-agnostic Representation from Multilingual Sentence Encoders for Cross-lingual Similarity Estimation
Nattapong Tiyajamorn | Tomoyuki Kajiwara | Yuki Arase | Makoto Onizuka
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We propose a method to distill a language-agnostic meaning embedding from a multilingual sentence encoder. By removing language-specific information from the original embedding, we retrieve an embedding that fully represents the sentence’s meaning. The proposed method relies only on parallel corpora without any human annotations. Our meaning embedding allows efficient cross-lingual sentence similarity estimation by simple cosine similarity calculation. Experimental results on both quality estimation of machine translation and cross-lingual semantic textual similarity tasks reveal that our method consistently outperforms the strong baselines using the original multilingual embedding. Our method consistently improves the performance of any pre-trained multilingual sentence encoder, even in low-resource language pairs where only tens of thousands of parallel sentence pairs are available.