Generative Spoken Language Model based on continuous word-sized audio tokens

Robin Algayres, Yossi Adi, Tu Nguyen, Jade Copet, Gabriel Synnaeve, Benoît Sagot, Emmanuel Dupoux


Abstract
In NLP, text language models based on words or subwords are known to outperform their character-based counterparts. Yet, in the speech community, the standard input of spoken LMs are 20ms or 40ms-long discrete units (shorter than a phoneme). Taking inspiration from word-based LM, we introduce a Generative Spoken Language Model (GSLM) based on word-size continuous-valued audio tokens that can generate diverse and expressive language output. This is obtained by replacing lookup table for lexical types with a Lexical Embedding function, the cross entropy loss by a contrastive loss, and multinomial sampling by k-NN sampling. The resulting model is the first generative language model based on word-size continuous tokens. Its performance is on par with discrete unit GSLMs regarding generation quality as measured by automatic metrics and subjective human judgements. Moreover, it is five times more memory efficient thanks to its large 200ms units. In addition, the embeddings before and after the Lexical Embedder are phonetically and semantically interpretable.
Anthology ID:
2023.emnlp-main.182
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3008–3028
Language:
URL:
https://aclanthology.org/2023.emnlp-main.182
DOI:
10.18653/v1/2023.emnlp-main.182
Bibkey:
Cite (ACL):
Robin Algayres, Yossi Adi, Tu Nguyen, Jade Copet, Gabriel Synnaeve, Benoît Sagot, and Emmanuel Dupoux. 2023. Generative Spoken Language Model based on continuous word-sized audio tokens. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3008–3028, Singapore. Association for Computational Linguistics.
Cite (Informal):
Generative Spoken Language Model based on continuous word-sized audio tokens (Algayres et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.182.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.182.mp4