Bootstrapping Small & High Performance Language Models with Unmasking-Removal Training Policy

Yahan Yang; Elior Sulem; Insup Lee; Dan Roth

doi:10.18653/v1/2023.emnlp-main.30

Bootstrapping Small & High Performance Language Models with Unmasking-Removal Training Policy

Yahan Yang, Elior Sulem, Insup Lee, Dan Roth

Abstract

BabyBERTa, a language model trained on small-scale child-directed speech while none of the words are unmasked during training, has been shown to achieve a level of grammaticality comparable to that of RoBERTa-base, which is trained on 6,000 times more words and 15 times more parameters. Relying on this promising result, we explore in this paper the performance of BabyBERTa-based models in downstream tasks, focusing on Semantic Role Labeling (SRL) and two Extractive Question Answering tasks, with the aim of building more efficient systems that rely on less data and smaller models. We investigate the influence of these models both alone and as a starting point to larger pre-trained models, separately examining the contribution of the pre-training data, the vocabulary, and the masking policy on the downstream task performance. Our results show that BabyBERTa trained with unmasking-removal policy is a much stronger starting point for downstream tasks compared to the use of RoBERTa masking policy when 10M words are used for training and that this tendency persists, although to a lesser extent, when adding more training data.

Anthology ID:: 2023.emnlp-main.30
Volume:: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Houda Bouamor, Juan Pino, Kalika Bali
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 457–464
Language:
URL:: https://aclanthology.org/2023.emnlp-main.30
DOI:: 10.18653/v1/2023.emnlp-main.30
Bibkey:
Cite (ACL):: Yahan Yang, Elior Sulem, Insup Lee, and Dan Roth. 2023. Bootstrapping Small & High Performance Language Models with Unmasking-Removal Training Policy. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 457–464, Singapore. Association for Computational Linguistics.
Cite (Informal):: Bootstrapping Small & High Performance Language Models with Unmasking-Removal Training Policy (Yang et al., EMNLP 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.emnlp-main.30.pdf
Video:: https://aclanthology.org/2023.emnlp-main.30.mp4

PDF Cite Search Video