Right the docs: Characterising voice dataset documentation practices used in machine learning

Kathy Reid, Elizabeth T. Williams


Abstract
Voice-enabled technologies such as virtual assistants are quickly becoming ubiquitous. Their functionality relies on machine learning (ML) models that perform tasks such as automatic speech recognition (ASR). These models, in general, currently perform less accurately for some cohorts of speakers, across axes such as age, gender and accent; they are biased. ML models are trained from large datasets. ML Practitioners (MLPs) are interested in addressing bias across the ML lifecycle, and they often use dataset documentation here to understand dataset characteristics. However, there is a lack of research centred on voice dataset documentation. Our work makes an empirical contribution to this gap, identifying shortcomings in voice dataset documents (VDD), and arguing for actions to improve them. First, we undertake 13 interviews with MLPs who work with voice data, exploring how they use VDDs. We focus here on MLP roles and trade-offs made when working with VDDs. Drawing from the literature and from interview data, we create a rubric through which to analyse VDDs for nine voice datasets. Triangulating the two methods in our findings, we show that VDDs are inadequate for the needs of MLPs on several fronts. VDDs currently codify voice data characteristics in fragmented ways that make it difficult to compare and combine datasets, presenting a barrier to MLPs’ bias reduction efforts. We then seek to address these shortcomings and “right the docs” by proposing improvement actions aligned to our findings.
Anthology ID:
2023.alta-1.6
Volume:
Proceedings of the 21st Annual Workshop of the Australasian Language Technology Association
Month:
November
Year:
2023
Address:
Melbourne, Australia
Editors:
Smaranda Muresan, Vivian Chen, Kennington Casey, Vandyke David, Dethlefs Nina, Inoue Koji, Ekstedt Erik, Ultes Stefan
Venue:
ALTA
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
51–66
Language:
URL:
https://aclanthology.org/2023.alta-1.6
DOI:
Bibkey:
Cite (ACL):
Kathy Reid and Elizabeth T. Williams. 2023. Right the docs: Characterising voice dataset documentation practices used in machine learning. In Proceedings of the 21st Annual Workshop of the Australasian Language Technology Association, pages 51–66, Melbourne, Australia. Association for Computational Linguistics.
Cite (Informal):
Right the docs: Characterising voice dataset documentation practices used in machine learning (Reid & Williams, ALTA 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.alta-1.6.pdf