CS2W: A Chinese Spoken-to-Written Style Conversion Dataset with Multiple Conversion Types

Zishan Guo, Linhao Yu, Minghui Xu, Renren Jin, Deyi Xiong


Abstract
Spoken texts (either manual or automatic transcriptions from automatic speech recognition (ASR)) often contain disfluencies and grammatical errors, which pose tremendous challenges to downstream tasks. Converting spoken into written language is hence desirable. Unfortunately, the availability of datasets for this is limited. To address this issue, we present CS2W, a Chinese Spoken-to-Written style conversion dataset comprising 7,237 spoken sentences extracted from transcribed conversational texts. Four types of conversion problems are covered in CS2W: disfluencies, grammatical errors, ASR transcription errors, and colloquial words. Our annotation convention, data, and code are publicly available at https://github.com/guozishan/CS2W.
Anthology ID:
2023.emnlp-main.241
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3962–3979
Language:
URL:
https://aclanthology.org/2023.emnlp-main.241
DOI:
10.18653/v1/2023.emnlp-main.241
Bibkey:
Cite (ACL):
Zishan Guo, Linhao Yu, Minghui Xu, Renren Jin, and Deyi Xiong. 2023. CS2W: A Chinese Spoken-to-Written Style Conversion Dataset with Multiple Conversion Types. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3962–3979, Singapore. Association for Computational Linguistics.
Cite (Informal):
CS2W: A Chinese Spoken-to-Written Style Conversion Dataset with Multiple Conversion Types (Guo et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.241.pdf