Translating Ancient Chinese to Modern Chinese at Scale: A Large Language Model-based Approach

Jiahuan Cao, Dezhi Peng, Yongxin Shi, Zongyuan Jiang, Lianwen Jin


Abstract
Recently, the emergence of large language models (LLMs) has provided powerful foundation models for a wide range of natural language processing (NLP) tasks. However, the vast majority of the pre-training corpus for most existing LLMs is in English, resulting in their Chinese proficiency falling far behind that of English. Furthermore, ancient Chinese has a much larger vocabulary and less available corpus than modern Chinese, which significantly challenges the generalization capacity of existing LLMs. In this paper, we investigate the Ancient-Chinese-to-Modern-Chinese (A2M) translation using LLMs including LLaMA and Ziya. Specifically, to improve the understanding of Chinese texts, we explore the vocabulary expansion and incremental pre-training methods based on existing pre-trained LLMs. Subsequently, a large-scale A2M translation dataset with 4M pairs is utilized to finetune the LLMs.Experimental results demonstrate the effectiveness of the proposed method, especially with Ziya-13B, in translating ancient Chinese to modern Chinese. Moreover,we deeply analyze the performance of various LLMs with different strategies, which we believe can benefit further research on LLM-based A2M approaches.
Anthology ID:
2023.alt-1.9
Volume:
Proceedings of ALT2023: Ancient Language Translation Workshop
Month:
September
Year:
2023
Address:
Macau SAR, China
Venue:
alt
SIG:
Publisher:
Asia-Pacific Association for Machine Translation
Note:
Pages:
61–69
Language:
URL:
https://aclanthology.org/2023.alt-1.9
DOI:
Bibkey:
Cite (ACL):
Jiahuan Cao, Dezhi Peng, Yongxin Shi, Zongyuan Jiang, and Lianwen Jin. 2023. Translating Ancient Chinese to Modern Chinese at Scale: A Large Language Model-based Approach. In Proceedings of ALT2023: Ancient Language Translation Workshop, pages 61–69, Macau SAR, China. Asia-Pacific Association for Machine Translation.
Cite (Informal):
Translating Ancient Chinese to Modern Chinese at Scale: A Large Language Model-based Approach (Cao et al., alt 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.alt-1.9.pdf