基于网络词典的现代汉语词义消歧数据集构建(Construction of a Modern Chinese Word Sense Dataset Based on Online Dictionaries)

Fukang Yan (严福康), Yue Zhang (章岳), Zhenghua Li (李正华)


Abstract
“词义消歧作为自然语言处理最经典的任务之一,旨在识别多义词在给定上下文中的正确词义。相比英文,中文的一词多义现象更普遍,然而当前公开发布的汉语词义消歧数据集很少。本文爬取并融合了两个公开的网络词典,并从中筛选1083个词语和相关义项作为待标注对象。进而,从网络数据及专业语料中为抽取相关句子。最后,以多人标注、专家审核的方式进行了人工标注。数据集1包含将近2万个句子,即每个词平均对应约20个句子。本文将数据集划分为训练集、验证集和测试集,对多种模型进行实验对比。”
Anthology ID:
2023.ccl-1.4
Volume:
Proceedings of the 22nd Chinese National Conference on Computational Linguistics
Month:
August
Year:
2023
Address:
Harbin, China
Editors:
Maosong Sun, Bing Qin, Xipeng Qiu, Jing Jiang, Xianpei Han
Venue:
CCL
SIG:
Publisher:
Chinese Information Processing Society of China
Note:
Pages:
43–53
Language:
Chinese
URL:
https://aclanthology.org/2023.ccl-1.4
DOI:
Bibkey:
Cite (ACL):
Fukang Yan, Yue Zhang, and Zhenghua Li. 2023. 基于网络词典的现代汉语词义消歧数据集构建(Construction of a Modern Chinese Word Sense Dataset Based on Online Dictionaries). In Proceedings of the 22nd Chinese National Conference on Computational Linguistics, pages 43–53, Harbin, China. Chinese Information Processing Society of China.
Cite (Informal):
基于网络词典的现代汉语词义消歧数据集构建(Construction of a Modern Chinese Word Sense Dataset Based on Online Dictionaries) (Yan et al., CCL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.ccl-1.4.pdf