Xiaobing Zhao

2023

In the era of large models, low-resource question-answering tasks lag, emphasizing the importance of data augmentation - a key research avenue in natural language processing. The main challenges include leveraging the large model’s internal knowledge for data augmentation, determining which QA data component - the question, passage, or answer - benefits most from augmentation, and retaining consistency in the augmented content without inducing excessive noise. To tackle these, we introduce PQQ, an innovative approach for question data augmentation consisting of Prompt Answer, Question Generation, and Question Filter. Our experiments reveal that ChatGPT underperforms on the experimental data, yet our PQQ method excels beyond existing augmentation strategies. Further, its universal applicability is validated through successful tests on high-resource QA tasks like SQUAD1.1 and TriviaQA.

2022

pdf bib abs
Question Generation Based on Grammar Knowledge and Fine-grained Classification
Yuan Sun | Sisi Liu | Zhengcuo Dan | Xiaobing Zhao
Proceedings of the 29th International Conference on Computational Linguistics

Question generation is the task of automatically generating questions based on given context and answers, and there are problems that the types of questions and answers do not match. In minority languages such as Tibetan, since the grammar rules are complex and the training data is small, the related research on question generation is still in its infancy. To solve the above problems, this paper constructs a question type classifier and a question generator. We perform fine-grained division of question types and integrate grammatical knowledge into question type classifiers to improve the accuracy of question types. Then, the types predicted by the question type classifier are fed into the question generator. Our model improves the accuracy of interrogative words in generated questions, and the BLEU-4 on SQuAD reaches 17.52, the BLEU-4 on HotpotQA reaches 19.31, the BLEU-4 on TibetanQA reaches 25.58.

pdf bib abs
机器音译研究综述(Survey on Machine Transliteration)
Zhuo Li (李卓) | Zhijuan Wang (王志娟) | Xiaobing Zhao (赵小兵)
Proceedings of the 21st Chinese National Conference on Computational Linguistics

“机器音译是基于语音相似性自动将文本从一种语言转换为另一种语言的过程,它是机器翻译的一个子任务,侧重于语音信息的翻译。音译后可知道源单词在另一种语言中的发音,使不熟悉源语言的人更容易理解该语言,有益于消除语言和拼写障碍。机器音译在多语言文本处理、语料库对齐、信息抽取等自然语言应用中发挥着重要作用。本文阐述了目前机器音译任务中存在的挑战,对主要的音译方法进行了剖析、分类和整理,对音译数据集进行了罗列汇总,并列出了常用的音译效果评价指标,最后对该领域目前存在的问题进行了说明并对音译学的未来进行了展望。本文以期对进入该领域的新人提供快速的入门指南,或供其他研究者参考。”

2021

pdf bib abs
基于枢轴语言系统融合的词汇混淆网络神经机器翻译(Neural Machine Translation for Vocabulary Confusion Network Based on Pivotal Language System Fusion)
Xiaobing Zhao (赵小兵) | Bo Jin (金波) | Yuan Sun (孙媛)
Proceedings of the 20th Chinese National Conference on Computational Linguistics

神经机器翻译在低资源语言的翻译任务中存在翻译难度大、译文质量不佳的问题。本文针对低资源语言与汉语之间没有双语平行语料的情况,采用正反向枢轴翻译的方法,生成了三种低资源语言到汉语的平行句对,采用词汇级的系统融合技术,将Transformer模型和对偶学习模型翻译生成的目标语言译文进行融合,然后通过混淆神经网络进行词汇选择,生成了更为优质的目标语言译文。实验证明,本文提出的多模型融合方法在爱沙尼亚语-汉语、拉脱维亚语-汉语、罗马尼亚语-汉语这三种低资源语言翻译任务中均优于独立模型的翻译效果,进一步提升了低资源语言神经机器翻译的译文质量。

pdf bib abs
JCapsR: 一种联合胶囊神经网络的藏语知识图谱表示学习模型(JCapsR: A Joint Capsule Neural Network for Tibetan Knowledge Graph Representation Learning)
Yuan Sun (孙媛) | Jiaya Liang (梁家亚) | Andong Chen (陈安东) | Xiaobing Zhao (赵小兵)
Proceedings of the 20th Chinese National Conference on Computational Linguistics

知识图谱表示学习是自然语言处理的一项关键技术,现有的知识图谱表示研究主要集中在英语、汉语等语言,而低资源语言的知识图谱表示学习研究还处于探索阶段,例如藏语。本文基于前期构建的藏语知识图谱,提出了一种联合胶囊神经网络(JCapsR)的藏语知识图谱表示学习模型。首先,我们使用TransR模型生成藏语知识图谱的结构化信息表示。其次,采用融合多头注意力和关系注意力的Transformer模型表示藏语实体的文本描述信息。最后,采用JCapsR进一步提取三元组在知识图谱语义空间中的关系,将实体文本描述信息和结构化信息融合,得到藏语知识图谱的表示。实验结果表明,相比基线系统,联合胶囊神经网络JCapsR模型提高了藏语知识图谱表示学习的效果,相关研究为其它低资源语言知识图谱表示学习的拓展优化提供了参考借鉴意义。

pdf bib abs
面向机器阅读理解的高质量藏语数据集构建(Construction of High-quality Tibetan Dataset for Machine Reading Comprehension)
Yuan Sun (孙媛) | Sisi Liu (刘思思) | Chaofan Chen (陈超凡) | Zhengcuo Dan (旦正错) | Xiaobing Zhao (赵小兵)
Proceedings of the 20th Chinese National Conference on Computational Linguistics

机器阅读理解是通过算法让机器根据给定的上下文回答问题,从而测试机器理解自然语言的程度。其中,数据集的构建是机器阅读理解的主要任务。目前,相关算法模型在大多数流行的英语数据集上都取得了显著的成绩,甚至超过了人类的表现。但对于低资源语言,由于缺乏相应的数据集,机器阅读理解研究还处于起步阶段。本文以藏语为例,人工构建了藏语机器阅读理解数据集(TibetanQA),其中包含20000个问题答案对和1513篇文章。本数据集的文章均来自云藏网,涵盖了自然、文化和教育等12个领域的知识,问题形式多样且具有一定的难度。另外,该数据集在文章收集、问题构建、答案验证、回答多样性和推理能力等方面,均采用严格的流程以确保数据的质量,同时采用基于语言特征消融输入的验证方法说明了数据集的质量。最后,本文初步探索了三种经典的英语阅读理解模型在TibetanQA数据集上的表现,其结果难以媲美人类,这表明在藏语机器阅读理解任务上还需要更进一步的探索。

pdf bib abs
Ti-Reader: 基于注意力机制的藏文机器阅读理解端到端网络模型(Ti-Reader: An End-to-End Network Model Based on Attention Mechanisms for Tibetan Machine Reading Comprehension)
Yuan Sun (孙媛) | Chaofan Chen (陈超凡) | Sisi Liu (刘思思) | Xiaobing Zhao (赵小兵)
Proceedings of the 20th Chinese National Conference on Computational Linguistics

机器阅读理解旨在教会机器去理解一篇文章并且回答与之相关的问题。为了解决低资源语言上机器阅读理解模型性能低的问题,本文提出了一种基于注意力机制的藏文机器阅读理解端到端网络模型Ti-Reader。首先,为了编码更细粒度的藏文文本信息,本文将音节和词相结合进行词表示,然后采用词级注意力机制去关注文本中的关键词,采用重读机制去捕捉文章和问题之间的语义信息,采用自注意力机制去匹配问题与答案的隐变量本身,为答案预测提供更多的线索。最后,实验结果表明,Ti-Reader模型提升了藏文机器阅读理解的性能,并且在英文数据集SQuAD上也有较好的表现。

2020

2019

2018

pdf bib
Tibetan-Chinese Neural Machine Translation based on Syllable Segmentation
Wen Lai | Xiaobing Zhao | Wei Bao
Proceedings of the AMTA 2018 Workshop on Technologies for MT of Low Resource Languages (LoResMT 2018)