Xiaohu Liu


2023

pdf bib
UseClean: learning from complex noisy labels in named entity recognition
Jinjin Tian | Kun Zhou | Meiguo Wang | Yu Zhang | Benjamin Yao | Xiaohu Liu | Chenlei Guo
Proceedings of the 2023 CLASP Conference on Learning with Small Data (LSD)

We investigate and refine denoising methods for NER task on data that potentially contains extremely noisy labels from multi-sources. In this paper, we first summarized all possible noise types and noise generation schemes, based on which we built a thorough evaluation system. We then pinpoint the bottleneck of current state-of-art denoising methods using our evaluation system. Correspondingly, we propose several refinements, including using a two-stage framework to avoid error accumulation; a novel confidence score utilizing minimal clean supervision to increase predictive power; an automatic cutoff fitting to save extensive hyper-parameter tuning; a warm started weighted partial CRF to better learn on the noisy tokens. Additionally, we propose to use adaptive sampling to further boost the performance in long-tailed entity settings. Our method improves F1 score by on average at least 5 10% over current state-of-art across extensive experiments.

pdf bib
KEPLET: Knowledge-Enhanced Pretrained Language Model with Topic Entity Awareness
Yichuan Li | Jialong Han | Kyumin Lee | Chengyuan Ma | Benjamin Yao | Xiaohu Liu
Findings of the Association for Computational Linguistics: EMNLP 2023

In recent years, Pre-trained Language Models (PLMs) have shown their superiority by pre-training on unstructured text corpus and then fine-tuning on downstream tasks. On entity-rich textual resources like Wikipedia, Knowledge-Enhanced PLMs (KEPLMs) incorporate the interactions between tokens and mentioned entities in pre-training, and are thus more effective on entity-centric tasks such as entity linking and relation classification. Although exploiting Wikipedia’s rich structures to some extent, conventional KEPLMs still neglect a unique layout of the corpus where each Wikipedia page is around a topic entity (identified by the page URL and shown in the page title). In this paper, we demonstrate that KEPLMs without incorporating the topic entities will lead to insufficient entity interaction and biased (relation) word semantics. We thus propose KEPLET, a novel Knowledge-Énhanced Pre-trained LanguagE model with Topic entity awareness. In an end-to-end manner, KEPLET identifies where to add the topic entity’s information in a Wikipedia sentence, fuses such information into token and mentioned entities representations, and supervises the network learning, through which it takes topic entities back into consideration. Experiments demonstrated the generality and superiority of KEPLET which was applied to two representative KEPLMs, achieving significant improvements on four entity-centric tasks.

pdf bib
CL-QR: Cross-Lingual Enhanced Query Reformulation for Multi-lingual Conversational AI Agents
Zhongkai Sun | Zhengyang Zhao | Sixing Lu | Chengyuan Ma | Xiaohu Liu | Xing Fan | Wei Shen | Chenlei Guo
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track

The growing popularity of conversational AI agents such as Alexa, Google Assistant, and Siri rely on accurate spoken language comprehension. The query reformulation (QR) method, which reformulates defective user queries, has been broadly adopted to mitigate the challenges posed by understanding user’s intent from imperfect spoken recognition result. However, due to the scarcity of non-English QR labels, providing high-quality QR for non-English users still remains a challenge. This work proposes a novel cross-lingual QR framework, CL-QR, to leverage the abundant reformulation resources in English to improve non-English QR performance. The proposed work also proposes a Module-wise Mutually-supervised Feedback learning (MMF) algorithm to enable the continually self-improving of the CL-QR, which alleviates the lack of cross-lingual QR training data and enhances the delivery of high-quality reformulations learned in English for multilingual queries. Both offline evaluation and online A/B testing demonstrates the effectiveness of the proposed method.

pdf bib
PersonaPKT: Building Personalized Dialogue Agents via Parameter-efficient Knowledge Transfer
Xu Han | Bin Guo | Yoon Jung | Benjamin Yao | Yu Zhang | Xiaohu Liu | Chenlei Guo
Proceedings of The Fourth Workshop on Simple and Efficient Natural Language Processing (SustaiNLP)

2022

pdf bib
Joint Goal Segmentation and Goal Success Prediction on Multi-Domain Conversations
Meiguo Wang | Benjamin Yao | Bin Guo | Xiaohu Liu | Yu Zhang | Tuan-Hung Pham | Chenlei Guo
Proceedings of the 29th International Conference on Computational Linguistics

To evaluate the performance of a multi-domain goal-oriented Dialogue System (DS), it is important to understand what the users’ goals are for the conversations and whether those goals are successfully achieved. The success rate of goals directly correlates with user satisfaction and perceived usefulness of the DS. In this paper, we propose a novel automatic dialogue evaluation framework that jointly performs two tasks: goal segmentation and goal success prediction. We extend the RoBERTa-IQ model (Gupta et al., 2021) by adding multi-task learning heads for goal segmentation and success prediction. Using an annotated dataset from a commercial DS, we demonstrate that our proposed model reaches an accuracy that is on-par with single-pass human annotation comparing to a three-pass gold annotation benchmark.

pdf bib
Overcoming Catastrophic Forgetting During Domain Adaptation of Seq2seq Language Generation
Dingcheng Li | Zheng Chen | Eunah Cho | Jie Hao | Xiaohu Liu | Fan Xing | Chenlei Guo | Yang Liu
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Seq2seq language generation models that are trained offline with multiple domains in a sequential fashion often suffer from catastrophic forgetting. Lifelong learning has been proposed to handle this problem. However, existing work such as experience replay or elastic weighted consolidation requires incremental memory space. In this work, we propose an innovative framework, RMR_DSEthat leverages a recall optimization mechanism to selectively memorize important parameters of previous tasks via regularization, and uses a domain drift estimation algorithm to compensate the drift between different do-mains in the embedding space. These designs enable the model to be trained on the current task while keep-ing the memory of previous tasks, and avoid much additional data storage. Furthermore, RMR_DSE can be combined with existing lifelong learning approaches. Our experiments on two seq2seq language generation tasks, paraphrase and dialog response generation, show thatRMR_DSE outperforms SOTA models by a considerable margin and reduces forgetting greatly.

2016

pdf bib
Task Completion Platform: A self-serve multi-domain goal oriented dialogue platform
Paul Crook | Alex Marin | Vipul Agarwal | Khushboo Aggarwal | Tasos Anastasakos | Ravi Bikkula | Daniel Boies | Asli Celikyilmaz | Senthilkumar Chandramohan | Zhaleh Feizollahi | Roman Holenstein | Minwoo Jeong | Omar Khan | Young-Bum Kim | Elizabeth Krawczyk | Xiaohu Liu | Danko Panic | Vasiliy Radostev | Nikhil Ramesh | Jean-Phillipe Robichaud | Alexandre Rochette | Logan Stromberg | Ruhi Sarikaya
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

2015

pdf bib
Compact Lexicon Selection with Spectral Methods
Young-Bum Kim | Karl Stratos | Xiaohu Liu | Ruhi Sarikaya
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

1999

pdf bib
Mixed Language Query Disambiguation
Pascale Fung | Xiaohu Liu | Chi Shun Cheung
Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics