Hao Xiang


2023

pdf bib
WebDP: Understanding Discourse Structures in Semi-Structured Web Documents
Peilin Liu | Hongyu Lin | Meng Liao | Hao Xiang | Xianpei Han | Le Sun
Findings of the Association for Computational Linguistics: ACL 2023

Web documents have become rich data resources in current era, and understanding their discourse structure will potentially benefit various downstream document processing applications. Unfortunately, current discourse analysis and document intelligence research mostly focus on either discourse structure of plain text or superficial visual structures in document, which cannot accurately describe discourse structure of highly free-styled and semi-structured web documents. To promote discourse studies on web documents, in this paper we introduced a benchmark – WebDP, orienting a new task named Web Document Discourse Parsing. Specifically, a web document discourse structure representation schema is proposed by extending classical discourse theories and adding special features to well represent discourse characteristics of web documents. Then, a manually annotated web document dataset – WEBDOCS is developed to facilitate the study of this parsing task. We compared current neural models on WEBDOCS and experimental results show that WebDP is feasible but also challenging for current models.