WebDP: Understanding Discourse Structures in Semi-Structured Web Documents

Peilin Liu, Hongyu Lin, Meng Liao, Hao Xiang, Xianpei Han, Le Sun


Abstract
Web documents have become rich data resources in current era, and understanding their discourse structure will potentially benefit various downstream document processing applications. Unfortunately, current discourse analysis and document intelligence research mostly focus on either discourse structure of plain text or superficial visual structures in document, which cannot accurately describe discourse structure of highly free-styled and semi-structured web documents. To promote discourse studies on web documents, in this paper we introduced a benchmark – WebDP, orienting a new task named Web Document Discourse Parsing. Specifically, a web document discourse structure representation schema is proposed by extending classical discourse theories and adding special features to well represent discourse characteristics of web documents. Then, a manually annotated web document dataset – WEBDOCS is developed to facilitate the study of this parsing task. We compared current neural models on WEBDOCS and experimental results show that WebDP is feasible but also challenging for current models.
Anthology ID:
2023.findings-acl.650
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10235–10258
Language:
URL:
https://aclanthology.org/2023.findings-acl.650
DOI:
10.18653/v1/2023.findings-acl.650
Bibkey:
Cite (ACL):
Peilin Liu, Hongyu Lin, Meng Liao, Hao Xiang, Xianpei Han, and Le Sun. 2023. WebDP: Understanding Discourse Structures in Semi-Structured Web Documents. In Findings of the Association for Computational Linguistics: ACL 2023, pages 10235–10258, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
WebDP: Understanding Discourse Structures in Semi-Structured Web Documents (Liu et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-acl.650.pdf