SciPara: A New Dataset for Investigating Paragraph Discourse Structure in Scientific Papers

Anna Kiepura, Yingqiang Gao, Jessica Lam, Nianlong Gu, Richard H.r. Hahnloser


Abstract
Good scientific writing makes use of specific sentence and paragraph structures, providing a rich platform for discourse analysis and developing tools to enhance text readability. In this vein, we introduce SciPara, a novel dataset consisting of 981 scientific paragraphs annotated by experts in terms of sentence discourse types and topic information. On this dataset, we explored two tasks: 1) discourse category classification, which is to predict the discourse category of a sentence by using its paragraph and surrounding paragraphs as context, and 2) discourse sentence generation, which is to generate a sentence of a certain discourse category by using various contexts as input. We found that Pre-trained Language Models (PLMs) can accurately identify Topic Sentences in SciPara, but have difficulty distinguishing Concluding, Transition, and Supporting Sentences. The quality of the sentences generated by all investigated PLMs improved with amount of context, regardless of discourse category. However, not all contexts were equally influential. Contrary to common assumptions about well-crafted scientific paragraphs, our analysis revealed that paradoxically, paragraphs with complete discourse structures were less readable.
Anthology ID:
2024.codi-1.2
Volume:
Proceedings of the 5th Workshop on Computational Approaches to Discourse (CODI 2024)
Month:
March
Year:
2024
Address:
St. Julians, Malta
Editors:
Michael Strube, Chloe Braud, Christian Hardmeier, Junyi Jessy Li, Sharid Loaiciga, Amir Zeldes, Chuyuan Li
Venues:
CODI | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12–26
Language:
URL:
https://aclanthology.org/2024.codi-1.2
DOI:
Bibkey:
Cite (ACL):
Anna Kiepura, Yingqiang Gao, Jessica Lam, Nianlong Gu, and Richard H.r. Hahnloser. 2024. SciPara: A New Dataset for Investigating Paragraph Discourse Structure in Scientific Papers. In Proceedings of the 5th Workshop on Computational Approaches to Discourse (CODI 2024), pages 12–26, St. Julians, Malta. Association for Computational Linguistics.
Cite (Informal):
SciPara: A New Dataset for Investigating Paragraph Discourse Structure in Scientific Papers (Kiepura et al., CODI-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.codi-1.2.pdf