Assessing Authenticity and Anonymity of Synthetic User-generated Content in the Medical Domain

Tomohiro Nishiyama, Lisa Raithel, Roland Roller, Pierre Zweigenbaum, Eiji Aramaki


Abstract
Since medical text cannot be shared easily due to privacy concerns, synthetic data bears much potential for natural language processing applications. In the context of social media and user-generated messages about drug intake and adverse drug effects, this work presents different methods to examine the authenticity of synthetic text. We conclude that the generated tweets are untraceable and show enough authenticity from the medical point of view to be used as a replacement for a real Twitter corpus. However, original data might still be the preferred choice as they contain much more diversity.
Anthology ID:
2024.caldpseudo-1.2
Volume:
Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024)
Month:
March
Year:
2024
Address:
St. Julian’s, Malta
Editors:
Elena Volodina, David Alfter, Simon Dobnik, Therese Lindström Tiedemann, Ricardo Muñoz Sánchez, Maria Irena Szawerna, Xuan-Son Vu
Venues:
CALD-pseudo | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8–17
Language:
URL:
https://aclanthology.org/2024.caldpseudo-1.2
DOI:
Bibkey:
Cite (ACL):
Tomohiro Nishiyama, Lisa Raithel, Roland Roller, Pierre Zweigenbaum, and Eiji Aramaki. 2024. Assessing Authenticity and Anonymity of Synthetic User-generated Content in the Medical Domain. In Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024), pages 8–17, St. Julian’s, Malta. Association for Computational Linguistics.
Cite (Informal):
Assessing Authenticity and Anonymity of Synthetic User-generated Content in the Medical Domain (Nishiyama et al., CALD-pseudo-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.caldpseudo-1.2.pdf