Extending Off-the-shelf NER Systems to Personal Information Detection in Dialogues with a Virtual Agent: Findings from a Real-Life Use Case

Mario Mina, Carlos Rodríguez, Aitor Gonzalez-Agirre, Marta Villegas


Abstract
We present the findings and results of our pseudonymisation system, which has been developed for a real-life use-case involving users and an informative chatbot in the context of the COVID-19 pandemic. Message exchanges between the two involve the former group providing information about themselves and their residential area, which could easily allow for their re-identification. We create a modular pipeline to detect PIIs and perform basic deidentification such that the data can be stored while mitigating any privacy concerns. The use-case presents several challenging aspects, the most difficult of which is the logistic challenge of not being able to directly view or access the data due to the very privacy issues we aim to resolve. Nevertheless, our system achieves a high recall of 0.99, correctly identifying almost all instances of personal data. However, this comes at the expense of precision, which only reaches 0.64. We describe the sensitive information identification in detail, explaining the design principles behind our decisions. We additionally highlight the particular challenges we’ve encountered.
Anthology ID:
2024.caldpseudo-1.6
Volume:
Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024)
Month:
March
Year:
2024
Address:
St. Julian’s, Malta
Editors:
Elena Volodina, David Alfter, Simon Dobnik, Therese Lindström Tiedemann, Ricardo Muñoz Sánchez, Maria Irena Szawerna, Xuan-Son Vu
Venues:
CALD-pseudo | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
44–53
Language:
URL:
https://aclanthology.org/2024.caldpseudo-1.6
DOI:
Bibkey:
Cite (ACL):
Mario Mina, Carlos Rodríguez, Aitor Gonzalez-Agirre, and Marta Villegas. 2024. Extending Off-the-shelf NER Systems to Personal Information Detection in Dialogues with a Virtual Agent: Findings from a Real-Life Use Case. In Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024), pages 44–53, St. Julian’s, Malta. Association for Computational Linguistics.
Cite (Informal):
Extending Off-the-shelf NER Systems to Personal Information Detection in Dialogues with a Virtual Agent: Findings from a Real-Life Use Case (Mina et al., CALD-pseudo-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.caldpseudo-1.6.pdf