Detecting Personal Identifiable Information in Swedish Learner Essays

Maria Irena Szawerna, Simon Dobnik, Ricardo Muñoz Sánchez, Therese Lindström Tiedemann, Elena Volodina


Abstract
Linguistic data can — and often does — contain PII (Personal Identifiable Information). Both from a legal and ethical standpoint, the sharing of such data is not permissible. According to the GDPR, pseudonymization, i.e. the replacement of sensitive information with surrogates, is an acceptable strategy for privacy preservation. While research has been conducted on the detection and replacement of sensitive data in Swedish medical data using Large Language Models (LLMs), it is unclear whether these models handle PII in less structured and more thematically varied texts equally well. In this paper, we present and discuss the performance of an LLM-based PII-detection system for Swedish learner essays.
Anthology ID:
2024.caldpseudo-1.7
Volume:
Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024)
Month:
March
Year:
2024
Address:
St. Julian’s, Malta
Editors:
Elena Volodina, David Alfter, Simon Dobnik, Therese Lindström Tiedemann, Ricardo Muñoz Sánchez, Maria Irena Szawerna, Xuan-Son Vu
Venues:
CALD-pseudo | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
54–63
Language:
URL:
https://aclanthology.org/2024.caldpseudo-1.7
DOI:
Bibkey:
Cite (ACL):
Maria Irena Szawerna, Simon Dobnik, Ricardo Muñoz Sánchez, Therese Lindström Tiedemann, and Elena Volodina. 2024. Detecting Personal Identifiable Information in Swedish Learner Essays. In Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024), pages 54–63, St. Julian’s, Malta. Association for Computational Linguistics.
Cite (Informal):
Detecting Personal Identifiable Information in Swedish Learner Essays (Szawerna et al., CALD-pseudo-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.caldpseudo-1.7.pdf