Wojciech Szmyd


2023

pdf bib
TrelBERT: A pre-trained encoder for Polish Twitter
Wojciech Szmyd | Alicja Kotyla | Michał Zobniów | Piotr Falkiewicz | Jakub Bartczuk | Artur Zygadło
Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023)

Pre-trained Transformer-based models have become immensely popular amongst NLP practitioners. We present TrelBERT – the first Polish language model suited for application in the social media domain. TrelBERT is based on an existing general-domain model and adapted to the language of social media by pre-training it further on a large collection of Twitter data. We demonstrate its usefulness by evaluating it in the downstream task of cyberbullying detection, in which it achieves state-of-the-art results, outperforming larger monolingual models trained on general-domain corpora, as well as multilingual in-domain models, by a large margin. We make the model publicly available. We also release a new dataset for the problem of harmful speech detection.