Eva Pettersson


2024

pdf bib
People and Places of the Past - Named Entity Recognition in Swedish Labour Movement Documents from Historical Sources
Crina Tudor | Eva Pettersson
Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)

Named Entity Recognition (NER) is an important step in many Natural Language Processing tasks. The existing state-of-the-art NER systems are however typically developed based on contemporary data, and not very well suited for analyzing historical text. In this paper, we present a comparative analysis of the performance of several language models when applied to Named Entity Recognition for historical Swedish text. The source texts we work with are documents from Swedish labour unions from the 19th and 20th century. We experiment with three off-the-shelf models for contemporary Swedish text, and one language model built on historical Swedish text that we fine-tune with labelled data for adaptation to the NER task. Lastly, we propose a hybrid approach by combining the results of two models in order to maximize usability. We show that, even though historical Swedish is a low-resource language with data sparsity issues affecting overall performance, historical language models still show very promising results. Further contributions of our paper are the release of our newly trained model for NER of historical Swedish text, along with a manually annotated corpus of over 650 named entities.

2023

pdf bib
Low-Resource Techniques for Analysing the Rhetorical Structure of Swedish Historical Petitions
Ellinor Lindqvist | Eva Pettersson | Joakim Nivre
Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023)

Natural language processing techniques can be valuable for improving and facilitating historical research. This is also true for the analysis of petitions, a source which has been relatively little used in historical research. However, limited data resources pose challenges for mainstream natural language processing approaches based on machine learning. In this paper, we explore methods for automatically segmenting petitions according to their rhetorical structure. We find that the use of rules, word embeddings, and especially keywords can give promising results for this task.

2022

pdf bib
Identifying Cleartext in Historical Ciphers
Maria-Elena Gambardella | Beata Megyesi | Eva Pettersson
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages

In historical encrypted sources we can find encrypted text sequences, also called ciphertext, as well as non-encrypted cleartexts written in a known language. While most of the cryptanalysis focuses on the decryption of ciphertext, cleartext is often overlooked although it can give us important clues about the historical interpretation and contextualisation of the manuscript. In this paper, we investigate to what extent we can automatically distinguish cleartext from ciphertext in historical ciphers and to what extent we are able to identify its language. The problem is challenging as cleartext sequences in ciphers are often short, up to a few words, in different languages due to historical code-switching. To identify the sequences and the language(s), we chose a rule-based approach and run 7 different models using historical language models on various ciphertexts.

pdf bib
To the Most Gracious Highness, from Your Humble Servant: Analysing Swedish 18th Century Petitions Using Text Classification
Ellinor Lindqvist | Eva Pettersson | Joakim Nivre
Proceedings of the 6th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

Petitions are a rich historical source, yet they have been relatively little used in historical research. In this paper, we aim to analyse Swedish texts from around the 18th century, and petitions in particular, using automatic means of text classification. We also test how text pre-processing and different feature representations affect the result, and we examine feature importance for our main class of interest - petitions. Our experiments show that the statistical algorithms NB, RF, SVM, and kNN are indeed very able to classify different genres of historical text. Further, we find that normalisation has a positive impact on classification, and that content words are particularly informative for the traditional models. A fine-tuned BERT model, fed with normalised data, outperforms all other classification experiments with a macro average F1 score at 98.8. However, using less computationally expensive methods, including feature representation with word2vec, fastText embeddings or even TF-IDF values, with a SVM classifier also show good results for both unnormalise and normalised data. In the feature importance analysis, where we obtain the features most decisive for the classification models, we find highly relevant characteristics of the petitions, namely words expressing signs of someone inferior addressing someone superior.

2020

pdf bib
Czech Historical Named Entity Corpus v 1.0
Helena Hubková | Pavel Kral | Eva Pettersson
Proceedings of the Twelfth Language Resources and Evaluation Conference

As the number of digitized archival documents increases very rapidly, named entity recognition (NER) in historical documents has become very important for information extraction and data mining. For this task an annotated corpus is needed, which has up to now been missing for Czech. In this paper we present a new annotated data collection for historical NER, composed of Czech historical newspapers. This corpus is freely available for research purposes. For this corpus, we have defined relevant domain-specific named entity types and created an annotation manual for corpus labelling. We further conducted some experiments on this corpus using recurrent neural networks. We experimented with randomly initialized embeddings and static and dynamic fastText word embeddings. We achieved 0.73 F1 score with a bidirectional LSTM model using static fastText embeddings.

2019

pdf bib
Matching Keys and Encrypted Manuscripts
Eva Pettersson | Beata Megyesi
Proceedings of the 22nd Nordic Conference on Computational Linguistics

Historical cryptology is the study of historical encrypted messages aiming at their decryption by analyzing the mathematical, linguistic and other coding patterns and their historical context. In libraries and archives we can find quite a lot of ciphers, as well as keys describing the method used to transform the plaintext message into a ciphertext. In this paper, we present work on automatically mapping keys to ciphers to reconstruct the original plaintext message, and use language models generated from historical texts to guess the underlying plaintext language.

2018

pdf bib
An Evaluation of Neural Machine Translation Models on Historical Spelling Normalization
Gongbo Tang | Fabienne Cap | Eva Pettersson | Joakim Nivre
Proceedings of the 27th International Conference on Computational Linguistics

In this paper, we apply different NMT models to the problem of historical spelling normalization for five languages: English, German, Hungarian, Icelandic, and Swedish. The NMT models are at different levels, have different attention mechanisms, and different neural network architectures. Our results show that NMT models are much better than SMT models in terms of character error rate. The vanilla RNNs are competitive to GRUs/LSTMs in historical spelling normalization. Transformer models perform better only when provided with more training data. We also find that subword-level models with a small subword vocabulary are better than character-level models. In addition, we propose a hybrid method which further improves the performance of historical spelling normalization.

2017

pdf bib
Annotating errors in student texts: First experiences and experiments
Sara Stymne | Eva Pettersson | Beáta Megyesi | Anne Palmér
Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition

pdf bib
Comparing Rule-based and SMT-based Spelling Normalisation for English Historical Texts
Gerold Schneider | Eva Pettersson | Michael Percillier
Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language

2015

pdf bib
Improving Verb Phrase Extraction from Historical Text by use of Verb Valency Frames
Eva Pettersson | Joakim Nivre
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)

pdf bib
Ranking Relevant Verb Phrases Extracted from Historical Text
Eva Pettersson | Beáta Megyesi | Joakim Nivre
Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)

2014

pdf bib
A Multilingual Evaluation of Three Spelling Normalisation Methods for Historical Text
Eva Pettersson | Beáta Megyesi | Joakim Nivre
Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)

2013

pdf bib
Normalisation of Historical Text Using Context-Sensitive Weighted Levenshtein Distance and Compound Splitting
Eva Pettersson | Beáta Megyesi | Joakim Nivre
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

2012

pdf bib
Parsing the Past - Identification of Verb Constructions in Historical Text
Eva Pettersson | Beáta Megyesi | Joakim Nivre
Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

2011

pdf bib
Automatic Verb Extraction from Historical Swedish Texts
Eva Pettersson | Joakim Nivre
Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

2008

pdf bib
Swedish-Turkish Parallel Treebank
Beáta Megyesi | Bengt Dahlqvist | Eva Pettersson | Joakim Nivre
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper, we describe our work on building a parallel treebank for a less studied and typologically dissimilar language pair, namely Swedish and Turkish. The treebank is a balanced syntactically annotated corpus containing both fiction and technical documents. In total, it consists of approximately 160,000 tokens in Swedish and 145,000 in Turkish. The texts are linguistically annotated using different layers from part of speech tags and morphological features to dependency annotation. Each layer is automatically processed by using basic language resources for the involved languages. The sentences and words are aligned, and partly manually corrected. We create the treebank by reusing and adjusting existing tools for the automatic annotation, alignment, and their correction and visualization. The treebank was developed within the project supporting research environment for minor languages aiming at to create representative language resources for language pairs dissimilar in language structure. Therefore, efforts are put on developing a general method for formatting and annotation procedure, as well as using tools that can be applied to other language pairs easily.

2004

pdf bib
MT Goes Farming: Comparing Two Machine Translation Approaches on a New Domain
Per Weijnitz | Eva Forsbom | Ebba Gustavii | Eva Pettersson | Jörg Tiedemann
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)