Karin Becker


2023

pdf bib
BB25HLegalSum: Leveraging BM25 and BERT-Based Clustering for the Summarization of Legal Documents
Leonardo de Andrade | Karin Becker
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Legal document summarization aims to provide a clear understanding of the main points and arguments in a legal document, contributing to the efficiency of the judicial system. In this paper, we propose BB25HLegalSum, a method that combines BERT clusters with the BM25 algorithm to summarize legal documents and present them to users with highlighted important information. The process involves selecting unique, relevant sentences from the original document, clustering them to find sentences about a similar subject, combining them to generate a summary according to three strategies, and highlighting them to the user in the original document. We outperformed baseline techniques using the BillSum dataset, a widely used benchmark in legal document summarization. Legal workers positively assessed the highlighted presentation.

pdf bib
SSSD: Leveraging Pre-trained Models and Semantic Search for Semi-supervised Stance Detection
André de Sousa | Karin Becker
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Pre-trained models (PTMs) based on the Transformers architecture are trained on massive amounts of data and can capture nuances and complexities in linguistic expressions, making them a powerful tool for many natural language processing tasks. In this paper, we present SSSD (Semantic Similarity Stance Detection), a semi-supervised method for stance detection on Twitter that automatically labels a large, domain-related corpus for training a stance classification model. The method assumes as input a domain set of tweets about a given target and a labeled query set of tweets of representative arguments related to the stances. It scales the automatic labeling of a large number of tweets, and improves classification accuracy by leveraging the power of PTMs and semantic search to capture context and meaning. We largely outperformed all baselines in experiments using the Semeval benchmark.

2018

pdf bib
UFRGS Participation on the WMT Biomedical Translation Shared Task
Felipe Soares | Karin Becker
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper describes the machine translation systems developed by the Universidade Federal do Rio Grande do Sul (UFRGS) team for the biomedical translation shared task. Our systems are based on statistical machine translation and neural machine translation, using the Moses and OpenNMT toolkits, respectively. We participated in four translation directions for the English/Spanish and English/Portuguese language pairs. To create our training data, we concatenated several parallel corpora, both from in-domain and out-of-domain sources, as well as terminological resources from UMLS. Our systems achieved the best BLEU scores according to the official shared task evaluation.

pdf bib
A Large Parallel Corpus of Full-Text Scientific Articles
Felipe Soares | Viviane Moreira | Karin Becker
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
INF-UFRGS at SemEval-2017 Task 5: A Supervised Identification of Sentiment Score in Tweets and Headlines
Tiago Zini | Karin Becker | Marcelo Dias
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This paper describes a supervised solution for detecting the polarity scores of tweets or headline news in the financial domain, submitted to the SemEval 2017 Fine-Grained Sentiment Analysis on Financial Microblogs and News Task. The premise is that it is possible to understand market reaction over a company stock by measuring the positive/negative sentiment contained in the financial tweets and news headlines, where polarity is measured in a continuous scale ranging from -1.0 (very bearish) to 1.0 (very bullish). Our system receives as input the textual content of tweets or news headlines, together with their ids, stock cashtag or name of target company, and the polarity score gold standard for the training dataset. Our solution retrieves features from these text instances using n-gram, hashtags, sentiment score calculated by a external APIs and others features to train a regression model capable to detect continuous score of these sentiments with precision.

2016

pdf bib
INF-UFRGS-OPINION-MINING at SemEval-2016 Task 6: Automatic Generation of a Training Corpus for Unsupervised Identification of Stance in Tweets
Marcelo Dias | Karin Becker
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)