Pavel Sofroniev


2018

pdf bib
Phonetic Vector Representations for Sound Sequence Alignment
Pavel Sofroniev | Çağrı Çöltekin
Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology

This study explores a number of data-driven vector representations of the IPA-encoded sound segments for the purpose of sound sequence alignment. We test the alternative representations based on the alignment accuracy in the context of computational historical linguistics. We show that the data-driven methods consistently do better than linguistically-motivated articulatory-acoustic features. The similarity scores obtained using the data-driven representations in a monolingual context, however, performs worse than the state-of-the-art distance (or similarity) scoring methods proposed in earlier studies of computational historical linguistics. We also show that adapting representations to the task at hand improves the results, yielding alignment accuracy comparable to the state of the art methods.

2017

pdf bib
Computational analysis of Gondi dialects
Taraka Rama | Çağrı Çöltekin | Pavel Sofroniev
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

This paper presents a computational analysis of Gondi dialects spoken in central India. We present a digitized data set of the dialect area, and analyze the data using different techniques from dialectometry, deep learning, and computational biology. We show that the methods largely agree with each other and with the earlier non-computational analyses of the language group.

pdf bib
The parse is darc and full of errors: Universal dependency parsing with transition-based and graph-based algorithms
Kuan Yu | Pavel Sofroniev | Erik Schill | Erhard Hinrichs
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

We developed two simple systems for dependency parsing: darc, a transition-based parser, and mstnn, a graph-based parser. We tested our systems in the CoNLL 2017 UD Shared Task, with darc being the official system. Darc ranked 12th among 33 systems, just above the baseline. Mstnn had no official ranking, but its main score was above the 27th. In this paper, we describe our two systems, examine their strengths and weaknesses, and discuss the lessons we learned.

pdf bib
Using support vector machines and state-of-the-art algorithms for phonetic alignment to identify cognates in multi-lingual wordlists
Gerhard Jäger | Johann-Mattis List | Pavel Sofroniev
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

Most current approaches in phylogenetic linguistics require as input multilingual word lists partitioned into sets of etymologically related words (cognates). Cognate identification is so far done manually by experts, which is time consuming and as of yet only available for a small number of well-studied language families. Automatizing this step will greatly expand the empirical scope of phylogenetic methods in linguistics, as raw wordlists (in phonetic transcription) are much easier to obtain than wordlists in which cognate words have been fully identified and annotated, even for under-studied languages. A couple of different methods have been proposed in the past, but they are either disappointing regarding their performance or not applicable to larger datasets. Here we present a new approach that uses support vector machines to unify different state-of-the-art methods for phonetic alignment and cognate detection within a single framework. Training and evaluating these method on a typologically broad collection of gold-standard data shows it to be superior to the existing state of the art.