Mark Dras


2023

pdf bib
What Learned Representations and Influence Functions Can Tell Us About Adversarial Examples
Shakila Mahjabin Tonni | Mark Dras
Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings)

2022

pdf bib
Few-shot fine-tuning SOTA summarization models for medical dialogues
David Fraile Navarro | Mark Dras | Shlomo Berkovsky
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop

Abstractive summarization of medical dialogues presents a challenge for standard training approaches, given the paucity of suitable datasets. We explore the performance of state-of-the-art models with zero-shot and few-shot learning strategies and measure the impact of pretraining with general domain and dialogue-specific text on the summarization performance.

pdf bib
Detecting Textual Adversarial Examples Based on Distributional Characteristics of Data Representations
Na Liu | Mark Dras | Wei Emma Zhang
Proceedings of the 7th Workshop on Representation Learning for NLP

Although deep neural networks have achieved state-of-the-art performance in various machine learning tasks, adversarial examples, constructed by adding small non-random perturbations to correctly classified inputs, successfully fool highly expressive deep classifiers into incorrect predictions. Approaches to adversarial attacks in natural language tasks have boomed in the last five years using character-level, word-level, phrase-level, or sentence-level textual perturbations. While there is some work in NLP on defending against such attacks through proactive methods, like adversarial training, there is to our knowledge no effective general reactive approaches to defence via detection of textual adversarial examples such as is found in the image processing literature. In this paper, we propose two new reactive methods for NLP to fill this gap, which unlike the few limited application baselines from NLP are based entirely on distribution characteristics of learned representations”:” we adapt one from the image processing literature (Local Intrinsic Dimensionality (LID)), and propose a novel one (MultiDistance Representation Ensemble Method (MDRE)). Adapted LID and MDRE obtain state-of-the-art results on character-level, word-level, and phrase-level attacks on the IMDB dataset as well as on the later two with respect to the MultiNLI dataset. For future research, we publish our code .

2021

pdf bib
Mention Flags (MF): Constraining Transformer-based Text Generators
Yufei Wang | Ian Wood | Stephen Wan | Mark Dras | Mark Johnson
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

This paper focuses on Seq2Seq (S2S) constrained text generation where the text generator is constrained to mention specific words which are inputs to the encoder in the generated outputs. Pre-trained S2S models or a Copy Mechanism are trained to copy the surface tokens from encoders to decoders, but they cannot guarantee constraint satisfaction. Constrained decoding algorithms always produce hypotheses satisfying all constraints. However, they are computationally expensive and can lower the generated text quality. In this paper, we propose Mention Flags (MF), which traces whether lexical constraints are satisfied in the generated outputs in an S2S decoder. The MF models can be trained to generate tokens in a hypothesis until all constraints are satisfied, guaranteeing high constraint satisfaction. Our experiments on the Common Sense Generation task (CommonGen) (Lin et al., 2020), End2end Restaurant Dialog task (E2ENLG) (Duˇsek et al., 2020) and Novel Object Captioning task (nocaps) (Agrawal et al., 2019) show that the MF models maintain higher constraint satisfaction and text quality than the baseline models and other constrained decoding algorithms, achieving state-of-the-art performance on all three tasks. These results are achieved with a much lower run-time than constrained decoding algorithms. We also show that the MF models work well in the low-resource setting.

2020

pdf bib
Large Scale Author Obfuscation Using Siamese Variational Auto-Encoder: The SiamAO System
Chakaveh Saedi | Mark Dras
Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics

Author obfuscation is the task of masking the author of a piece of text, with applications in privacy. Recent advances in deep neural networks have boosted author identification performance making author obfuscation more challenging. Existing approaches to author obfuscation are largely heuristic. Obfuscation can, however, be thought of as the construction of adversarial examples to attack author identification, suggesting that the deep learning architectures used for adversarial attacks could have application here. Current architectures are proposed to construct adversarial examples against classification-based models, which in author identification would exclude the high-performing similarity-based models employed when facing large number of authorial classes. In this paper, we propose the first deep learning architecture for constructing adversarial examples against similarity-based learners, and explore its application to author obfuscation. We analyse the output from both success in obfuscation and language acceptability, as well as comparing the performance with some common baselines, and showing promising results in finding a balance between safety and soundness of the perturbed texts.

2018

pdf bib
A Fast and Accurate Vietnamese Word Segmenter
Dat Quoc Nguyen | Dai Quoc Nguyen | Thanh Vu | Mark Dras | Mark Johnson
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Native Language Identification With Classifier Stacking and Ensembles
Shervin Malmasi | Mark Dras
Computational Linguistics, Volume 44, Issue 3 - September 2018

Ensemble methods using multiple classifiers have proven to be among the most successful approaches for the task of Native Language Identification (NLI), achieving the current state of the art. However, a systematic examination of ensemble methods for NLI has yet to be conducted. Additionally, deeper ensemble architectures such as classifier stacking have not been closely evaluated. We present a set of experiments using three ensemble-based models, testing each with multiple configurations and algorithms. This includes a rigorous application of meta-classification models for NLI, achieving state-of-the-art results on several large data sets, evaluated in both intra-corpus and cross-corpus modes.

pdf bib
VnCoreNLP: A Vietnamese Natural Language Processing Toolkit
Thanh Vu | Dat Quoc Nguyen | Dai Quoc Nguyen | Mark Dras | Mark Johnson
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

We present an easy-to-use and fast toolkit, namely VnCoreNLP—a Java NLP annotation pipeline for Vietnamese. Our VnCoreNLP supports key natural language processing (NLP) tasks including word segmentation, part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing, and obtains state-of-the-art (SOTA) results for these tasks. We release VnCoreNLP to provide rich linguistic annotations to facilitate research work on Vietnamese NLP. Our VnCoreNLP is open-source and available at: https://github.com/vncorenlp/VnCoreNLP

pdf bib
Predicting accuracy on large datasets from smaller pilot data
Mark Johnson | Peter Anderson | Mark Dras | Mark Steedman
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Because obtaining training data is often the most difficult part of an NLP or ML project, we develop methods for predicting how much data is required to achieve a desired test accuracy by extrapolating results from models trained on a small pilot training dataset. We model how accuracy varies as a function of training size on subsets of the pilot data, and use that model to predict how much training data would be required to achieve the desired accuracy. We introduce a new performance extrapolation task to evaluate how well different extrapolations predict accuracy on larger training sets. We show that details of hyperparameter optimisation and the extrapolation models can have dramatic effects in a document classification task. We believe this is an important first step in developing methods for estimating the resources required to meet specific engineering performance targets.

2017

pdf bib
Unsupervised Text Segmentation Based on Native Language Characteristics
Shervin Malmasi | Mark Dras | Mark Johnson | Lan Du | Magdalena Wolska
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Most work on segmenting text does so on the basis of topic changes, but it can be of interest to segment by other, stylistically expressed characteristics such as change of authorship or native language. We propose a Bayesian unsupervised text segmentation approach to the latter. While baseline models achieve essentially random segmentation on our task, indicating its difficulty, a Bayesian model that incorporates appropriately compact language models and alternating asymmetric priors can achieve scores on the standard metrics around halfway to perfect segmentation.

pdf bib
Feature Hashing for Language and Dialect Identification
Shervin Malmasi | Mark Dras
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We evaluate feature hashing for language identification (LID), a method not previously used for this task. Using a standard dataset, we first show that while feature performance is high, LID data is highly dimensional and mostly sparse (>99.5%) as it includes large vocabularies for many languages; memory requirements grow as languages are added. Next we apply hashing using various hash sizes, demonstrating that there is no performance loss with dimensionality reductions of up to 86%. We also show that using an ensemble of low-dimension hash-based classifiers further boosts performance. Feature hashing is highly useful for LID and holds great promise for future work in this area.

pdf bib
A Novel Neural Network Model for Joint POS Tagging and Graph-based Dependency Parsing
Dat Quoc Nguyen | Mark Dras | Mark Johnson
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

We present a novel neural network model that learns POS tagging and graph-based dependency parsing jointly. Our model uses bidirectional LSTMs to learn feature representations shared for both POS tagging and dependency parsing tasks, thus handling the feature-engineering problem. Our extensive experiments, on 19 languages from the Universal Dependencies project, show that our model outperforms the state-of-the-art neural network-based Stack-propagation model for joint POS tagging and transition-based dependency parsing, resulting in a new state of the art. Our code is open-source and available together with pre-trained models at: https://github.com/datquocnguyen/jPTDP

pdf bib
Stock Market Prediction with Deep Learning: A Character-based Neural Language Model for Event-based Trading
Leonardo dos Santos Pinheiro | Mark Dras
Proceedings of the Australasian Language Technology Association Workshop 2017

pdf bib
From Word Segmentation to POS Tagging for Vietnamese
Dat Quoc Nguyen | Thanh Vu | Dai Quoc Nguyen | Mark Dras | Mark Johnson
Proceedings of the Australasian Language Technology Association Workshop 2017

2016

pdf bib
An empirical study for Vietnamese dependency parsing
Dat Quoc Nguyen | Mark Dras | Mark Johnson
Proceedings of the Australasian Language Technology Association Workshop 2016

pdf bib
Modeling Language Change in Historical Corpora: The Case of Portuguese
Marcos Zampieri | Shervin Malmasi | Mark Dras
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents a number of experiments to model changes in a historical Portuguese corpus composed of literary texts for the purpose of temporal text classification. Algorithms were trained to classify texts with respect to their publication date taking into account lexical variation represented as word n-grams, and morphosyntactic variation represented by part-of-speech (POS) distribution. We report results of 99.8% accuracy using word unigram features with a Support Vector Machines classifier to predict the publication date of documents in time intervals of both one century and half a century. A feature analysis is performed to investigate the most informative features for this task and how they are linked to language change.

pdf bib
Predicting Post Severity in Mental Health Forums
Shervin Malmasi | Marcos Zampieri | Mark Dras
Proceedings of the Third Workshop on Computational Linguistics and Clinical Psychology

pdf bib
LTG at SemEval-2016 Task 11: Complex Word Identification with Classifier Ensembles
Shervin Malmasi | Mark Dras | Marcos Zampieri
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2015

pdf bib
Norwegian Native Language Identification
Shervin Malmasi | Mark Dras | Irina Temnikova
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf bib
Clinical Information Extraction Using Word Representations
Shervin Malmasi | Hamed Hassanzadeh | Mark Dras
Proceedings of the Australasian Language Technology Association Workshop 2015

pdf bib
Cognate Identification using Machine Translation
Shervin Malmasi | Mark Dras
Proceedings of the Australasian Language Technology Association Workshop 2015

pdf bib
Large-Scale Native Language Identification with Cross-Corpus Evaluation
Shervin Malmasi | Mark Dras
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Oracle and Human Baselines for Native Language Identification
Shervin Malmasi | Joel Tetreault | Mark Dras
Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications

pdf bib
Language Identification using Classifier Ensembles
Shervin Malmasi | Mark Dras
Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects

pdf bib
Squibs: Evaluating Human Pairwise Preference Judgments
Mark Dras
Computational Linguistics, Volume 41, Issue 2 - June 2015

2014

pdf bib
Chinese Native Language Identification
Shervin Malmasi | Mark Dras
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers

pdf bib
Arabic Native Language Identification
Shervin Malmasi | Mark Dras
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)

pdf bib
From Visualisation to Hypothesis Construction for Second Language Acquisition
Shervin Malmasi | Mark Dras
Proceedings of TextGraphs-9: the workshop on Graph-based Methods for Natural Language Processing

pdf bib
Cross-lingual Transfer Parsing for Low-Resourced Languages: An Irish Case Study
Teresa Lynn | Jennifer Foster | Mark Dras | Lamia Tounsi
Proceedings of the First Celtic Language Technology Workshop

pdf bib
Language Transfer Hypotheses with Linear SVM Weights
Shervin Malmasi | Mark Dras
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib
Finnish Native Language Identification
Shervin Malmasi | Mark Dras
Proceedings of the Australasian Language Technology Association Workshop 2014

2013

pdf bib
NLI Shared Task 2013: MQ Submission
Shervin Malmasi | Sze-Meng Jojo Wong | Mark Dras
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications

pdf bib
Working with a small dataset - semi-supervised dependency parsing for Irish
Teresa Lynn | Jennifer Foster | Mark Dras
Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages

2012

pdf bib
Is Bad Structure Better Than No Structure?: Unsupervised Parsing for Realisation Ranking
Yasaman Motazedi | Mark Dras | François Lareau
Proceedings of COLING 2012

pdf bib
Proceedings of the First International Workshop on Optimization Techniques for Human Language Technology
Pushpak Bhattacharyya | Asif Ekbal | Sriparna Saha | Mark Johnson | Diego Molla-Aliod | Mark Dras
Proceedings of the First International Workshop on Optimization Techniques for Human Language Technology

pdf bib
Irish Treebanking and Parsing: A Preliminary Evaluation
Teresa Lynn | Özlem Çetinoğlu | Jennifer Foster | Elaine Uí Dhonnchadha | Mark Dras | Josef van Genabith
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Language resources are essential for linguistic research and the development of NLP applications. Low-density languages, such as Irish, therefore lack significant research in this area. This paper describes the early stages in the development of new language resources for Irish ― namely the first Irish dependency treebank and the first Irish statistical dependency parser. We present the methodology behind building our new treebank and the steps we take to leverage upon the few existing resources. We discuss language-specific choices made when defining our dependency labelling scheme, and describe interesting Irish language characteristics such as prepositional attachment, copula, and clefting. We manually develop a small treebank of 300 sentences based on an existing POS-tagged corpus and report an inter-annotator agreement of 0.7902. We train MaltParser to achieve preliminary parsing results for Irish and describe a bootstrapping approach for further stages of development.

pdf bib
Active Learning and the Irish Treebank
Teresa Lynn | Jennifer Foster | Mark Dras | Elaine Uí Dhonnchadha
Proceedings of the Australasian Language Technology Association Workshop 2012

pdf bib
Valence Shifting: Is It A Valid Task?
Mary Gardiner | Mark Dras
Proceedings of the Australasian Language Technology Association Workshop 2012

pdf bib
Exploring Adaptor Grammars for Native Language Identification
Sze-Meng Jojo Wong | Mark Dras | Mark Johnson
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

2011

pdf bib
Clause Restructuring For SMT Not Absolutely Helpful
Susan Howlett | Mark Dras
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Collocations in Multilingual Natural Language Generation: Lexical Functions meet Lexical Functional Grammar
François Lareau | Mark Dras | Benjamin Börschinger | Robert Dale
Proceedings of the Australasian Language Technology Association Workshop 2011

pdf bib
Topic Modeling for Native Language Identification
Sze-Meng Jojo Wong | Mark Dras | Mark Johnson
Proceedings of the Australasian Language Technology Association Workshop 2011

pdf bib
Exploiting Parse Structures for Native Language Identification
Sze-Meng Jojo Wong | Mark Dras
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

pdf bib
Detecting Interesting Event Sequences for Sports Reporting
François Lareau | Mark Dras | Robert Dale
Proceedings of the 13th European Workshop on Natural Language Generation

2010

pdf bib
Dual-Path Phrase-Based Statistical Machine Translation
Susan Howlett | Mark Dras
Proceedings of the Australasian Language Technology Association Workshop 2010

pdf bib
Parser Features for Sentence Grammaticality Classification
Sze-Meng Jojo Wong | Mark Dras
Proceedings of the Australasian Language Technology Association Workshop 2010

2009

pdf bib
Coupling Hierarchical Word Reordering and Decoding in Phrase-Based Statistical Machine Translation
Maxim Khalilov | José A. R. Fonollosa | Mark Dras
Proceedings of the Third Workshop on Syntax and Structure in Statistical Translation (SSST-3) at NAACL HLT 2009

pdf bib
Using Hypernymy Acquisition to Tackle (Part of) Textual Entailment
Elena Akhmatova | Mark Dras
Proceedings of the 2009 Workshop on Applied Textual Inference (TextInfer)

pdf bib
Improving Grammaticality in Statistical Sentence Generation: Introducing a Dependency Spanning Tree Algorithm with an Argument Satisfaction Model
Stephen Wan | Mark Dras | Robert Dale | Cécile Paris
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

pdf bib
A New Subtree-Transfer Approach to Syntax-Based Reordering for Statistical Machine Translation
Maxim Khalilov | José A. R. Fonollosa | Mark Dras
Proceedings of the 13th Annual Conference of the European Association for Machine Translation

pdf bib
Contrastive Analysis and Native Language Identification
Sze-Meng Jojo Wong | Mark Dras
Proceedings of the Australasian Language Technology Association Workshop 2009

2008

pdf bib
Morphosyntactic Target Language Matching in Statistical Machine Translation
Simon Zwarts | Mark Dras
Proceedings of the Australasian Language Technology Association Workshop 2008

pdf bib
Seed and Grow: Augmenting Statistically Generated Summary Sentences using Schematic Word Patterns
Stephen Wan | Robert Dale | Mark Dras | Cécile Paris
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

pdf bib
Choosing the Right Translation: A Syntactically Informed Classification Approach
Simon Zwarts | Mark Dras
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

2007

pdf bib
Syntax-based word reordering in phrase-based statistical machine translation: why does it work?
Simon Zwarts | Mark Dras
Proceedings of Machine Translation Summit XI: Papers

pdf bib
GLEU: Automatic Evaluation of Sentence-Level Fluency
Andrew Mutton | Mark Dras | Stephen Wan | Robert Dale
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

pdf bib
Proceedings of the Australasian Language Technology Workshop 2007
Nathalie Colineau | Mark Dras
Proceedings of the Australasian Language Technology Workshop 2007

pdf bib
Entailment due to Syntactically Encoded Semantic Relationships
Elena Akhmatova | Mark Dras
Proceedings of the Australasian Language Technology Workshop 2007

pdf bib
Exploring Approaches to Discriminating among Near-Synonyms
Mary Gardiner | Mark Dras
Proceedings of the Australasian Language Technology Workshop 2007

pdf bib
Statistical Machine Translation of Australian Aboriginal Languages: Morphological Analysis with Languages of Differing Morphological Richness
Simon Zwarts | Mark Dras
Proceedings of the Australasian Language Technology Workshop 2007

pdf bib
ACL 2007 Workshop on Deep Linguistic Processing
Timothy Baldwin | Mark Dras | Julia Hockenmaier | Tracy Holloway King | Gertjan van Noord
ACL 2007 Workshop on Deep Linguistic Processing

pdf bib
The Impact of Deep Linguistic Processing on Parsing Technology
Timothy Baldwin | Mark Dras | Julia Hockenmaier | Tracy Holloway King | Gertjan van Noord
Proceedings of the Tenth International Conference on Parsing Technologies

2006

pdf bib
Using Dependency-Based Features to Take the ’Para-farce’ out of Paraphrase
Stephen Wan | Mark Dras | Robert Dale | Cécile Paris
Proceedings of the Australasian Language Technology Workshop 2006

pdf bib
This Phrase-Based SMT System is Out of Order: Generalised Word Reordering in Machine Translation
Simon Zwarts | Mark Dras
Proceedings of the Australasian Language Technology Workshop 2006

2005

pdf bib
Searching for Grammaticality: Propagating Dependencies in the Viterbi Algorithm
Stephen Wan | Robert Dale | Mark Dras
Proceedings of the Tenth European Workshop on Natural Language Generation (ENLG-05)

pdf bib
Towards Statistical Paraphrase Generation: Preliminary Evaluations of Grammaticality
Stephen Wan | Mark Dras | Robert Dale | Cécile Paris
Proceedings of the Third International Workshop on Paraphrasing (IWP2005)

pdf bib
Formal Grammars for Linguistic Treebank Queries
Mark Dras | Steve Cassidy
Proceedings of the Australasian Language Technology Workshop 2005

2004

pdf bib
Non-contiguous tree parsing
Mark Dras | Chung-hye Han
Proceedings of the 10th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages

2003

pdf bib
Using Thematic Information in Statistical Headline Generation
Stephen Wan | Mark Dras | Cécile Paris | Robert Dale
Proceedings of the ACL 2003 Workshop on Multilingual Summarization and Question Answering

pdf bib
Straight to the point: Discovering themes for summary generation
Stephen Wan | Mark Dras | Cecile Paris | Robert Dale
Proceedings of the Australasian Language Technology Workshop 2003

2002

pdf bib
Korean-English MT and S-TAG
Mark Dras | Chung-hye Han
Proceedings of the Sixth International Workshop on Tree Adjoining Grammar and Related Frameworks (TAG+6)

2000

pdf bib
Multi-Component TAG and Notions of Formal Power
William Schuler | David Chiang | Mark Dras
Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics

pdf bib
Some remarks on an extension of synchronous TAG
David Chiang | William Schuler | Mark Dras
Proceedings of the Fifth International Workshop on Tree Adjoining Grammar and Related Frameworks (TAG+5)

pdf bib
How problematic are clitics for S-TAG translation?
Mark Dras | Tonia Bleam
Proceedings of the Fifth International Workshop on Tree Adjoining Grammar and Related Frameworks (TAG+5)

1999

pdf bib
A Meta-Level Grammar: Redefining Synchronous TAG for Translation and Paraphrase
Mark Dras
Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics

1997

pdf bib
Representing Paraphrases Using Synchronous TAGs
Mark Dras
35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics