Agha Ali Raza

Also published as: Agha Raza


2023

pdf bib
Data Pruning for Efficient Model Pruning in Neural Machine Translation
Abdul Azeemi | Ihsan Qazi | Agha Raza
Findings of the Association for Computational Linguistics: EMNLP 2023

Model pruning methods reduce memory requirements and inference time of large-scale pre-trained language models after deployment. However, the actual pruning procedure is computationally intensive, involving repeated training and pruning until the required sparsity is achieved. This paper combines data pruning with movement pruning for Neural Machine Translation (NMT) to enable efficient fine-pruning. We design a dataset pruning strategy by leveraging cross-entropy scores of individual training instances. We conduct pruning experiments on the task of machine translation from Romanian-to-English and Turkish-to-English, and demonstrate that selecting hard-to-learn examples (top-k) based on training cross-entropy scores outperforms other dataset pruning methods. We empirically demonstrate that data pruning reduces the overall steps required for convergence and the training time of movement pruning. Finally, we perform a series of experiments to tease apart the role of training data during movement pruning and uncover new insights to understand the interplay between data and model pruning in the context of NMT.

2020

pdf bib
SimplifyUR: Unsupervised Lexical Text Simplification for Urdu
Namoos Hayat Qasmi | Haris Bin Zia | Awais Athar | Agha Ali Raza
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper presents the first attempt at Automatic Text Simplification (ATS) for Urdu, the language of 170 million people worldwide. Being a low-resource language in terms of standard linguistic resources, recent text simplification approaches that rely on manually crafted simplified corpora or lexicons such as WordNet are not applicable to Urdu. Urdu is a morphologically rich language that requires unique considerations such as proper handling of inflectional case and honorifics. We present an unsupervised method for lexical simplification of complex Urdu text. Our method only requires plain Urdu text and makes use of word embeddings together with a set of morphological features to generate simplifications. Our system achieves a BLEU score of 80.15 and SARI score of 42.02 upon automatic evaluation on manually crafted simplified corpora. We also report results for human evaluations for correctness, grammaticality, meaning-preservation and simplicity of the output. Our code and corpus are publicly available to make our results reproducible.

2018

pdf bib
PronouncUR: An Urdu Pronunciation Lexicon Generator
Haris Bin Zia | Agha Ali Raza | Awais Athar
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Urdu Word Segmentation using Conditional Random Fields (CRFs)
Haris Bin Zia | Agha Ali Raza | Awais Athar
Proceedings of the 27th International Conference on Computational Linguistics

State-of-the-art Natural Language Processing algorithms rely heavily on efficient word segmentation. Urdu is amongst languages for which word segmentation is a complex task as it exhibits space omission as well as space insertion issues. This is partly due to the Arabic script which although cursive in nature, consists of characters that have inherent joining and non-joining attributes regardless of word boundary. This paper presents a word segmentation system for Urdu which uses a Conditional Random Field sequence modeler with orthographic, linguistic and morphological features. Our proposed model automatically learns to predict white space as word boundary as well as Zero Width Non-Joiner (ZWNJ) as sub-word boundary. Using a manually annotated corpus, our model achieves F1 score of 0.97 for word boundary identification and 0.85 for sub-word boundary identification tasks. We have made our code and corpus publicly available to make our results reproducible.