Adapting Language Specific Components of Cross-Media Analysis Frameworks to Less-Resourced Languages: the Case of Amharic

Yonas Woldemariam; Adam Dahlgren

Adapting Language Specific Components of Cross-Media Analysis Frameworks to Less-Resourced Languages: the Case of Amharic

Abstract

We present an ASR based pipeline for Amharic that orchestrates NLP components within a cross media analysis framework (CMAF). One of the major challenges that are inherently associated with CMAFs is effectively addressing multi-lingual issues. As a result, many languages remain under-resourced and fail to leverage out of available media analysis solutions. Although spoken natively by over 22 million people and there is an ever-increasing amount of Amharic multimedia content on the Web, querying them with simple text search is difficult. Searching for, especially audio/video content with simple key words, is even hard as they exist in their raw form. In this study, we introduce a spoken and textual content processing workflow into a CMAF for Amharic. We design an ASR-named entity recognition (NER) pipeline that includes three main components: ASR, a transliterator and NER. We explore various acoustic modeling techniques and develop an OpenNLP-based NER extractor along with a transliterator that interfaces between ASR and NER. The designed ASR-NER pipeline for Amharic promotes the multi-lingual support of CMAFs. Also, the state-of-the art design principles and techniques employed in this study shed light for other less-resourced languages, particularly the Semitic ones.

Anthology ID:: 2020.sltu-1.42
Volume:: Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)
Month:: May
Year:: 2020
Address:: Marseille, France
Editors:: Dorothee Beermann, Laurent Besacier, Sakriani Sakti, Claudia Soria
Venue:: SLTU
SIG:
Publisher:: European Language Resources association
Note:
Pages:: 298–305
Language:: English
URL:: https://aclanthology.org/2020.sltu-1.42
DOI:
Bibkey:
Cite (ACL):: Yonas Woldemariam and Adam Dahlgren. 2020. Adapting Language Specific Components of Cross-Media Analysis Frameworks to Less-Resourced Languages: the Case of Amharic. In Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), pages 298–305, Marseille, France. European Language Resources association.
Cite (Informal):: Adapting Language Specific Components of Cross-Media Analysis Frameworks to Less-Resourced Languages: the Case of Amharic (Woldemariam & Dahlgren, SLTU 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.sltu-1.42.pdf

PDF Cite Search