Yoshiki Mikami


2012

pdf bib
Stemming Tigrinya Words for Information Retrieval
Omer Osman | Yoshiki Mikami
Proceedings of COLING 2012: Demonstration Papers

2008

pdf bib
A Rule-based Syllable Segmentation of Myanmar Text
Zin Maung Maung | Yoshiki Mikami
Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages

pdf bib
Technical Terminology in Asian Languages: Different Approaches to Adopting Engineering Terms
Makiko Matsuda | Tomoe Takahashi | Hiroki Goto | Yoshikazu Hayase | Robin Lee Nagano | Yoshiki Mikami
Proceedings of the 6th Workshop on Asian Language Resources

pdf bib
The Link Structure of Language Communities and its Implication for Language-specific Crawling
Rizza Caminero | Yoshiki Mikami
Proceedings of the 6th Workshop on Asian Language Resources

2005

pdf bib
Language and Encoding Scheme Identification of Extremely Large Sets of Multilingual Text
Pavol Zavarsky | Yoshiki Mikami | Shota Wada
Proceedings of Machine Translation Summit X: Posters

In the paper we present an outline of our approach to identify languages and encoding schemes in extremely large sets of multi-lingual documents. The large sets we are analyzing in our Language Observatory project [1] are formed by dozens of millions of text documents. In the paper we present an approach which allows us to analyze about 250 documents every second (about 20 million documents/day) on a single Linux machine. Using a multithread processing on a cluster of Linux servers we are able to analyze easily more than 100 million documents/day.