Brian Mak


2017

pdf bib
Derivation of Document Vectors from Adaptation of LSTM Language Model
Wei Li | Brian Mak
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

In many natural language processing (NLP) tasks, a document is commonly modeled as a bag of words using the term frequency-inverse document frequency (TF-IDF) vector. One major shortcoming of the frequency-based TF-IDF feature vector is that it ignores word orders that carry syntactic and semantic relationships among the words in a document. This paper proposes a novel distributed vector representation of a document, which will be labeled as DV-LSTM, and is derived from the result of adapting a long short-term memory recurrent neural network language model by the document. DV-LSTM is expected to capture some high-level sequential information in the document, which other current document representations fail to do. It was evaluated in document genre classification in the Brown Corpus and the BNC Baby Corpus. The results show that DV-LSTM significantly outperforms TF-IDF vector and paragraph vector (PV-DM) in most cases, and their combinations may further improve the classification performance.

2003

pdf bib
PLASER: Pronunciation Learning via Automatic Speech Recognition
Brian Mak | Manhung Siu | Mimi Ng | Yik-Cheung Tam | Yu-Chung Chan | Kin-Wah Chan | Ka-Yee Leung | Simon Ho | Jimmy Wong | Jacqueline Lo
Proceedings of the HLT-NAACL 03 Workshop on Building Educational Applications Using Natural Language Processing