Shigeki Matsubara


2023

pdf bib
Revisiting Syntax-Based Approach in Negation Scope Resolution
Asahi Yoshida | Yoshihide Kato | Shigeki Matsubara
Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023)

Negation scope resolution is the process of detecting the negated part of a sentence. Unlike the syntax-based approach employed in previous research, state-of-the-art methods performed better without the explicit use of syntactic structure. This work revisits the syntax-based approach and re-evaluates the effectiveness of syntactic structure in negation scope resolution. We replace the parser utilized in the prior works with state-of-the-art parsers and modify the syntax-based heuristic rules. The experimental results demonstrate that the simple modifications enhance the performance of the prior syntax-based method to the same level as state-of-the-art end-to-end neural-based methods.

pdf bib
Paper Recommendation Using Citation Contexts in Scholarly Documents
Tomoki Ikoma | Shigeki Matsubara
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

pdf bib
Automatic Insertion of Commas and Linefeeds into Lecture Transcripts based on Multi-Task Learning
Zhicheng Fang | Masaki Murata | Shigeki Matsubara
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

pdf bib
On the Use of Language Models for Function Identification of Citations in Scholarly Papers
Tomoki Ikoma | Shigeki Matsubara
Proceedings of the Second Workshop on Information Extraction from Scientific Publications

2022

pdf bib
A Model-Theoretic Formalization of Natural Language Inference Using Neural Network and Tableau Method
Ayahito Saji | Yoshihide Kato | Shigeki Matsubara
Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation

pdf bib
Construction of Responsive Utterance Corpus for Attentive Listening Response Production
Koichiro Ito | Masaki Murata | Tomohiro Ohno | Shigeki Matsubara
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In Japan, the number of single-person households, particularly among the elderly, is increasing. Consequently, opportunities for people to narrate are being reduced. To address this issue, conversational agents, e.g., communication robots and smart speakers, are expected to play the role of the listener. To realize these agents, this paper describes the collection of conversational responses by listeners that demonstrate attentive listening attitudes toward narrative speakers, and a method to annotate existing narrative speech with responsive utterances is proposed. To summarize, 148,962 responsive utterances by 11 listeners were collected in a narrative corpus comprising 13,234 utterance units. The collected responsive utterances were analyzed in terms of response frequency, diversity, coverage, and naturalness. These results demonstrated that diverse and natural responsive utterances were collected by the proposed method in an efficient and comprehensive manner. To demonstrate the practical use of the collected responsive utterances, an experiment was conducted, in which response generation timings were detected in narratives.

pdf bib
Classification of URL Citations in Scholarly Papers for Promoting Utilization of Research Artifacts
Masaya Tsunokake | Shigeki Matsubara
Proceedings of the first Workshop on Information Extraction from Scientific Publications

Utilizing citations for research artifacts (e.g., dataset, software) in scholarly papers contributes to efficient expansion of research artifact repositories and various applications e.g., the search, recommendation, and evaluation of such artifacts. This study focuses on citations using URLs (URL citations) and aims to identify and analyze research artifact citations automatically. This paper addresses the classification task for each URL citation to identify (1) the role that the referenced resources play in research activities, (2) the type of referenced resources, and (3) the reason why the author cited the resources. This paper proposes the classification method using section titles and footnote texts as new input features. We extracted URL citations from international conference papers as experimental data. We performed 5-fold cross-validation using the data and computed the classification performance of our method. The results demonstrate that our method is effective in all tasks. An additional experiment demonstrates that using cited URLs as input features is also effective.

2021

pdf bib
A New Representation for Span-based CCG Parsing
Yoshihide Kato | Shigeki Matsubara
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

This paper proposes a new representation for CCG derivations. CCG derivations are represented as trees whose nodes are labeled with categories strictly restricted by CCG rule schemata. This characteristic is not suitable for span-based parsing models because they predict node labels independently. In other words, span-based models may generate invalid CCG derivations that violate the rule schemata. Our proposed representation decomposes CCG derivations into several independent pieces and prevents the span-based parsing models from violating the schemata. Our experimental result shows that an off-the-shelf span-based parser with our representation is comparable with previous CCG parsers.

pdf bib
Natural Language Inference using Neural Network and Tableau Method
Ayahito Saji | Daiki Takao | Yoshihide Kato | Shigeki Matsubara
Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation

2020

pdf bib
Parsing Gapping Constructions Based on Grammatical and Semantic Roles
Yoshihide Kato | Shigeki Matsubara
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

A gapping construction consists of a coordinated structure where redundant elements are elided from all but one conjuncts. This paper proposes a method of parsing sentences with gapping to recover elided elements. The proposed method is based on constituent trees annotated with grammatical and semantic roles that are useful for identifying elided elements. Our method outperforms the previous method in terms of F-measure and recall.

pdf bib
Relation between Degree of Empathy for Narrative Speech and Type of Responsive Utterance in Attentive Listening
Koichiro Ito | Masaki Murata | Tomohiro Ohno | Shigeki Matsubara
Proceedings of the Twelfth Language Resources and Evaluation Conference

Nowadays, spoken dialogue agents such as communication robots and smart speakers listen to narratives of humans. In order for such an agent to be recognized as a listener of narratives and convey the attitude of attentive listening, it is necessary to generate responsive utterances. Moreover, responsive utterances can express empathy to narratives and showing an appropriate degree of empathy to narratives is significant for enhancing speaker’s motivation. The degree of empathy shown by responsive utterances is thought to depend on their type. However, the relation between responsive utterances and degrees of the empathy has not been explored yet. This paper describes the classification of responsive utterances based on the degree of empathy in order to explain that relation. In this research, responsive utterances are classified into five levels based on the effect of utterances and literature on attentive listening. Quantitative evaluations using 37,995 responsive utterances showed the appropriateness of the proposed classification.

2019

pdf bib
PTB Graph Parsing with Tree Approximation
Yoshihide Kato | Shigeki Matsubara
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

The Penn Treebank (PTB) represents syntactic structures as graphs due to nonlocal dependencies. This paper proposes a method that approximates PTB graph-structured representations by trees. By our approximation method, we can reduce nonlocal dependency identification and constituency parsing into single tree-based parsing. An experimental result demonstrates that our approximation method with an off-the-shelf tree-based constituency parser significantly outperforms the previous methods in nonlocal dependency identification.

2018

pdf bib
Model-Theoretic Incremental Interpretation Based on Discourse Representation Theory
Yoshihide Kato | Shigeki Matsubara
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

pdf bib
Statistical Analysis of Missing Translation in Simultaneous Interpretation Using A Large-scale Bilingual Speech Corpus
Zhongxi Cai | Koichiro Ryu | Shigeki Matsubara
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf bib
Correcting Errors in a Treebank Based on Tree Mining
Kanta Suzuki | Yoshihide Kato | Shigeki Matsubara
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper provides a new method to correct annotation errors in a treebank. The previous error correction method constructs a pseudo parallel corpus where incorrect partial parse trees are paired with correct ones, and extracts error correction rules from the parallel corpus. By applying these rules to a treebank, the method corrects errors. However, this method does not achieve wide coverage of error correction. To achieve wide coverage, our method adopts a different approach. In our method, we consider that an infrequent pattern which can be transformed to a frequent one is an annotation error pattern. Based on a tree mining technique, our method seeks such infrequent tree patterns, and constructs error correction rules each of which consists of an infrequent pattern and a corresponding frequent pattern. We conducted an experiment using the Penn Treebank. We obtained 1,987 rules which are not constructed by the previous method, and the rules achieved good precision.

pdf bib
Transition-Based Left-Corner Parsing for Identifying PTB-Style Nonlocal Dependencies
Yoshihide Kato | Shigeki Matsubara
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2015

pdf bib
Incremental Semantic Construction Using Normal Form CCG Derivation
Yoshihide Kato | Shigeki Matsubara
Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics

pdf bib
Japanese Word Reordering Executed Concurrently with Dependency Parsing and Its Evaluation
Tomohiro Ohno | Kazushi Yoshida | Yoshihide Kato | Shigeki Matsubara
Proceedings of the 15th European Workshop on Natural Language Generation (ENLG)

2014

pdf bib
Japanese Word Reordering Integrated with Dependency Parsing
Kazushi Yoshida | Tomohiro Ohno | Yoshihide Kato | Shigeki Matsubara
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2013

pdf bib
Dependency Structure for Incremental Parsing of Japanese and Its Application
Tomohiro Ohno | Shigeki Matsubara
Proceedings of the 13th International Conference on Parsing Technologies (IWPT 2013)

2010

pdf bib
Construction of Back-Channel Utterance Corpus for Responsive Spoken Dialogue System Development
Yuki Kamiya | Tomohiro Ohno | Shigeki Matsubara | Hideki Kashioka
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In spoken dialogues, if a spoken dialogue system does not respond at all during user’s utterances, the user might feel uneasy because the user does not know whether or not the system has recognized the utterances. In particular, back-channel utterances, which the system outputs as voices such as “yeah” and “uh huh” in English have important roles for a driver in in-car speech dialogues because the driver does not look owards a listener while driving. This paper describes construction of a back-channel utterance corpus and its analysis to develop the system which can output back-channel utterances at the proper timing in the responsive in-car speech dialogue. First, we constructed the back-channel utterance corpus by integrating the back-channel utterances that four subjects provided for the driver’s utterances in 60 dialogues in the CIAIR in-car speech dialogue corpus. Next, we analyzed the corpus and revealed the relation between back-channel utterance timings and information on bunsetsu, clause, pause and rate of speech. Based on the analysis, we examined the possibility of detecting back-channel utterance timings by machine learning technique. As the result of the experiment, we confirmed that our technique achieved as same detection capability as a human.

pdf bib
Construction of Chunk-Aligned Bilingual Lecture Corpus for Simultaneous Machine Translation
Masaki Murata | Tomohiro Ohno | Shigeki Matsubara | Yasuyoshi Inagaki
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

With the development of speech and language processing, speech translation systems have been developed. These studies target spoken dialogues, and employ consecutive interpretation, which uses a sentence as the translation unit. On the other hand, there exist a few researches about simultaneous interpreting, and recently, the language resources for promoting simultaneous interpreting research, such as the publication of an analytical large-scale corpus, has been prepared. For the future, it is necessary to make the corpora more practical toward realization of a simultaneous interpreting system. In this paper, we describe the construction of a bilingual corpus which can be used for simultaneous lecture interpreting research. Simultaneous lecture interpreting systems are required to recognize translation units in the middle of a sentence, and generate its translation at the proper timing. We constructed the bilingual lecture corpus by the following steps. First, we segmented sentences in the lecture data into semantically meaningful units for the simultaneous interpreting. And then, we assigned the translations to these units from the viewpoint of the simultaneous interpreting. In addition, we investigated the possibility of automatically detecting the simultaneous interpreting timing from our corpus.

pdf bib
Collection of Usage Information for Language Resources from Academic Articles
Shunsuke Kozawa | Hitomi Tohyama | Kiyotaka Uchimoto | Shigeki Matsubara
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Recently, language resources (LRs) are becoming indispensable for linguistic researches. However, existing LRs are often not fully utilized because their variety of usage is not well known, indicating that their intrinsic value is not recognized very well either. Regarding this issue, lists of usage information might improve LR searches and lead to their efficient use. In this research, therefore, we collect a list of usage information for each LR from academic articles to promote the efficient utilization of LRs. This paper proposes to construct a text corpus annotated with usage information (UI corpus). In particular, we automatically extract sentences containing LR names from academic articles. Then, the extracted sentences are annotated with usage information by two annotators in a cascaded manner. We show that the UI corpus contributes to efficient LR searches by combining the UI corpus with a metadata database of LRs and comparing the number of LRs retrieved with and without the UI corpus.

pdf bib
Coherent Back-Channel Feedback Tagging of In-Car Spoken Dialogue Corpus
Yuki Kamiya | Tomohiro Ohno | Shigeki Matsubara
Proceedings of the SIGDIAL 2010 Conference

pdf bib
Correcting Errors in a Treebank Based on Synchronous Tree Substitution Grammar
Yoshihide Kato | Shigeki Matsubara
Proceedings of the ACL 2010 Conference Short Papers

pdf bib
Automatic Comma Insertion for Japanese Text Generation
Masaki Murata | Tomohiro Ohno | Shigeki Matsubara
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

2009

pdf bib
Linefeed Insertion into Japanese Spoken Monologue for Captioning
Tomohiro Ohno | Masaki Murata | Shigeki Matsubara
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

pdf bib
Incremental Parsing with Monotonic Adjoining Operation
Yoshihide Kato | Shigeki Matsubara
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

2008

pdf bib
Automatic Acquisition of Usage Information for Language Resources
Shunsuke Kozawa | Hitomi Tohyama | Kiyotaka Uchimoto | Shigeki Matsubara
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Recently, language resources (LRs) are becoming indispensable for linguistic research. Unfortunately, it is not easy to find their usages by searching the web even though they must be described in the Internet or academic articles. This indicates that the intrinsic value of LRs is not recognized very well. In this research, therefore, we extract a list of usage information for each LR to promote the efficient utilization of LRs. In this paper, we proposed a method for extracting a list of usage information from academic articles by using rules based on syntactic information. The rules are generated by focusing on the syntactic features that are observed in the sentences describing usage information. As a result of experiments, we achieved 72.9% in recall and 78.4% in precision for the closed test and 60.9% in recall and 72.7% in precision for the open test.

pdf bib
Construction of a Metadata Database for Efficient Development and Use of Language Resources
Hitomi Tohyama | Shunsuke Kozawa | Kiyotaka Uchimoto | Shigeki Matsubara | Hitoshi Isahara
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The National Institute of Information and Communications Technology (NICT) and Nagoya University have been jointly constructing a large scale database named SHACHI by collecting detailed meta-information on language resources (LRs) in Asia and Western countries, for the purpose of effectively combining LRs. The purpose of this project is to investigate languages, tag sets, and formats compiled in LRs throughout the world, to systematically store LR metadata, to create a search function for this information, and to ultimately utilize all this for a more efficient development of LRs. This metadata database contains more than 2,000 compiled LRs such as corpora, dictionaries, thesauruses and lexicons, forming a large scale metadata of LRs archive. Its metadata, an extended version of OLAC metadata set conforming to Dublin Core, which contain detailed meta-information, have been collected semi-automatically. This paper explains the design and the structure of the metadata database, as well as the realization of the catalogue search tool. Additionally, the website of this database is now open to the public and accessible to all Internet users.

pdf bib
Construction and Analysis of Word-level Time-aligned Simultaneous Interpretation Corpus
Takahiro Ono | Hitomi Tohyama | Shigeki Matsubara
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper, quantitative analyses of the delay in Japanese-to-English (J-E) and English-to-Japanese (E-J) interpretations are described. The Simultaneous Interpretation Database of Nagoya University (SIDB) was used for the analyses. Beginning time and end time of each word were provided to the corpus using HMM-based phoneme segmentation, and the time lag between the corresponding words was calculated as the word-level delay. Word-level delay was calculated for 3,722 pairs and 4,932 pairs of words for J-E and E-J interpretations, respectively. The analyses revealed that J-E interpretation has much larger delay than E-J interpretation and that the difference of word order between Japanese and English affect the degree of delay.

pdf bib
Construction of an Infrastructure for Providing Users with Suitable Language Resources
Hitomi Tohyama | Shunsuke Kozawa | Kiyotaka Uchimoto | Shigeki Matsubara | Hitoshi Isahara
Coling 2008: Companion volume: Posters

2006

pdf bib
Layered Speech-Act Annotation for Spoken Dialogue Corpus
Yuki Irie | Shigeki Matsubara | Nobuo Kawaguchi | Yukiko Yamaguchi | Yasuyoshi Inagaki
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper describes the design of speech act tags for spoken dialogue corpora and its evaluation. Compared with the tags used for conventional corpus annotation, the proposed speech intention tag is specialized enough to determine system operations. However, detailed information description increases tag types. This causes an ambiguous tag selection. Therefore, we have designed an organization of tags, with focusing attention on layered tagging and context-dependent tagging. Over 35,000 utterance units in the CIAIR corpus have been tagged by hand. To evaluate the reliability of the intention tag, a tagging experiment was conducted. The reliability of tagging is evaluated by comparing the tagging among some annotators using kappa value. As a result, we confirmed that reliable data could be built. This corpus with speech intention tag could be widely used from basic research to applications of spoken dialogue. In particular, this would play an important role from the viewpoint of practical use of spoken dialogue corpora.

pdf bib
A Syntactically Annotated Corpus of Japanese Spoken Monologue
Tomohiro Ohno | Shigeki Matsubara | Hideki Kashioka | Naoto Kato | Yasuyoshi Inagaki
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Recently, monologue data such as lecture and commentary by professionals have been considered as valuable intellectual resources, and have been gathering attention. On the other hand, in order to use these monologue data effectively and efficiently, it is necessary for the monologue data not only just to be accumulated but also to be structured. This paper describes the construction of a Japanese spoken monologue corpus in which dependency structure is given to each utterance. Spontaneous monologue includes a lot of very long sentences composed of two or more clauses. In these sentences, there may exist the subject or the adverb common to multi-clauses, and it may be considered that the subject or adverb depend on multi-predicates. In order to give the dependency information in a real fashion, our research allows that a bunsetsu depends on multiple bunsetsus.

pdf bib
Collection of Simultaneous Interpreting Patterns by Using Bilingual Spoken Monologue Corpus
Hitomi Tohyama | Shigeki Matsubara
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The manual quantitative analysis of CIAIR simultaneous interpretation corpus and the collection of interpreting patterns This paper provides an investigation of simultaneous interpreting patterns using a bilingual spoken monologue corpus. 4,578 pairs of English-Japanese aligned utterances in CIAIR simultaneous interpretation database were used. This investigation is the largest scale as the observation of simultaneous interpreting speech. The simultaneous interpreters are required to generate the target speech simultaneously with the source speech. Therefore, they have various kinds of strategies to raise simultaneity. In this investigation, the simultaneous interpreting patterns with high frequency and high flexibility were extracted from the corpus. As a result, we collected 203 cases out of aligned utterances in which simultaneous interpretersf strategies for raising simultaneity were observed. These 203 cases could be categorized into 12 types of interpreting pattern. It was clarified that 4.5 percent of the English-Japanese monologue data were fitted in those interpreting patterns. These interpreting patterns can be expected to be used as interpreting rules of simultaneous machine interpretation.

pdf bib
A Corpus Search System Utilizing Lexical Dependency Structure
Yoshihide Kato | Shigeki Matsubara | Yasuyoshi Inagaki
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper presents a corpus search system utilizing lexical dependency structure. The user's query consists of lexical dependency structure. The user's query consists of a sequence of keywords. For a given query, the system automatically generates the dependency structure patterns which consist of keywords in the query, and returns the sentences whose dependency structures match the generated patterns. The dependency structure patterns are generated by using two operations: combining and interpolation, which utilize dependency structures in the searched corpus. The operations enable the system to generate only the dependency structure patterns that occur in the corpus. The system achieves simple and intuitive corpus search and it is enough linguistically sophisticated to utilize structural information.

pdf bib
Dependency Parsing of Japanese Spoken Monologue Based on Clause Boundaries
Tomohiro Ohno | Shigeki Matsubara | Hideki Kashioka | Takehiko Maruyama | Yasuyoshi Inagaki
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

pdf bib
Simultaneous English-Japanese Spoken Language Translation Based on Incremental Dependency Parsing and Transfer
Koichiro Ryu | Shigeki Matsubara | Yasuyoshi Inagaki
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

2005

pdf bib
Construction of Structurally Annotated Spoken Dialogue Corpus
Shingo Kato | Shigeki Matsubara | Yukiko Yamaguchi | Nobuo Kawaguchi
Proceedings of the Fifth Workshop on Asian Language Resources (ALR-05) and First Symposium on Asian Language Resources Network (ALRN)

2004

pdf bib
Stochastically Evaluating the Validity of Partial Parse Trees in Incremental Parsing
Yoshihide Kato | Shigeki Matsubara | Yasuyoshi Inagaki
Proceedings of the Workshop on Incremental Parsing: Bringing Engineering and Cognition Together

2003

pdf bib
Example-based Spoken Dialogue System using WOZ System Log
Hiroya Murao | Nobuo Kawaguchi | Shigeki Matsubara | Yukiko Yamaguchi | Yasuyoshi Inagaki
Proceedings of the Fourth SIGdial Workshop of Discourse and Dialogue

2002

pdf bib
Bilingual Spoken Monologue Corpus for Simultaneous Machine Interpretation Research
Shigeki Matsubara | Akira Takagi | Nobuo Kawaguchi | Yasuyoshi Inagaki
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf bib
Multi-Dimensional Data Acquisition for Integrated Acoustic Information Research
Nobuo Kawaguchi | Shigeki Matsubara | Kazuya Takeda | Fumitada Itakura
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf bib
Example-based Speech Intention Understanding and Its Application to In-Car Spoken Dialogue System
Shigeki Matsubara | Shinichi Kimura | Nobuo Kawaguchi | Yukiko Yamaguchi | Yasuyoshi Inagaki
COLING 2002: The 19th International Conference on Computational Linguistics

pdf bib
Stochastic Dependency Parsing of Spontaneous Japanese Spoken Language
Shigeki Matsubara | Takahisa Murase | Nobuo Kawaguchi | Yasuyoshi Inagaki
COLING 2002: The 19th International Conference on Computational Linguistics

2001

pdf bib
Efficient Incremental Dependency Parsing
Yoshihide Kato | Shigeki Matsubara | Katsuhiko Toyama | Yasuyoshi Inagaki
Proceedings of the Seventh International Workshop on Parsing Technologies

1997

pdf bib
Utilizing extra-grammatical phenomena in incremental English-Japanese machine translation
Shigeki Matsubara | Yasuyoshi Inagaki
Proceedings of the 7th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages