LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development

Ilias Chalkidis, Nicolas Garneau, Catalina Goanta, Daniel Katz, Anders Søgaard


Abstract
In this work, we conduct a detailed analysis on the performance of legal-oriented pre-trained language models (PLMs). We examine the interplay between their original objective, acquired knowledge, and legal language understanding capacities which we define as the upstream, probing, and downstream performance, respectively. We consider not only the models’ size but also the pre-training corpora used as important dimensions in our study. To this end, we release a multinational English legal corpus (LeXFiles) and a legal knowledge probing benchmark (LegalLAMA) to facilitate training and detailed analysis of legal-oriented PLMs. We release two new legal PLMs trained on LeXFiles and evaluate them alongside others on LegalLAMA and LexGLUE. We find that probing performance strongly correlates with upstream performance in related legal topics. On the other hand, downstream performance is mainly driven by the model’s size and prior legal knowledge which can be estimated by upstream and probing performance. Based on these findings, we can conclude that both dimensions are important for those seeking the development of domain-specific PLMs.
Anthology ID:
2023.acl-long.865
Volume:
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
15513–15535
Language:
URL:
https://aclanthology.org/2023.acl-long.865
DOI:
10.18653/v1/2023.acl-long.865
Bibkey:
Cite (ACL):
Ilias Chalkidis, Nicolas Garneau, Catalina Goanta, Daniel Katz, and Anders Søgaard. 2023. LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15513–15535, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development (Chalkidis et al., ACL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.acl-long.865.pdf
Video:
 https://aclanthology.org/2023.acl-long.865.mp4