Proceedings of the Natural Legal Language Processing Workshop 2023

Daniel Preoțiuc-Pietro, Catalina Goanta, Ilias Chalkidis, Leslie Barrett, Gerasimos (Jerry) Spanakis, Nikolaos Aletras (Editors)


Anthology ID:
2023.nllp-1
Month:
December
Year:
2023
Address:
Singapore
Venues:
NLLP | WS
SIG:
Publisher:
Association for Computational Linguistics
URL:
https://aclanthology.org/2023.nllp-1
DOI:
Bib Export formats:
BibTeX MODS XML EndNote

pdf bib
Proceedings of the Natural Legal Language Processing Workshop 2023
Daniel Preoțiuc-Pietro | Catalina Goanta | Ilias Chalkidis | Leslie Barrett | Gerasimos (Jerry) Spanakis | Nikolaos Aletras

pdf bib
Anthropomorphization of AI: Opportunities and Risks
Ameet Deshpande | Tanmay Rajpurohit | Karthik Narasimhan | Ashwin Kalyan

Anthropomorphization is the tendency to attribute human-like traits to non-human entities. It is prevalent in many social contexts – children anthropomorphize toys, adults do so with brands, and it is a literary device. It is also a versatile tool in science, with behavioral psychology and evolutionary biology meticulously documenting its consequences. With widespread adoption of AI systems, and the push from stakeholders to make it human-like through alignment techniques, human voice, and pictorial avatars, the tendency for users to anthropomorphize it increases significantly. We take a dyadic approach to understanding this phenomenon with large language models (LLMs) by studying (1) the objective legal implications, as analyzed through the lens of the recent blueprint of AI bill of rights and the (2) subtle psychological aspects customization and anthropomorphization. We find that anthropomorphized LLMs customized for different user bases violate multiple provisions in the legislative blueprint. In addition, we point out that anthropomorphization of LLMs affects the influence they can have on their users, thus having the potential to fundamentally change the nature of human-AI interaction, with potential for manipulation and negative influence. With LLMs being hyper-personalized for vulnerable groups like children and patients among others, our work is a timely and important contribution. We propose a conservative strategy for the cautious use of anthropomorphization to improve trustworthiness of AI systems.

pdf bib
NOMOS: Navigating Obligation Mining in Official Statutes
Andrea Pennisi | Elvira González Hernández | Nina Koivula

The process of identifying obligations in a legal text is not a straightforward task, because not only are the documents long, but the sentences therein are long as well. As a result of long elements in the text, law is more difficult to interpret (Coupette et al., 2021). Moreover, the identification of obligations relies not only on the clarity and precision of the language used but also on the unique perspectives, experiences, and knowledge of the reader. In particular, this paper addresses the problem of identifyingobligations using machine and deep learning approaches showing a full comparison between both methodologies and proposing a new approach called NOMOS based on the combination of Positional Embeddings (PE) and Temporal Convolutional Networks (TCNs). Quantitative and qualitative experiments, conducted on legal regulations 1, demonstrate the effectiveness of the proposed approach.

pdf bib
Long Text Classification using Transformers with Paragraph Selection Strategies
Mohit Tuteja | Daniel González Juclà

In the legal domain, we often perform classification tasks on very long documents, for example court judgements. These documents often contain thousands of words, so the length of these documents poses a challenge for this modelling task. In this research paper, we present a comprehensive evaluation of various strategies to perform long text classification using Transformers in conjunction with strategies to select document chunks using traditional NLP models. We conduct our experiments on 6 benchmark datasets comprising lengthy documents, 4 of which are publicly available. Each dataset has a median word count exceeding 1,000. Our evaluation encompasses state-of-the-art Transformer models, such as RoBERTa, Longformer, HAT, MEGA and LegalBERT and compares them with a traditional baseline TF-IDF + Neural Network (NN) model. We investigate the effectiveness of pre-training on large corpora, fine tuning strategies, and transfer learning techniques in the context of long text classification.

pdf bib
Do Language Models Learn about Legal Entity Types during Pretraining?
Claire Barale | Michael Rovatsos | Nehal Bhuta

Language Models (LMs) have proven their ability to acquire diverse linguistic knowledge during the pretraining phase, potentially serving as a valuable source of incidental supervision for downstream tasks. However, there has been limited research conducted on the retrieval of domain-specific knowledge, and specifically legal knowledge. We propose to explore the task of Entity Typing, serving as a proxy for evaluating legal knowledge as an essential aspect of text comprehension, and a foundational task to numerous downstream legal NLP applications. Through systematic evaluation and analysis and two types of prompting (cloze sentences and QA-based templates) and to clarify the nature of these acquired cues, we compare diverse types and lengths of entities both general and domain-specific entities, semantics or syntax signals, and different LM pretraining corpus (generic and legal-oriented) and architectures (encoder BERT-based and decoder-only with Llama2). We show that (1) Llama2 performs well on certain entities and exhibits potential for substantial improvement with optimized prompt templates, (2) law-oriented LMs show inconsistent performance, possibly due to variations in their training corpus, (3) LMs demonstrate the ability to type entities even in the case of multi-token entities, (4) all models struggle with entities belonging to sub-domains of the law (5) Llama2 appears to frequently overlook syntactic cues, a shortcoming less present in BERT-based architectures.

pdf bib
Pretrained Language Models v. Court Ruling Predictions: A Case Study on a Small Dataset of French Court of Appeal Rulings
Olivia Vaudaux | Caroline Bazzoli | Maximin Coavoux | Géraldine Vial | Étienne Vergès

NLP systems are increasingly used in the law domain, either by legal institutions or by the industry. As a result there is a pressing need to characterize their strengths and weaknesses and understand their inner workings. This article presents a case study on the task of judicial decision prediction, on a small dataset from French Courts of Appeal. Specifically, our dataset of around 1000 decisions is about the habitual place of residency of children from divorced parents. The task consists in predicting, from the facts and reasons of the documents, whether the court rules that children should live with their mother or their father. Instead of feeding the whole document to a classifier, we carefully construct the dataset to make sure that the input to the classifier does not contain any ‘spoilers’ (it is often the case in court rulings that information all along the document mentions the final decision). Our results are mostly negative: even classifiers based on French pretrained language models (Flaubert, JuriBERT) do not classify the decisions with a reasonable accuracy. However, they can extract the decision when it is part of the input. With regards to these results, we argue that there is a strong caveat when constructing legal NLP datasets automatically.

pdf bib
Italian Legislative Text Classification for Gazzetta Ufficiale
Marco Rovera | Alessio Palmero Aprosio | Francesco Greco | Mariano Lucchese | Sara Tonelli | Antonio Antetomaso

This work introduces a novel, extensive annotated corpus for multi-label legislative text classification in Italian, based on legal acts from the Gazzetta Ufficiale, the official source of legislative information of the Italian state. The annotated dataset, which we released to the community, comprises over 363,000 titles of legislative acts, spanning over 30 years from 1988 until 2022. Moreover, we evaluate four models for text classification on the dataset, demonstrating how using only the acts’ titles can achieve top-level classification performance, with a micro F1-score of 0.87. Also, our analysis shows how Italian domain-adapted legal models do not outperform general-purpose models on the task. Models’ performance can be checked by users via a demonstrator system provided in support of this work.

pdf bib
Mixed-domain Language Modeling for Processing Long Legal Documents
Wenyue Hua | Yuchen Zhang | Zhe Chen | Josie Li | Melanie Weber

The application of Natural Language Processing (NLP) to specialized domains, such as the law, has recently received a surge of interest. As many legal services rely on processing and analyzing large collections of documents, automating such tasks with NLP tools such as language models emerges as a key challenge since legal documents may contain specialized vocabulary from other domains, such as medical terminology in personal injury text. However, most language models are general-purpose models, which either have limited reasoning capabilities on highly specialized legal terminology and syntax, such as BERT or ROBERTA, or are expensive to run and tune, such as GPT-3.5 and Claude. Thus, in this paper, we propose a specialized language model for personal injury text, LEGALRELECTRA, which is trained on mixed-domain legal and medical corpora. We show that as a small language model, our model improves over general-domain and single-domain medical and legal language models when processing mixed-domain (personal injury) text. Our training architecture implements the ELECTRA framework but utilizes REFORMER instead of BERT for its generator and discriminator. We show that this improves the model’s performance on processing long passages and results in better long-range text comprehension.

pdf bib
Questions about Contracts: Prompt Templates for Structured Answer Generation
Adam Roegiest | Radha Chitta | Jonathan Donnelly | Maya Lash | Alexandra Vtyurina | Francois Longtin

Finding the answers to legal questions about specific clauses in contracts is an important analysis in many legal workflows (e.g., understanding market trends, due diligence, risk mitigation) but more important is being able to do this at scale. In this paper, we present an examination of using large language models to produce (partially) structured answers to legal questions; primarily in the form of multiple choice and multiple select. We first show that traditional semantic matching is unable to perform this task at acceptable accuracy and then show how question specific prompts can achieve reasonable accuracy across a range of generative models. Finally, we show that much of this effectiveness can be maintained when generalized prompt templates are used rather than question specific ones.

pdf bib
Legal Judgment Prediction: If You Are Going to Do It, Do It Right
Masha Medvedeva | Pauline Mcbride

The field of Legal Judgment Prediction (LJP) has witnessed significant growth in the past decade, with over 100 papers published in the past three years alone. Our comprehensive survey of over 150 papers reveals a stark reality: only ~7% of published papers are doing what they set out to do - predict court decisions. We delve into the reasons behind the flawed and unreliable nature of the remaining experiments, emphasising their limited utility in the legal domain. We examine the distinctions between predicting court decisions and the practices of legal professionals in their daily work. We explore how a lack of attention to the identity and needs of end-users has fostered the misconception that LJP is a near-solved challenge suitable for practical application, and contributed to the surge in academic research in the field. To address these issues, we examine three different dimensions of ‘doing LJP right’: using data appropriate for the task; tackling explainability; and adopting an application-centric approach to model reporting and evaluation. We formulate a practical checklist of recommendations, delineating the characteristics that are required if a judgment prediction system is to be a valuable addition to the legal field.

pdf bib
Beyond The Text: Analysis of Privacy Statements through Syntactic and Semantic Role Labeling
Yan Shvartzshanider | Ananth Balashankar | Thomas Wies | Lakshminarayanan Subramanian

This paper formulates a new task of extracting privacy parameters from a privacy policy, through the lens of Contextual Integrity (CI), an established social theory framework for reasoning about privacy norms. Through extensive experiments, we further show that incorporating CI-based domain-specific knowledge into a BERT-based SRL model results in the highest precision and recall, achieving an F1 score of 84%. With our work, we would like to motivate new research in building NLP applications for the privacy domain.

pdf bib
Towards Mitigating Perceived Unfairness in Contracts from a Non-Legal Stakeholder’s Perspective
Anmol Singhal | Preethu Rose Anish | Shirish Karande | Smita Ghaisas

Commercial contracts are known to be a valuable source for deriving project-specific requirements. However, contract negotiations mainly occur among the legal counsel of the parties involved. The participation of non-legal stakeholders, including requirement analysts, engineers, and solution architects, whose primary responsibility lies in ensuring the seamless implementation of contractual terms, is often indirect and inadequate. Consequently, a significant number of sentences in contractual clauses, though legally accurate, can appear unfair from an implementation perspective to non-legal stakeholders. This perception poses a problem since requirements indicated in the clauses are obligatory and can involve punitive measures and penalties if not implemented as committed in the contract. Therefore, the identification of potentially unfair clauses in contracts becomes crucial. In this work, we conduct an empirical study to analyze the perspectives of different stakeholders regarding contractual fairness. We then investigate the ability of Pre-trained Language Models (PLMs) to identify unfairness in contractual sentences by comparing chain of thought prompting and semi-supervised fine-tuning approaches. Using BERT-based fine-tuning, we achieved an accuracy of 84% on a dataset consisting of proprietary contracts. It outperformed chain of thought prompting using Vicuna-13B by a margin of 9%.

pdf bib
Connecting Symbolic Statutory Reasoning with Legal Information Extraction
Nils Holzenberger | Benjamin Van Durme

Statutory reasoning is the task of determining whether a given law – a part of a statute – applies to a given legal case. Previous work has shown that structured, logical representations of laws and cases can be leveraged to solve statutory reasoning, including on the StAtutory Reasoning Assessment dataset (SARA), but rely on costly human translation into structured representations. Here, we investigate a form of legal information extraction atop the SARA cases, illustrating how the task can be done with high performance. Further, we show how the performance of downstream symbolic reasoning directly correlates with the quality of the information extraction.

pdf bib
Retrieval-based Evaluation for LLMs: A Case Study in Korean Legal QA
Cheol Ryu | Seolhwa Lee | Subeen Pang | Chanyeol Choi | Hojun Choi | Myeonggee Min | Jy-Yong Sohn

While large language models (LLMs) have demonstrated significant capabilities in text generation, their utilization in areas requiring domain-specific expertise, such as law, must be approached cautiously. This caution is warranted due to the inherent challenges associated with LLM-generated texts, including the potential presence of factual errors. Motivated by this issue, we propose Eval-RAG, a new evaluation method for LLM-generated texts. Unlike existing methods, Eval-RAG evaluates the validity of generated texts based on the related document that are collected by the retriever. In other words, Eval-RAG adopts the idea of retrieval augmented generation (RAG) for the purpose of evaluation. Our experimental results on Korean Legal Question-Answering (QA) tasks show that conventional LLM-based evaluation methods can be better aligned with Lawyers’ evaluations, by combining with Eval-RAG. In addition, our qualitative analysis show that Eval-RAG successfully finds the factual errors in LLM-generated texts, while existing evaluation methods cannot.

pdf bib
Legal NLP Meets MiCAR: Advancing the Analysis of Crypto White Papers
Carolina Camassa

In the rapidly evolving field of crypto assets, white papers are essential documents for investor guidance, and are now subject to unprecedented content requirements under the European Union’s Markets in Crypto-Assets Regulation (MiCAR). Natural Language Processing (NLP) can serve as a powerful tool for both analyzing these documents and assisting in regulatory compliance. This paper delivers two contributions to the topic. First, we survey existing applications of textual analysis to unregulated crypto asset white papers, uncovering a research gap that could be bridged with interdisciplinary collaboration. We then conduct an analysis of the changes introduced by MiCAR, highlighting the opportunities and challenges of integrating NLP within the new regulatory framework. The findings set the stage for further research, with the potential to benefit regulators, crypto asset issuers, and investors.

pdf bib
Low-Resource Deontic Modality Classification in EU Legislation
Kristina Minkova | Shashank Chakravarthy | Gijs Dijck

In law, it is important to distinguish between obligations, permissions, prohibitions, rights, and powers. These categories are called deontic modalities. This paper evaluates the performance of two deontic modality classification models, LEGAL-BERT and a Fusion model, in a low-resource setting. To create a generalized dataset for multi-class classification, we extracted random provisions from European Union (EU) legislation. By fine-tuning previously researched and published models, we evaluate their performance on our dataset against fusion models designed for low-resource text classification. We incorporate focal loss as an alternative for cross-entropy to tackle issues of class imbalance. The experiments indicate that the fusion model performs better for both balanced and imbalanced data with a macro F1-score of 0.61 for imbalanced data, 0.62 for balanced data, and 0.55 with focal loss for imbalanced data. When focusing on accuracy, our experiments indicate that the fusion model performs better with scores of 0.91 for imbalanced data, 0.78 for balanced data, and 0.90 for imbalanced data with focal loss.

pdf bib
Automatic Anonymization of Swiss Federal Supreme Court Rulings
Joel Niklaus | Robin Mamié | Matthias Stürmer | Daniel Brunner | Marcel Gygli

Releasing court decisions to the public relies on proper anonymization to protect all involved parties, where necessary. The Swiss Federal Supreme Court relies on an existing system that combines different traditional computational methods with human experts. In this work, we enhance the existing anonymization software using a large dataset annotated with entities to be anonymized. We compared BERT-based models with models pre-trained on in-domain data. Our results show that using in-domain data to pre-train the models further improves the F1-score by more than 5% compared to existing models. Our work demonstrates that combining existing anonymization methods, such as regular expressions, with machine learning can further reduce manual labor and enhance automatic suggestions.

pdf bib
Exploration of Open Large Language Models for eDiscovery
Sumit Pai | Sounak Lahiri | Ujjwal Kumar | Krishanu Baksi | Elijah Soba | Michael Suesserman | Nirmala Pudota | Jon Foster | Edward Bowen | Sanmitra Bhattacharya

The rapid advancement of Generative Artificial Intelligence (AI), particularly Large Language Models (LLMs), has led to their widespread adoption for various natural language processing (NLP) tasks. One crucial domain ripe for innovation is the Technology-Assisted Review (TAR) process in Electronic discovery (eDiscovery). Traditionally, TAR involves manual review and classification of documents for relevance over large document collections for litigations and investigations. This process is aided by machine learning and NLP tools which require extensive training and fine-tuning. In this paper, we explore the application of LLMs to TAR, specifically for predictive coding. We experiment with out-of-the-box prompting and fine-tuning of LLMs using parameter-efficient techniques. We conduct experiments using open LLMs and compare them to commercially-licensed ones. Our experiments demonstrate that open LLMs lag behind commercially-licensed models in relevance classification using out-of-the-box prompting. However, topic-specific instruction tuning of open LLMs not only improve their effectiveness but can often outperform their commercially-licensed counterparts in performance evaluations. Additionally, we conduct a user study to gauge the preferences of our eDiscovery Subject Matter Specialists (SMS) regarding human-authored versus model-generated reasoning. We demonstrate that instruction-tuned open LLMs can generate high quality reasonings that are comparable to commercial LLMs.

pdf bib
Retrieval-Augmented Chain-of-Thought in Semi-structured Domains
Vaibhav Mavi | Abulhair Saparov | Chen Zhao

Applying existing question answering (QA) systems to specialized domains like law and finance presents challenges that necessitate domain expertise. Although large language models (LLMs) have shown impressive language comprehension and in-context learning capabilities, their inability to handle very long inputs/contexts is well known. Tasks specific to these domains need significant background knowledge, leading to contexts that can often exceed the maximum length that existing LLMs can process. This study explores leveraging the semi-structured nature of legal and financial data to efficiently retrieve relevant context, enabling the use of LLMs for domain-specialized QA. The resulting system outperforms contemporary models and also provides useful explanations for the answers, encouraging the integration of LLMs into legal and financial NLP systems for future research.

pdf bib
Joint Learning for Legal Text Retrieval and Textual Entailment: Leveraging the Relationship between Relevancy and Affirmation
Nguyen Hai Long | Thi Hai Yen Vuong | Ha Thanh Nguyen | Xuan-Hieu Phan

In legal text processing and reasoning, one normally performs information retrieval to find relevant documents of an input question, and then performs textual entailment to answer the question. The former is about relevancy whereas the latter is about affirmation (or conclusion). While relevancy and affirmation are two different concepts, there is obviously a connection between them. That is why performing retrieval and textual entailment sequentially and independently may not make the most of this mutually supportive relationship. This paper, therefore, propose a multi–task learning model for these two tasks to improve their performance. Technically, in the COLIEE dataset, we use the information of Task 4 (conclusions) to improve the performance of Task 3 (searching for legal provisions related to the question). Our empirical findings indicate that this supportive relationship truly exists. This important insight sheds light on how leveraging relationship between tasks can significantly enhance the effectiveness of our multi-task learning approach for legal text processing.

pdf bib
Super-SCOTUS: A multi-sourced dataset for the Supreme Court of the US
Biaoyan Fang | Trevor Cohn | Timothy Baldwin | Lea Frermann

Given the complexity of the judiciary in the US Supreme Court, various procedures, along with various resources, contribute to the court system. However, most research focuses on a limited set of resources, e.g., court opinions or oral arguments, for analyzing a specific perspective in court, e.g., partisanship or voting. To gain a fuller understanding of these perspectives in the legal system of the US Supreme Court, a more comprehensive dataset, connecting different sources in different phases of the court procedure, is needed. To address this gap, we present a multi-sourced dataset for the Supreme Court, comprising court resources from different procedural phases, connecting language documents with extensive metadata. We showcase its utility through a case study on how different court documents reveal the decision direction (conservative vs. liberal) of the cases. We analyze performance differences across three protected attributes, indicating that different court resources encode different biases, and reinforcing that considering various resources provides a fuller picture of the court procedures. We further discuss how our dataset can contribute to future research directions.

pdf bib
Transferring Legal Natural Language Inference Model from a US State to Another: What Makes It So Hard?
Alice Kwak | Gaetano Forte | Derek Bambauer | Mihai Surdeanu

This study investigates whether a legal natural language inference (NLI) model trained on the data from one US state can be transferred to another state. We fine-tuned a pre-trained model on the task of evaluating the validity of legal will statements, once with the dataset containing the Tennessee wills and once with the dataset containing the Idaho wills. Each model’s performance on the in-domain setting and the out-of-domain setting are compared to see if the models can across the states. We found that the model trained on one US state can be mostly transferred to another state. However, it is clear that the model’s performance drops in the out-of-domain setting. The F1 scores of the Tennessee model and the Idaho model are 96.41 and 92.03 when predicting the data from the same state, but they drop to 66.32 and 81.60 when predicting the data from another state. Subsequent error analysis revealed that there are two major sources of errors. First, the model fails to recognize equivalent laws across states when there are stylistic differences between laws. Second, difference in statutory section numbering system between the states makes it difficult for the model to locate laws relevant to the cases being predicted on. This analysis provides insights on how the future NLI system can be improved. Also, our findings offer empirical support to legal experts advocating the standardization of legal documents.

pdf bib
Large Language Models are legal but they are not: Making the case for a powerful LegalLLM
Thanmay Jayakumar | Fauzan Farooqui | Luqman Farooqui

Realizing the recent advances from Natural Language Processing (NLP) to the legal sector poses challenging problems such as extremely long sequence lengths, specialized vocabulary that is usually only understood by legal professionals, and high amounts of data imbalance. The recent surge of Large Language Models (LLM) has begun to provide new opportunities to apply NLP in the legal domain due to their ability to handle lengthy, complex sequences. Moreover, the emergence of domain-specific LLMs has displayed extremely promising results on various tasks. In this study, we aim to quantify how general LLMs perform in comparison to legal-domain models (be it an LLM or otherwise). Specifically, we compare the zero-shot performance of three general-purpose LLMs (ChatGPT-3.5, LLaMA-70b and Falcon-180b) on the LEDGAR subset of the LexGLUE benchmark for contract provision classification. Although the LLMs were not explicitly trained on legal data, we observe that they are still able to classify the theme correctly in most cases. However, we find that their mic-F1/mac-F1 performance are upto 19.2/26.8% lesser than smaller models fine-tuned on the legal domain, thus underscoring the need for more powerful legal-domain LLMs.

pdf bib
On the Potential and Limitations of Few-Shot In-Context Learning to Generate Metamorphic Specifications for Tax Preparation Software
Dananjay Srinivas | Rohan Das | Saeid Tizpaz-Niari | Ashutosh Trivedi | Maria Leonor Pacheco

Due to the ever-increasing complexity of income tax laws in the United States, the number of US taxpayers filing their taxes using tax preparation software henceforth, tax software) continues to increase. According to the U.S. Internal Revenue Service (IRS), in FY22, nearly 50% of taxpayers filed their individual income taxes using tax software. Given the legal consequences of incorrectly filing taxes for the taxpayer, ensuring the correctness of tax software is of paramount importance. Metamorphic testing has emerged as a leading solution to test and debug legal-critical tax software due to the absence of correctness requirements and trustworthy datasets. The key idea behind metamorphic testing is to express the properties of a system in terms of the relationship between one input and its slightly metamorphosed twinned input. Extracting metamorphic properties from IRS tax publications is a tedious and time-consuming process. As a response, this paper formulates the task of generating metamorphic specifications as a translation task between properties extracted from tax documents - expressed in natural language - to a contrastive first-order logic form. We perform a systematic analysis on the potential and limitations of in-context learning with Large Language Models (LLMs) for this task, and outline a research agenda towards automating the generation of metamorphic specifications for tax preparation software.

pdf bib
AsyLex: A Dataset for Legal Language Processing of Refugee Claims
Claire Barale | Mark Klaisoongnoen | Pasquale Minervini | Michael Rovatsos | Nehal Bhuta

Advancements in natural language processing (NLP) and language models have demonstrated immense potential in the legal domain, enabling automated analysis and comprehension of legal texts. However, developing robust models in Legal NLP is significantly challenged by the scarcity of resources. This paper presents AsyLex, the first dataset specifically designed for Refugee Law applications to address this gap. The dataset introduces 59,112 documents on refugee status determination in Canada from 1996 to 2022, providing researchers and practitioners with essential material for training and evaluating NLP models for legal research and case review. Case review is defined as entity extraction and outcome prediction tasks. The dataset includes 19,115 gold-standard human-labeled annotations for 20 legally relevant entity types curated with the help of legal experts and 1,682 gold-standard labeled documents for the case outcome. Furthermore, we supply the corresponding trained entity extraction models and the resulting labeled entities generated through the inference process on AsyLex. Four supplementary features are obtained through rule-based extraction. We demonstrate the usefulness of our dataset on the legal judgment prediction task to predict the binary outcome and test a set of baselines using the text of the documents and our annotations. We observe that models pretrained on similar legal documents reach better scores, suggesting that acquiring more datasets for specialized domains such as law is crucial.

pdf bib
A Comparative Study of Prompting Strategies for Legal Text Classification
Ali Hakimi Parizi | Yuyang Liu | Prudhvi Nokku | Sina Gholamian | David Emerson

In this study, we explore the performance oflarge language models (LLMs) using differ-ent prompt engineering approaches in the con-text of legal text classification. Prior researchhas demonstrated that various prompting tech-niques can improve the performance of a di-verse array of tasks done by LLMs. However,in this research, we observe that professionaldocuments, and in particular legal documents,pose unique challenges for LLMs. We experi-ment with several LLMs and various promptingtechniques, including zero/few-shot prompting,prompt ensembling, chain-of-thought, and ac-tivation fine-tuning and compare the perfor-mance on legal datasets. Although the newgeneration of LLMs and prompt optimizationtechniques have been shown to improve gener-ation and understanding of generic tasks, ourfindings suggest that such improvements maynot readily transfer to other domains. Specifi-cally, experiments indicate that not all prompt-ing approaches and models are well-suited forthe legal domain which involves complexitiessuch as long documents and domain-specificlanguage.

pdf bib
Tracing Influence at Scale: A Contrastive Learning Approach to Linking Public Comments and Regulator Responses
Linzi Xing | Brad Hackinen | Giuseppe Carenini

U.S. Federal Regulators receive over one million comment letters each year from businesses, interest groups, and members of the public, all advocating for changes to proposed regulations. These comments are believed to have wide-ranging impacts on public policy. However, measuring the impact of specific comments is challenging because regulators are required to respond to comments but they do not have to specify which comments they are addressing. In this paper, we propose a simple yet effective solution to this problem by using an iterative contrastive method to train a neural model aiming for matching text from public comments to responses written by regulators. We demonstrate that our proposal substantially outperforms a set of selected text-matching baselines on a human-annotated test set. Furthermore, it delivers performance comparable to the most advanced gigantic language model (i.e., GPT-4), and is more cost-effective when handling comments and regulator responses matching in larger scale.