Abstract
Automated text classification is a fundamental research topic within the legal domain as it is the foundation for building many intelligent legal solutions. There is a scarcity of publicly available legal training data and these classification algorithms struggle to perform in low data scenarios. Text augmentation techniques have been proposed to enhance classifiers through artificially synthesised training data. In this paper we present and evaluate a combination of rule-based and advanced generative text augmentation methods designed to create additional training data for the task of classification of legal contracts. We introduce a repurposed CUAD contract dataset, modified for the task of document level classification, and compare a deep learning distilBERT model with an optimised support vector machine baseline for useful comparison of shallow and deep strategies. The deep learning model significantly outperformed the shallow model on the full training data (F1-score of 0.9738 compared to 0.599). We achieved promising improvements when evaluating the combined augmentation techniques on three reduced datasets. Augmentation caused the F1-score performance to increase by 66.6%, 17.5% and 2.6% for the 25%, 50% and 75% reduced datasets respectively, compared to the non-augmented baseline. We discuss the benefits augmentation can bring to low data regimes and the need to extend augmentation techniques to preserve key terms in specialised domains such as law.
Original language | English |
---|---|
Article number | 102798 |
Number of pages | 22 |
Journal | Knowledge and Information Systems |
Early online date | 26 May 2025 |
DOIs | |
Publication status | Published online - 26 May 2025 |
Bibliographical note
Publisher Copyright:© The Author(s) 2025.
Data Access Statement
For this study—The original CUAD dataset was repurposed to a document classification problem. The updated CUAD dataset is available upon request from the contact author—email:[email protected].
Keywords
- Text Augmentation
- Legal Document Classification
- Random Swap
- Random Deletion
- Paraphrasing
- Round Trip Translation
- Deep Learning
- DistilBERT
- SVM
- Deep learning
- Random swap, random deletion
- Paraphrasing, round trip translation
- Text augmentation
- Legal document classification