
Legal-Tech Open Diaries: Lesson learned on how to develop and deploy
light-weight models in the era of humongous Language Models
Stelios Maroudas∗† Sotiris Legkas∗ †
Prodromos Malakasiotis †Ilias Chalkidis ‡
†Department of Informatics, Athens University of Economics and Business, Greece
‡Department of Computer Science, University of Copenhagen, Denmark
Cognitiv+, Athens, Greece
Abstract
In the era of billion-parameter-sized Lan-
guage Models (LMs), start-ups have to follow
trends and adapt their technology accordingly.
Nonetheless, there are open challenges since
the development and deployment of large mod-
els comes with a need for high computational
resources and has economical consequences.
In this work, we follow the steps of the R&D
group of a modern legal-tech start-up and
present important insights on model develop-
ment and deployment. We start from ground
zero by pre-training multiple domain-specific
multi-lingual LMs which are a better fit to con-
tractual and regulatory text compared to the
available alternatives (XLM-R). We present
benchmark results of such models in a half-
public half-private legal benchmark compris-
ing 5 downstream tasks showing the impact of
larger model size. Lastly, we examine the im-
pact of a full-scale pipeline for model compres-
sion which includes: a) Parameter Pruning, b)
Knowledge Distillation, and c) Quantization:
The resulting models are much more efficient
without sacrificing performance at large.
1 Introduction
Transformer-based Languages Models (LMs) (Rad-
ford and Narasimhan,2018;Devlin et al.,2019;Liu
et al.,2019) have stormed NLP benchmarks with
state-of-the-art performance, while recently hu-
mongous billion-parameter-sized models (Brown
et al.,2020;Rae et al.,2021;Hoffmann et al.,2022)
have showcased impressive few-shot capabilities.
In addition, multi-lingual LMs (Conneau et al.,
2020) have been also developed demonstrating ex-
ceptional results as well as impressive performance
in zero-shot cross-lingual transfer.
The legal NLP literature is also flourishing with
the release of many new resources, including large
∗
Equal contribution. Work done during capstone projects
in Cognitiv+(https://www.cognitivplus.com/).
legal corpora (Henderson* et al.,2022), bench-
mark datasets (Chalkidis et al.,2021a;Koreeda
and Manning,2021;Zheng et al.,2021;Chalkidis
et al.,2022;Habernal et al.,2022), and pre-trained
legal-oriented language models (Chalkidis et al.,
2020;Zheng et al.,2021). Despite this impressive
progress, the efficacy of differently-sized language
models on legal NLP tasks and the importance of
domain (legal) specificity are still understudied,
while the effect of model compression techniques
in model’s performance and efficiency is ignored.
In this work, we aim to shed light in all these di-
rections following model development across three
incremental steps in a pipelined approach:
(a) model pre-training on large legal corpora,
(b) model fine-tuning on down-stream tasks, and
(c) model compression to improve efficiency.
To do so, we initially develop 4 multi-lingual legal-
oriented language models (C-XLMs). We bench-
mark their performance across 5 down-stream legal
NLP tasks, comprising both publicly available and
private datasets, covering both English and multi-
lingual scenarios in several tasks types, i.e., docu-
ment/sentence classification, natural language infer-
ence, and entity extraction. Finally, we experiment
with a full-scale pipeline for model compression
which includes a) Parameter Pruning, b) Knowl-
edge Distillation, and c) Quantization to produce
much more efficient (smaller and faster) models
that can be effectively deployed in production.
Our work aims to provide guidelines to legal-
tech practitioners on model development (pre-
training, fine-tuning, compression) bearing both
performance and efficiency into consideration. Our
findings show that the impact of larger vs. smaller
models, domain-specific vs. generic models and
the efficacy of model compression techniques
varies across tasks, but in general larger domain-
specific models perform better. Via full-scale
model compression, we produce models with per-
formance decrease by 2.3 p.p., while being approx.
arXiv:2210.13086v1 [cs.CL] 24 Oct 2022