Legal-Tech Open Diaries Lesson learned on how to develop and deploy light-weight models in the era of humongous Language Models Stelios MaroudasySotiris Legkasy

2025-05-02 0 0 2.56MB 23 页 10玖币
侵权投诉
Legal-Tech Open Diaries: Lesson learned on how to develop and deploy
light-weight models in the era of humongous Language Models
Stelios Maroudas† Sotiris Legkas∗ †
Prodromos Malakasiotis Ilias Chalkidis ‡
Department of Informatics, Athens University of Economics and Business, Greece
Department of Computer Science, University of Copenhagen, Denmark
Cognitiv+, Athens, Greece
Abstract
In the era of billion-parameter-sized Lan-
guage Models (LMs), start-ups have to follow
trends and adapt their technology accordingly.
Nonetheless, there are open challenges since
the development and deployment of large mod-
els comes with a need for high computational
resources and has economical consequences.
In this work, we follow the steps of the R&D
group of a modern legal-tech start-up and
present important insights on model develop-
ment and deployment. We start from ground
zero by pre-training multiple domain-specific
multi-lingual LMs which are a better fit to con-
tractual and regulatory text compared to the
available alternatives (XLM-R). We present
benchmark results of such models in a half-
public half-private legal benchmark compris-
ing 5 downstream tasks showing the impact of
larger model size. Lastly, we examine the im-
pact of a full-scale pipeline for model compres-
sion which includes: a) Parameter Pruning, b)
Knowledge Distillation, and c) Quantization:
The resulting models are much more ecient
without sacrificing performance at large.
1 Introduction
Transformer-based Languages Models (LMs) (Rad-
ford and Narasimhan,2018;Devlin et al.,2019;Liu
et al.,2019) have stormed NLP benchmarks with
state-of-the-art performance, while recently hu-
mongous billion-parameter-sized models (Brown
et al.,2020;Rae et al.,2021;Homann et al.,2022)
have showcased impressive few-shot capabilities.
In addition, multi-lingual LMs (Conneau et al.,
2020) have been also developed demonstrating ex-
ceptional results as well as impressive performance
in zero-shot cross-lingual transfer.
The legal NLP literature is also flourishing with
the release of many new resources, including large
Equal contribution. Work done during capstone projects
in Cognitiv+(https://www.cognitivplus.com/).
legal corpora (Henderson* et al.,2022), bench-
mark datasets (Chalkidis et al.,2021a;Koreeda
and Manning,2021;Zheng et al.,2021;Chalkidis
et al.,2022;Habernal et al.,2022), and pre-trained
legal-oriented language models (Chalkidis et al.,
2020;Zheng et al.,2021). Despite this impressive
progress, the ecacy of dierently-sized language
models on legal NLP tasks and the importance of
domain (legal) specificity are still understudied,
while the eect of model compression techniques
in model’s performance and eciency is ignored.
In this work, we aim to shed light in all these di-
rections following model development across three
incremental steps in a pipelined approach:
(a) model pre-training on large legal corpora,
(b) model fine-tuning on down-stream tasks, and
(c) model compression to improve eciency.
To do so, we initially develop 4 multi-lingual legal-
oriented language models (C-XLMs). We bench-
mark their performance across 5 down-stream legal
NLP tasks, comprising both publicly available and
private datasets, covering both English and multi-
lingual scenarios in several tasks types, i.e., docu-
ment/sentence classification, natural language infer-
ence, and entity extraction. Finally, we experiment
with a full-scale pipeline for model compression
which includes a) Parameter Pruning, b) Knowl-
edge Distillation, and c) Quantization to produce
much more ecient (smaller and faster) models
that can be eectively deployed in production.
Our work aims to provide guidelines to legal-
tech practitioners on model development (pre-
training, fine-tuning, compression) bearing both
performance and eciency into consideration. Our
findings show that the impact of larger vs. smaller
models, domain-specific vs. generic models and
the ecacy of model compression techniques
varies across tasks, but in general larger domain-
specific models perform better. Via full-scale
model compression, we produce models with per-
formance decrease by 2.3 p.p., while being approx.
arXiv:2210.13086v1 [cs.CL] 24 Oct 2022
Model Alias #Langs #Layers #Units #Heads #Params Vocab. Size Train. Tokens MLM Acc.
XLM-R base 100 12 768 12 278M 250k 6.3T 74.0
XLM-R large 100 24 1024 16 559M 250k 6.3T 78.9
C-XLM tiny 10 4 128 4 9M 64k 92B 54.9
C-XLM small 10 6 256 4 21M 64k 92B 68.9
C-XLM base 10 12 512 8 71M 64k 92B 77.8
C-XLM large 10 24 1024 16 368M 64k 92B 81.5
Table 1: Model Specifications, Training Tokens processed on pre-training and MLM performance (Acc.) for all
variants of our XLM (C-XLM) models and the XLM-R models of Conneau et al. (2020) considered as baselines.
42
×
smaller, and approx. 66
×
faster. We also find
that fully compressed models outperform equally
sized distilled or fine-tuned models.
2 Model Specifications
Following Chalkidis et al. (2020), we pre-train
from scratch legal domain specific transformer-
based language models. Our models are based
on the RoBERTa architecture (Liu et al.,2019),
i.e., trained with the Masked Language Modelling
(MLM) objective, excluding the Next Sentence Pre-
diction (NSP) one used by BERT (Devlin et al.,
2019). In addition, based on the industry needs and
driven by the work of (Conneau et al.,2020), our
models are a multilingual one -usually referred as
XLM in the literature- and supports ten languages
in total (English, French, German, Greek, Spanish,
Italian, Dutch, Polish, Portuguese, Russian).
We pre-train 4 variants of custom XLM mod-
els (C-XLM) starting from a large version with
24 Transformer blocks (layers), each consisting
of 1024 hidden units and 16 attention heads and
continue by decreasing each time by a factor of 2
across all dimensions, i.e., blocks/layers, hidden
units, and attention heads (Table 1).1
3 Pre-Training
3.1 Training Corpora
We pre-trained our models using multi-lingual cor-
pora that consist of regulations and contracts. For
regulations, we used the MultiEURLEX dataset
of Chalkidis et al. (2021b) that comprises 65k EU
regulations ocially translated in 24 languages.
2
.
We also considered additional publicly available
English resources; specifically the 250 US code
books, part of the “Pile of Law” corpus released by
1
A minor exception in the tiny version, where we consider
4 attention heads of 32 hidden units per head instead of 2
attention heads with 64 units per head.
2
In our work, we consider 9 languages (English, French,
German, Greek, Spanish, Italian, Dutch, Polish, Portuguese).
(Henderson* et al.,2022), along-size 36k UK laws
published by Chalkidis and Søgaard (2022).
Regarding contracts, we considered the
LEDGAR (Tuggener et al.,2020) dataset compris-
ing 900k sections from US contracts in English;
and 60k additional full contracts in English from
a publicly available crawl from EDGAR. Since,
there are no publicly available contracts in the rest
of the languages, we translated these documents
using state-of-the-art Neural Machine Translation
(NMT) systems across all languages of interest.3
3.2 Custom Vocabulary
Relying on the above mentioned resources, we built
a custom vocabulary of 64k sub-word units that bet-
ter fit the documents in the respective domains and
languages of interest. We opted for Byte-Pair En-
codings (BPEs) (Sennrich et al.,2016), similarly to
most recent work on Transformer-based language
models (Radford and Narasimhan,2018;Liu et al.,
2019;Conneau et al.,2020).
3.3 Masked Language Modelling (MLM)
We pre-trained all variants of C-XLM (our domain-
specific multi-lingual RoBERTa) for 1.1m steps
(gradient updates) in total based on a two-step
approach, similarly to Devlin et al. (2019), i.e.,
pre-train for 1m steps with sequences up to 128
sub-word units, followed by continued pre-training
for 100k steps with sequences up to 512 sub-word
units, always with a batch size of 512 sequences.
4
At each example, we mask out 15% of the tokens
in total. We train all models for a maximum learn-
ing rate of
1e4
with warm-up for the initial (5%)
training steps followed by a cosine decay.
In comparison XLM-R models were pre-trained
for 1.5m steps with batches of 8192 sequences,
which accounts for approx. 63
×
more training
3
We used the OpusMT (en2m) mBART models using the
EasyNMT library.
4
This approach aims to a more ecient (compute-friendly)
pre-training, since pre-training with shorter sequences severely
decreases the needed compute and time.
Figure 1: MLM performance per language across C-XLM model variants depicted with dierent coloured webs.
Figure 2: Pre-training loss curves of C-XLMs.
tokens processed; the majority of those in high-
resource languages like the ones we consider.
3.4 MLM Results
In Figure 2, we observe the loss curves of dier-
ently sized models during pre-training. While mod-
els are equally poor performing in the very initial
steps, larger models substantially outperform the
smaller counterparts due to their increased capac-
ity (number of parameters). Table 1presents the
accuracy of our dierent models. As expected, the
large version (81.5% accuracy) followed by the
base version (77.8% accuracy) of C-XLM outper-
form their corresponding generic XLM-R models
by 2.6% and 3.8% respectively.
5
Figure 1presents
masked language modelling performance in finer
details across languages per model, highlighting
the predominance of our two largest models.6
5
A comparison between the XLM-R models of Conneau
et al. (2020) and our models (C-XLMs) is not ideal due to the
dierent vocabulary used. Nevertheless, it provides a general
idea on pre-training performance on legal specific corpora.
6
More fine-grained MLM evaluation (per language and per
document type) can be found in Appendix B.
4 Fine-tuning
4.1 Benchmark - Tasks and Datasets
In this section, we briefly present the evaluation
benchmark that we use, which consist of both pub-
licly available and private datasets. The bench-
mark is diverse covering three task types (docu-
ment, sentence, and token classification) and two
multi-lingual datasets.7The datasets in detail are:
MultiEURLEX
(Chalkidis et al.,2021a), a multi-
lingual dataset for legal topic classification compris-
ing 65k EU laws ocially translated in 23 EU lan-
guages.
8
Each document (EU law) was originally
annotated with relevant EUROVOC
9
concepts by
the Publications Oce of EU. We use the 21 ‘Level
1’ labels, obtained by Chalkidis et al. (2021a) from
the original EUROVOC annotations of the docu-
ments. We use a derivative of the original dataset
considering only 1k non-parallel documents per
supported language (9k in total, Section 3.1).
10
This is a multi-label document classification task,
thus we evaluate performance using macro- (m-
F1
)
and micro- (µ-F1) F1 scores.
UNFAIR-ToS
(Drawzeski et al.,2021) is a dataset
for detecting unfair clauses in Terms of Service
(ToS) agreements from on-line platforms (e.g.,
YouTube, Facebook, etc.) in 4 languages (English,
German, Italian, and Polish). The dataset has been
annotated on the sentence-level with 8 types of un-
7
We do not use the LexGLUE benchmark of Chalkidis
et al. (2022), since it is monolingual (English only) and also
covers tasks that involve litigation, which are out of scope.
8
MultiEURLEX is available at
https://huggingface.
co/datasets/multi_eurlex.
9
EUROVOC is a hierarchically organized taxonomy
of concepts (a hierarchy of labels) available at
http://
eurovoc.europa.eu/.
10
This is inline with the work of Xenouleas et al. (2022),
where the authors consider a more “realistic” harder version
of MultiEURLEX with less and non-parallel documents.
Model Alias MultiEURLEX UNFAIR-ToS CNLI Obligations ContractNER
µ-F1m-F1Acc. MAE µ-F1m-F1µ-F1m-F1µ-F1m-F1
XLM-R (base) 75.3 53.2 86.6 0.17 84.0 81.9 89.7 88.2 92.4 93.9
XLM-R (large) 77.8 63.8 89.0 0.16 86.3 84.7 88.9 87.4 92.8 93.7
C-XLM (tiny) 66.5 46.1 78.2 0.27 70.2 69.2 88.7 87.4 87.2 89.3
C-XLM (small) 72.3 54.7 85.4 0.20 79.7 77.0 90.4 89.0 90.1 92.4
C-XLM (base) 75.3 59.4 87.3 0.18 84.0 82.1 91.2 90.4 92.9 93.9
C-XLM (large) 78.4 65.4 89.7 0.14 85.3 83.0 91.8 90.6 93.2 94.6
Table 2: Overall results of fine-tuned models across all down-stream tasks.
fair contractual terms, meaning terms (sentences)
that potentially violate user rights according to EU
consumer law. Sentences have been also annotated
according to a 3-level fairness score (fair,partially
unfair,clearly unfair). In our case, we examine
the latter task as sentence regression and evaluate
performance using Mean Absolute Error (MAE),
and Accuracy (Acc.) on rounded (discrete) scores.
ContractNLI
(Koreeda and Manning,2021) is a
dataset for contract-based Natural Language Infer-
ence (NLI). The dataset consists of 607 contracts,
specifically Non-Disclosure Agreements (NDAs).
Each document has been paired with 17 templated
hypotheses and labeled with one out of three classes
(entailment,contradiction, or neutral). We exam-
ine a lenient version of this task, where instead of
the full document (NDA), we represent the docu-
ment with a short number of sentences which have
been annotated as rationales for the specific task.
This is a single-label multi-class document classi-
fication task and we evaluate performance using
macro- (m-F1) and micro- (µ-F1) F1 scores.
Contract-Obligations
(Chalkidis et al.,2018) is a
proprietary (privately developed) dataset for obli-
gation extraction from contracts (legal agreements).
The dataset consists of 100 service agreements.
Each contract has been split into paragraphs (ap-
prox. 9,400 in total), and labeled with 4 obligation
sub-types, i.e., Obligation,Deliverable,Discretion,
and Prohibition, while some paragraphs are not
relevant, resulting in a total of 5 potential classes.
This is a single-label multi-class document classifi-
cation task. We evaluate performance using macro-
(m-F1) and micro- (µ-F1) F1 scores.
ContractNER
(Chalkidis et al.,2017) is a propri-
etary dataset for contract element extraction. The
dataset consists of 3,500 contractual introductions
from several types (service, employment, purchase,
etc.) of contracts. Each introduction (paragraph)
has been labeled with 4 entity types (Title,Con-
tracting Party,Start Date,Eective Date). This is
a single-label multi-class token classification task.
Thus, we evaluate performance using macro- (m-
F1) and micro- (µ-F1) F1 scores on entity level.
4.2 Experimental Set Up
We tune all models conducting a grid search for
learning rates
∈ {
1e-4, 3e-4, 1e-5, 3e-5, 5e-5, 1e-6
}
.
We use early stopping based on validation loss; we
select and report test scores based on the model
with the best validation performance.11
4.3 Fine-tuning Results
Table 2presents the results of the fined-tuned base-
lines, XLM-R models, (upper zone) and of all the
variants of our C-XLM models (lower zone) for
each downstream task. We hypothesize that the
base and large versions of C-XLM will perform
better compared to their counterpart XLM-R mod-
els. Indeed, the base version of C-XLM always
outperforms XLM-R across all 5 datasets, while
the large version of C-XLM outperforms XLM-R
in all but one (4 out of 5) datasets.
MultiEURLEX:
Both large versions of C-XLM
and XLM-R clearly outperform the rest of the mod-
els with the C-XLM outperforming XLM-R by 0.6
p.p. in
µ
-
F1
and 1.6 p.p. in m-
F1
. Similarly, the
base version of C-XLM outperforms the equivalent
version of XLM-R. Interestingly, the small version
of C-XLM has comparable performance with the
latter while being approx. 13×smaller.
UNFAIR-ToS:
Both large and base versions of C-
XLM outperform their counterpart XLM-R models
by 0.7 p.p. in accuracy. Again, the small version
of C-XLM achieves competitive performance to
base-sized models.
11
Additional details and development scores are provided
in Appendix A
(a) MultiEURLEX
(b) UNFAIR-ToS
Figure 3: Radar plots with per language performance
for the multilingual MultiEURLEX and Unfair-ToS
datasets for all the versions of C-XLM.
ContractNLI:
In this task, we find that the large
version of XLM-R outperforms the one of C-XLM
(+1 p.p. in
µ
-
F1
and +1.7 p.p. in m-
F1
) while both
base models perform comparably. We also note that
the relative dierences between dierently sized
models are the more intense across all tasks.
Contract-Obligations:
On this task, all C-XLM
models except the tiny version outperform the base-
lines (XLM-R). Specifically, the large version of
C-XLM achieves +2.9 p.p. in
µ
-
F1
and +3.2 p.p.
in m-F1compared to the large version of XLM-R.
ContractNER:
Similarly, our C-XLM models out-
perform the corresponding large and base baselines
by approx. 0.5 p.p. in
µ
-
F1
. In addition, m-
F1
is
higher in our large model by 0.9 p.p., while base
models have identical results. Again, the small
version of C-XLM is competitive to the baseline.
In general trends, we observe that larger models
outperform smaller ones in most cases, and domain-
specific models outperform generic ones, while
using a sunstantially smaller (4
×
) vocabulary and
be significantly less (63
×
) pre-trained. The largest
relative dierences occur in MULTIEURLEX, a
20-class multi-label classificationtask, and CNLI,
a sentence pair classification task.
Language Parity:
Figure 3provides information
through radar plots, about scores per language for
each variant of C-XLM. We generally observe that
performance varies across languages (e.g., mod-
els perform better in English compared to Ger-
man), while also language performance disparity
varies across models (depicted as dierently shaped
webs), and across datasets (e.g., models are better
in English compared to Italian in MultiEURLEX,
but the opposite is true for UNFAIR-ToS).12
We cross out representation disparity as a possi-
ble explanation, since training data equally repre-
sent all languages (equal number of training exam-
ples). Interestingly, pre-training (MLM) accuracy
also does not correlate with the down-stream per-
formance. Based on the aforementioned points, we
can only hypothesize that other qualitative charac-
teristics (idiosyncrasies of a language in a specific
context/domain) are responsible for perfomance
disparities in-between languages.
Algorithm 1 Gradual Compression
if Teacher Size >> Student Size then
S0: Distill model to teacher assistant
S1: Prune model vocabulary and
fine-tune for 1-3 epochs (if needed).
S2: Prune model depth and distill.
S3: Prune model width and re-distill.
S4.1: Optimize computational graph.
S4.2: Apply 8-bit dynamic quantization.
5 Model Compression
5.1 Methodology
To compress and accelerate the inference of fine-
tuned transformer-based models we adopt gradual
compression, a pipeline that combines structured
pruning, knowledge distillation, and post-training
quantization to progressively reach the desired com-
pression rate, summarized in Algorithm 1.13
Step 0 — Teacher Assistant:
In case the teacher
is very large and the desired compression rate is
high (e.g., reducing the large version of C-XLM
to the tiny one), teacher assistants (Mirzadeh et al.,
2020) are used to make the transition smoother.
12Refer to Appendix Bfor detailed results.
13
See additional details and results from preliminary exper-
iments in Appendix B.
摘要:

Legal-TechOpenDiaries:Lessonlearnedonhowtodevelopanddeploylight-weightmodelsintheeraofhumongousLanguageModelsSteliosMaroudasySotirisLegkasyProdromosMalakasiotisyIliasChalkidiszyDepartmentofInformatics,AthensUniversityofEconomicsandBusiness,GreecezDepartmentofComputerScience,UniversityofCopenhag...

展开>> 收起<<
Legal-Tech Open Diaries Lesson learned on how to develop and deploy light-weight models in the era of humongous Language Models Stelios MaroudasySotiris Legkasy.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:2.56MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注