Legal-Tech Open Diaries Lesson learned on how to develop and deploy light-weight models in the era of humongous Language Models Stelios MaroudasySotiris Legkasy

2025-05-02 0 0 2.56MB 23 页 10玖币

侵权投诉

Legal-Tech Open Diaries: Lesson learned on how to develop and deploy

light-weight models in the era of humongous Language Models

Stelios Maroudas∗† Sotiris Legkas∗ †

Prodromos Malakasiotis †Ilias Chalkidis ‡

†Department of Informatics, Athens University of Economics and Business, Greece

‡Department of Computer Science, University of Copenhagen, Denmark

Cognitiv+, Athens, Greece

Abstract

In the era of billion-parameter-sized Lan-

guage Models (LMs), start-ups have to follow

trends and adapt their technology accordingly.

Nonetheless, there are open challenges since

the development and deployment of large mod-

els comes with a need for high computational

resources and has economical consequences.

In this work, we follow the steps of the R&D

group of a modern legal-tech start-up and

present important insights on model develop-

ment and deployment. We start from ground

zero by pre-training multiple domain-speciﬁc

multi-lingual LMs which are a better ﬁt to con-

tractual and regulatory text compared to the

available alternatives (XLM-R). We present

benchmark results of such models in a half-

public half-private legal benchmark compris-

ing 5 downstream tasks showing the impact of

larger model size. Lastly, we examine the im-

pact of a full-scale pipeline for model compres-

sion which includes: a) Parameter Pruning, b)

Knowledge Distillation, and c) Quantization:

The resulting models are much more eﬃcient

without sacriﬁcing performance at large.

1 Introduction

Transformer-based Languages Models (LMs) (Rad-

ford and Narasimhan,2018;Devlin et al.,2019;Liu

et al.,2019) have stormed NLP benchmarks with

state-of-the-art performance, while recently hu-

mongous billion-parameter-sized models (Brown

et al.,2020;Rae et al.,2021;Hoﬀmann et al.,2022)

have showcased impressive few-shot capabilities.

In addition, multi-lingual LMs (Conneau et al.,

2020) have been also developed demonstrating ex-

ceptional results as well as impressive performance

in zero-shot cross-lingual transfer.

The legal NLP literature is also ﬂourishing with

the release of many new resources, including large

∗

Equal contribution. Work done during capstone projects

in Cognitiv+(https://www.cognitivplus.com/).

legal corpora (Henderson* et al.,2022), bench-

mark datasets (Chalkidis et al.,2021a;Koreeda

and Manning,2021;Zheng et al.,2021;Chalkidis

et al.,2022;Habernal et al.,2022), and pre-trained

legal-oriented language models (Chalkidis et al.,

2020;Zheng et al.,2021). Despite this impressive

progress, the eﬃcacy of diﬀerently-sized language

models on legal NLP tasks and the importance of

domain (legal) speciﬁcity are still understudied,

while the eﬀect of model compression techniques

in model’s performance and eﬃciency is ignored.

In this work, we aim to shed light in all these di-

rections following model development across three

incremental steps in a pipelined approach:

(a) model pre-training on large legal corpora,

(b) model ﬁne-tuning on down-stream tasks, and

To do so, we initially develop 4 multi-lingual legal-

oriented language models (C-XLMs). We bench-

mark their performance across 5 down-stream legal

NLP tasks, comprising both publicly available and

private datasets, covering both English and multi-

lingual scenarios in several tasks types, i.e., docu-

ment/sentence classiﬁcation, natural language infer-

ence, and entity extraction. Finally, we experiment

with a full-scale pipeline for model compression

which includes a) Parameter Pruning, b) Knowl-

edge Distillation, and c) Quantization to produce

much more eﬃcient (smaller and faster) models

that can be eﬀectively deployed in production.

Our work aims to provide guidelines to legal-

tech practitioners on model development (pre-

training, ﬁne-tuning, compression) bearing both

performance and eﬃciency into consideration. Our

ﬁndings show that the impact of larger vs. smaller

models, domain-speciﬁc vs. generic models and

the eﬃcacy of model compression techniques

varies across tasks, but in general larger domain-

speciﬁc models perform better. Via full-scale

model compression, we produce models with per-

formance decrease by 2.3 p.p., while being approx.

arXiv:2210.13086v1 [cs.CL] 24 Oct 2022

Model Alias #Langs #Layers #Units #Heads #Params Vocab. Size Train. Tokens MLM Acc.

XLM-R base 100 12 768 12 278M 250k 6.3T 74.0

XLM-R large 100 24 1024 16 559M 250k 6.3T 78.9

C-XLM tiny 10 4 128 4 9M 64k 92B 54.9

C-XLM small 10 6 256 4 21M 64k 92B 68.9

C-XLM base 10 12 512 8 71M 64k 92B 77.8

C-XLM large 10 24 1024 16 368M 64k 92B 81.5

Table 1: Model Speciﬁcations, Training Tokens processed on pre-training and MLM performance (Acc.) for all

variants of our XLM (C-XLM) models and the XLM-R models of Conneau et al. (2020) considered as baselines.

smaller, and approx. 66

faster. We also ﬁnd

that fully compressed models outperform equally

sized distilled or ﬁne-tuned models.

2 Model Speciﬁcations

Following Chalkidis et al. (2020), we pre-train

from scratch legal domain speciﬁc transformer-

based language models. Our models are based

on the RoBERTa architecture (Liu et al.,2019),

i.e., trained with the Masked Language Modelling

(MLM) objective, excluding the Next Sentence Pre-

diction (NSP) one used by BERT (Devlin et al.,

2019). In addition, based on the industry needs and

driven by the work of (Conneau et al.,2020), our

models are a multilingual one -usually referred as

XLM in the literature- and supports ten languages

in total (English, French, German, Greek, Spanish,

Italian, Dutch, Polish, Portuguese, Russian).

We pre-train 4 variants of custom XLM mod-

els (C-XLM) starting from a large version with

24 Transformer blocks (layers), each consisting

of 1024 hidden units and 16 attention heads and

continue by decreasing each time by a factor of 2

across all dimensions, i.e., blocks/layers, hidden

units, and attention heads (Table 1).1

3 Pre-Training

3.1 Training Corpora

We pre-trained our models using multi-lingual cor-

pora that consist of regulations and contracts. For

regulations, we used the MultiEURLEX dataset

of Chalkidis et al. (2021b) that comprises 65k EU

regulations oﬃcially translated in 24 languages.

We also considered additional publicly available

English resources; speciﬁcally the 250 US code

books, part of the “Pile of Law” corpus released by

A minor exception in the tiny version, where we consider

4 attention heads of 32 hidden units per head instead of 2

attention heads with 64 units per head.

In our work, we consider 9 languages (English, French,

German, Greek, Spanish, Italian, Dutch, Polish, Portuguese).

(Henderson* et al.,2022), along-size 36k UK laws

published by Chalkidis and Søgaard (2022).

Regarding contracts, we considered the

LEDGAR (Tuggener et al.,2020) dataset compris-

ing 900k sections from US contracts in English;

and 60k additional full contracts in English from

a publicly available crawl from EDGAR. Since,

there are no publicly available contracts in the rest

of the languages, we translated these documents

using state-of-the-art Neural Machine Translation

(NMT) systems across all languages of interest.3

3.2 Custom Vocabulary

Relying on the above mentioned resources, we built

a custom vocabulary of 64k sub-word units that bet-

ter ﬁt the documents in the respective domains and

languages of interest. We opted for Byte-Pair En-

codings (BPEs) (Sennrich et al.,2016), similarly to

most recent work on Transformer-based language

models (Radford and Narasimhan,2018;Liu et al.,

2019;Conneau et al.,2020).

3.3 Masked Language Modelling (MLM)

We pre-trained all variants of C-XLM (our domain-

speciﬁc multi-lingual RoBERTa) for 1.1m steps

(gradient updates) in total based on a two-step

approach, similarly to Devlin et al. (2019), i.e.,

pre-train for 1m steps with sequences up to 128

sub-word units, followed by continued pre-training

for 100k steps with sequences up to 512 sub-word

units, always with a batch size of 512 sequences.

At each example, we mask out 15% of the tokens

in total. We train all models for a maximum learn-

ing rate of

1e−4

with warm-up for the initial (5%)

training steps followed by a cosine decay.

In comparison XLM-R models were pre-trained

for 1.5m steps with batches of 8192 sequences,

which accounts for approx. 63

more training

We used the OpusMT (en2m) mBART models using the

EasyNMT library.

This approach aims to a more eﬃcient (compute-friendly)

pre-training, since pre-training with shorter sequences severely

decreases the needed compute and time.

Figure 1: MLM performance per language across C-XLM model variants depicted with diﬀerent coloured webs.

Figure 2: Pre-training loss curves of C-XLMs.

tokens processed; the majority of those in high-

resource languages like the ones we consider.

3.4 MLM Results

In Figure 2, we observe the loss curves of diﬀer-

ently sized models during pre-training. While mod-

els are equally poor performing in the very initial

steps, larger models substantially outperform the

smaller counterparts due to their increased capac-

ity (number of parameters). Table 1presents the

accuracy of our diﬀerent models. As expected, the

large version (81.5% accuracy) followed by the

base version (77.8% accuracy) of C-XLM outper-

form their corresponding generic XLM-R models

by 2.6% and 3.8% respectively.

Figure 1presents

masked language modelling performance in ﬁner

details across languages per model, highlighting

the predominance of our two largest models.6

A comparison between the XLM-R models of Conneau

et al. (2020) and our models (C-XLMs) is not ideal due to the

diﬀerent vocabulary used. Nevertheless, it provides a general

idea on pre-training performance on legal speciﬁc corpora.

More ﬁne-grained MLM evaluation (per language and per

document type) can be found in Appendix B.

4 Fine-tuning

4.1 Benchmark - Tasks and Datasets

In this section, we brieﬂy present the evaluation

benchmark that we use, which consist of both pub-

licly available and private datasets. The bench-

mark is diverse covering three task types (docu-

ment, sentence, and token classiﬁcation) and two

multi-lingual datasets.7The datasets in detail are:

MultiEURLEX

(Chalkidis et al.,2021a), a multi-

lingual dataset for legal topic classiﬁcation compris-

ing 65k EU laws oﬃcially translated in 23 EU lan-

guages.

Each document (EU law) was originally

annotated with relevant EUROVOC

concepts by

the Publications Oﬃce of EU. We use the 21 ‘Level

1’ labels, obtained by Chalkidis et al. (2021a) from

the original EUROVOC annotations of the docu-

ments. We use a derivative of the original dataset

considering only 1k non-parallel documents per

supported language (9k in total, Section 3.1).

This is a multi-label document classiﬁcation task,

thus we evaluate performance using macro- (m-

)

and micro- (µ-F1) F1 scores.

UNFAIR-ToS

(Drawzeski et al.,2021) is a dataset

for detecting unfair clauses in Terms of Service

(ToS) agreements from on-line platforms (e.g.,

YouTube, Facebook, etc.) in 4 languages (English,

German, Italian, and Polish). The dataset has been

annotated on the sentence-level with 8 types of un-

We do not use the LexGLUE benchmark of Chalkidis

et al. (2022), since it is monolingual (English only) and also

covers tasks that involve litigation, which are out of scope.

MultiEURLEX is available at

https://huggingface.

co/datasets/multi_eurlex.

EUROVOC is a hierarchically organized taxonomy

of concepts (a hierarchy of labels) available at

http://

eurovoc.europa.eu/.

This is inline with the work of Xenouleas et al. (2022),

where the authors consider a more “realistic” harder version

of MultiEURLEX with less and non-parallel documents.

Model Alias MultiEURLEX UNFAIR-ToS CNLI Obligations ContractNER

µ-F1m-F1Acc. MAE µ-F1m-F1µ-F1m-F1µ-F1m-F1

XLM-R (base) 75.3 53.2 86.6 0.17 84.0 81.9 89.7 88.2 92.4 93.9

XLM-R (large) 77.8 63.8 89.0 0.16 86.3 84.7 88.9 87.4 92.8 93.7

C-XLM (tiny) 66.5 46.1 78.2 0.27 70.2 69.2 88.7 87.4 87.2 89.3

C-XLM (small) 72.3 54.7 85.4 0.20 79.7 77.0 90.4 89.0 90.1 92.4

C-XLM (base) 75.3 59.4 87.3 0.18 84.0 82.1 91.2 90.4 92.9 93.9

C-XLM (large) 78.4 65.4 89.7 0.14 85.3 83.0 91.8 90.6 93.2 94.6

Table 2: Overall results of ﬁne-tuned models across all down-stream tasks.

fair contractual terms, meaning terms (sentences)

that potentially violate user rights according to EU

consumer law. Sentences have been also annotated

according to a 3-level fairness score (fair,partially

unfair,clearly unfair). In our case, we examine

the latter task as sentence regression and evaluate

performance using Mean Absolute Error (MAE),

and Accuracy (Acc.) on rounded (discrete) scores.

ContractNLI

(Koreeda and Manning,2021) is a

dataset for contract-based Natural Language Infer-

ence (NLI). The dataset consists of 607 contracts,

speciﬁcally Non-Disclosure Agreements (NDAs).

Each document has been paired with 17 templated

hypotheses and labeled with one out of three classes

(entailment,contradiction, or neutral). We exam-

ine a lenient version of this task, where instead of

the full document (NDA), we represent the docu-

ment with a short number of sentences which have

been annotated as rationales for the speciﬁc task.

This is a single-label multi-class document classi-

ﬁcation task and we evaluate performance using

macro- (m-F1) and micro- (µ-F1) F1 scores.

Contract-Obligations

(Chalkidis et al.,2018) is a

proprietary (privately developed) dataset for obli-

gation extraction from contracts (legal agreements).

The dataset consists of 100 service agreements.

Each contract has been split into paragraphs (ap-

prox. 9,400 in total), and labeled with 4 obligation

sub-types, i.e., Obligation,Deliverable,Discretion,

and Prohibition, while some paragraphs are not

relevant, resulting in a total of 5 potential classes.

This is a single-label multi-class document classiﬁ-

cation task. We evaluate performance using macro-

(m-F1) and micro- (µ-F1) F1 scores.

ContractNER

(Chalkidis et al.,2017) is a propri-

etary dataset for contract element extraction. The

dataset consists of 3,500 contractual introductions

from several types (service, employment, purchase,

etc.) of contracts. Each introduction (paragraph)

has been labeled with 4 entity types (Title,Con-

tracting Party,Start Date,Eﬀective Date). This is

a single-label multi-class token classiﬁcation task.

Thus, we evaluate performance using macro- (m-

F1) and micro- (µ-F1) F1 scores on entity level.

4.2 Experimental Set Up

We tune all models conducting a grid search for

learning rates

∈ {

1e-4, 3e-4, 1e-5, 3e-5, 5e-5, 1e-6

}

We use early stopping based on validation loss; we

select and report test scores based on the model

with the best validation performance.11

4.3 Fine-tuning Results

Table 2presents the results of the ﬁned-tuned base-

lines, XLM-R models, (upper zone) and of all the

variants of our C-XLM models (lower zone) for

each downstream task. We hypothesize that the

base and large versions of C-XLM will perform

better compared to their counterpart XLM-R mod-

els. Indeed, the base version of C-XLM always

outperforms XLM-R across all 5 datasets, while

the large version of C-XLM outperforms XLM-R

in all but one (4 out of 5) datasets.

MultiEURLEX:

Both large versions of C-XLM

and XLM-R clearly outperform the rest of the mod-

els with the C-XLM outperforming XLM-R by 0.6

p.p. in

and 1.6 p.p. in m-

. Similarly, the

base version of C-XLM outperforms the equivalent

version of XLM-R. Interestingly, the small version

of C-XLM has comparable performance with the

latter while being approx. 13×smaller.

UNFAIR-ToS:

Both large and base versions of C-

XLM outperform their counterpart XLM-R models

by 0.7 p.p. in accuracy. Again, the small version

of C-XLM achieves competitive performance to

base-sized models.

Additional details and development scores are provided

in Appendix A

(a) MultiEURLEX

(b) UNFAIR-ToS

Figure 3: Radar plots with per language performance

for the multilingual MultiEURLEX and Unfair-ToS

datasets for all the versions of C-XLM.

ContractNLI:

In this task, we ﬁnd that the large

version of XLM-R outperforms the one of C-XLM

(+1 p.p. in

and +1.7 p.p. in m-

) while both

base models perform comparably. We also note that

the relative diﬀerences between diﬀerently sized

models are the more intense across all tasks.

Contract-Obligations:

On this task, all C-XLM

models except the tiny version outperform the base-

lines (XLM-R). Speciﬁcally, the large version of

C-XLM achieves +2.9 p.p. in

and +3.2 p.p.

in m-F1compared to the large version of XLM-R.

ContractNER:

Similarly, our C-XLM models out-

perform the corresponding large and base baselines

by approx. 0.5 p.p. in

. In addition, m-

higher in our large model by 0.9 p.p., while base

models have identical results. Again, the small

version of C-XLM is competitive to the baseline.

In general trends, we observe that larger models

outperform smaller ones in most cases, and domain-

speciﬁc models outperform generic ones, while

using a sunstantially smaller (4

) vocabulary and

be signiﬁcantly less (63

) pre-trained. The largest

relative diﬀerences occur in MULTIEURLEX, a

20-class multi-label classiﬁcationtask, and CNLI,

a sentence pair classiﬁcation task.

Language Parity:

Figure 3provides information

through radar plots, about scores per language for

each variant of C-XLM. We generally observe that

performance varies across languages (e.g., mod-

els perform better in English compared to Ger-

man), while also language performance disparity

varies across models (depicted as diﬀerently shaped

webs), and across datasets (e.g., models are better

in English compared to Italian in MultiEURLEX,

but the opposite is true for UNFAIR-ToS).12

We cross out representation disparity as a possi-

ble explanation, since training data equally repre-

sent all languages (equal number of training exam-

ples). Interestingly, pre-training (MLM) accuracy

also does not correlate with the down-stream per-

formance. Based on the aforementioned points, we

can only hypothesize that other qualitative charac-

teristics (idiosyncrasies of a language in a speciﬁc

context/domain) are responsible for perfomance

disparities in-between languages.

Algorithm 1 Gradual Compression

if Teacher Size >> Student Size then

S0: Distill model to teacher assistant

S1: Prune model vocabulary and

ﬁne-tune for 1-3 epochs (if needed).

S2: Prune model depth and distill.

S3: Prune model width and re-distill.

S4.1: Optimize computational graph.

S4.2: Apply 8-bit dynamic quantization.

5 Model Compression

5.1 Methodology

To compress and accelerate the inference of ﬁne-

tuned transformer-based models we adopt gradual

compression, a pipeline that combines structured

pruning, knowledge distillation, and post-training

quantization to progressively reach the desired com-

pression rate, summarized in Algorithm 1.13

Step 0 — Teacher Assistant:

In case the teacher

is very large and the desired compression rate is

high (e.g., reducing the large version of C-XLM

to the tiny one), teacher assistants (Mirzadeh et al.,

2020) are used to make the transition smoother.

12Refer to Appendix Bfor detailed results.

See additional details and results from preliminary exper-

iments in Appendix B.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Legal-TechOpenDiaries:Lessonlearnedonhowtodevelopanddeploylight-weightmodelsintheeraofhumongousLanguageModelsSteliosMaroudasySotirisLegkasyProdromosMalakasiotisyIliasChalkidiszyDepartmentofInformatics,AthensUniversityofEconomicsandBusiness,GreecezDepartmentofComputerScience,UniversityofCopenhag...

展开>> 收起<<

Legal-Tech Open Diaries Lesson learned on how to develop and deploy light-weight models in the era of humongous Language Models Stelios MaroudasySotiris Legkasy.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Legal-Tech Open Diaries Lesson learned on how to develop and deploy light-weight models in the era of humongous Language Models Stelios MaroudasySotiris Legkasy

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: