
more than two languages. Existing resources are
often constructed in similar fashion to their mono-
lingual counterparts (Scialom et al.,2020;Varab
and Schluter,2021) and subsequently share the
same shortcomings of low data quality.
Our main contribution in this work is the construc-
tion of a novel multi- and cross-lingual corpus of
reference texts and human-written summaries that
extract texts from legal acts of the European Union
(EU). Aside from a varying number of training sam-
ples per language, we provide a paragraph-aligned
validation and test set across all 24 official lan-
guages of the European Uninon
1
, which further
enables cross-lingual evaluation settings.
2 Related Work
Influencing works can generally be categorized into
works about EU data, or more broadly about sum-
marization in the legal domain. Aside from that, we
also compare our research to other existing multi-
and cross-lingual works for text summarization.
2.1 The EU as a Data Source
Data generated by the European Union has been
utilized extensively in other sub-fields of Natural
Language Processing. The most prominent exam-
ple is probably the Europarl corpus (Koehn,2005),
consisting of sentence-aligned translated texts gen-
erated from transcripts of the European Parliament
proceedings, frequently used in Machine Transla-
tion systems due to its size and language coverage.
In similar fashion to parliament transcripts, the
European Union has its dedicated web platform
for legal acts, case law and treaties, called EUR-
Lex (Bernet and Berteloot,2006)
2
, which we will
refer to as the EUR-Lex platform. Data from the
EUR-Lex platform has previously been utilized
as a resource for extreme multi-label classifica-
tion (Loza Mencía and Fürnkranz,2010), most
recently including an updated version by Chalkidis
et al. (2019a,b). In particular, the MultiEURLEX
dataset (Chalkidis et al.,2021) extends the monolin-
gual resource to a multilingual one, however, does
not move beyond the classification of EuroVoc la-
bels. To our knowledge, document summaries of
legal acts from the platform have recently been
1https://eur-lex.europa.eu/content/
help/eurlex-content/linguistic-coverage.
html, last accessed: 2022-06-15
2
most recent URL:
https://eur-lex.europa.eu
,
last accessed: 2022-06-15
used as a monolingual English training resource
for summarization systems (Klaus et al.,2022).
2.2 Processing of Long Legal Texts
Recently, using sparse attention, transformer-based
models have been proposed to handle longer docu-
ments (Beltagy et al.,2020;Zaheer et al.,2020a).
However, the content structure is not explicitly con-
sidered in current models. Yang et al. (2020) pro-
posed a hierarchical Transformer model, SMITH,
that incrementally encodes increasingly larger text
blocks. Given the lengthy nature of legal texts, (Au-
miller et al.,2021) investigate methods to separate
content into topically coherent segments, which
can benefit the processing of unstructured and het-
erogeneous documents in long-form processing set-
tings with limited context. From a data perspective,
Kornilova and Eidelman (2019) propose BillSum,
a resource based on US and California bill texts,
spanning between approximately 5,000 to 20,000
characters in length. For the aforementioned En-
glish summarization corpus based on the EUR-Lex
platform, Klaus et al. (2022) utilize an automati-
cally aligned text corpus for fine-tuning BERT-like
Transformer models on an extractive summariza-
tion objective. Their best-performing approach is a
hybrid solution that prefaces the Transformer sys-
tem with a TextRank-based pre-filtering step.
2.3 Datasets for Multi- or Cross-lingual
Summarization
For Cross-lingual Summarization (XLS), Wang
et al. (2022b) provide an extensive survey on
the currently available methods, datasets, and
prospects. Resources for XLS can be divided into
two primary categories: synthetic datasets and web-
native multilingual resources. For the former, sam-
ples are created by directly translating summaries
from a given source language to a separate target
language. Examples include English-Chinese (and
vice versa) by Zhu et al. (2019), and an English-
German resource (Bai et al.,2021). Both works
utilize news articles for data and neural MT sys-
tems for the translation. In contrast, there also exist
web-native multilingual datasets, where both ref-
erences and summaries were obtained primarily
from parallel website data. Global Voices (Nguyen
and Daumé III,2019), XWikis (Perez-Beltrachini
and Lapata,2021), Spektrum (Fatima and Strube,
2021), and CLIDSUM (Wang et al.,2022a) repre-
sent instances of datasets for the news, encyclope-
dic, and dialogue domain, with differing numbers