EUR-Lex-Sum A Multi- and Cross-lingual Dataset for Long-form Summarization in the Legal Domain Dennis Aumillery Ashish Chouhanyzand Michael Gertzy

2025-05-06 0 0 882.76KB 14 页 10玖币

侵权投诉

EUR-Lex-Sum: A Multi- and Cross-lingual Dataset for Long-form

Summarization in the Legal Domain

Dennis Aumiller∗†, Ashish Chouhan∗†‡ and Michael Gertz†

†Institute of Computer Science, Heidelberg University

‡School of Information, Media and Design, SRH Hochschule Heidelberg

{aumiller, chouhan, gertz}@informatik.uni-heidelberg.de

Abstract

Existing summarization datasets come with

two main drawbacks: (1) They tend to focus

on overly exposed domains, such as news ar-

ticles or wiki-like texts, and (2) are primarily

monolingual, with few multilingual datasets.

In this work, we propose a novel dataset,

called EUR-Lex-Sum, based on manually cu-

rated document summaries of legal acts from

the European Union law platform (EUR-Lex).

Documents and their respective summaries ex-

ist as cross-lingual paragraph-aligned data in

several of the 24 ofﬁcial European languages,

enabling access to various cross-lingual and

lower-resourced summarization setups. We

obtain up to 1,500 document/summary pairs

per language, including a subset of 375 cross-

lingually aligned legal acts with texts available

in all 24 languages.

In this work, the data acquisition process is de-

tailed and key characteristics of the resource

are compared to existing summarization re-

sources. In particular, we illustrate challeng-

ing sub-problems and open questions on the

dataset that could help the facilitation of future

research in the direction of domain-speciﬁc

cross-lingual summarization. Limited by the

extreme length and language diversity of sam-

ples, we further conduct experiments with suit-

able extractive monolingual and cross-lingual

baselines for future work.

Code for the extraction as well as ac-

cess to our data and baselines is avail-

able online at: https://github.com/

achouhan93/eur-lex-sum.

1 Introduction

Despite a long history in the ﬁeld of text summa-

rization (Luhn,1958), current systems in the area

are still mainly targeted towards a few select do-

mains. This stems in part from the homogeneity

of existing summarization datasets and extraction

processes: frequently, these are either collected

∗These authors contributed equally to this work.

from news articles (Over and Yen,2004;Sandhaus,

2008;Hermann et al.,2015;Narayan et al.,2018;

Grusky et al.,2018;Hasan et al.,2021) or wiki-

style knowledge bases (Ladhak et al.,2020;Frefel,

2020), where alignment with supposed “summaries”

is particularly straightforward. Domain outliers do

exist, e.g., for scientiﬁc literature (Cachola et al.,

2020) or the legal domain (Gebendorfer and Elnag-

gar,2018;Kornilova and Eidelman,2019;Manor

and Li,2019;Bhattacharya et al.,2019), but are

primarily restricted to the English language or do

not contain ﬁner-grained alignments between cross-

lingual documents.

Reasons for the usage of mentioned predominant

domains are manifold: Data is reasonably accessi-

ble throughout the internet, can be automatically

extracted, and the structure naturally lends itself

to the extraction of excerpts that can be seen as a

form of summarization. For news articles, short

snippets (or headlines) describing the gist of main

article texts are quite common. Wikipedia has an

introductionary paragraph that has been framed as a

“summary” of the remaining article (Frefel,2020),

whereas others utilize scholarly abstracts (or vari-

ants thereof) as extreme summaries of academic

texts (Cachola et al.,2020).

For a variety of reasons, using these datasets as

a training resource for summarization systems in-

troduces (unwanted) biases. Examples include ex-

treme lead bias (Zhu et al.,2021), focus on ex-

tremely short input/output texts (Narayan et al.,

2018), or high overlap in the document con-

tents (Nallapati et al.,2016). Models trained in

such a fashion also tend to score quite well on zero-

shot evaluation of datasets from similar domains,

however, poorly generalize beyond immediate in-

domain samples that follow a different content dis-

tribution or longer expected summary length.

Simultaneously, high-quality multilingual and

cross-lingual data for training summarization sys-

tems is scarce, particularly for datasets including

arXiv:2210.13448v1 [cs.CL] 24 Oct 2022

more than two languages. Existing resources are

often constructed in similar fashion to their mono-

lingual counterparts (Scialom et al.,2020;Varab

and Schluter,2021) and subsequently share the

same shortcomings of low data quality.

Our main contribution in this work is the construc-

tion of a novel multi- and cross-lingual corpus of

reference texts and human-written summaries that

extract texts from legal acts of the European Union

(EU). Aside from a varying number of training sam-

ples per language, we provide a paragraph-aligned

validation and test set across all 24 ofﬁcial lan-

guages of the European Uninon

, which further

enables cross-lingual evaluation settings.

2 Related Work

Inﬂuencing works can generally be categorized into

works about EU data, or more broadly about sum-

marization in the legal domain. Aside from that, we

also compare our research to other existing multi-

and cross-lingual works for text summarization.

2.1 The EU as a Data Source

Data generated by the European Union has been

utilized extensively in other sub-ﬁelds of Natural

Language Processing. The most prominent exam-

ple is probably the Europarl corpus (Koehn,2005),

consisting of sentence-aligned translated texts gen-

erated from transcripts of the European Parliament

proceedings, frequently used in Machine Transla-

tion systems due to its size and language coverage.

In similar fashion to parliament transcripts, the

European Union has its dedicated web platform

for legal acts, case law and treaties, called EUR-

Lex (Bernet and Berteloot,2006)

, which we will

refer to as the EUR-Lex platform. Data from the

EUR-Lex platform has previously been utilized

as a resource for extreme multi-label classiﬁca-

tion (Loza Mencía and Fürnkranz,2010), most

recently including an updated version by Chalkidis

et al. (2019a,b). In particular, the MultiEURLEX

dataset (Chalkidis et al.,2021) extends the monolin-

gual resource to a multilingual one, however, does

not move beyond the classiﬁcation of EuroVoc la-

bels. To our knowledge, document summaries of

legal acts from the platform have recently been

1https://eur-lex.europa.eu/content/

help/eurlex-content/linguistic-coverage.

html, last accessed: 2022-06-15

most recent URL:

https://eur-lex.europa.eu

last accessed: 2022-06-15

used as a monolingual English training resource

for summarization systems (Klaus et al.,2022).

2.2 Processing of Long Legal Texts

Recently, using sparse attention, transformer-based

models have been proposed to handle longer docu-

ments (Beltagy et al.,2020;Zaheer et al.,2020a).

However, the content structure is not explicitly con-

sidered in current models. Yang et al. (2020) pro-

posed a hierarchical Transformer model, SMITH,

that incrementally encodes increasingly larger text

blocks. Given the lengthy nature of legal texts, (Au-

miller et al.,2021) investigate methods to separate

content into topically coherent segments, which

can beneﬁt the processing of unstructured and het-

erogeneous documents in long-form processing set-

tings with limited context. From a data perspective,

Kornilova and Eidelman (2019) propose BillSum,

a resource based on US and California bill texts,

spanning between approximately 5,000 to 20,000

characters in length. For the aforementioned En-

glish summarization corpus based on the EUR-Lex

platform, Klaus et al. (2022) utilize an automati-

cally aligned text corpus for ﬁne-tuning BERT-like

Transformer models on an extractive summariza-

tion objective. Their best-performing approach is a

hybrid solution that prefaces the Transformer sys-

tem with a TextRank-based pre-ﬁltering step.

2.3 Datasets for Multi- or Cross-lingual

Summarization

For Cross-lingual Summarization (XLS), Wang

et al. (2022b) provide an extensive survey on

the currently available methods, datasets, and

prospects. Resources for XLS can be divided into

two primary categories: synthetic datasets and web-

native multilingual resources. For the former, sam-

ples are created by directly translating summaries

from a given source language to a separate target

language. Examples include English-Chinese (and

vice versa) by Zhu et al. (2019), and an English-

German resource (Bai et al.,2021). Both works

utilize news articles for data and neural MT sys-

tems for the translation. In contrast, there also exist

web-native multilingual datasets, where both ref-

erences and summaries were obtained primarily

from parallel website data. Global Voices (Nguyen

and Daumé III,2019), XWikis (Perez-Beltrachini

and Lapata,2021), Spektrum (Fatima and Strube,

2021), and CLIDSUM (Wang et al.,2022a) repre-

sent instances of datasets for the news, encyclope-

dic, and dialogue domain, with differing numbers

of supported languages.

We have previously mentioned some of the mul-

tilingual summarization resource where multiple

languages are covered. MLSUM (Scialom et al.,

2020) is based on news articles in six languages,

however, without cross-lingual alignments. Sim-

ilarly without alignments, but larger in scale, is

MassiveSum (Varab and Schluter,2021). XL-

Sum Hasan et al. (2021) does provide document-

aligned news article, in 44 distinct languages, ex-

tracted data from translated articles published by

the BBC. In particular, their work also provides

translations in several lower-resourced Asian lan-

guages. WikiLingua (Ladhak et al.,2020) borders

the multi- and cross-lingual domain; some weak

alignments exist, but only for English references,

and not between languages themselves.

3 The EUR-Lex-Sum Dataset

We present a novel dataset based on available

multilingual document summaries from the EUR-

Lex platform. The ﬁnal dataset, which we ti-

tle “EUR-Lex-Sum”, consists of up to 1,500 docu-

ment/summary pairs per language. For comparable

validation and test splits, we identiﬁed a subset of

375 cross-lingually aligned legal acts that are avail-

able in all 24 languages. In this section, the data

acquisition process is detailed, followed by a brief

exploratory analysis of the documents and their

content. Finally, key intrinsic characteristics of

the resource are compared with relation to existing

summarization resources. In short, we ﬁnd that the

combination of human-written summaries coupled

with comparatively long source and summary texts

makes this dataset a suitable resource for evaluating

a less common summarization setting, especially

for long-form tasks.

3.1 Dataset Creation

The EUR-Lex platform provides access to various

legal documents published by organs within the

European Union. In particular, we focus on cur-

rently enforced EU legislation (legal acts) for the

20 domains from the EUR-Lex platform.

From

the mentioned link, direct access to lists of pub-

lished legal acts associated with a particular do-

main is available, which forms the starting point

for our later crawling step. Notably, each of these

3https://eur-lex.europa.eu/browse/

directories/legislation.html

, last accessed:

2022-06-21

domains also provides a diverse set of speciﬁc key-

words, topics and regulations, which even within

the dataset provide a high level of diversity.

A legal act is uniquely identiﬁed by the so-called

Celex ID, composed of codes for the respective sec-

tor, year and document type. The ID is consistent

across all 24 languages, which makes it possible

to align articles on a document level. Across all

20 domains, the website reports a total of 26,468

legal acts spanning from 1952 until 2022. How-

ever, as there is a probability of a particular legal

act being assigned to multiple domains, approxi-

mately 22,000 unique legal acts can be extracted

from the platform. We do not consider EU case law

and treaties, which are also available through the

EUR-Lex platform, but in other document formats.

3.1.1 Crawling

The web page of a particular legal act contains the

following page content relevant for a summariza-

tion setting: 1. The published text of the particular

legal act in various ﬁle formats, 2. metadata in-

formation about the legal acts, such as published

year, associated treaties, etc., 3. links to the content

pages in other ofﬁcial languages, and 4. if available,

a link to an associated summary document.

This work contributes to preparing a dataset with

the legal act content and their respective summaries

in different languages. Therefore, crawling over

the entirety of published legal acts gives access to

all relevant information needed to extract source

and summary text pairs. Since a single legal act

requires 50 individual web requests to extract ﬁles

across all languages, we have a total of around 5.5

million access requests, distributed across the span

of a month between May and June 2022. We dump

the content of all accessed acts in a local Elastic-

search instance, and separately mark documents

without existing associated summaries. This al-

lows the resource to be continually updated in the

future without re-crawling documents that do not

have available summaries.

3.1.2 Filtering

For further processing, we ﬁlter the documents

available through our ofﬂine storage. First, some ar-

ticle texts may only be available as scanned (PDF)

documents, which compromises text quality and is

therefore discarded. For the most consistent repre-

sentation, we choose to limit ourselves to articles

present in an HTML document, with further ad-

vantages explained in Section 4.1. Availability of

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EUR-Lex-Sum:AMulti-andCross-lingualDatasetforLong-formSummarizationintheLegalDomainDennisAumillery,AshishChouhanyzandMichaelGertzyyInstituteofComputerScience,HeidelbergUniversityzSchoolofInformation,MediaandDesign,SRHHochschuleHeidelberg{aumiller,chouhan,gertz}@informatik.uni-heidelberg.deAbstract...

展开>> 收起<<

EUR-Lex-Sum A Multi- and Cross-lingual Dataset for Long-form Summarization in the Legal Domain Dennis Aumillery Ashish Chouhanyzand Michael Gertzy.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

EUR-Lex-Sum A Multi- and Cross-lingual Dataset for Long-form Summarization in the Legal Domain Dennis Aumillery Ashish Chouhanyzand Michael Gertzy

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: