EUR-Lex-Sum A Multi- and Cross-lingual Dataset for Long-form Summarization in the Legal Domain Dennis Aumillery Ashish Chouhanyzand Michael Gertzy

2025-05-06 0 0 882.76KB 14 页 10玖币
侵权投诉
EUR-Lex-Sum: A Multi- and Cross-lingual Dataset for Long-form
Summarization in the Legal Domain
Dennis Aumiller, Ashish Chouhan∗†‡ and Michael Gertz
Institute of Computer Science, Heidelberg University
School of Information, Media and Design, SRH Hochschule Heidelberg
{aumiller, chouhan, gertz}@informatik.uni-heidelberg.de
Abstract
Existing summarization datasets come with
two main drawbacks: (1) They tend to focus
on overly exposed domains, such as news ar-
ticles or wiki-like texts, and (2) are primarily
monolingual, with few multilingual datasets.
In this work, we propose a novel dataset,
called EUR-Lex-Sum, based on manually cu-
rated document summaries of legal acts from
the European Union law platform (EUR-Lex).
Documents and their respective summaries ex-
ist as cross-lingual paragraph-aligned data in
several of the 24 official European languages,
enabling access to various cross-lingual and
lower-resourced summarization setups. We
obtain up to 1,500 document/summary pairs
per language, including a subset of 375 cross-
lingually aligned legal acts with texts available
in all 24 languages.
In this work, the data acquisition process is de-
tailed and key characteristics of the resource
are compared to existing summarization re-
sources. In particular, we illustrate challeng-
ing sub-problems and open questions on the
dataset that could help the facilitation of future
research in the direction of domain-specific
cross-lingual summarization. Limited by the
extreme length and language diversity of sam-
ples, we further conduct experiments with suit-
able extractive monolingual and cross-lingual
baselines for future work.
Code for the extraction as well as ac-
cess to our data and baselines is avail-
able online at: https://github.com/
achouhan93/eur-lex-sum.
1 Introduction
Despite a long history in the field of text summa-
rization (Luhn,1958), current systems in the area
are still mainly targeted towards a few select do-
mains. This stems in part from the homogeneity
of existing summarization datasets and extraction
processes: frequently, these are either collected
These authors contributed equally to this work.
from news articles (Over and Yen,2004;Sandhaus,
2008;Hermann et al.,2015;Narayan et al.,2018;
Grusky et al.,2018;Hasan et al.,2021) or wiki-
style knowledge bases (Ladhak et al.,2020;Frefel,
2020), where alignment with supposed “summaries”
is particularly straightforward. Domain outliers do
exist, e.g., for scientific literature (Cachola et al.,
2020) or the legal domain (Gebendorfer and Elnag-
gar,2018;Kornilova and Eidelman,2019;Manor
and Li,2019;Bhattacharya et al.,2019), but are
primarily restricted to the English language or do
not contain finer-grained alignments between cross-
lingual documents.
Reasons for the usage of mentioned predominant
domains are manifold: Data is reasonably accessi-
ble throughout the internet, can be automatically
extracted, and the structure naturally lends itself
to the extraction of excerpts that can be seen as a
form of summarization. For news articles, short
snippets (or headlines) describing the gist of main
article texts are quite common. Wikipedia has an
introductionary paragraph that has been framed as a
“summary” of the remaining article (Frefel,2020),
whereas others utilize scholarly abstracts (or vari-
ants thereof) as extreme summaries of academic
texts (Cachola et al.,2020).
For a variety of reasons, using these datasets as
a training resource for summarization systems in-
troduces (unwanted) biases. Examples include ex-
treme lead bias (Zhu et al.,2021), focus on ex-
tremely short input/output texts (Narayan et al.,
2018), or high overlap in the document con-
tents (Nallapati et al.,2016). Models trained in
such a fashion also tend to score quite well on zero-
shot evaluation of datasets from similar domains,
however, poorly generalize beyond immediate in-
domain samples that follow a different content dis-
tribution or longer expected summary length.
Simultaneously, high-quality multilingual and
cross-lingual data for training summarization sys-
tems is scarce, particularly for datasets including
arXiv:2210.13448v1 [cs.CL] 24 Oct 2022
more than two languages. Existing resources are
often constructed in similar fashion to their mono-
lingual counterparts (Scialom et al.,2020;Varab
and Schluter,2021) and subsequently share the
same shortcomings of low data quality.
Our main contribution in this work is the construc-
tion of a novel multi- and cross-lingual corpus of
reference texts and human-written summaries that
extract texts from legal acts of the European Union
(EU). Aside from a varying number of training sam-
ples per language, we provide a paragraph-aligned
validation and test set across all 24 official lan-
guages of the European Uninon
1
, which further
enables cross-lingual evaluation settings.
2 Related Work
Influencing works can generally be categorized into
works about EU data, or more broadly about sum-
marization in the legal domain. Aside from that, we
also compare our research to other existing multi-
and cross-lingual works for text summarization.
2.1 The EU as a Data Source
Data generated by the European Union has been
utilized extensively in other sub-fields of Natural
Language Processing. The most prominent exam-
ple is probably the Europarl corpus (Koehn,2005),
consisting of sentence-aligned translated texts gen-
erated from transcripts of the European Parliament
proceedings, frequently used in Machine Transla-
tion systems due to its size and language coverage.
In similar fashion to parliament transcripts, the
European Union has its dedicated web platform
for legal acts, case law and treaties, called EUR-
Lex (Bernet and Berteloot,2006)
2
, which we will
refer to as the EUR-Lex platform. Data from the
EUR-Lex platform has previously been utilized
as a resource for extreme multi-label classifica-
tion (Loza Mencía and Fürnkranz,2010), most
recently including an updated version by Chalkidis
et al. (2019a,b). In particular, the MultiEURLEX
dataset (Chalkidis et al.,2021) extends the monolin-
gual resource to a multilingual one, however, does
not move beyond the classification of EuroVoc la-
bels. To our knowledge, document summaries of
legal acts from the platform have recently been
1https://eur-lex.europa.eu/content/
help/eurlex-content/linguistic-coverage.
html, last accessed: 2022-06-15
2
most recent URL:
https://eur-lex.europa.eu
,
last accessed: 2022-06-15
used as a monolingual English training resource
for summarization systems (Klaus et al.,2022).
2.2 Processing of Long Legal Texts
Recently, using sparse attention, transformer-based
models have been proposed to handle longer docu-
ments (Beltagy et al.,2020;Zaheer et al.,2020a).
However, the content structure is not explicitly con-
sidered in current models. Yang et al. (2020) pro-
posed a hierarchical Transformer model, SMITH,
that incrementally encodes increasingly larger text
blocks. Given the lengthy nature of legal texts, (Au-
miller et al.,2021) investigate methods to separate
content into topically coherent segments, which
can benefit the processing of unstructured and het-
erogeneous documents in long-form processing set-
tings with limited context. From a data perspective,
Kornilova and Eidelman (2019) propose BillSum,
a resource based on US and California bill texts,
spanning between approximately 5,000 to 20,000
characters in length. For the aforementioned En-
glish summarization corpus based on the EUR-Lex
platform, Klaus et al. (2022) utilize an automati-
cally aligned text corpus for fine-tuning BERT-like
Transformer models on an extractive summariza-
tion objective. Their best-performing approach is a
hybrid solution that prefaces the Transformer sys-
tem with a TextRank-based pre-filtering step.
2.3 Datasets for Multi- or Cross-lingual
Summarization
For Cross-lingual Summarization (XLS), Wang
et al. (2022b) provide an extensive survey on
the currently available methods, datasets, and
prospects. Resources for XLS can be divided into
two primary categories: synthetic datasets and web-
native multilingual resources. For the former, sam-
ples are created by directly translating summaries
from a given source language to a separate target
language. Examples include English-Chinese (and
vice versa) by Zhu et al. (2019), and an English-
German resource (Bai et al.,2021). Both works
utilize news articles for data and neural MT sys-
tems for the translation. In contrast, there also exist
web-native multilingual datasets, where both ref-
erences and summaries were obtained primarily
from parallel website data. Global Voices (Nguyen
and Daumé III,2019), XWikis (Perez-Beltrachini
and Lapata,2021), Spektrum (Fatima and Strube,
2021), and CLIDSUM (Wang et al.,2022a) repre-
sent instances of datasets for the news, encyclope-
dic, and dialogue domain, with differing numbers
of supported languages.
We have previously mentioned some of the mul-
tilingual summarization resource where multiple
languages are covered. MLSUM (Scialom et al.,
2020) is based on news articles in six languages,
however, without cross-lingual alignments. Sim-
ilarly without alignments, but larger in scale, is
MassiveSum (Varab and Schluter,2021). XL-
Sum Hasan et al. (2021) does provide document-
aligned news article, in 44 distinct languages, ex-
tracted data from translated articles published by
the BBC. In particular, their work also provides
translations in several lower-resourced Asian lan-
guages. WikiLingua (Ladhak et al.,2020) borders
the multi- and cross-lingual domain; some weak
alignments exist, but only for English references,
and not between languages themselves.
3 The EUR-Lex-Sum Dataset
We present a novel dataset based on available
multilingual document summaries from the EUR-
Lex platform. The final dataset, which we ti-
tle “EUR-Lex-Sum”, consists of up to 1,500 docu-
ment/summary pairs per language. For comparable
validation and test splits, we identified a subset of
375 cross-lingually aligned legal acts that are avail-
able in all 24 languages. In this section, the data
acquisition process is detailed, followed by a brief
exploratory analysis of the documents and their
content. Finally, key intrinsic characteristics of
the resource are compared with relation to existing
summarization resources. In short, we find that the
combination of human-written summaries coupled
with comparatively long source and summary texts
makes this dataset a suitable resource for evaluating
a less common summarization setting, especially
for long-form tasks.
3.1 Dataset Creation
The EUR-Lex platform provides access to various
legal documents published by organs within the
European Union. In particular, we focus on cur-
rently enforced EU legislation (legal acts) for the
20 domains from the EUR-Lex platform.
3
From
the mentioned link, direct access to lists of pub-
lished legal acts associated with a particular do-
main is available, which forms the starting point
for our later crawling step. Notably, each of these
3https://eur-lex.europa.eu/browse/
directories/legislation.html
, last accessed:
2022-06-21
domains also provides a diverse set of specific key-
words, topics and regulations, which even within
the dataset provide a high level of diversity.
A legal act is uniquely identified by the so-called
Celex ID, composed of codes for the respective sec-
tor, year and document type. The ID is consistent
across all 24 languages, which makes it possible
to align articles on a document level. Across all
20 domains, the website reports a total of 26,468
legal acts spanning from 1952 until 2022. How-
ever, as there is a probability of a particular legal
act being assigned to multiple domains, approxi-
mately 22,000 unique legal acts can be extracted
from the platform. We do not consider EU case law
and treaties, which are also available through the
EUR-Lex platform, but in other document formats.
3.1.1 Crawling
The web page of a particular legal act contains the
following page content relevant for a summariza-
tion setting: 1. The published text of the particular
legal act in various file formats, 2. metadata in-
formation about the legal acts, such as published
year, associated treaties, etc., 3. links to the content
pages in other official languages, and 4. if available,
a link to an associated summary document.
This work contributes to preparing a dataset with
the legal act content and their respective summaries
in different languages. Therefore, crawling over
the entirety of published legal acts gives access to
all relevant information needed to extract source
and summary text pairs. Since a single legal act
requires 50 individual web requests to extract files
across all languages, we have a total of around 5.5
million access requests, distributed across the span
of a month between May and June 2022. We dump
the content of all accessed acts in a local Elastic-
search instance, and separately mark documents
without existing associated summaries. This al-
lows the resource to be continually updated in the
future without re-crawling documents that do not
have available summaries.
3.1.2 Filtering
For further processing, we filter the documents
available through our offline storage. First, some ar-
ticle texts may only be available as scanned (PDF)
documents, which compromises text quality and is
therefore discarded. For the most consistent repre-
sentation, we choose to limit ourselves to articles
present in an HTML document, with further ad-
vantages explained in Section 4.1. Availability of
摘要:

EUR-Lex-Sum:AMulti-andCross-lingualDatasetforLong-formSummarizationintheLegalDomainDennisAumillery,AshishChouhanyzandMichaelGertzyyInstituteofComputerScience,HeidelbergUniversityzSchoolofInformation,MediaandDesign,SRHHochschuleHeidelberg{aumiller,chouhan,gertz}@informatik.uni-heidelberg.deAbstract...

展开>> 收起<<
EUR-Lex-Sum A Multi- and Cross-lingual Dataset for Long-form Summarization in the Legal Domain Dennis Aumillery Ashish Chouhanyzand Michael Gertzy.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:882.76KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注