Extractive Summarization of Legal Decisions using Multi-task Learning and Maximal Marginal Relevance Abhishek Agarwal andShanshan Xu andMatthias Grabmair

2025-05-08 1 0 1.55MB 16 页 10玖币

侵权投诉

Extractive Summarization of Legal Decisions using Multi-task Learning

and Maximal Marginal Relevance

Abhishek Agarwal and Shanshan Xu and Matthias Grabmair

Technical University of Munich, Germany

{abhishek.agarwal, shanshan.xu, matthias.grabmair}@tum.de

Abstract

Summarizing legal decisions requires the ex-

pertise of law practitioners, which is both time-

and cost-intensive. This paper presents tech-

niques for extractive summarization of legal

decisions in a low-resource setting using lim-

ited expert annotated data. We test a set of

models that locate relevant content using a se-

quential model and tackle redundancy by lever-

aging maximal marginal relevance to com-

pose summaries. We also demonstrate an

implicit approach to help train our proposed

models generate more informative summaries.

Our multi-task learning model variant lever-

ages rhetorical role identiﬁcation as an auxil-

iary task to further improve the summarizer.

We perform extensive experiments on datasets

containing legal decisions from the US Board

of Veterans’ Appeals and conduct quantitative

and expert-ranked evaluations of our models.

Our results show that the proposed approaches

can achieve ROUGE scores vis-à-vis expert ex-

tracted summaries that match those achieved

by inter-annotator comparison.

1 Introduction

In common-law systems, law practitioners research

large numbers of legal decisions from past cases

to ﬁnd similar precedents that justify their argu-

ments and lead to favorable outcomes. The analy-

sis can be time-consuming and expensive as these

documents are long and verbose, and understand-

ing them requires legal expertise. Automatic sum-

marization of legal documents can help expedite

the process cost-effectively. However, the limited

availability of expert-annotated summaries makes it

challenging to design such automated systems to as-

sist paralegals, lawyers, and other law practitioners.

Extractive summarization aims to identify and

extract essential sentences from the source docu-

ment to compose the corresponding summary. It is

more common in the legal domain due to the com-

plexity of the legal language and the scarcity of

labeled data. By contrast, abstractive summariza-

tion generates an abstract representation that cap-

tures the salient ideas of the source text and might

contain new words and phrases not present in the

source document.

One of the main challenges of extractive summa-

rization is the redundancy in legal documents, as

legal decisions can often contain several semanti-

cally similar sentences. Our objective is to gener-

ate summaries that provide maximum information

while minimizing redundancy. Maximal Marginal

Relevance (MMR) (Carbonell and Goldstein,1998)

has proved to be an effective tool to tackle redun-

dancy explicitly (Zhong et al.,2019) by balancing

the importance of query relevance and diversity.

However, more recent methods like MMR-Select

can use neural models as a substitute for query rel-

evance. Additionally, we can train the neural mod-

els to handle the redundancy implicitly by adding

a redundancy loss term (Xiao and Carenini,2020).

Another challenge is the low availability of ex-

pert annotated summarization datasets in the legal

domain. In this work, we leverage large amounts

of unlabeled data along with the small annotated

datasets to gain maximum performance. Pre-

trained transformers like BERT (Devlin et al.,2019)

can improve the performance of downstream tasks,

such as summarization, even with limited labeled

data. However, such models trained on the general

domain may fail to capture the intricacies of the

domain-speciﬁc vocabulary used in legal decisions.

The domain-speciﬁc variants of BERT (Chalkidis

et al.,2020;Zheng et al.,2021) pre-trained on large

corpora of legal texts can help better embed the le-

gal terms and achieve robust performance in vari-

ous legal-speciﬁc downstream tasks like argument

mining (Xu et al.,2021), rhetorical role labeling

(Bhattacharya et al.,2021a), and legal citation rec-

ommendation (Huang et al.,2021).

To maximize the summarization performance,

we also leverage Multi-task Learning (MTL) by

arXiv:2210.12437v1 [cs.CL] 22 Oct 2022

aggregating training samples from several smaller

datasets of multiple related tasks. MTL helps the

model learn shared representations between the pri-

mary task (summarization) and the auxiliary task

(rhetorical role identiﬁcation) to generalize better.

The identiﬁcation of rhetorical roles involves iden-

tifying the function of different sentences to un-

derstand underlying reasoning and argument pat-

terns in legal decisions. Previous works have of-

ten used rhetorical role labeling as a precursor to

extractive summarization to improve performance

(Zhong et al.,2019;Bhattacharya et al.,2021b). In

this paper, we explore the idea of using rhetorical

role identiﬁcation as an auxiliary task to augment

our annotated dataset and help generate better sum-

maries.

In brief, we consider our contributions to the

extractive summarization of legal documents as

follows:

•

We generate informative summaries with

maximum information and minimum redun-

dancy in a low-resource setting. Our experi-

ments demonstrate a general improvement in

ROUGE scores for the proposed approaches.

•

We further improve the summarizer using a

multi-task setting by combining extractive

summarization and rhetorical role labeling.

The quantitative evaluation demonstrates that

the multi-task models perform better than the

single-task models.

•

We evaluate the generated summaries qualita-

tively with the help of a legal expert. In con-

trast to the quantitative evaluation, the qualita-

tive results show that our proposed approaches

rank at least as good as human annotators.1

2 Related Work

2.1 Extractive Summarization

Galgani et al. (2012) developed a rule-based ap-

proach to summarization that uses a knowledge

base, statistical information, and other handcrafted

features like POS tags, speciﬁc legal terms, and

citations. Kim et al. (2012) propose a graph-

based summarization system that constructs a di-

rected graph for each document where nodes are

assigned weights based on how likely words in a

given sentence appear in the conclusion of judg-

ments. CaseSummarizer (Polsley et al.,2016), an

automated text summarization tool, uses word fre-

quency augmented with additional domain-speciﬁc

1Our code is available here

knowledge to score the sentences in the case docu-

ment. Liu and Chen (2019) propose a classiﬁcation-

based approach that uses several handcrafted fea-

tures as input. However, such techniques require

knowledge engineering of different features and

do not tackle redundancy in legal decisions. Re-

cently, various proposed approaches have tried

to address redundancy in legal decisions for pur-

poses of summarization. Zhong et al. (2019) hy-

pothesize that the iterative selection of predictive

sentences using a CNN-based train-attribute-mask

pipeline followed by a Random Forest classiﬁer

to distinguish between sentences containing Rea-

soning/EvidentialSupport and other types. MMR

then selects the ﬁnal sentences for the summary.

(Bhattacharya et al.,2021b) demonstrate an unsu-

pervised approach named DELSumm that gener-

ates extractive summaries by incorporating guide-

lines from legal experts into an optimization prob-

lem that maximizes the informativeness and con-

tent words, as well as conciseness. In this work,

we use an MMR-based variant which tackles redun-

dancy explicitly and can be combined with a neu-

ral classiﬁer to generate summaries. It alleviates

the need to engineer handcrafted features or spe-

ciﬁc expert guidelines to prevent redundancy.

2.2 Rhetorical Role Labeling

Saravanan and Ravindran (2010) propose a rule-

based system along with a Conditional Random

Field (CRF) approach to identify the different seg-

ments. Nejadgholi et al. (2017) proposed a semi-

supervised approach to searching legal facts in

immigration-speciﬁc case documents by using an

unsupervised word embedding model to aid the

training of a supervised fact-detecting classiﬁer us-

ing a small set of annotated sentences. The authors

in (Walker et al.,2019) compare the performance

between rule-based scripts and ML algorithms to

classify sentences that state ﬁndings of fact. Bhat-

tacharya et al. (2019) explore the use of hierarchical

BiLSTM models by adding an attention layer and

experiment with the pre-trained word and sentence

embeddings (Bhattacharya et al.,2021a). (Savelka

et al.,2021) annotated legal cases from seven coun-

tries in six languages using a structural type system

and found that Bi-GRU models could be general-

ized for data across different jurisdictions to some

degree. Despite copious work, there are very few

annotated rhetorical role datasets in the legal do-

main. In this work, we use rhetorical role label-

ing as an auxiliary task to augment our annotated

dataset and help generate better summaries.

3 Data

We use the dataset containing single-issue Post-

Traumatic Stress Disorder decisions from the US

Board of Veterans’ Appeals

(BVA) by (Zhong

et al.,2019). These cases focus on veterans’ ap-

peals for beneﬁts for a PTSD disability connected

to stressful experiences during military service.

The dataset is a sample from the BVA database that

has been constrained to single-issue cases focus-

ing on PTSD. In the texts, the BVA reviews the

available evidence and either makes a ﬁnding that

it warrants an award for service-connected PTSD

(granted) or not (denied), or refers the case back

to a lower administrative division for further de-

velopment (remand). The dataset consists of 112

decisions and the corresponding expert annotated

gold-standard summaries. We have 92 cases (48

remanded, 28 denied, 16 granted) in the training

set with one annotated summary each. Another 20

cases (10 remanded, 6 denied, 4 granted) constitute

the test set, for which there are four extractive sum-

maries by different annotators and two drafted ab-

stractive summaries. Each annotator chose a 6-10

sentence long summary based on predeﬁned guide-

lines, out of which they selected 1-3 sentences each

from the Reasoning and Evidence annotation type.

The Reasoning sentences connect the outcome to

the facts, while Evidence sentences add more infor-

mation to support the former.

For rhetorical role labeling, we use two differ-

ent datasets containing 50 plus 25 annotated BVA

decisions

(Walker et al.,2019). The decisions in

the larger dataset have partial annotations, so we

keep only decisions that have annotations for at

least 60%

of the sentences in each decision. It re-

sults in 28 decisions consisting of 17 denied and 11

granted outcomes, while the smaller dataset con-

tains 10 denied and 15 granted decisions. We map

the different annotation types of the two datasets to

a uniform type system of six annotation types. We

merged the different annotation types for our exper-

iments, resulting in 1889 Evidence/Reasoning and

3728 Others sentences. Therefore, the ﬁnal dataset

2https://www.bva.va.gov

3https://github.com/LLTLab/VetClaims-JSON

The remaining sentences with missing annotations con-

tain sentences from annotation types other than Evidence or

Reasoning, so we automatically annotate these sentences with

the Others type.

has 53 decisions with 7473 binary sentence-level

annotations.

The datasets are from different time periods;

therefore, the decisions have a slightly differ-

ent document structure. We remove the meta-

information like the case number, dates, judge

names, names of the witnesses, and other similar

information to have a more uniform layout. We

keep only the following sections (if present) from

each decision in the dataset:

•Order

•Finding of Fact

•Conclusion of Law

•

Reasons and Bases for Finding and Conclu-

sion

•Remand

•Reasons for Remand

Additionally, we use the SpaCy

pipeline en-

hanced with additional handcrafted rules to seg-

ment the sentences in each document for the sum-

marization dataset.

After pre-processing, the aver-

age number of sentences per decision in the sum-

marization and rhetorical role labeling datasets is

77.29 (± 52.28

) and

118.37 (± 78.33)

, respectively.

4 Our Approach

4.1 Sentence Embeddings

Sentence embeddings map an input sentence to a

ﬁxed-size dense vector representation. Sentence-

BERT (Reimers and Gurevych,2019) has recently

emerged as an effective tool to derive semantically

meaningful sentence embeddings. However, the

lack of a domain-speciﬁc labeled entailment dataset

required to train it makes it inaccessible for us. Al-

ternatively, we can use a BERT model to extract a

sentence embedding by pooling the embeddings for

each token in the sentence. The mean-pooling oper-

ation outputs a 768-dimensional ﬁxed-sized repre-

sentation for each sentence. Such embeddings gen-

eralize quite well and provide a good starting point

for training our sequential models later. This work

uses the legal-domain speciﬁc transformer Legal-

BERT (Zheng et al.,2021) trained on the Harvard

Law case corpus’ 3,446,187 legal decisions to gen-

erate the sentence embeddings.

5https://spacy.io

To validate the performance of the sentence segmentation,

we manually segment 11 decisions (6 remanded, 3 denied, 2

granted) and compare the matches. In terms of Recall, our

approach and the legal text sentence segmenter (Savelka et al.,

2017) score 0.937 and 0.905, respectively.

4.2 Weighted Loss Function

The conventional cross-entropy loss function for

the extractive summarization results in poor classiﬁ-

cation performance due to the class imbalance. For

each decision, on average, we have very few posi-

tive labels (5-6 sentences) for each summary, which

results in a highly imbalanced dataset. To tackle

this issue, we use the weighted cross-entropy loss

function that puts more emphasis on positive labels

by manually rescaling the weights for each class.

wc=#samples

#classes ×#samplesc

LCE =−

c=1

wc(yo,c log(po,c))

4.3 Maximal Marginal Relevance

Maximal Marginal Relevance (MMR) (Carbonell

and Goldstein,1998) iteratively (greedily) selects

sentences for the summary while balancing the

query relevance and diversity:

MMR = arg max

siD\ˆ

[λ Sim(si, Q)

−(1 −λ) max

sjˆ

Sim(si, sj) ]

The parameter

helps control the redundancy

(novelty) in the extracted summary. We use co-

sine similarity to calculate the similarity between

two sentence embeddings. The query

represents

the case document by taking the average embed-

dings of all the sentences in the decision. Xiao and

Carenini (2020) propose MMR-Select as an alterna-

tive approach to eliminate redundancy explicitly. It

eliminates the greedy method, computing the query

relevance to ﬁnd suitable candidates with a neural

model. MMR is more robust as it picks candidate

sentences using the conﬁdence scores,

P(yi)

, pro-

duced by the neural model.

MMR-Select = arg max

siD\ˆ

[λ P (yi)

−(1 −λ) max

sjˆ

Sim(si, sj) ]

4.4 Redundancy Loss

The major limitation of explicit methods like MMR

is the disconnect between the sentence scoring and

sentence selection phases. Such techniques rely on

the classiﬁer to score the sentences in the document

and check for redundancy later when selecting the

ﬁnal sentences for the summary. Thus, the clas-

siﬁer used to generate the conﬁdence score does

not implicitly learn how to handle redundancy. We

can generate more informative summaries by teach-

ing the neural model to avoid picking similar sen-

tences. Xiao and Carenini (2020) propose adding

a redundancy loss term

LRD

to the cross-entropy

loss function that penalizes the model for choosing

two similar sentences with high conﬁdence scores.

The parameter

balances the importance we as-

sign to the

LCE

and

LRD

. The neural models

tend to classify more sentences as part of the sum-

mary for longer case documents. Therefore, we

scale the redundancy loss

LRD

deﬁned by Xiao and

Carenini (2020) to ensure that it does not explode

as the length of the document increases, preventing

it from overshadowing the cross-entropy loss.

L=β LCE + (1 −β)LRD

LRD =1

i=1

j=1

P(yi)P(yj)Sim(si, sj)

4.5 Extractive Summarization

4.5.1 Single-Task Models

We deﬁne extractive summarization as a binary

classiﬁcation problem where the proposed mod-

els decide whether a given sentence belongs to the

ﬁxed-length summary or not. We use the proposed

models to generate the summary only for Reason-

ing/Evidence sentences as we can extract perfect

matches for the other rhetorical role sentences (e.g.,

case issue and procedural background) by using

regular expressions

. Our proposed models consist

of two phases: Sentence Scoring and Sentence Se-

lection, as shown in Figure 1. Initially, we use the

approach explained in Section 4.1 to generate the

embeddings for all the sentences in a given case

document.

Sentence Scoring

: Bidirectional Gated Recur-

rent Units (Bi-GRU) use two GRUs to simultane-

ously encode the sentence embeddings in both for-

ward and backward directions. The concatenation

of the forward and backward hidden states gives us

the representation of each input sequence. A fully

connected dense layer followed by the non-linear

This is particular to the task of summarizing BVA deci-

sions as introduced by (Zhong et al.,2019)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ExtractiveSummarizationofLegalDecisionsusingMulti-taskLearningandMaximalMarginalRelevanceAbhishekAgarwalandShanshanXuandMatthiasGrabmairTechnicalUniversityofMunich,Germany{abhishek.agarwal,shanshan.xu,matthias.grabmair}@tum.deAbstractSummarizinglegaldecisionsrequirestheex-pertiseoflawpractitioners,w...

展开>> 收起<<

Extractive Summarization of Legal Decisions using Multi-task Learning and Maximal Marginal Relevance Abhishek Agarwal andShanshan Xu andMatthias Grabmair.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Extractive Summarization of Legal Decisions using Multi-task Learning and Maximal Marginal Relevance Abhishek Agarwal andShanshan Xu andMatthias Grabmair

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: