Extractive Summarization of Legal Decisions using Multi-task Learning and Maximal Marginal Relevance Abhishek Agarwal andShanshan Xu andMatthias Grabmair

2025-05-08 0 0 1.55MB 16 页 10玖币
侵权投诉
Extractive Summarization of Legal Decisions using Multi-task Learning
and Maximal Marginal Relevance
Abhishek Agarwal and Shanshan Xu and Matthias Grabmair
Technical University of Munich, Germany
{abhishek.agarwal, shanshan.xu, matthias.grabmair}@tum.de
Abstract
Summarizing legal decisions requires the ex-
pertise of law practitioners, which is both time-
and cost-intensive. This paper presents tech-
niques for extractive summarization of legal
decisions in a low-resource setting using lim-
ited expert annotated data. We test a set of
models that locate relevant content using a se-
quential model and tackle redundancy by lever-
aging maximal marginal relevance to com-
pose summaries. We also demonstrate an
implicit approach to help train our proposed
models generate more informative summaries.
Our multi-task learning model variant lever-
ages rhetorical role identification as an auxil-
iary task to further improve the summarizer.
We perform extensive experiments on datasets
containing legal decisions from the US Board
of Veterans’ Appeals and conduct quantitative
and expert-ranked evaluations of our models.
Our results show that the proposed approaches
can achieve ROUGE scores vis-à-vis expert ex-
tracted summaries that match those achieved
by inter-annotator comparison.
1 Introduction
In common-law systems, law practitioners research
large numbers of legal decisions from past cases
to find similar precedents that justify their argu-
ments and lead to favorable outcomes. The analy-
sis can be time-consuming and expensive as these
documents are long and verbose, and understand-
ing them requires legal expertise. Automatic sum-
marization of legal documents can help expedite
the process cost-effectively. However, the limited
availability of expert-annotated summaries makes it
challenging to design such automated systems to as-
sist paralegals, lawyers, and other law practitioners.
Extractive summarization aims to identify and
extract essential sentences from the source docu-
ment to compose the corresponding summary. It is
more common in the legal domain due to the com-
plexity of the legal language and the scarcity of
labeled data. By contrast, abstractive summariza-
tion generates an abstract representation that cap-
tures the salient ideas of the source text and might
contain new words and phrases not present in the
source document.
One of the main challenges of extractive summa-
rization is the redundancy in legal documents, as
legal decisions can often contain several semanti-
cally similar sentences. Our objective is to gener-
ate summaries that provide maximum information
while minimizing redundancy. Maximal Marginal
Relevance (MMR) (Carbonell and Goldstein,1998)
has proved to be an effective tool to tackle redun-
dancy explicitly (Zhong et al.,2019) by balancing
the importance of query relevance and diversity.
However, more recent methods like MMR-Select
can use neural models as a substitute for query rel-
evance. Additionally, we can train the neural mod-
els to handle the redundancy implicitly by adding
a redundancy loss term (Xiao and Carenini,2020).
Another challenge is the low availability of ex-
pert annotated summarization datasets in the legal
domain. In this work, we leverage large amounts
of unlabeled data along with the small annotated
datasets to gain maximum performance. Pre-
trained transformers like BERT (Devlin et al.,2019)
can improve the performance of downstream tasks,
such as summarization, even with limited labeled
data. However, such models trained on the general
domain may fail to capture the intricacies of the
domain-specific vocabulary used in legal decisions.
The domain-specific variants of BERT (Chalkidis
et al.,2020;Zheng et al.,2021) pre-trained on large
corpora of legal texts can help better embed the le-
gal terms and achieve robust performance in vari-
ous legal-specific downstream tasks like argument
mining (Xu et al.,2021), rhetorical role labeling
(Bhattacharya et al.,2021a), and legal citation rec-
ommendation (Huang et al.,2021).
To maximize the summarization performance,
we also leverage Multi-task Learning (MTL) by
arXiv:2210.12437v1 [cs.CL] 22 Oct 2022
aggregating training samples from several smaller
datasets of multiple related tasks. MTL helps the
model learn shared representations between the pri-
mary task (summarization) and the auxiliary task
(rhetorical role identification) to generalize better.
The identification of rhetorical roles involves iden-
tifying the function of different sentences to un-
derstand underlying reasoning and argument pat-
terns in legal decisions. Previous works have of-
ten used rhetorical role labeling as a precursor to
extractive summarization to improve performance
(Zhong et al.,2019;Bhattacharya et al.,2021b). In
this paper, we explore the idea of using rhetorical
role identification as an auxiliary task to augment
our annotated dataset and help generate better sum-
maries.
In brief, we consider our contributions to the
extractive summarization of legal documents as
follows:
We generate informative summaries with
maximum information and minimum redun-
dancy in a low-resource setting. Our experi-
ments demonstrate a general improvement in
ROUGE scores for the proposed approaches.
We further improve the summarizer using a
multi-task setting by combining extractive
summarization and rhetorical role labeling.
The quantitative evaluation demonstrates that
the multi-task models perform better than the
single-task models.
We evaluate the generated summaries qualita-
tively with the help of a legal expert. In con-
trast to the quantitative evaluation, the qualita-
tive results show that our proposed approaches
rank at least as good as human annotators.1
2 Related Work
2.1 Extractive Summarization
Galgani et al. (2012) developed a rule-based ap-
proach to summarization that uses a knowledge
base, statistical information, and other handcrafted
features like POS tags, specific legal terms, and
citations. Kim et al. (2012) propose a graph-
based summarization system that constructs a di-
rected graph for each document where nodes are
assigned weights based on how likely words in a
given sentence appear in the conclusion of judg-
ments. CaseSummarizer (Polsley et al.,2016), an
automated text summarization tool, uses word fre-
quency augmented with additional domain-specific
1Our code is available here
knowledge to score the sentences in the case docu-
ment. Liu and Chen (2019) propose a classification-
based approach that uses several handcrafted fea-
tures as input. However, such techniques require
knowledge engineering of different features and
do not tackle redundancy in legal decisions. Re-
cently, various proposed approaches have tried
to address redundancy in legal decisions for pur-
poses of summarization. Zhong et al. (2019) hy-
pothesize that the iterative selection of predictive
sentences using a CNN-based train-attribute-mask
pipeline followed by a Random Forest classifier
to distinguish between sentences containing Rea-
soning/EvidentialSupport and other types. MMR
then selects the final sentences for the summary.
(Bhattacharya et al.,2021b) demonstrate an unsu-
pervised approach named DELSumm that gener-
ates extractive summaries by incorporating guide-
lines from legal experts into an optimization prob-
lem that maximizes the informativeness and con-
tent words, as well as conciseness. In this work,
we use an MMR-based variant which tackles redun-
dancy explicitly and can be combined with a neu-
ral classifier to generate summaries. It alleviates
the need to engineer handcrafted features or spe-
cific expert guidelines to prevent redundancy.
2.2 Rhetorical Role Labeling
Saravanan and Ravindran (2010) propose a rule-
based system along with a Conditional Random
Field (CRF) approach to identify the different seg-
ments. Nejadgholi et al. (2017) proposed a semi-
supervised approach to searching legal facts in
immigration-specific case documents by using an
unsupervised word embedding model to aid the
training of a supervised fact-detecting classifier us-
ing a small set of annotated sentences. The authors
in (Walker et al.,2019) compare the performance
between rule-based scripts and ML algorithms to
classify sentences that state findings of fact. Bhat-
tacharya et al. (2019) explore the use of hierarchical
BiLSTM models by adding an attention layer and
experiment with the pre-trained word and sentence
embeddings (Bhattacharya et al.,2021a). (Savelka
et al.,2021) annotated legal cases from seven coun-
tries in six languages using a structural type system
and found that Bi-GRU models could be general-
ized for data across different jurisdictions to some
degree. Despite copious work, there are very few
annotated rhetorical role datasets in the legal do-
main. In this work, we use rhetorical role label-
ing as an auxiliary task to augment our annotated
dataset and help generate better summaries.
3 Data
We use the dataset containing single-issue Post-
Traumatic Stress Disorder decisions from the US
Board of Veterans’ Appeals
2
(BVA) by (Zhong
et al.,2019). These cases focus on veterans’ ap-
peals for benefits for a PTSD disability connected
to stressful experiences during military service.
The dataset is a sample from the BVA database that
has been constrained to single-issue cases focus-
ing on PTSD. In the texts, the BVA reviews the
available evidence and either makes a finding that
it warrants an award for service-connected PTSD
(granted) or not (denied), or refers the case back
to a lower administrative division for further de-
velopment (remand). The dataset consists of 112
decisions and the corresponding expert annotated
gold-standard summaries. We have 92 cases (48
remanded, 28 denied, 16 granted) in the training
set with one annotated summary each. Another 20
cases (10 remanded, 6 denied, 4 granted) constitute
the test set, for which there are four extractive sum-
maries by different annotators and two drafted ab-
stractive summaries. Each annotator chose a 6-10
sentence long summary based on predefined guide-
lines, out of which they selected 1-3 sentences each
from the Reasoning and Evidence annotation type.
The Reasoning sentences connect the outcome to
the facts, while Evidence sentences add more infor-
mation to support the former.
For rhetorical role labeling, we use two differ-
ent datasets containing 50 plus 25 annotated BVA
decisions
3
(Walker et al.,2019). The decisions in
the larger dataset have partial annotations, so we
keep only decisions that have annotations for at
least 60%
4
of the sentences in each decision. It re-
sults in 28 decisions consisting of 17 denied and 11
granted outcomes, while the smaller dataset con-
tains 10 denied and 15 granted decisions. We map
the different annotation types of the two datasets to
a uniform type system of six annotation types. We
merged the different annotation types for our exper-
iments, resulting in 1889 Evidence/Reasoning and
3728 Others sentences. Therefore, the final dataset
2https://www.bva.va.gov
3https://github.com/LLTLab/VetClaims-JSON
4
The remaining sentences with missing annotations con-
tain sentences from annotation types other than Evidence or
Reasoning, so we automatically annotate these sentences with
the Others type.
has 53 decisions with 7473 binary sentence-level
annotations.
The datasets are from different time periods;
therefore, the decisions have a slightly differ-
ent document structure. We remove the meta-
information like the case number, dates, judge
names, names of the witnesses, and other similar
information to have a more uniform layout. We
keep only the following sections (if present) from
each decision in the dataset:
Order
Finding of Fact
Conclusion of Law
Reasons and Bases for Finding and Conclu-
sion
Remand
Reasons for Remand
Additionally, we use the SpaCy
5
pipeline en-
hanced with additional handcrafted rules to seg-
ment the sentences in each document for the sum-
marization dataset.
6
After pre-processing, the aver-
age number of sentences per decision in the sum-
marization and rhetorical role labeling datasets is
77.29 (± 52.28
) and
118.37 (± 78.33)
, respectively.
4 Our Approach
4.1 Sentence Embeddings
Sentence embeddings map an input sentence to a
fixed-size dense vector representation. Sentence-
BERT (Reimers and Gurevych,2019) has recently
emerged as an effective tool to derive semantically
meaningful sentence embeddings. However, the
lack of a domain-specific labeled entailment dataset
required to train it makes it inaccessible for us. Al-
ternatively, we can use a BERT model to extract a
sentence embedding by pooling the embeddings for
each token in the sentence. The mean-pooling oper-
ation outputs a 768-dimensional fixed-sized repre-
sentation for each sentence. Such embeddings gen-
eralize quite well and provide a good starting point
for training our sequential models later. This work
uses the legal-domain specific transformer Legal-
BERT (Zheng et al.,2021) trained on the Harvard
Law case corpus’ 3,446,187 legal decisions to gen-
erate the sentence embeddings.
5https://spacy.io
6
To validate the performance of the sentence segmentation,
we manually segment 11 decisions (6 remanded, 3 denied, 2
granted) and compare the matches. In terms of Recall, our
approach and the legal text sentence segmenter (Savelka et al.,
2017) score 0.937 and 0.905, respectively.
4.2 Weighted Loss Function
The conventional cross-entropy loss function for
the extractive summarization results in poor classifi-
cation performance due to the class imbalance. For
each decision, on average, we have very few posi-
tive labels (5-6 sentences) for each summary, which
results in a highly imbalanced dataset. To tackle
this issue, we use the weighted cross-entropy loss
function that puts more emphasis on positive labels
by manually rescaling the weights for each class.
wc=#samples
#classes ×#samplesc
LCE =
M
X
c=1
wc(yo,c log(po,c))
4.3 Maximal Marginal Relevance
Maximal Marginal Relevance (MMR) (Carbonell
and Goldstein,1998) iteratively (greedily) selects
sentences for the summary while balancing the
query relevance and diversity:
MMR = arg max
siD\ˆ
S
[λ Sim(si, Q)
(1 λ) max
sjˆ
S
Sim(si, sj) ]
The parameter
λ
helps control the redundancy
(novelty) in the extracted summary. We use co-
sine similarity to calculate the similarity between
two sentence embeddings. The query
Q
represents
the case document by taking the average embed-
dings of all the sentences in the decision. Xiao and
Carenini (2020) propose MMR-Select as an alterna-
tive approach to eliminate redundancy explicitly. It
eliminates the greedy method, computing the query
relevance to find suitable candidates with a neural
model. MMR is more robust as it picks candidate
sentences using the confidence scores,
P(yi)
, pro-
duced by the neural model.
MMR-Select = arg max
siD\ˆ
S
[λ P (yi)
(1 λ) max
sjˆ
S
Sim(si, sj) ]
4.4 Redundancy Loss
The major limitation of explicit methods like MMR
is the disconnect between the sentence scoring and
sentence selection phases. Such techniques rely on
the classifier to score the sentences in the document
and check for redundancy later when selecting the
final sentences for the summary. Thus, the clas-
sifier used to generate the confidence score does
not implicitly learn how to handle redundancy. We
can generate more informative summaries by teach-
ing the neural model to avoid picking similar sen-
tences. Xiao and Carenini (2020) propose adding
a redundancy loss term
LRD
to the cross-entropy
loss function that penalizes the model for choosing
two similar sentences with high confidence scores.
The parameter
β
balances the importance we as-
sign to the
LCE
and
LRD
. The neural models
tend to classify more sentences as part of the sum-
mary for longer case documents. Therefore, we
scale the redundancy loss
LRD
defined by Xiao and
Carenini (2020) to ensure that it does not explode
as the length of the document increases, preventing
it from overshadowing the cross-entropy loss.
L=β LCE + (1 β)LRD
LRD =1
n2
n
X
i=1
n
X
j=1
P(yi)P(yj)Sim(si, sj)
4.5 Extractive Summarization
4.5.1 Single-Task Models
We define extractive summarization as a binary
classification problem where the proposed mod-
els decide whether a given sentence belongs to the
fixed-length summary or not. We use the proposed
models to generate the summary only for Reason-
ing/Evidence sentences as we can extract perfect
matches for the other rhetorical role sentences (e.g.,
case issue and procedural background) by using
regular expressions
7
. Our proposed models consist
of two phases: Sentence Scoring and Sentence Se-
lection, as shown in Figure 1. Initially, we use the
approach explained in Section 4.1 to generate the
embeddings for all the sentences in a given case
document.
Sentence Scoring
: Bidirectional Gated Recur-
rent Units (Bi-GRU) use two GRUs to simultane-
ously encode the sentence embeddings in both for-
ward and backward directions. The concatenation
of the forward and backward hidden states gives us
the representation of each input sequence. A fully
connected dense layer followed by the non-linear
7
This is particular to the task of summarizing BVA deci-
sions as introduced by (Zhong et al.,2019)
摘要:

ExtractiveSummarizationofLegalDecisionsusingMulti-taskLearningandMaximalMarginalRelevanceAbhishekAgarwalandShanshanXuandMatthiasGrabmairTechnicalUniversityofMunich,Germany{abhishek.agarwal,shanshan.xu,matthias.grabmair}@tum.deAbstractSummarizinglegaldecisionsrequirestheex-pertiseoflawpractitioners,w...

展开>> 收起<<
Extractive Summarization of Legal Decisions using Multi-task Learning and Maximal Marginal Relevance Abhishek Agarwal andShanshan Xu andMatthias Grabmair.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:1.55MB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注