An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification Ilias ChalkidisyXiang DaizManos Fergadiotis

2025-04-24 0 0 1.62MB 16 页 10玖币
侵权投诉
An Exploration of Hierarchical Attention Transformers
for Efficient Long Document Classification
Ilias ChalkidisXiang Dai Manos Fergadiotis
Prodromos Malakasiotis Desmond Elliott ]
Department of Computer Science, University of Copenhagen, Denmark
CSIRO Data61, Sydney, Australia
Department of Informatics, Athens University of Economics and Business, Greece
]Pioneer Centre for AI, Copenhagen, Denmark
Abstract
Non-hierarchical sparse attention Transformer-
based models, such as Longformer and Big
Bird, are popular approaches to working with
long documents. There are clear benefits
to these approaches compared to the original
Transformer in terms of efficiency, but Hier-
archical Attention Transformer (HAT) models
are a vastly understudied alternative. We de-
velop and release fully pre-trained HAT mod-
els that use segment-wise followed by cross-
segment encoders and compare them with
Longformer models and partially pre-trained
HATs. In several long document downstream
classification tasks, our best HAT model out-
performs equally-sized Longformer models
while using 10-20% less GPU memory and
processing documents 40-45% faster. In a se-
ries of ablation studies, we find that HATs per-
form best with cross-segment contextualiza-
tion throughout the model than alternative con-
figurations that implement either early or late
cross-segment contextualization. Our code is
on GitHub: https://github.com/coast
alcph/hierarchical-transformers.
1 Introduction
Long Document Classification is the classification
of a single long document typically in the length
of thousands of words, e.g., classification of legal
(Chalkidis et al.,2022) and biomedical documents
(Johnson et al.,2016), or co-processing of long
and shorter chunks of texts, e.g., sequential sen-
tence classification (Cohan et al.,2019), document-
level multiple-choice QA (Pang et al.,2021), and
document-level NLI (Koreeda and Manning,2021).
One approach to working with long documents
is to simply expand standard Transformer-based
language models (BERT of Devlin et al. (2019),
RoBERTa of Liu et al. (2019), etc.) but this is
problematic for long sequences, given the
O(N2)
Corresponding author: ilias.chalkidis[at]di.ku.dk
Figure 1: Performance - Efficiency trade-off for HAT
and Longformer on downstream tasks.
self-attention operations. To address this compu-
tational problem, researchers have introduced ef-
ficient Transformer-based architectures. Several
sparse attention networks, such as Longformer of
Beltagy et al. (2020), or BigBird of Zaheer et al.
(2020), have been proposed relying on a combina-
tion of different attention patterns (e.g., relying on
local (neighbor), global and/or randomly selected
tokens). Another approach relies on Hierarchical
Attention Transformers (HATs) that use a multi-
level attention pattern: segment-wise followed by
cross-segment attention. Ad-hoc (partially pre-
trained), and non-standardized variants of HAT
have been presented in the literature (Chalkidis
et al.,2019;Wu et al.,2021;Chalkidis et al.,2022;
Liu et al.,2022;Dai et al.,2022), but the potential
of such models is still vastly understudied.
In this work, we examine the potential of fully
(end-to-end) pre-trained HATs and aim to answer
three main questions: (a) Which configurations
of segment-wise and cross-segment attention lay-
ers in HATs perform best? (b) What is the effect
of pre-training HATs end-to-end, compared to ad-
arXiv:2210.05529v1 [cs.CL] 11 Oct 2022
Figure 2: Attention patterns for the examined architec-
tures: Hierarchical (Segment-wise followed by cross-
segment attention) and Sparse (Combination of win-
dowed and global attention) Attention Transformers.
hoc (partially pre-trained), i.e., plugging randomly
initialized cross-segment transformer blocks dur-
ing fine-tuning? (c) Are there computational or
downstream perfomance benefits of using HATs
compared to widely-used sparse attention networks,
such as Longformer and BigBird?
2 Related Work
2.1 Sparse Attention Transformers
Longformer
of Beltagy et al. (2020) consists of
local (window-based) attention and global atten-
tion that reduces the computational complexity of
the model and thus can be deployed to process
up to
4096
tokens. Local attention is computed
in-between a window of neighbour (consecutive)
tokens. Global attention relies on the idea of global
tokens that are able to attend and be attended by
any other token in the sequence. Windowed (local)
attention does not leverage hierarchical information
in any sense, and can be considered greedy.
BigBird
of Zaheer et al. (2020) is another sparse-
attention based Transformer that uses a combina-
tion of a local, global and random attention, i.e.,
all tokens also attend a number of random tokens
on top of those in the same neighbourhood. Both
models are warm-started from the public RoBERTa
checkpoint and are further pre-trained on masked
language modelling. They have been reported to
outperform RoBERTa on a range of tasks that re-
quire modelling long sequences.
In both cases (models), the attention scores for
local (neighbor), global, and randomly selected to-
kens are combined (added), i.e., attention blends
only word-level representations (Figure 1). Big-
Bird is even more computationally expensive with
borderline improved results in some benchmarks,
e.g., LRA of Tay et al. (2021), but not in others,
e.g., LexGLUE of Chalkidis et al. (2022).
2.2 Hierarchical Attention Transformers
Hierarchical Attention Transformers (HATs) are di-
rectly inspired by Hierarchical Attention Networks
(HANs) of Yang et al. (2016). The main idea is to
process (encode) document in a hierarchical fash-
ion, e.g., contextualize word representations per
sentence, and then sentence-level representations
across sentences. Chalkidis et al. (2019) were prob-
ably the first to use HATs as a viable option for
processing long documents based on pre-trained
Transformer-based language models. They show
improved results using a hierarchical variant of
BERT compared to BERT (fed with truncated doc-
uments) or HANs. Similar models were used in the
work of Chalkidis et al. (2022), where they com-
pared hierarchical variants of several pre-trained
language models (BERT, RoBERTa, etc.) showcas-
ing comparable results to Longformer and BigBird
in long document classification tasks. Recently,
Dai et al. (2022) compared ad-hoc RoBERTa-based
HATs with Longformer and reported comparable
performance in four document classification tasks.
Wu et al. (2021) proposed a HAT architec-
ture, named Hi-Transformers, a shallow version
of our interleaved variant presented in detail in Sec-
tion 3.2. They showed that their model performs
better compared to Longformer and BigBird across
three classification tasks. Although their analysis
relies on non pre-trained models, i.e., all models
considered are randomly initialized and directly
fine-tuned on the downstream tasks, thus the im-
pact of pre-training such models is unknown.
Liu et al. (2022) propose a similar architecture,
named Hierarchical Sparse Transformer (HST). Liu
et al. showed that HST has improved results in
the long range arena (LRA) benchmark, text clas-
sification and QA compared to Longformer and
BigBird. Their analysis considers a single layout
(topology) and is mainly limited on datasets where
documents are not really long (<1000 tokens). In
our work, we consider several HAT layouts (con-
figurations) and evaluate our models in several
segment-level, document-level, and multi-segment
tasks with larger documents (Table 1).
2.3 Other Approaches
Several other efficient Transformer-based models
have been proposed in the literature (Katharopoulos
et al.,2020;Kitaev et al.,2020;Choromanski et al.,
2021). We refer readers to Xiong et al. (2021);
Tay et al. (2022) for a survey on efficient attention
Figure 3: Top: The two main modules (building blocks) of Hierarchical Attention Transformers (HAT): the
Segment-wise ( SWE ), and the Cross-segment ( CSE ) encoders. Bottom: The four examined HAT variants.
variants. Recently other non Transformer-based
approaches (Gu et al.,2022;Gupta et al.,2022)
have been proposed for efficient long sequence
processing relying on structured state spaces (Gu
et al.,2021). In this work, we do not compare
with such architectures (Transformer-based or not),
since there are no standardized implementations or
publicly available pre-trained models to rely on at
the moment. There are several other Transformer-
based encoder-decoder models (Guo et al.,2022;
Pang et al.,2022) targeting generative tasks, e.g.,
long document summarization (Shen et al.,2022),
which are out of the scope in this study.
3 Hierarchical Attention Transformers
3.1 Architecture
Hierarchical Attention Transformers (HATs) con-
sider as input a sequence of tokens (
S
), or-
ganized in
N
equally-sized segments (chunks)
(
S= [C1, C2, C3, . . . , CN]
). Each sub-sequence
(segment) is a sequence of
K
tokens (
Ci=
[Wi[CLS], Wi1, Wi2, Wi3, . . . , WiK1]
), i.e. each
segment has its own segment-level representative
[CLS]
token. A HAT is built using two types
of neural modules (blocks): (a) the Segment-wise
encoder (SWE): A shared Transformer (Vaswani
et al.,2017) block processing each segment (
Ci
)
independently, and (b) the Cross-segment encoder
(CSE): A Transformer block processing (and con-
textualizing) segment-level representative tokens
(
Wi[CLS]
). The two components can be used in
several different layouts (topologies). We present
HAT variants (architectures) in Section 3.2.
HATs use two types of absolute positional em-
beddings to model the position of tokens: segment-
wise position embeddings (
Psw
iRH, i [1, K]
)
to model token positioning per segment, and cross-
segment position embeddings (
Pcs
iRH, i
[1, N]
) to model the position of a segment in the
document.
Psw
embeddings are additive to word
ones, like in most other Transformer-based mod-
els, such as BERT. Similarly,
Pcs
embeddings are
added to the segment representations (
W0
i[CLS]
) be-
fore they are passed to a CSE, and they are shared
across all CSEs of the model. A more detailed de-
piction of HAT including positional embeddings is
presented in Figure 4of Appendix B.1.
3.2 Examined Layouts
We first examine several alternative layouts of HAT
layers, i.e., the placement of SWE and CSE:
Ad-Hoc (AH)
: An ad-hoc (partially pre-trained)
HAT (Chalkidis et al.,2022) comprises an initial
stack of shared
LSWE
segment encoders from a
pre-trained transformer-based model, followed by
LCSE
ad-hoc segment-wise encoders. In this case
the model initially encodes and contextualize token
representations per segment, and then builds higher-
order segment-level representations (Figure 3(a)).
Interleaved (I)
: An interleaved HAT comprises
a stack of
LP
paired segment-wise and cross-
segment encoders. In this case, contrary to the
ad-hoc version of HAT, cross-segment attention
(contextualization) is performed across several lev-
els (layers) of the model (Figure 3(b)).
Early-Contextualization (EC)
: An early-
contextualized HAT comprises an initial stack of
LP
paired segment-wise and cross-segment en-
coders, followed by a stack of
LSWE
segment-wise
encoders. In this case, cross-segment attention
(contextualization) is only performed at the initial
layers of the model (Figure 3(c)).
Late-Contextualization (LC)
: A late-
contextualized HAT comprises an initial stack
of
LSWE
segment-wise encoders, followed by a
stack of
LP
paired segment and segment-wise
encoders. In this case, cross-segment attention
(contextualization) is only performed in the latter
layers of the model (Figure 3(d)).
We present task-specific HAT architectures (e.g.,
for token/segment/document classification, and
multiple-choice QA tasks) in Appendix A.1.
3.3 Tokenization / Segmentation
Since HATs consider a sequence of segments, we
need to define a segmentation strategy, i.e. how to
group tokens (sub-words) into segments. Standard
approaches consider sentences or paragraphs as seg-
ments. We opt for a dynamic segmentation strategy
that balances the trade-off between the preservation
of the text structure (avoid sentence truncation),
and the minimization of padding, which minimizes
document truncation as a result. We split each doc-
ument in
N
segments by grouping sentences up
to
K
total tokens.
1
Following Dai et al. (2022),
our models consider segments of
K=128
tokens
each; such a window was shown to balance the
computational complexity with task performance.
4 Experimental Set Up
4.1 Evaluation Tasks
We consider three groups of evaluation tasks:
(a) Upstream (pre-training) tasks, which aim to
pre-train (warm-start) the encoder in a generic
self-supervized manner; (b) Midstream (quality-
assessment) tasks, which aim to estimate the qual-
ity of the pre-trained models; and (c) Downstream
tasks, which aim to estimate model’s performance
in realistic (practical) applications.
Upstream (Pre-training) Task
: We consider
Masked Language Modeling (MLM), a well-
established bidirectional extension of traditional
1
Any sentence splitter can be used. In our work, we con-
sider the NLTK (
https://www.nltk.org/
) English
sentence splitter. We present examples in Appendix B.
language modeling proposed by Devlin et al. (2019)
for Transformer-based text encoders. Following
Devlin et al. (2019), we mask 15% of the tokens.
Midstream Tasks
: We consider four alternative
mid-stream tasks. These tasks aim to assess the
quality of word, segment, and document representa-
tions of pre-trained models, i.e., models pre-trained
on the MLM task.2
Segment Masked Language Modeling (MLM), an
extension of MLM, where a percentage of tokens
in a subset (20%) of segments are masked. We
consider two alternatives: 40% (SMLM-40) and
100% (SMLM-100) masking. For this tasks, we
predict the identity of the masked tokens. We use
cross-entropy loss as the evaluation metric. In-
tuitively we assess cross-segment contextualiza-
tion, since we predict masked words of a segment
mainly based on the other segments.
Segment Order Prediction (SOP), where the in-
put for a model is a shuffled sequence of seg-
ments from a document. The goal of the task
is to predict the correct position (order) of the
segments, as it was in the original document. For
this task, we predict the position per segment as
a regression task; hence our evaluation metric is
mean absolute error (mae). Intuitively we assess
cross-segment contextualization and the quality
of segment-level representations since segment
order has to resolved given segment relations.
Multiple-Choice Masked Segment Prediction
(MC-MSP), where the input for a model is a se-
quence of segments from a document with one
segment being masked at a time, and a list of
ve alternative segments (choices) including the
masked one. The goal on this task for the model,
is to identify the correct segment; the one masked
from the original document. For this task, we pre-
dict the id of the correct pair (<masked document,
choice>) across all pairs; hence our evaluation
metric is accuracy. Similarly with SOP we assess
cross-segment contextualization and the quality
of segment-level representations, since predicting
the correct segment has to be resolved based on
both document-level semantics and those of the
neighbor segments to the masked one.
2
We present additional details (e.g., dataset curation) for
the midstream tasks in Appendix A.2.
摘要:

AnExplorationofHierarchicalAttentionTransformersforEfcientLongDocumentClassicationIliasChalkidisyXiangDaizManosFergadiotisProdromosMalakasiotisDesmondElliotty]yDepartmentofComputerScience,UniversityofCopenhagen,DenmarkyCSIROData61,Sydney,AustraliaDepartmentofInformatics,AthensUniversityofEcono...

展开>> 收起<<
An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification Ilias ChalkidisyXiang DaizManos Fergadiotis.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:1.62MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注