An Exploration of Hierarchical Attention Transformers for Efﬁcient Long Document Classiﬁcation Ilias ChalkidisyXiang DaizManos Fergadiotis

2025-04-24 0 0 1.62MB 16 页 10玖币

侵权投诉

An Exploration of Hierarchical Attention Transformers

for Efﬁcient Long Document Classiﬁcation

Ilias Chalkidis∗†Xiang Dai ‡Manos Fergadiotis 

Prodromos Malakasiotis Desmond Elliott †]

†Department of Computer Science, University of Copenhagen, Denmark

†CSIRO Data61, Sydney, Australia

Department of Informatics, Athens University of Economics and Business, Greece

]Pioneer Centre for AI, Copenhagen, Denmark

Abstract

Non-hierarchical sparse attention Transformer-

based models, such as Longformer and Big

Bird, are popular approaches to working with

long documents. There are clear beneﬁts

to these approaches compared to the original

Transformer in terms of efﬁciency, but Hier-

archical Attention Transformer (HAT) models

are a vastly understudied alternative. We de-

velop and release fully pre-trained HAT mod-

els that use segment-wise followed by cross-

segment encoders and compare them with

Longformer models and partially pre-trained

HATs. In several long document downstream

classiﬁcation tasks, our best HAT model out-

performs equally-sized Longformer models

while using 10-20% less GPU memory and

processing documents 40-45% faster. In a se-

ries of ablation studies, we ﬁnd that HATs per-

form best with cross-segment contextualiza-

tion throughout the model than alternative con-

ﬁgurations that implement either early or late

cross-segment contextualization. Our code is

on GitHub: https://github.com/coast

alcph/hierarchical-transformers.

1 Introduction

Long Document Classiﬁcation is the classiﬁcation

of a single long document typically in the length

of thousands of words, e.g., classiﬁcation of legal

(Chalkidis et al.,2022) and biomedical documents

(Johnson et al.,2016), or co-processing of long

and shorter chunks of texts, e.g., sequential sen-

tence classiﬁcation (Cohan et al.,2019), document-

level multiple-choice QA (Pang et al.,2021), and

document-level NLI (Koreeda and Manning,2021).

One approach to working with long documents

is to simply expand standard Transformer-based

language models (BERT of Devlin et al. (2019),

RoBERTa of Liu et al. (2019), etc.) but this is

problematic for long sequences, given the

O(N2)

∗Corresponding author: ilias.chalkidis[at]di.ku.dk

Figure 1: Performance - Efﬁciency trade-off for HAT

and Longformer on downstream tasks.

self-attention operations. To address this compu-

tational problem, researchers have introduced ef-

ﬁcient Transformer-based architectures. Several

sparse attention networks, such as Longformer of

Beltagy et al. (2020), or BigBird of Zaheer et al.

(2020), have been proposed relying on a combina-

tion of different attention patterns (e.g., relying on

local (neighbor), global and/or randomly selected

tokens). Another approach relies on Hierarchical

Attention Transformers (HATs) that use a multi-

level attention pattern: segment-wise followed by

cross-segment attention. Ad-hoc (partially pre-

trained), and non-standardized variants of HAT

have been presented in the literature (Chalkidis

et al.,2019;Wu et al.,2021;Chalkidis et al.,2022;

Liu et al.,2022;Dai et al.,2022), but the potential

of such models is still vastly understudied.

In this work, we examine the potential of fully

(end-to-end) pre-trained HATs and aim to answer

three main questions: (a) Which conﬁgurations

of segment-wise and cross-segment attention lay-

ers in HATs perform best? (b) What is the effect

of pre-training HATs end-to-end, compared to ad-

arXiv:2210.05529v1 [cs.CL] 11 Oct 2022

Figure 2: Attention patterns for the examined architec-

tures: Hierarchical (Segment-wise followed by cross-

segment attention) and Sparse (Combination of win-

dowed and global attention) Attention Transformers.

hoc (partially pre-trained), i.e., plugging randomly

initialized cross-segment transformer blocks dur-

ing ﬁne-tuning? (c) Are there computational or

downstream perfomance beneﬁts of using HATs

compared to widely-used sparse attention networks,

such as Longformer and BigBird?

2 Related Work

2.1 Sparse Attention Transformers

Longformer

of Beltagy et al. (2020) consists of

local (window-based) attention and global atten-

tion that reduces the computational complexity of

the model and thus can be deployed to process

up to

4096

tokens. Local attention is computed

in-between a window of neighbour (consecutive)

tokens. Global attention relies on the idea of global

tokens that are able to attend and be attended by

any other token in the sequence. Windowed (local)

attention does not leverage hierarchical information

in any sense, and can be considered greedy.

BigBird

of Zaheer et al. (2020) is another sparse-

attention based Transformer that uses a combina-

tion of a local, global and random attention, i.e.,

all tokens also attend a number of random tokens

on top of those in the same neighbourhood. Both

models are warm-started from the public RoBERTa

checkpoint and are further pre-trained on masked

language modelling. They have been reported to

outperform RoBERTa on a range of tasks that re-

quire modelling long sequences.

In both cases (models), the attention scores for

local (neighbor), global, and randomly selected to-

kens are combined (added), i.e., attention blends

only word-level representations (Figure 1). Big-

Bird is even more computationally expensive with

borderline improved results in some benchmarks,

e.g., LRA of Tay et al. (2021), but not in others,

e.g., LexGLUE of Chalkidis et al. (2022).

2.2 Hierarchical Attention Transformers

Hierarchical Attention Transformers (HATs) are di-

rectly inspired by Hierarchical Attention Networks

(HANs) of Yang et al. (2016). The main idea is to

process (encode) document in a hierarchical fash-

ion, e.g., contextualize word representations per

sentence, and then sentence-level representations

across sentences. Chalkidis et al. (2019) were prob-

ably the ﬁrst to use HATs as a viable option for

processing long documents based on pre-trained

Transformer-based language models. They show

improved results using a hierarchical variant of

BERT compared to BERT (fed with truncated doc-

uments) or HANs. Similar models were used in the

work of Chalkidis et al. (2022), where they com-

pared hierarchical variants of several pre-trained

language models (BERT, RoBERTa, etc.) showcas-

ing comparable results to Longformer and BigBird

in long document classiﬁcation tasks. Recently,

Dai et al. (2022) compared ad-hoc RoBERTa-based

HATs with Longformer and reported comparable

performance in four document classiﬁcation tasks.

Wu et al. (2021) proposed a HAT architec-

ture, named Hi-Transformers, a shallow version

of our interleaved variant presented in detail in Sec-

tion 3.2. They showed that their model performs

better compared to Longformer and BigBird across

three classiﬁcation tasks. Although their analysis

relies on non pre-trained models, i.e., all models

considered are randomly initialized and directly

ﬁne-tuned on the downstream tasks, thus the im-

pact of pre-training such models is unknown.

Liu et al. (2022) propose a similar architecture,

named Hierarchical Sparse Transformer (HST). Liu

et al. showed that HST has improved results in

the long range arena (LRA) benchmark, text clas-

siﬁcation and QA compared to Longformer and

BigBird. Their analysis considers a single layout

(topology) and is mainly limited on datasets where

documents are not really long (<1000 tokens). In

our work, we consider several HAT layouts (con-

ﬁgurations) and evaluate our models in several

segment-level, document-level, and multi-segment

tasks with larger documents (Table 1).

2.3 Other Approaches

Several other efﬁcient Transformer-based models

have been proposed in the literature (Katharopoulos

et al.,2020;Kitaev et al.,2020;Choromanski et al.,

2021). We refer readers to Xiong et al. (2021);

Tay et al. (2022) for a survey on efﬁcient attention

Figure 3: Top: The two main modules (building blocks) of Hierarchical Attention Transformers (HAT): the

Segment-wise ( SWE ), and the Cross-segment ( CSE ) encoders. Bottom: The four examined HAT variants.

variants. Recently other non Transformer-based

approaches (Gu et al.,2022;Gupta et al.,2022)

have been proposed for efﬁcient long sequence

processing relying on structured state spaces (Gu

et al.,2021). In this work, we do not compare

with such architectures (Transformer-based or not),

since there are no standardized implementations or

publicly available pre-trained models to rely on at

the moment. There are several other Transformer-

based encoder-decoder models (Guo et al.,2022;

Pang et al.,2022) targeting generative tasks, e.g.,

long document summarization (Shen et al.,2022),

which are out of the scope in this study.

3 Hierarchical Attention Transformers

3.1 Architecture

Hierarchical Attention Transformers (HATs) con-

sider as input a sequence of tokens (

), or-

ganized in

equally-sized segments (chunks)

(

S= [C1, C2, C3, . . . , CN]

). Each sub-sequence

(segment) is a sequence of

tokens (

Ci=

[Wi[CLS], Wi1, Wi2, Wi3, . . . , WiK−1]

), i.e. each

segment has its own segment-level representative

[CLS]

token. A HAT is built using two types

of neural modules (blocks): (a) the Segment-wise

encoder (SWE): A shared Transformer (Vaswani

et al.,2017) block processing each segment (

)

independently, and (b) the Cross-segment encoder

(CSE): A Transformer block processing (and con-

textualizing) segment-level representative tokens

(

Wi[CLS]

). The two components can be used in

several different layouts (topologies). We present

HAT variants (architectures) in Section 3.2.

HATs use two types of absolute positional em-

beddings to model the position of tokens: segment-

wise position embeddings (

Psw

i∈RH, i ∈[1, K]

)

to model token positioning per segment, and cross-

segment position embeddings (

Pcs

i∈RH, i ∈

[1, N]

) to model the position of a segment in the

document.

Psw

embeddings are additive to word

ones, like in most other Transformer-based mod-

els, such as BERT. Similarly,

Pcs

embeddings are

added to the segment representations (

i[CLS]

) be-

fore they are passed to a CSE, and they are shared

across all CSEs of the model. A more detailed de-

piction of HAT including positional embeddings is

presented in Figure 4of Appendix B.1.

3.2 Examined Layouts

We ﬁrst examine several alternative layouts of HAT

layers, i.e., the placement of SWE and CSE:

Ad-Hoc (AH)

: An ad-hoc (partially pre-trained)

HAT (Chalkidis et al.,2022) comprises an initial

stack of shared

LSWE

segment encoders from a

pre-trained transformer-based model, followed by

LCSE

ad-hoc segment-wise encoders. In this case

the model initially encodes and contextualize token

representations per segment, and then builds higher-

order segment-level representations (Figure 3(a)).

Interleaved (I)

: An interleaved HAT comprises

a stack of

paired segment-wise and cross-

segment encoders. In this case, contrary to the

ad-hoc version of HAT, cross-segment attention

(contextualization) is performed across several lev-

els (layers) of the model (Figure 3(b)).

Early-Contextualization (EC)

: An early-

contextualized HAT comprises an initial stack of

paired segment-wise and cross-segment en-

coders, followed by a stack of

LSWE

segment-wise

encoders. In this case, cross-segment attention

(contextualization) is only performed at the initial

layers of the model (Figure 3(c)).

Late-Contextualization (LC)

: A late-

contextualized HAT comprises an initial stack

LSWE

segment-wise encoders, followed by a

stack of

paired segment and segment-wise

encoders. In this case, cross-segment attention

(contextualization) is only performed in the latter

layers of the model (Figure 3(d)).

We present task-speciﬁc HAT architectures (e.g.,

for token/segment/document classiﬁcation, and

multiple-choice QA tasks) in Appendix A.1.

3.3 Tokenization / Segmentation

Since HATs consider a sequence of segments, we

need to deﬁne a segmentation strategy, i.e. how to

group tokens (sub-words) into segments. Standard

approaches consider sentences or paragraphs as seg-

ments. We opt for a dynamic segmentation strategy

that balances the trade-off between the preservation

of the text structure (avoid sentence truncation),

and the minimization of padding, which minimizes

document truncation as a result. We split each doc-

ument in

segments by grouping sentences up

total tokens.

Following Dai et al. (2022),

our models consider segments of

K=128

tokens

each; such a window was shown to balance the

computational complexity with task performance.

4 Experimental Set Up

4.1 Evaluation Tasks

We consider three groups of evaluation tasks:

(a) Upstream (pre-training) tasks, which aim to

pre-train (warm-start) the encoder in a generic

self-supervized manner; (b) Midstream (quality-

assessment) tasks, which aim to estimate the qual-

ity of the pre-trained models; and (c) Downstream

tasks, which aim to estimate model’s performance

in realistic (practical) applications.

Upstream (Pre-training) Task

: We consider

Masked Language Modeling (MLM), a well-

established bidirectional extension of traditional

Any sentence splitter can be used. In our work, we con-

sider the NLTK (

https://www.nltk.org/

) English

sentence splitter. We present examples in Appendix B.

language modeling proposed by Devlin et al. (2019)

for Transformer-based text encoders. Following

Devlin et al. (2019), we mask 15% of the tokens.

Midstream Tasks

: We consider four alternative

mid-stream tasks. These tasks aim to assess the

quality of word, segment, and document representa-

tions of pre-trained models, i.e., models pre-trained

on the MLM task.2

•

Segment Masked Language Modeling (MLM), an

extension of MLM, where a percentage of tokens

in a subset (20%) of segments are masked. We

consider two alternatives: 40% (SMLM-40) and

100% (SMLM-100) masking. For this tasks, we

predict the identity of the masked tokens. We use

cross-entropy loss as the evaluation metric. In-

tuitively we assess cross-segment contextualiza-

tion, since we predict masked words of a segment

mainly based on the other segments.

•

Segment Order Prediction (SOP), where the in-

put for a model is a shufﬂed sequence of seg-

ments from a document. The goal of the task

is to predict the correct position (order) of the

segments, as it was in the original document. For

this task, we predict the position per segment as

a regression task; hence our evaluation metric is

mean absolute error (mae). Intuitively we assess

cross-segment contextualization and the quality

of segment-level representations since segment

order has to resolved given segment relations.

•

Multiple-Choice Masked Segment Prediction

(MC-MSP), where the input for a model is a se-

quence of segments from a document with one

segment being masked at a time, and a list of

ﬁve alternative segments (choices) including the

masked one. The goal on this task for the model,

is to identify the correct segment; the one masked

from the original document. For this task, we pre-

dict the id of the correct pair (<masked document,

choice>) across all pairs; hence our evaluation

metric is accuracy. Similarly with SOP we assess

cross-segment contextualization and the quality

of segment-level representations, since predicting

the correct segment has to be resolved based on

both document-level semantics and those of the

neighbor segments to the masked one.

We present additional details (e.g., dataset curation) for

the midstream tasks in Appendix A.2.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AnExplorationofHierarchicalAttentionTransformersforEfcientLongDocumentClassicationIliasChalkidisyXiangDaizManosFergadiotisProdromosMalakasiotisDesmondElliotty]yDepartmentofComputerScience,UniversityofCopenhagen,DenmarkyCSIROData61,Sydney,AustraliaDepartmentofInformatics,AthensUniversityofEcono...

展开>> 收起<<

An Exploration of Hierarchical Attention Transformers for Efﬁcient Long Document Classiﬁcation Ilias ChalkidisyXiang DaizManos Fergadiotis.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

An Exploration of Hierarchical Attention Transformers for Efﬁcient Long Document Classiﬁcation Ilias ChalkidisyXiang DaizManos Fergadiotis

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: