
Early-Contextualization (EC)
: An early-
contextualized HAT comprises an initial stack of
LP
paired segment-wise and cross-segment en-
coders, followed by a stack of
LSWE
segment-wise
encoders. In this case, cross-segment attention
(contextualization) is only performed at the initial
layers of the model (Figure 3(c)).
Late-Contextualization (LC)
: A late-
contextualized HAT comprises an initial stack
of
LSWE
segment-wise encoders, followed by a
stack of
LP
paired segment and segment-wise
encoders. In this case, cross-segment attention
(contextualization) is only performed in the latter
layers of the model (Figure 3(d)).
We present task-specific HAT architectures (e.g.,
for token/segment/document classification, and
multiple-choice QA tasks) in Appendix A.1.
3.3 Tokenization / Segmentation
Since HATs consider a sequence of segments, we
need to define a segmentation strategy, i.e. how to
group tokens (sub-words) into segments. Standard
approaches consider sentences or paragraphs as seg-
ments. We opt for a dynamic segmentation strategy
that balances the trade-off between the preservation
of the text structure (avoid sentence truncation),
and the minimization of padding, which minimizes
document truncation as a result. We split each doc-
ument in
N
segments by grouping sentences up
to
K
total tokens.
1
Following Dai et al. (2022),
our models consider segments of
K=128
tokens
each; such a window was shown to balance the
computational complexity with task performance.
4 Experimental Set Up
4.1 Evaluation Tasks
We consider three groups of evaluation tasks:
(a) Upstream (pre-training) tasks, which aim to
pre-train (warm-start) the encoder in a generic
self-supervized manner; (b) Midstream (quality-
assessment) tasks, which aim to estimate the qual-
ity of the pre-trained models; and (c) Downstream
tasks, which aim to estimate model’s performance
in realistic (practical) applications.
Upstream (Pre-training) Task
: We consider
Masked Language Modeling (MLM), a well-
established bidirectional extension of traditional
1
Any sentence splitter can be used. In our work, we con-
sider the NLTK (
https://www.nltk.org/
) English
sentence splitter. We present examples in Appendix B.
language modeling proposed by Devlin et al. (2019)
for Transformer-based text encoders. Following
Devlin et al. (2019), we mask 15% of the tokens.
Midstream Tasks
: We consider four alternative
mid-stream tasks. These tasks aim to assess the
quality of word, segment, and document representa-
tions of pre-trained models, i.e., models pre-trained
on the MLM task.2
•
Segment Masked Language Modeling (MLM), an
extension of MLM, where a percentage of tokens
in a subset (20%) of segments are masked. We
consider two alternatives: 40% (SMLM-40) and
100% (SMLM-100) masking. For this tasks, we
predict the identity of the masked tokens. We use
cross-entropy loss as the evaluation metric. In-
tuitively we assess cross-segment contextualiza-
tion, since we predict masked words of a segment
mainly based on the other segments.
•
Segment Order Prediction (SOP), where the in-
put for a model is a shuffled sequence of seg-
ments from a document. The goal of the task
is to predict the correct position (order) of the
segments, as it was in the original document. For
this task, we predict the position per segment as
a regression task; hence our evaluation metric is
mean absolute error (mae). Intuitively we assess
cross-segment contextualization and the quality
of segment-level representations since segment
order has to resolved given segment relations.
•
Multiple-Choice Masked Segment Prediction
(MC-MSP), where the input for a model is a se-
quence of segments from a document with one
segment being masked at a time, and a list of
five alternative segments (choices) including the
masked one. The goal on this task for the model,
is to identify the correct segment; the one masked
from the original document. For this task, we pre-
dict the id of the correct pair (<masked document,
choice>) across all pairs; hence our evaluation
metric is accuracy. Similarly with SOP we assess
cross-segment contextualization and the quality
of segment-level representations, since predicting
the correct segment has to be resolved based on
both document-level semantics and those of the
neighbor segments to the masked one.
2
We present additional details (e.g., dataset curation) for
the midstream tasks in Appendix A.2.