The Curious Case of Absolute Position Embeddings Koustuv SinhazyAmirhossein Kazemnejadz Siva ReddyzJoelle PineauyzDieuwke HupkesyAdina Williamsy

2025-05-02 0 0 957.25KB 24 页 10玖币
侵权投诉
The Curious Case of Absolute Position Embeddings
Koustuv SinhaAmirhossein Kazemnejad
Siva ReddyJoelle PineauDieuwke HupkesAdina Williams
McGill University / Mila - Quebec AI; Meta AI
{koustuv.sinha,amirhossein.kazemnejad}@mail.mcgill.ca
Abstract
Transformer language models encode the no-
tion of word order using positional informa-
tion. Most commonly, this positional informa-
tion is represented by absolute position embed-
dings (APEs), that are learned from the pre-
training data. However, in natural language, it
is not absolute position that matters, but rela-
tive position, and the extent to which APEs can
capture this type of information has not been
investigated. In this work, we observe that
models trained with APE over-rely on posi-
tional information to the point that they break-
down when subjected to sentences with shifted
position information. Specifically, when mod-
els are subjected to sentences starting from
a non-zero position (excluding the effect of
priming), they exhibit noticeably degraded per-
formance on zero- to full-shot tasks, across a
range of model families and model sizes. Our
findings raise questions about the efficacy of
APEs to model the relativity of position infor-
mation, and invite further introspection on the
sentence and word order processing strategies
employed by these models.
1 Introduction
Recently, Transformer (Vaswani et al.,2017) lan-
guage models (TLMs) have been widely used for
natural language applications. Such models in-
corporate positional encodings: vectors encoding
information about the order of words in context.
Many models, such as RoBERTa (Liu et al.,2019),
GPT3 (Brown et al.,2020) and OPT (Zhang et al.,
2022), utilize absolute position embeddings (APEs)
that directly encode absolute (linear) word order.
APEs appear to contribute to the performance of
such models; although when they are removed,
some models become sensitive to ablative word
scrambles (Sinha et al.,2021), while others work
optimally (Haviv et al.,2022). Thus, what precisely
APEs contribute remains unclear.
*Equal contributions.
Who could Thomas observe without distracting Nathan ?
Who could Thomas observe without distracting Nathan ?
Zero starting position
Non-zero starting position
Figure 1: Transformer models with absolute posi-
tional embeddings have different representations for
sentences starting from non-zero positions.
It is conceivable that APEs may enable the model
to handle the relative distances between words. If
models were somehow learning relative position
information despite using absolute positional em-
beddings, we would expect sentence encodings to
be the same in most cases, regardless of where they
appear in the context window. For example, the
meaning of “smoking kills” should be constant in
“Kim said smoking kills” (positions 2–3) and “It
was commonly believed by most adult Americans
in the 90s that smoking kills” (positions 13–14),
despite the fact that these words appear in different
absolute positions. Given this, our central question
is: do APEs enable the model to learn the relative
distances between the words in a sentence?
Prior work has attempted to explore the conse-
quences of APEs using probing methods (Wang
et al.,2021). APEs have been found to not capture
the meaning of absolute or relative positions (Wang
and Chen,2020). APEs have also been found to
bias model output with positional artefacts (Luo
et al.,2021), leading to better performance on token
to position de-correlation (Ke et al.,2021). Haviv
et al. (2022) even find that causal TLMs perform
adequately even without an explicit APEs. How-
ever, a systematic study on relativity of positional
encodings is still needed.
To better understand the relativity of absolute
arXiv:2210.12574v1 [cs.CL] 23 Oct 2022
position embeddings, we first need to ascertain the
robustness of relative position understanding for a
given input. TLMs are typically trained in a batch
containing multiple sentences, with a limited se-
quence window size, which is typically much larger
than an average sentence. We hypothesize that a
systematic model should encode the same sentence
equally throughout this context window. However,
evaluating the encoding of a sentence starting from
any position in this window in isolation is hard, as
the representation of the sentence would depend on
the prior context (Misra et al.,2020;Kassner and
Schütze,2020).
In this work, we subject models from several dif-
ferent architectures and sizes to phase shifting. In
this paradigm, the sentences exposed to the model
are provided contiguous position identifiers start-
ing from a non-zero position (Figure 1). Such in-
spection allows us to gauge the model’s sentence
encodings on different positions, emulating sub-
window sentence representation, while factoring
out the influence of prior context. We investigate
several zero shot, few shot and full shot tasks by
shifting the start positions of the sentences. We
observe the following:
TLMs display different sub-window sentence
representation capabilities, resulting in de-
creased zero shot task performance and vari-
ability in sentence perplexities.
Autoregressive models, including the recently
published OPT (Zhang et al.,2022), show er-
ratic zero and few-shot performance on sub-
window representations, highlighting the brit-
tleness of in-context learning evaluation.
Masked Language Models (MLMs) encode
sentences in non-standard positions better
than their autoregressive counterparts.
During fine-tuning models suffer drastically
on cross phase-shifted evaluation, suggesting
position specific overfitting.
We aim to raise awareness about issues with APEs,
which are still widely used in pre-training large
language models. Our results highlight the severity
of position shortcuts taken by the model during pre-
training and fine-tuning, and imply that TLMs may
have vastly varying sub-window sentence represen-
tation capability than previously assumed. We will
release the code and analysis used in this work on
Github. 1
2 Approach
Position encodings used by TLMs come in three
broad categories: fixed sinusoidal embeddings as
proposed by Vaswani et al. (2017), absolute or
learned popularized by BERT (Devlin et al.,2019)
family of masked language models, and relative
positions (Shaw et al.,2018) used by T5 (Raffel
et al.,2020). Wang et al. (2021) presents a compre-
hensive overview of current encoding strategies.
Despite being an older method, absolute posi-
tional embeddings (APEs) are reportedly better
than its relative counterparts on several tasks (Rav-
ishankar et al.,2021), and are still used by majority
of the large pre-trained TLMs, including the re-
cently released OPT (Zhang et al.,2022). APEs
compute token representation after adding the in-
put token to the position embedding for the corre-
sponding position:
xi=θW[wi] + θP[i]
, where,
θWR|Vd
is the token vocabulary of size
|V|
,
embedding dimension
d
, and the absolute position
embedding matrix
θPR|Td
, where
T
is the
maximum context window size of the model. Now,
a sentence
S= [w1, w2...wn]
containing
n
tokens,
is mapped during inference to positions 1,2, ...
n
contiguously for all models.
TLMs offer various sizes of context window,
which is the maximum sequence length in tokens
it can train and infer on. Since this context win-
dow is usually larger than the average sentence
length, multiple sentences can be packed together
to “fill" the context window during pre-training.
This allows TLMs to learn that sentences can start
from various positions in their context window. If
models trained with APEs do encode relativity of
position, then the sentence representations should
be roughly equal throughout the context window,
regardless of their starting position.
2.1 Phase Shift Methodology
To understand the relativity of APEs, we examine
the model performance under phase shift condi-
tions. Phase shift
2
involves right-shifting the ab-
solute positions of all tokens in the sentence by
an equal distance
k
, such that the tokens are now
1https://github.com/kazemnejad/lm_pos_investigations
2
More related to our work, Kiyono et al. (2021) train a
Transformer model from scratch using shifted positional em-
beddings for machine translation, and observe improved per-
formance in extrapolation and intrapolation setup.
Figure 2: Acceptability Scores in BLiMP (Warstadt
et al.,2020) dataset across different phase shifts.
RoBERTa only supports context window of size T=
512, so we capped the scores to phase shift k= 300
to allow for sentences of maximum length in BLiMP to
be evaluated.
mapped to new positions
1 + k, 2 + k, ..., n +k
, or
xi=θW[wi] + θP[i+k]
. As such, phase shifting
changes only the absolute position, but preserves
the relative distances between tokens in the a sen-
tence. Theoretically, we can shift the positions
within the context window as long as
k+nT
.
For example, given phase shift
k= 100
, and sen-
tence length of
n
, we could have the following
vector of position ids:
~p = [101,102,103, . . . , n + 100]
While computing the task scores and perplexities
of the models, we observed that all of the models
exhibit poor task performance on phase shifts. Due
to the non-shiftable nature of the
[CLS]
token in
masked language models (MLMs), we first fix the
position of
[CLS]
token to start position during
phase shifting, which results in significantly im-
proved performance for all models:
~p = [1,102,103, . . . , n + 100]
Futhermore, we observed yet another marked
improvement in task performance when we use
special tokens in the beginning of the sentence:
typically the end-of-sentence (
[EOS]
) token in case
of MLM models (RoBERTa, BART). An explana-
tion for this ambiguity in results is that typically
when models are pre-trained, multiple sentences
are packed together in the context window by de-
limiting the start of each sentence with an
[EOS]
Figure 3: Distribution of sentences in BLiMP
(Warstadt et al.,2020) having the lowest perplexities
(i.e., are deemed most acceptable) for each phase shift.
token
3
. Thus, in all of our results, we opt with this
configuration (adding an
[EOS]
token before the
sentence) to ensure fairer evaluation for all model
families. Concretely, the input to a model uses the
following template 4:
[CLS][EOS]<sentence>
3 Impact of phase shifts on grammatical
acceptability
First, we investigate the impact of phase shift-
ing on the model performance. We compute the
perplexities of several publicly available models—
RoBERTa (Liu et al.,2019), BART (Lewis et al.,
2020), GPT2 (Radford et al.,2019) and OPT
(Zhang et al.,2022)—to evaluate the grammati-
cal acceptability capabilities of the model, using
the BLiMP (Warstadt et al.,2020) benchmark.
5
We
compute the task score by comparing grammatical
and ungrammatical sentence perplexities, and ap-
plying the phase shift in increasing values of
k
to
the sentences and models (Figure 2).
We observe that the task performance of all mod-
els, except for RoBERTa, drastically suffers from
phase shifting. Autoregressive models in particu-
lar display worse results. This is likely due to a
mismatch of position information learned due to
3
While this is not the case for GPT2, we also observed
improved performance in some cases when we add a beginning
of sentence (
[BOS]
) token to the sentence and add a special
[EOS] token to delimit the start of a sentence.
4
In cases where a model does not have the
[CLS]
token, we
instead use
[BOS]
. If none of those are available, we replace
it with [EOS] (so a total of two [EOS]s will be prepended).
5
We adopt the perplexity computation strategy for
RoBERTa and BART from Salazar et al. (2020)
Figure 4: Aggregate performance of OPT family on six NLP tasks when various phase shifts are applied.
the causal language modelling objective vs the po-
sition information provided to the model during
phase shift (Haviv et al.,2022). We also compare
the perplexities of each sentence across different
phase shifts and plot the frequency of sentences
having the lowest perplexity in each
k
(Figure 3).
We observe in GPT2 that more than 70% of the
sentences have their best perplexity in
k= 0
, high-
lighting a severe zero-position bias.
OPT350M
has
better sub-window sentence representation capac-
ity than similarly sized GPT2, which is also evident
from the acceptability results in Figure 2.
4 Impact of phase shifts on in-context
learning
More recently, zero-shot and few-shot inference,
commonly referred to as in-context learning, have
become a de facto standard in evaluating pretrained
language models (Brown et al.,2020). In this ap-
proach, the model’s predictions are produced by
conditioning it on certain prompts, such as instruc-
tions (zero-shot setting) or a few examples of input-
output pairs (few-shot setup). In both cases, the
model faces an extended input text, and we sus-
pect it will be affected by deficiencies of APE.
To evaluate this hypothesis, we employ an exper-
imental setup similar to §3. Under zero-shot and
ve-shot inference regimes, we assess the model
performance on standard NLP tasks when it is fed
with inputs in increasing values of phase shifts. We
choose OPT model family, because it is available
in a wide range of sizes (125M to 30B parameters),
allowing allows us to examine the behavior of APE
at different scales. Moreover, our evaluations take
into account four tasks reported in the original pa-
Figure 5: Distribution of prompts with best accuracy
across all six tasks.
per: Winogrande (Sakaguchi et al.,2020), COPA
(Gordon et al.,2012), PIQA (Bisk et al.,2020),
and ARC (Clark et al.,2018) as well as two clas-
sification datasets from GLUE benchmark (Wang
et al.,2019): MRPC and RTE. We provide an ag-
gregated view of the models’ performance on all
six accuracy-dominated benchmarks in Figure 4.
The detailed plots for each task are in Appendix B.
In most tasks, the performance deteriorates when
the model process inputs in any other phase shift
than zero, especially in zero-shot inference. More
importantly, the model’s performance is not always
adversely affected by phase shifts. In fact, Figure 5
shows that non-zero starting positions result in the
best accuracy for many prompts. This erratic per-
formance is present in all model sizes, and scaling
the number of parameters does not help. Further-
more, one can see larger models are more affected
by shifted starting position, which suggests that ab-
solute positional embedding might need more data
or training as the number of parameters increases.
Figure 6: GLUE task heatmap with varying fine-tuning
train and test phase shifts, averaged across all models.
Darker colors represent better task performance.
5 Impact of phase-shifts on fine-tuning
Finally, we investigate the effect of phase shift
in fine-tuning. We ask whether the models can
generalize to out-of-phase sentences for a given
task. We train RoBERTa, BART, GPT2 and OPT
models on CoLA, RTE and MRPC tasks from the
GLUE benchmark (Wang et al.,2019) and evaluate
them on phase-shifts. We choose these three rela-
tively small tasks in order to decrease the number
of gradient updates to position embeddings during
fine-tuning. We perform a cross-phase analysis
by training and evaluating across different phase
shifts (
k= 0,100,200,300)
for all models on the
same set of datasets, and show the averaged per-
formance. We observe for all models, the task
performance drops during out-of-phase evaluation
(non-diagonals in Figure 6).
The drop in performance of evaluating out-of-
phase sentences might just be simply attributed
to overfitting on position information during fine-
tuning. However, we observe that for all tasks,
training and evaluating on the same phase-shift is
worse when
k6= 0
(diagonals in Figure 6). Out-of-
phase training appears to be worst for CoLA, which
suffers drastically when fine-tuning on different
phase shifts. These results highlight a potential
task data bias with respect to different positions.
6 Conclusion
In this work, we investigate the abilities of APEs in
encoding the relative positions of the tokens in an
input. We observe that TLMs using APEs encode
sentences differently based on the starting posi-
tion of the sentence in the context window. This
result has major implications in the way we per-
ceive the sentence processing capabilities of TLMs.
Specifically, we observe that the representation of
the same sentence varies depending on where it is
in the context window, such that it impacts zero
shot, few shot and full shot task performance of
sub-window sentences. Future work could leverage
the start position in building robust and position-
generalizable models. We hope our work can in-
form the community on the pitfalls of using APEs,
and inspire development and adoption of alterna-
tive relative position embedding based approaches.
Limitations
Our work primarily focuses on evaluating the rela-
tive position encoding of APEs. We do not focus
on the relative position embeddings (Shaw et al.,
2018;Raffel et al.,2020) (RPE) as our method
of phase-shift analysis is not applicable to those
classes of models. RPEs employ a window based
position information computation on the fly, which
does not require it to store embeddings uniquely
for each position. Thus, a phase shift in RPE would
not change the sentence processing pipeline, as the
model recomputes the position information based
on the shifted window. Thus, we need different
tools to study the relative position encoding of RPE
than the one proposed in this paper.
We also acknowledge that our study is primarily
focused on English language data from BLiMP and
GLUE. It is likely the same results would hold in
a multi-lingual model, however, since many lan-
guages are less word order inflexible than English,
that should be investigated in a follow-up work.
Ethical Consideration
Our work aims at understanding the difference in
sentence representation by shifting position infor-
mation. In practice, this could yield un-intended
results from a TLM deployed in production. Since
we observe a large variation in results, we would ad-
vise for caution when deploying TLMs in sensitive
real world applications, as the relative positioning
of a given sentence might evoke different responses
from the model. We hope our work can be useful
to motivate the use of better positional encoding
schemes in pre-training TLMs in future.
Acknowledgements
We would like to thank Kanishka Misra, Shagun
Sodhani, Stephen Roller and Kushal Arora for their
feedback on the initial versions of this draft. We
are also grateful for anonymous reviewers’ feed-
back. Siva Reddy acknowledges the support by the
Facebook CIFAR AI Chair program.
摘要:

TheCuriousCaseofAbsolutePositionEmbeddingsKoustuvSinhazyAmirhosseinKazemnejadzSivaReddyzJoellePineauyzDieuwkeHupkesyAdinaWilliamsyzMcGillUniversity/Mila-QuebecAI;yMetaAI{koustuv.sinha,amirhossein.kazemnejad}@mail.mcgill.caAbstractTransformerlanguagemodelsencodetheno-tionofwordorderusingpositionali...

展开>> 收起<<
The Curious Case of Absolute Position Embeddings Koustuv SinhazyAmirhossein Kazemnejadz Siva ReddyzJoelle PineauyzDieuwke HupkesyAdina Williamsy.pdf

共24页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:24 页 大小:957.25KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 24
客服
关注