The Curious Case of Absolute Position Embeddings Koustuv SinhazyAmirhossein Kazemnejadz Siva ReddyzJoelle PineauyzDieuwke HupkesyAdina Williamsy

2025-05-02 0 0 957.25KB 24 页 10玖币

侵权投诉

The Curious Case of Absolute Position Embeddings

Koustuv Sinha‡†∗ Amirhossein Kazemnejad ‡∗

Siva Reddy‡Joelle Pineau†‡ Dieuwke Hupkes†Adina Williams†

‡McGill University / Mila - Quebec AI; †Meta AI

{koustuv.sinha,amirhossein.kazemnejad}@mail.mcgill.ca

Abstract

Transformer language models encode the no-

tion of word order using positional informa-

tion. Most commonly, this positional informa-

tion is represented by absolute position embed-

dings (APEs), that are learned from the pre-

training data. However, in natural language, it

is not absolute position that matters, but rela-

tive position, and the extent to which APEs can

capture this type of information has not been

investigated. In this work, we observe that

models trained with APE over-rely on posi-

tional information to the point that they break-

down when subjected to sentences with shifted

position information. Speciﬁcally, when mod-

els are subjected to sentences starting from

a non-zero position (excluding the effect of

priming), they exhibit noticeably degraded per-

formance on zero- to full-shot tasks, across a

range of model families and model sizes. Our

ﬁndings raise questions about the efﬁcacy of

APEs to model the relativity of position infor-

mation, and invite further introspection on the

sentence and word order processing strategies

employed by these models.

1 Introduction

Recently, Transformer (Vaswani et al.,2017) lan-

guage models (TLMs) have been widely used for

natural language applications. Such models in-

corporate positional encodings: vectors encoding

information about the order of words in context.

Many models, such as RoBERTa (Liu et al.,2019),

GPT3 (Brown et al.,2020) and OPT (Zhang et al.,

2022), utilize absolute position embeddings (APEs)

that directly encode absolute (linear) word order.

APEs appear to contribute to the performance of

such models; although when they are removed,

some models become sensitive to ablative word

scrambles (Sinha et al.,2021), while others work

optimally (Haviv et al.,2022). Thus, what precisely

APEs contribute remains unclear.

*Equal contributions.

Who could Thomas observe without distracting Nathan ?

Zero starting position

Non-zero starting position

Figure 1: Transformer models with absolute posi-

tional embeddings have different representations for

sentences starting from non-zero positions.

It is conceivable that APEs may enable the model

to handle the relative distances between words. If

models were somehow learning relative position

information despite using absolute positional em-

beddings, we would expect sentence encodings to

be the same in most cases, regardless of where they

appear in the context window. For example, the

meaning of “smoking kills” should be constant in

“Kim said smoking kills” (positions 2–3) and “It

was commonly believed by most adult Americans

in the 90s that smoking kills” (positions 13–14),

despite the fact that these words appear in different

absolute positions. Given this, our central question

is: do APEs enable the model to learn the relative

distances between the words in a sentence?

Prior work has attempted to explore the conse-

quences of APEs using probing methods (Wang

et al.,2021). APEs have been found to not capture

the meaning of absolute or relative positions (Wang

and Chen,2020). APEs have also been found to

bias model output with positional artefacts (Luo

et al.,2021), leading to better performance on token

to position de-correlation (Ke et al.,2021). Haviv

et al. (2022) even ﬁnd that causal TLMs perform

adequately even without an explicit APEs. How-

ever, a systematic study on relativity of positional

encodings is still needed.

To better understand the relativity of absolute

arXiv:2210.12574v1 [cs.CL] 23 Oct 2022

position embeddings, we ﬁrst need to ascertain the

robustness of relative position understanding for a

given input. TLMs are typically trained in a batch

containing multiple sentences, with a limited se-

quence window size, which is typically much larger

than an average sentence. We hypothesize that a

systematic model should encode the same sentence

equally throughout this context window. However,

evaluating the encoding of a sentence starting from

any position in this window in isolation is hard, as

the representation of the sentence would depend on

the prior context (Misra et al.,2020;Kassner and

Schütze,2020).

In this work, we subject models from several dif-

ferent architectures and sizes to phase shifting. In

this paradigm, the sentences exposed to the model

are provided contiguous position identiﬁers start-

ing from a non-zero position (Figure 1). Such in-

spection allows us to gauge the model’s sentence

encodings on different positions, emulating sub-

window sentence representation, while factoring

out the inﬂuence of prior context. We investigate

several zero shot, few shot and full shot tasks by

shifting the start positions of the sentences. We

observe the following:

•

TLMs display different sub-window sentence

representation capabilities, resulting in de-

creased zero shot task performance and vari-

ability in sentence perplexities.

•

Autoregressive models, including the recently

published OPT (Zhang et al.,2022), show er-

ratic zero and few-shot performance on sub-

window representations, highlighting the brit-

tleness of in-context learning evaluation.

•

Masked Language Models (MLMs) encode

sentences in non-standard positions better

than their autoregressive counterparts.

•

During ﬁne-tuning models suffer drastically

on cross phase-shifted evaluation, suggesting

position speciﬁc overﬁtting.

We aim to raise awareness about issues with APEs,

which are still widely used in pre-training large

language models. Our results highlight the severity

of position shortcuts taken by the model during pre-

training and ﬁne-tuning, and imply that TLMs may

have vastly varying sub-window sentence represen-

tation capability than previously assumed. We will

release the code and analysis used in this work on

Github. 1

2 Approach

Position encodings used by TLMs come in three

broad categories: ﬁxed sinusoidal embeddings as

proposed by Vaswani et al. (2017), absolute or

learned popularized by BERT (Devlin et al.,2019)

family of masked language models, and relative

positions (Shaw et al.,2018) used by T5 (Raffel

et al.,2020). Wang et al. (2021) presents a compre-

hensive overview of current encoding strategies.

Despite being an older method, absolute posi-

tional embeddings (APEs) are reportedly better

than its relative counterparts on several tasks (Rav-

ishankar et al.,2021), and are still used by majority

of the large pre-trained TLMs, including the re-

cently released OPT (Zhang et al.,2022). APEs

compute token representation after adding the in-

put token to the position embedding for the corre-

sponding position:

xi=θW[wi] + θP[i]

, where,

θW∈R|V|×d

is the token vocabulary of size

|V|

embedding dimension

, and the absolute position

embedding matrix

θP∈R|T|×d

, where

is the

maximum context window size of the model. Now,

a sentence

S= [w1, w2...wn]

containing

tokens,

is mapped during inference to positions 1,2, ...

contiguously for all models.

TLMs offer various sizes of context window,

which is the maximum sequence length in tokens

it can train and infer on. Since this context win-

dow is usually larger than the average sentence

length, multiple sentences can be packed together

to “ﬁll" the context window during pre-training.

This allows TLMs to learn that sentences can start

from various positions in their context window. If

models trained with APEs do encode relativity of

position, then the sentence representations should

be roughly equal throughout the context window,

regardless of their starting position.

2.1 Phase Shift Methodology

To understand the relativity of APEs, we examine

the model performance under phase shift condi-

tions. Phase shift

involves right-shifting the ab-

solute positions of all tokens in the sentence by

an equal distance

, such that the tokens are now

1https://github.com/kazemnejad/lm_pos_investigations

More related to our work, Kiyono et al. (2021) train a

Transformer model from scratch using shifted positional em-

beddings for machine translation, and observe improved per-

formance in extrapolation and intrapolation setup.

Figure 2: Acceptability Scores in BLiMP (Warstadt

et al.,2020) dataset across different phase shifts.

RoBERTa only supports context window of size T=

512, so we capped the scores to phase shift k= 300

to allow for sentences of maximum length in BLiMP to

be evaluated.

mapped to new positions

1 + k, 2 + k, ..., n +k

, or

xi=θW[wi] + θP[i+k]

. As such, phase shifting

changes only the absolute position, but preserves

the relative distances between tokens in the a sen-

tence. Theoretically, we can shift the positions

within the context window as long as

k+n≤T

For example, given phase shift

k= 100

, and sen-

tence length of

, we could have the following

vector of position ids:

~p = [101,102,103, . . . , n + 100]

While computing the task scores and perplexities

of the models, we observed that all of the models

exhibit poor task performance on phase shifts. Due

to the non-shiftable nature of the

[CLS]

token in

masked language models (MLMs), we ﬁrst ﬁx the

position of

[CLS]

token to start position during

phase shifting, which results in signiﬁcantly im-

proved performance for all models:

~p = [1,102,103, . . . , n + 100]

Futhermore, we observed yet another marked

improvement in task performance when we use

special tokens in the beginning of the sentence:

typically the end-of-sentence (

[EOS]

) token in case

of MLM models (RoBERTa, BART). An explana-

tion for this ambiguity in results is that typically

when models are pre-trained, multiple sentences

are packed together in the context window by de-

limiting the start of each sentence with an

[EOS]

Figure 3: Distribution of sentences in BLiMP

(Warstadt et al.,2020) having the lowest perplexities

(i.e., are deemed most acceptable) for each phase shift.

token

. Thus, in all of our results, we opt with this

conﬁguration (adding an

[EOS]

token before the

sentence) to ensure fairer evaluation for all model

families. Concretely, the input to a model uses the

following template 4:

[CLS][EOS]<sentence>

3 Impact of phase shifts on grammatical

acceptability

First, we investigate the impact of phase shift-

ing on the model performance. We compute the

perplexities of several publicly available models—

RoBERTa (Liu et al.,2019), BART (Lewis et al.,

2020), GPT2 (Radford et al.,2019) and OPT

(Zhang et al.,2022)—to evaluate the grammati-

cal acceptability capabilities of the model, using

the BLiMP (Warstadt et al.,2020) benchmark.

compute the task score by comparing grammatical

and ungrammatical sentence perplexities, and ap-

plying the phase shift in increasing values of

the sentences and models (Figure 2).

We observe that the task performance of all mod-

els, except for RoBERTa, drastically suffers from

phase shifting. Autoregressive models in particu-

lar display worse results. This is likely due to a

mismatch of position information learned due to

While this is not the case for GPT2, we also observed

improved performance in some cases when we add a beginning

of sentence (

[BOS]

) token to the sentence and add a special

[EOS] token to delimit the start of a sentence.

In cases where a model does not have the

[CLS]

token, we

instead use

[BOS]

. If none of those are available, we replace

it with [EOS] (so a total of two [EOS]’s will be prepended).

We adopt the perplexity computation strategy for

RoBERTa and BART from Salazar et al. (2020)

Figure 4: Aggregate performance of OPT family on six NLP tasks when various phase shifts are applied.

the causal language modelling objective vs the po-

sition information provided to the model during

phase shift (Haviv et al.,2022). We also compare

the perplexities of each sentence across different

phase shifts and plot the frequency of sentences

having the lowest perplexity in each

(Figure 3).

We observe in GPT2 that more than 70% of the

sentences have their best perplexity in

k= 0

, high-

lighting a severe zero-position bias.

OPT350M

has

better sub-window sentence representation capac-

ity than similarly sized GPT2, which is also evident

from the acceptability results in Figure 2.

4 Impact of phase shifts on in-context

learning

More recently, zero-shot and few-shot inference,

commonly referred to as in-context learning, have

become a de facto standard in evaluating pretrained

language models (Brown et al.,2020). In this ap-

proach, the model’s predictions are produced by

conditioning it on certain prompts, such as instruc-

tions (zero-shot setting) or a few examples of input-

output pairs (few-shot setup). In both cases, the

model faces an extended input text, and we sus-

pect it will be affected by deﬁciencies of APE.

To evaluate this hypothesis, we employ an exper-

imental setup similar to §3. Under zero-shot and

ﬁve-shot inference regimes, we assess the model

performance on standard NLP tasks when it is fed

with inputs in increasing values of phase shifts. We

choose OPT model family, because it is available

in a wide range of sizes (125M to 30B parameters),

allowing allows us to examine the behavior of APE

at different scales. Moreover, our evaluations take

into account four tasks reported in the original pa-

Figure 5: Distribution of prompts with best accuracy

across all six tasks.

per: Winogrande (Sakaguchi et al.,2020), COPA

(Gordon et al.,2012), PIQA (Bisk et al.,2020),

and ARC (Clark et al.,2018) as well as two clas-

siﬁcation datasets from GLUE benchmark (Wang

et al.,2019): MRPC and RTE. We provide an ag-

gregated view of the models’ performance on all

six accuracy-dominated benchmarks in Figure 4.

The detailed plots for each task are in Appendix B.

In most tasks, the performance deteriorates when

the model process inputs in any other phase shift

than zero, especially in zero-shot inference. More

importantly, the model’s performance is not always

adversely affected by phase shifts. In fact, Figure 5

shows that non-zero starting positions result in the

best accuracy for many prompts. This erratic per-

formance is present in all model sizes, and scaling

the number of parameters does not help. Further-

more, one can see larger models are more affected

by shifted starting position, which suggests that ab-

solute positional embedding might need more data

or training as the number of parameters increases.

Figure 6: GLUE task heatmap with varying ﬁne-tuning

train and test phase shifts, averaged across all models.

Darker colors represent better task performance.

5 Impact of phase-shifts on ﬁne-tuning

Finally, we investigate the effect of phase shift

in ﬁne-tuning. We ask whether the models can

generalize to out-of-phase sentences for a given

task. We train RoBERTa, BART, GPT2 and OPT

models on CoLA, RTE and MRPC tasks from the

GLUE benchmark (Wang et al.,2019) and evaluate

them on phase-shifts. We choose these three rela-

tively small tasks in order to decrease the number

of gradient updates to position embeddings during

ﬁne-tuning. We perform a cross-phase analysis

by training and evaluating across different phase

shifts (

k= 0,100,200,300)

for all models on the

same set of datasets, and show the averaged per-

formance. We observe for all models, the task

performance drops during out-of-phase evaluation

(non-diagonals in Figure 6).

The drop in performance of evaluating out-of-

phase sentences might just be simply attributed

to overﬁtting on position information during ﬁne-

tuning. However, we observe that for all tasks,

training and evaluating on the same phase-shift is

worse when

k6= 0

(diagonals in Figure 6). Out-of-

phase training appears to be worst for CoLA, which

suffers drastically when ﬁne-tuning on different

phase shifts. These results highlight a potential

task data bias with respect to different positions.

6 Conclusion

In this work, we investigate the abilities of APEs in

encoding the relative positions of the tokens in an

input. We observe that TLMs using APEs encode

sentences differently based on the starting posi-

tion of the sentence in the context window. This

result has major implications in the way we per-

ceive the sentence processing capabilities of TLMs.

Speciﬁcally, we observe that the representation of

the same sentence varies depending on where it is

in the context window, such that it impacts zero

shot, few shot and full shot task performance of

sub-window sentences. Future work could leverage

the start position in building robust and position-

generalizable models. We hope our work can in-

form the community on the pitfalls of using APEs,

and inspire development and adoption of alterna-

tive relative position embedding based approaches.

Limitations

Our work primarily focuses on evaluating the rela-

tive position encoding of APEs. We do not focus

on the relative position embeddings (Shaw et al.,

2018;Raffel et al.,2020) (RPE) as our method

of phase-shift analysis is not applicable to those

classes of models. RPEs employ a window based

position information computation on the ﬂy, which

does not require it to store embeddings uniquely

for each position. Thus, a phase shift in RPE would

not change the sentence processing pipeline, as the

model recomputes the position information based

on the shifted window. Thus, we need different

tools to study the relative position encoding of RPE

than the one proposed in this paper.

We also acknowledge that our study is primarily

focused on English language data from BLiMP and

GLUE. It is likely the same results would hold in

a multi-lingual model, however, since many lan-

guages are less word order inﬂexible than English,

that should be investigated in a follow-up work.

Ethical Consideration

Our work aims at understanding the difference in

sentence representation by shifting position infor-

mation. In practice, this could yield un-intended

results from a TLM deployed in production. Since

we observe a large variation in results, we would ad-

vise for caution when deploying TLMs in sensitive

real world applications, as the relative positioning

of a given sentence might evoke different responses

from the model. We hope our work can be useful

to motivate the use of better positional encoding

schemes in pre-training TLMs in future.

Acknowledgements

We would like to thank Kanishka Misra, Shagun

Sodhani, Stephen Roller and Kushal Arora for their

feedback on the initial versions of this draft. We

are also grateful for anonymous reviewers’ feed-

back. Siva Reddy acknowledges the support by the

Facebook CIFAR AI Chair program.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TheCuriousCaseofAbsolutePositionEmbeddingsKoustuvSinhazyAmirhosseinKazemnejadzSivaReddyzJoellePineauyzDieuwkeHupkesyAdinaWilliamsyzMcGillUniversity/Mila-QuebecAI;yMetaAI{koustuv.sinha,amirhossein.kazemnejad}@mail.mcgill.caAbstractTransformerlanguagemodelsencodetheno-tionofwordorderusingpositionali...

展开>> 收起<<

The Curious Case of Absolute Position Embeddings Koustuv SinhazyAmirhossein Kazemnejadz Siva ReddyzJoelle PineauyzDieuwke HupkesyAdina Williamsy.pdf

共24页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

The Curious Case of Absolute Position Embeddings Koustuv SinhazyAmirhossein Kazemnejadz Siva ReddyzJoelle PineauyzDieuwke HupkesyAdina Williamsy

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: