
position embeddings, we first need to ascertain the
robustness of relative position understanding for a
given input. TLMs are typically trained in a batch
containing multiple sentences, with a limited se-
quence window size, which is typically much larger
than an average sentence. We hypothesize that a
systematic model should encode the same sentence
equally throughout this context window. However,
evaluating the encoding of a sentence starting from
any position in this window in isolation is hard, as
the representation of the sentence would depend on
the prior context (Misra et al.,2020;Kassner and
Schütze,2020).
In this work, we subject models from several dif-
ferent architectures and sizes to phase shifting. In
this paradigm, the sentences exposed to the model
are provided contiguous position identifiers start-
ing from a non-zero position (Figure 1). Such in-
spection allows us to gauge the model’s sentence
encodings on different positions, emulating sub-
window sentence representation, while factoring
out the influence of prior context. We investigate
several zero shot, few shot and full shot tasks by
shifting the start positions of the sentences. We
observe the following:
•
TLMs display different sub-window sentence
representation capabilities, resulting in de-
creased zero shot task performance and vari-
ability in sentence perplexities.
•
Autoregressive models, including the recently
published OPT (Zhang et al.,2022), show er-
ratic zero and few-shot performance on sub-
window representations, highlighting the brit-
tleness of in-context learning evaluation.
•
Masked Language Models (MLMs) encode
sentences in non-standard positions better
than their autoregressive counterparts.
•
During fine-tuning models suffer drastically
on cross phase-shifted evaluation, suggesting
position specific overfitting.
We aim to raise awareness about issues with APEs,
which are still widely used in pre-training large
language models. Our results highlight the severity
of position shortcuts taken by the model during pre-
training and fine-tuning, and imply that TLMs may
have vastly varying sub-window sentence represen-
tation capability than previously assumed. We will
release the code and analysis used in this work on
Github. 1
2 Approach
Position encodings used by TLMs come in three
broad categories: fixed sinusoidal embeddings as
proposed by Vaswani et al. (2017), absolute or
learned popularized by BERT (Devlin et al.,2019)
family of masked language models, and relative
positions (Shaw et al.,2018) used by T5 (Raffel
et al.,2020). Wang et al. (2021) presents a compre-
hensive overview of current encoding strategies.
Despite being an older method, absolute posi-
tional embeddings (APEs) are reportedly better
than its relative counterparts on several tasks (Rav-
ishankar et al.,2021), and are still used by majority
of the large pre-trained TLMs, including the re-
cently released OPT (Zhang et al.,2022). APEs
compute token representation after adding the in-
put token to the position embedding for the corre-
sponding position:
xi=θW[wi] + θP[i]
, where,
θW∈R|V|×d
is the token vocabulary of size
|V|
,
embedding dimension
d
, and the absolute position
embedding matrix
θP∈R|T|×d
, where
T
is the
maximum context window size of the model. Now,
a sentence
S= [w1, w2...wn]
containing
n
tokens,
is mapped during inference to positions 1,2, ...
n
contiguously for all models.
TLMs offer various sizes of context window,
which is the maximum sequence length in tokens
it can train and infer on. Since this context win-
dow is usually larger than the average sentence
length, multiple sentences can be packed together
to “fill" the context window during pre-training.
This allows TLMs to learn that sentences can start
from various positions in their context window. If
models trained with APEs do encode relativity of
position, then the sentence representations should
be roughly equal throughout the context window,
regardless of their starting position.
2.1 Phase Shift Methodology
To understand the relativity of APEs, we examine
the model performance under phase shift condi-
tions. Phase shift
2
involves right-shifting the ab-
solute positions of all tokens in the sentence by
an equal distance
k
, such that the tokens are now
1https://github.com/kazemnejad/lm_pos_investigations
2
More related to our work, Kiyono et al. (2021) train a
Transformer model from scratch using shifted positional em-
beddings for machine translation, and observe improved per-
formance in extrapolation and intrapolation setup.