
categories of virtual tokens similarly. In § 4.3 we
will show how inducer-tuning addresses the two
parts through different residual forms.
We suggest that the mechanism of prefix-tuning
can be further understood through the concept of
inducing variables in kernel learning literature (Tit-
sias,2009). Many computational methods in kernel
learning utilize a small set of support points (in-
ducing variables) to improve the inference perfor-
mance (Musco and Musco,2017;Chen and Yang,
2021). Snelson and Ghahramani (2005) specifi-
cally consider the inducing variables as auxiliary
pseudo-inputs and infer them using continuous op-
timization, which is similar to prefix-tuning. We
emphasize that from the first sight the main char-
acter of inducing-point methods is representing a
vast amount of training examples through a small
number of points, so as to reduce the computational
cost; however, here we instead aim to leverage the
mechanism of inducing variables to well-steer the
estimation: the goal we try to attain is to strengthen
prefix-tuning by making the prefixes better modu-
late the attention output. We introduce and analyze
the mechanism as follows.
Mechanism for well-steering inference out-
puts in inducing-point methods
. Conceptually,
inducing variables help the inference because they
can represent the distribution of the query inputs
and steer the kernel methods without changing the
kernel in use. In particular, we consider the dis-
tribution pattern of unconstrained inducing points
XM
(Snelson and Ghahramani,2005, Figure 1).
We observe that most of them are close to the test-
ing examples
X∗
, and in the new estimation (Snel-
son and Ghahramani,2005, Equation (8)) the in-
ducers
XM
will receive great weights through the
weights assignment mechanism in kernel methods
(we recall kernel methods can assign the weights
of samples as attention (Choromanski et al.,2020;
Chen et al.,2021;Tsai et al.,2019); for inducing
variables close to the query, they would automati-
cally receive more attention), and thus effectively
modulate the output.
From this mechanism, we draw an inductive bias
"the prefix should be close to the query" (which is
not enforced in the method of prefix-tuning) and
accordingly propose inducer-tuning. We remark
since we are not pursuing the original goal, reduc-
ing computational cost, of inducing variables, it is
ordinary that the concrete design in the next sub-
section is different from the usual form of inducing
points, a small number of samples.
We speculate prefix-tuning partially benefits
from the above mechanism as well. Furthermore,
some indirect evidence is stated as follows. As dis-
cussed in previous studies, to make the full poten-
tial of prompting, the manually designed prompts
are expected to be related to the topic of the input
sequence (Brown et al.,2020) (close to the query);
even for the soft prompts they are recommended to
be initialized with the token relevant to the specific
tasks (Li and Liang,2021), which also requires
the prompts to be close to the query to provide
effective adaptation. With this belief, we propose
inducer-tuning to exploit further the mechanism of
inducing variables and improve upon prefix-tuning.
4.3 Method
Inducer-tuning follows the same design principle
as prefix-tuning, which modulates the attention out-
put through inserting virtual tokens (vectors). How-
ever, unlike prefix-tuning, our virtual tokens are
not shared among the input sequences. Inducer-
tuning also incorporates the benefits of residual
forms to ease the initialization and remove the re-
parametrization trick in prefix-tuning. Specifically,
we suggest the following modifications:
1
The
“inducers” are adaptive to and customized for each
input token to strengthen the expressiveness of the
new attention output.
2
We propose to model
the virtual vectors in a residual form as an adapter,
which makes the final attention output be in a resid-
ual form as well. We now dive into discussing the
intuitions behind the modifications in detail.
Adaptive inducers.
There is an important dif-
ference between language models and kernel meth-
ods, making fixed prefixes less effective than in-
ducing variables in kernel methods. In language
models, the distribution of the input queries keeps
changing, and for some inputs, the fixed prefixes
fail to be qualified as “inducing variables”. Even
worse, for a long input, there probably exists some
query vectors away (regarding
`2
distance) from all
the virtual vectors in the fixed prefixes, which are
thus unable to modulate the attention output well.
The phenomenon that prefix-tuning has a relatively
poorer performance on tasks with longer inputs can
be observed in our experiments (§ 6).
To alleviate the above issue, we propose adaptive
modeling of the virtual key vectors. For a query
Qi
, we suggest taking a vector close to
Qi
itself as
the corresponding virtual key vector (the length of