
Parametric
Nonparametric
(a) Classical learning setups
(b) Modern retrieval-based setup
Figure 1: An illustration of a retrieval-based classification model. Given an input instance
x
,
similar to an instance-based model, it retrieves similar (labeled) examples
Rx
=
{
(
x0
j, y0
j
)
}j
from training data. Subsequently, it processes (potentially via a nonparametric method)
input instance along with the retrieved examples to make the final prediction
ˆy
=
f
(
x, Rx
).
its prediction based on the nearest neighbors w.r.t. the input. However, these models often
suffer from weaker empirical performance as compared to deep parametric models.
Increasingly, a middle ground combining the two paradigms and retaining the best of both
worlds is becoming popular across various domains, ranging from natural language [Das
et al., 2021, Wang et al., 2022, Liu et al., 2022, Izacard et al., 2022], to vision [Liu et al.,
2015, 2019, Iscen et al., 2022, Long et al., 2022], to reinforcement learning [Blundell et al.,
2016, Pritzel et al., 2017, Ritter et al., 2020] , to even protein structure predictions [Cramer,
2021] . In such approaches, given a test input, one first retrieves relevant entries from a
data index and then processes the retrieved entries along with the test input to make the
final predictions using a machine learning model. This process is visualized in Figure 1b.
For example, in semantic parsing, models that augment a parametric seq2seq model with
similar examples have not only outperformed much larger models but also are more robust
to changes in data [Das et al., 2021].
While classical learning setups (cf. Figure 1a) have been studied extensively over decades,
even basic properties and trade-offs pertaining to retrieval-based models (cf. Figure 1b),
despite their aforementioned remarkable successes, remain highly under-explored. Most of
the existing efforts on retrieval-based machine learning models solely focus on developing end-
to-end domain-specific models, without identifying the key dataset properties or structures
that are critical in realizing performance gains by such models. Furthermore, at first glance,
due to the highly dependent nature of an input and the associated retrieved set, direct
application of existing statistical learning techniques does not appear as straightforward.
This prompts the natural question: What should be the right theoretical framework that can
help rigorously showcase the value of the retrieved set in ensuring superior performance of
modern retrieval-based models?
In this paper, we take the first step towards answering this question, while focusing on
the classification setting (Sec. 2.1). We begin with the hypothesis that the model might
be using the retrieved set to do local learning implicitly and then adapt its predictions to
the neighborhood of the test point. This idea is inspired from Bottou and Vapnik [1992].
Such local learning is potentially beneficial in cases where the underlying task has a local
structure, where a much simpler function class suffices to explain the data in a given local
neighborhood but overall the data can be complex (formally defined in Sec. 2.2). For instance
2