98.7%
of the teacher model performance and re-
duces its memory by
2
times. Second, our Hybrid-
LITE saves more than
13×
memory compared to
Hybrid-DPR, while maintaining more than
98.0%
performance; and Hybrid-DrBoost reduces the in-
dexing memory (
8×
) compared to Hybrid-DPR
and maintains at least
98.5%
of the performance.
This shows that the light hybrid model can achieve
sufficient performance while reducing the indexing
memory significantly, which suggests the practi-
cal usage of light retrievers for memory-limited
applications, such as on-devices.
One important reason for using hybrid retriev-
ers in real-world applications is the generalization.
Thus, we further study if reducing the indexing
memory will hamper the generalization of light hy-
brid retrievers. Two prominent ideas have emerged
to test generalization: out-of-domain (OOD) gener-
alization and adversarial robustness (Gokhale et al.,
2022). We study OOD generalization of retriev-
ers on EntityQuestion (Sciavolino et al.,2021).
To study the robustness, we leverage six tech-
niques (Morris et al.,2020) to create adversarial
attack testing sets based on NQ dataset. Our exper-
iments demonstrate that Hybrid-LITE and Hybrid-
DrBoost achieve better generalization performance
than individual components. The study of robust-
ness shows that hybrid retrievers are always bet-
ter than sparse and dense retrievers. Nevertheless
all retrievers are vulnerable, suggesting room for
improving the robustness of retrievers, and our
datasets can aid the future research.
2 Related Work
Hybrid Retriever integrates the sparse and
dense retriever and ranks the documents by inter-
polating the relevance score from each retriever.
The most popular way to obtain the hybrid ranking
is applying linear combination of the sparse/dense
retriever scores (Karpukhin et al.,2020;Ma et al.,
2020;Luan et al.,2021;Ma et al.,2021a;Luo
et al.,2022). Instead of using the scores, Chen
et al. (2022) adopts Reciprocal Rank Fusion (Cor-
mack et al.,2009) to obtain the final ranking by
the ranking positions of each candidate retrieved
by individual retriever. Arabzadeh et al. (2021)
trains a classification model to select one of the
retrieval strategies: sparse, dense or hybrid model.
Most of the hybrid models rely on heavy dense
retrievers, and one exception is (Ma et al.,2021a),
where they use linear projection, PCA, and product
quantization (Jegou et al.,2010) to compress the
dense retriever component. Our hybrid retrievers
use either DrBoost or our proposed LITE as the
dense retrievers, which are more memory-efficient
and achieve better performance than the methods
used in (Ma et al.,2021a).
Indexing-Efficient Dense Retriever. Efficiency
includes two dimensions: latency (Seo et al.,2019;
Lee et al.,2021;Varshney et al.,2022) and mem-
ory. In this work, our primary focus is on memory,
specifically the memory used for indexing. Most
of the existing DRs are indexing heavy (Karpukhin
et al.,2020;Khattab and Zaharia,2020;Luo,
2022). To improve the indexing efficiency, there
are mainly three types of techniques. One is to
use vector product quantization (Jegou et al.,2010).
Second is to compress a high dimension dense vec-
tor to a low dimension dense vector, for e.g. from
768 to 32 dimension (Lewis et al.,2021;Ma et al.,
2021a). The third way is to use a binary vector (Ya-
mada et al.,2021;Zhan et al.,2021). Our proposed
method LITE (§3.2) reduces the indexing memory
by joint training of retrieval task and knowledge
distillation from a teacher model.
Generalization of IR. Two main benchmarks
have been proposed to study the OOD generaliza-
tion of retrievers, BEIR (Thakur et al.,2021b) and
EntityQuestion (Sciavolino et al.,2021). As shown
by previous work (Thakur et al.,2021b;Chen et al.,
2022), the generalization is one major concern of
DR. To address this limitation, Wang et al. (2021)
proposed GPL, a domain adaptation technique to
generate synthetic question-answer pairs in specific
domains. A follow-up work Thakur et al. (2022)
trains BPR and JPQ on the GPL synthetic data to
achieve efficiency and generalization. Chen et al.
(2022) investigates a hybrid model in the OOD set-
ting, yet different from us, they use a heavy DR
and do not concern the indexing memory. Most
existing work studies OOD generalization, and
much less attention paid toward the robustness of
retrievers (Penha et al.,2022;Zhuang and Zuccon,
2022;Chen et al.). To study robustness, Penha et al.
(2022) identifies four ways to change the syntax
of the queries but not the semantics. Our work
is a complementary to Penha et al. (2022), where
we leverage adversarial attack techniques (Morris
et al.,2020) to create six different testing sets for
NQ dataset (Kwiatkowski et al.,2019).