
BLINK (Ledell et al.,2020) to fuse tables and text,
which is an effective entity linker and capable to
link against all Wikipedia entities and their cor-
responding passages. Given a flat table segment,
BLINK returns
l
relevant passages linked to the en-
tities in table. However, as table size and passage
quantity grow, the input may become too long for
BERT-based encoders (Devlin et al.,2019). Thus,
we split a table into several segments to limit the in-
put token number that each segment contains only
a single row. This setup can be seen as a trade-
off to resolve input limit but our approaches are
scalable to full tables when input capacity permits.
More details about table-text blocks can be found
in Appendix A.1.
3 Methodology
We present
OTTER
, an
O
penQA
T
able-
TE
xt
R
etriever. We first introduce the basic dual-encoder
architecture for dense retrieval (§ 3.1). We then
describe three mechanisms to mitigate the table-
text discrepancy and data sparsity problems, i.e.,
modality-enhanced representation (§ 3.2), mixed-
modality hard negative sampling (§ 3.3), and
mixed-modality synthetic pre-training (§ 3.4).
3.1 The Dual-Encoder Architecture
The prevailing choice for dense retrieval is the dual-
encoder method. In this framework, a question
q
and a table-text block
b
are separately encoded
into two
d
-dimensional vectors by a neural encoder
E(·)
. Then, the relevance between
q
and
b
is mea-
sured by dot product over these two vectors:
s(q, b) = q>·b=E(q)>·E(b).(1)
The benefit of this method is that all the table-text
blocks can be pre-encoded into vectors to support
indexed searching during inference time. In this
work, we initialize the encoder with a pre-trained
RoBERTa (Liu et al.,2019), and take the represen-
tation of the first
[CLS]
token as the encoded vector.
When an incoming question is encoded, the approx-
imate nearest neighbor search can be leveraged for
efficient retrieval (Johnson et al.,2021).
Training
The training objective aims to learn rep-
resentations by maximizing the relevance of the
gold table-text block and the question. We follow
Karpukhin et al. (2020) to learn the representations.
Formally, given a training set of
N
instances, the
ith
instance
(qi, b+
i, b−
i,1, ..., b−
i,m)
consists of a pos-
itive block
b+
i
and
m
negative blocks
{b−
i,j }m
j=1
, we
minimize the cross-entropy loss as follows:
L(qi, b+
i,{b−
i,j }m
j=1) = −log es(qi,b+
i)
es(qi,b+
i)+Pm
j=1 es(qi,b−
i,j ).
Negatives are a hard negative and
m−1
in-batch
negatives from other instances in a mini-batch.
3.2 Modality-enhanced Representation
Most dense retrievers use a coarse-grained single-
modal representation from either the representation
of the
[CLS]
token or the averaged representations
of tokens (Zhan et al.,2020;Huang et al.,2021),
which is insufficient to represent cross-modal in-
formation. To remedy this, we propose to learn
modality-enhanced representation (MER) of table-
text blocks.
As illustrated in Figure 3, instead of using only
the coarse representation
h[CLS]
at the
[CLS]
to-
ken, MER incorporates tabular and textual repre-
sentations (
htable
and
htext
) to enhance the seman-
tics of table and text. Thus, the modality-enhanced
representation is
b= [h[CLS];htable;htext]
,
where ;denotes concatenation.
Given the tokens in a tabular/textual modality,
we calculate a representation in the following ways:
(1)
FIRST
: representations of the beginning to-
ken (i.e.,
[TAB]
and
[PSG]
); (2)
AVG
: averaged
token representations; (3)
MAX
: max pooling over
token representations ; (4)
SelfAtt
: weighted aver-
age over token representations where weights are
computed by a self attention layer. We discuss the
impact of different types of MERs in § 5.3. Our
best model adopts FIRST as the final setting. To
ensure the same vector dimensionality with the en-
riched representation, we represent the question by
replicating the encoded question representation.
3.3 Mixed-modality Hard Negative Sampling
Prior studies (Nogueira and Cho,2019;Gillick
et al.,2019) have found that hard negative sam-
pling is essential in training a dense retriever. These
methods take each evidence as a whole and retrieve
the most similar irrelevant one as the hard negative.
Instead of finding an entire irrelevant block, we
propose a mixed-modality hard negative sampling
mechanism, which constructs more challenging
hard negatives by only substituting partial informa-
tion in the table or text.