grounded understanding for dialog state tracking
(DST). In addition to using structured knowledge
such as the ontology of slot type-value pairs, we
also consider unstructured knowledge from the raw
training data. We train a TOD model to query rel-
evant knowledge for each turn in the context, and
leverage the retrieved knowledge to predict dia-
log state. We evaluate our method on MultiWOZ
(Budzianowski et al.,2018) for both the full-data
and few-shot settings, and show superior perfor-
mance compared to previous methods.
2 Related Work
2.1 Knowledge grounding
To relax the requirement of encoding knowledge
of the whole world into model parameters, one di-
rection is to disentangle knowledge representation
from LMs. Most of these methods are applied to
knowledge-intensive text generation tasks such as
open-domain question answering (Lee et al.,2019;
Karpukhin et al.,2020;Guu et al.,2020;Lewis
et al.,2020;Borgeaud et al.,2021), and response
generation with factual information (Dinan et al.,
2019;Komeili et al.,2022;Thoppilan et al.,2022;
Kim et al.,2020;Thulke et al.,2021;Chen et al.,
2022). Similarly, some work also considers re-
trieving information to serve as a reference to re-
fine the model generation process (Weston et al.,
2018;Gonzalez et al.,2019;Khandelwal et al.,
2021;Zhang et al.,2021). Different from these
approaches, our method focuses on learning and
utilizing available domain-relevant knowledge for
language understanding tasks. Moreover, we pro-
pose to leverage knowledge of various formats.
2.2 Knowledge guided dialog understanding
Encoding domain schema into model parameters
(Hosseini-Asl et al.,2020;Madotto et al.,2020)
may not be efficient for unseen domains and tasks
where the ontology can be different. One line of re-
search (Ren et al.,2018;Wu et al.,2019;Zhou and
Small,2019;Rastogi et al.,2020;Du et al.,2021;
Lee et al.,2021) leverages question-answering tech-
niques to predict values for each slot, or prepend all
slot-value information to the context (Zhao et al.,
2022). However, this method is not scalable when
the number of slot-value pairs is large, especially
in multi-domain TOD systems. In addition, proba-
bly due to blurry attention over long context (Fan
et al.,2021), Lee et al. (2021) find that adding
potential slot values does not improve the model
performance. In contrast, retrieving only relevant
schema effectively solves the scalability problem
by specifying the knowledge with a fixed length.
Alternatively, instead of structured schema
knowledge, recent research proposes to use hand-
crafted demonstrations as prompts (Gupta et al.,
2022) or find similar examples to guide understand-
ing tasks (Yu et al.,2021;Pasupat et al.,2021;
Yao et al.,2021) such as conversational semantic
parsing. However, one turn can contain multiple di-
alog states so that retrieved examples from previous
methods may not be sufficient to provide required
evidence. Furthermore, our method can be applied
to unify different forms of knowledge including
structured and unstructured ones.
3 Methodology
Our proposed method is illustrated in Figure 1.
Given the context
x
, we first retrieve
k
relevant
knowledge entries
e
by the similarity between
Enc(x)
and
Enc(e)
using an encoder
Enc
. Then
we integrate the retrieved entries
e1, e2, ..., ek
with
the original context to form
x0
, where
x0
is used as
the input for the target DST task.
Knowledge retrieval
Different from previous
work (such as question answering) where there is
only one ground-truth knowledge for each query,
multiple entries of the form slot-value pairs may
exist in the ontology base that match the conversa-
tion context. Importantly, unlike passage retrieval
where the query (e.g., a sentence) and the target
(e.g., another sentence or passage) are similar to
the pre-training corpus, structured knowledge such
as schema pairs may have different representation
distribution. Thus, an off-the-shelf encoder may
retrieve noisy elements and degrade final perfor-
mance, especially when training with the target
task optimized on DST generation. Moreover, non-
parametric retrieval methods such as TF-IDF and
BM25 (Robertson and Zaragoza,2009) rely on lex-
ical overlapping, which could be detrimental when
entries in schemas contain high word overlapping
(e.g., same value for different slots).
We therefore train our knowledge retriever to
promote similar representations between a query
and its ground truth knowledge. We started with
optimizing the marginal likelihood over all posi-
tive knowledge entries, but found that it resulted
in peaky distribution centered around specific ele-
ments in our preliminary studies. Instead, we mini-