Are Current Decoding Strategies Capable of Facing
the Challenges of Visual Dialogue?
Amit Kumar Chaudhary
CIMeC, University of Trento
amitkumar.chaudhar@unitn.it
Alex J. Lucassen
CIMeC, University of Trento
alex.lucassen@unitn.it
Ioanna Tsani
CIMeC, University of Trento
ioanna.tsani@unitn.it
Alberto Testoni
DISI, University of Trento
alberto.testoni@unitn.it
Abstract
Decoding strategies play a crucial role in nat-
ural language generation systems. They are
usually designed and evaluated in open-ended
text-only tasks, and it is not clear how different
strategies handle the numerous challenges that
goal-oriented multimodal systems face (such
as grounding and informativeness). To an-
swer this question, we compare a wide vari-
ety of different decoding strategies and hyper-
parameter configurations in a Visual Dialogue
referential game. Although none of them suc-
cessfully balance lexical richness, accuracy in
the task, and visual grounding, our in-depth
analysis allows us to highlight the strengths
and weaknesses of each decoding strategy. We
believe our findings and suggestions may serve
as a starting point for designing more effec-
tive decoding algorithms that handle the chal-
lenges of Visual Dialogue tasks.
1 Introduction
The last few years have witnessed remarkable
progress in developing efficient generative lan-
guage models. The choice of the decoding strategy
plays a crucial role in the quality of the output (see
Zarrieß et al. (2021) for an exhaustive overview). It
should be noted that decoding strategies are usually
designed for and evaluated in text-only settings.
The most-used decoding strategies can be grouped
into two main classes. On the one hand, decoding
strategies that aim to generate text that maximizes
likelihood (like greedy and beam search) are shown
to generate generic, repetitive, and degenerate out-
put. Zhang et al. (2021) refer to this phenomenon as
the likelihood trap, and provide evidence that these
strategies lead to sub-optimal sequences. On the
other hand, stochastic strategies like pure sampling,
top-k sampling, and nucleus sampling (Holtzman
et al.,2020) increase the variability of generated
texts by taking random samples from the model.
However, this comes at the cost of generating words
that are not semantically appropriate for the con-
text in which they appear. Recently, Meister et al.
(2022) used an information-theoretic framework
to propose a new decoding algorithm (typical de-
coding), which samples tokens with an information
content close to their conditional entropy. Typical
decoding shows promising results in human evalu-
ation experiments but, given its recent release, it is
not clear yet how general this approach is.
Multimodal vision & language systems have re-
cently received a lot of attention from the research
community, but a thorough analysis of different
decoding strategies in these systems has not been
carried out. Thus, the question arises of whether the
above-mentioned decoding strategies can handle
the challenges of multimodal systems. i.e., gen-
erate text that not only takes into account lexical
variability, but also grounding in the visual modal-
ity. Moreover, in goal-oriented tasks, the informa-
tiveness of the generated text plays a crucial role
as well. To address these research questions, in
this paper we take a referential visual dialogue task,
GuessWhat?! (De Vries et al.,2017), where two
players (a Questioner and an Oracle) interact so
that the Questioner identifies the secret object as-
signed to the Oracle among the ones appearing in
an image (see Figure 1for an example). Apart from
well-known issues, such as repetitions in the output,
this task poses specific challenges for evaluating de-
coding techniques compared to previous work. On
the one hand, the generated output has to be coher-
ent with the visual input upon which the conversa-
tion takes place. As highlighted by Rohrbach et al.
(2018); Testoni and Bernardi (2021b), multimodal
generative models often generate hallucinated en-
tities, i.e., tokens that refer to entities that do not
appear in the image upon which the conversation
takes place. On the other hand, the questions must
be informative, i.e., they must help the Questioner
to incrementally identify the target object.
We show that the choice of the decoding strat-
arXiv:2210.12997v1 [cs.CL] 24 Oct 2022