Are Current Decoding Strategies Capable of Facing the Challenges of Visual Dialogue Amit Kumar Chaudhary

2025-04-30 0 0 1.84MB 10 页 10玖币

侵权投诉

Are Current Decoding Strategies Capable of Facing

the Challenges of Visual Dialogue?

Amit Kumar Chaudhary

CIMeC, University of Trento

amitkumar.chaudhar@unitn.it

Alex J. Lucassen

CIMeC, University of Trento

alex.lucassen@unitn.it

Ioanna Tsani

CIMeC, University of Trento

ioanna.tsani@unitn.it

Alberto Testoni

DISI, University of Trento

alberto.testoni@unitn.it

Abstract

Decoding strategies play a crucial role in nat-

ural language generation systems. They are

usually designed and evaluated in open-ended

text-only tasks, and it is not clear how different

strategies handle the numerous challenges that

goal-oriented multimodal systems face (such

as grounding and informativeness). To an-

swer this question, we compare a wide vari-

ety of different decoding strategies and hyper-

parameter conﬁgurations in a Visual Dialogue

referential game. Although none of them suc-

cessfully balance lexical richness, accuracy in

the task, and visual grounding, our in-depth

analysis allows us to highlight the strengths

and weaknesses of each decoding strategy. We

believe our ﬁndings and suggestions may serve

as a starting point for designing more effec-

tive decoding algorithms that handle the chal-

lenges of Visual Dialogue tasks.

1 Introduction

The last few years have witnessed remarkable

progress in developing efﬁcient generative lan-

guage models. The choice of the decoding strategy

plays a crucial role in the quality of the output (see

Zarrieß et al. (2021) for an exhaustive overview). It

should be noted that decoding strategies are usually

designed for and evaluated in text-only settings.

The most-used decoding strategies can be grouped

into two main classes. On the one hand, decoding

strategies that aim to generate text that maximizes

likelihood (like greedy and beam search) are shown

to generate generic, repetitive, and degenerate out-

put. Zhang et al. (2021) refer to this phenomenon as

the likelihood trap, and provide evidence that these

strategies lead to sub-optimal sequences. On the

other hand, stochastic strategies like pure sampling,

top-k sampling, and nucleus sampling (Holtzman

et al.,2020) increase the variability of generated

texts by taking random samples from the model.

However, this comes at the cost of generating words

that are not semantically appropriate for the con-

text in which they appear. Recently, Meister et al.

(2022) used an information-theoretic framework

to propose a new decoding algorithm (typical de-

coding), which samples tokens with an information

content close to their conditional entropy. Typical

decoding shows promising results in human evalu-

ation experiments but, given its recent release, it is

not clear yet how general this approach is.

Multimodal vision & language systems have re-

cently received a lot of attention from the research

community, but a thorough analysis of different

decoding strategies in these systems has not been

carried out. Thus, the question arises of whether the

above-mentioned decoding strategies can handle

the challenges of multimodal systems. i.e., gen-

erate text that not only takes into account lexical

variability, but also grounding in the visual modal-

ity. Moreover, in goal-oriented tasks, the informa-

tiveness of the generated text plays a crucial role

as well. To address these research questions, in

this paper we take a referential visual dialogue task,

GuessWhat?! (De Vries et al.,2017), where two

players (a Questioner and an Oracle) interact so

that the Questioner identiﬁes the secret object as-

signed to the Oracle among the ones appearing in

an image (see Figure 1for an example). Apart from

well-known issues, such as repetitions in the output,

this task poses speciﬁc challenges for evaluating de-

coding techniques compared to previous work. On

the one hand, the generated output has to be coher-

ent with the visual input upon which the conversa-

tion takes place. As highlighted by Rohrbach et al.

(2018); Testoni and Bernardi (2021b), multimodal

generative models often generate hallucinated en-

tities, i.e., tokens that refer to entities that do not

appear in the image upon which the conversation

takes place. On the other hand, the questions must

be informative, i.e., they must help the Questioner

to incrementally identify the target object.

We show that the choice of the decoding strat-

arXiv:2210.12997v1 [cs.CL] 24 Oct 2022

Figure 1: Example of a GuessWhat game from

De Vries et al. (2017)

egy and its hyper-parameter conﬁguration heavily

affects the quality of the generated output. Our

results highlight the speciﬁc strengths and weak-

nesses of decoding strategies that aim at generating

sequences with the highest probability vs. strate-

gies that randomly sample words. We ﬁnd that

none of the decoding strategies currently available

is able to balance task accuracy and linguistic qual-

ity of the output. However, we also show which

strategies perform better at important challenges,

such as incremental dialogue history, human evalu-

ation, hallucination rate, and lexical diversity. We

believe our work may serve as a starting point

for designing decoding strategies that take into ac-

count all the challenges involved in Visual Dia-

logue tasks.

2 Task & Dataset

GuessWhat?! (De Vries et al.,2017) is a simple

object identiﬁcation game in English where two

participants see a real-world image from MSCOCO

(Lin et al.,2014) containing multiple objects. One

player (the Oracle) is secretly assigned one object

in the image (the target) and the other player (the

Questioner) has to guess it by asking a series of

binary yes-no questions to the Oracle. The task

is considered to be successful if the Questioner

identiﬁes the target. The dataset for this task was

collected from human players via Amazon Mechan-

ical Turk. The authors collected 150K dialogues

with an average of 5.3 binary questions per dia-

logue. Figure 1shows an example of a GuessWhat

game from the dataset.

3 Model and Decoding Strategies

We use the model and pre-trained checkpoints of

the Questioner agent made available by Testoni and

Bernardi (2021c) for the GuessWhat?! task. This

model is based on the GDSE architecture (Shekhar

et al.,2019). It uses a ResNet-152 network (He

et al.,2016) to encode the images and an LSTM

network to encode the dialogue history. A multi-

modal shared representation is generated and then

used to train both the question generator (which

generates a follow-up question given the dialogue

history) and the Guesser module (which selects

the target object among a list of candidates at the

end of the dialogue) in a joint multi-task learning

fashion. Testoni and Bernardi (2021c) added an

internal Oracle module to the GDSE architecture,

which guides a cognitively-inspired beam search re-

ranking strategy (Conﬁrm-it) at inference time: this

strategy promotes the generation of questions that

aim at conﬁrming the model’s intermediate conjec-

tures about the target. In our work, at inference

time the Questioner agent always interacts with the

baseline Oracle agent proposed in De Vries et al.

(2017).

We analyse the effect of a large number of de-

coding strategies as well as hyper-parameter conﬁg-

uration for each strategy: as highlighted by Zhang

et al. (2021), it is crucial to evaluate different hyper-

parameter conﬁgurations when comparing multiple

decoding strategies. Among the ones that maxi-

mize the likelihood of the sequence, we consider

plain

beam search

(with a beam size of 3) and

greedy search

. We also consider

Conﬁrm-it

, the

cognitively-inspired beam search re-ranking strat-

egy proposed in Testoni and Bernardi (2021c) for

promoting the generation of questions that aim at

conﬁrming the model’s intermediate conjectures

about the target. This strategy re-ranks the set

of candidate questions from beam search and se-

lects the one that helps the most in conﬁrming

the model’s hypothesis about the target. As for

stochastic strategies, we analyse

pure sampling

top-k sampling

(with different

values), and

nu-

cleus sampling

(with different

values), a strat-

egy proposed in Holtzman et al. (2020) which se-

lects the highest probability tokens whose cumu-

lative probability mass exceeds a given threshold

. We also consider

typical decoding

(with differ-

ent

values), a recently proposed strategy (Meis-

ter et al.,2022) based on an information-theoretic

framework. We refer to the respective papers for

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AreCurrentDecodingStrategiesCapableofFacingtheChallengesofVisualDialogue?AmitKumarChaudharyCIMeC,UniversityofTrentoamitkumar.chaudhar@unitn.itAlexJ.LucassenCIMeC,UniversityofTrentoalex.lucassen@unitn.itIoannaTsaniCIMeC,UniversityofTrentoioanna.tsani@unitn.itAlbertoTestoniDISI,UniversityofTrentoalber...

展开>> 收起<<

Are Current Decoding Strategies Capable of Facing the Challenges of Visual Dialogue Amit Kumar Chaudhary.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Are Current Decoding Strategies Capable of Facing the Challenges of Visual Dialogue Amit Kumar Chaudhary

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: