Deep Gesture Generation for Social Robots Using Type-Speciﬁc Libraries Hitoshi Teshima1Naoki Wake3Diego Thomas1Yuta Nakashima2Hiroshi Kawasaki1Katsushi Ikeuchi3

2025-04-24 0 0 1.26MB 6 页 10玖币

侵权投诉

Deep Gesture Generation for Social Robots

Using Type-Speciﬁc Libraries

Hitoshi Teshima1Naoki Wake3Diego Thomas1Yuta Nakashima2Hiroshi Kawasaki1Katsushi Ikeuchi3

1Kyushu University 2Osaka University 3Microsoft

teshima.hitoshi.058@s.kyushu-u.ac.jp

Merge &

Interpolate

Beat Gesture Imagistic Gesture No-Gesture

TED Gesture-Type Dataset

Text

Gesture

Type

Prediction

Beat

Library

Beat

Words

Imagistic

Words

No-Gesture

Words

Audio

Beat Generator

No-Gesture Generator

No-Gesture

Library

Imagistic Generator

Imagistic

Library

this idea of the market

with small shops…

this idea of …

small

Important

Words

Labanotation

Gesture

𝑇𝑖𝑚𝑒 × 𝐽𝑜𝑖𝑛𝑡𝑠 × 𝑋𝑌𝑍

Avatar

Robot

Beat

Gesture

Imagistic

Gesture

Gesture Generation System

Fig. 1: Our proposed system predicts the gesture type for each word from the input text, and inputs each word sequence to

separate generator to generate gestures.

Abstract— Body language such as conversational gesture is a

powerful way to ease communication. Conversational gestures

do not only make a speech more lively but also contain semantic

meaning that helps to stress important information in the

discussion. In the ﬁeld of robotics, giving conversational agents

(humanoid robots or virtual avatars) the ability to properly use

gestures is critical, yet remain a task of extraordinary difﬁculty.

This is because given only a text as input, there are many

possibilities and ambiguities to generate an appropriate gesture.

Different to previous works we propose a new method that

explicitly takes into account the gesture types to reduce these

ambiguities and generate human-like conversational gestures.

Key to our proposed system is a new gesture database built

on the TED dataset that allows us to map a word to one of

three types of gestures: ”Imagistic” gestures, which express

the content of the speech, ”Beat” gestures, which emphasize

words, and ”No gestures.” We propose a system that ﬁrst maps

the words in the input text to their corresponding gesture

type, generate type-speciﬁc gestures and combine the generated

gestures into one ﬁnal smooth gesture. In our comparative

experiments, the effectiveness of the proposed method was

conﬁrmed in user studies for both avatar and humanoid robot.

I. INTRODUCTION

We communicate information in two ways, speech and

gesture, and they are both important elements. As McNeill

[2] argued, many human gestures are related to speech.

Gestures supplement the content of the speech and make it

easier for the listener to understand the information. In recent

years, agents for interacting with humans, such as avatars

and humanoid robots, are commonly used. If the agent

communicates not only through speech but also through

gestures, it will be easier for humans to understand the

conversation with the agent.

Recent gesture generation systems generate gestures by

learning end-to-end from text, speech, or both. However,

it is still a difﬁcult task to generate appropriate gestures

that correspond to the contents of speech due to the high

ambiguity of the gesture. In the workshop [1] to establish a

benchmark for gesture generation, many gestures generated

by the participant’s method were below the Ground Truth

arXiv:2210.06790v1 [cs.RO] 13 Oct 2022

gestures in the measure of ”human-like” and below the

random gestures in the measure of ”gesture appropriateness”.

We identify the cause is that the features of text or audio as

input did not match the characteristics of the gesture.

In this paper, we propose a gesture generation system with

separate generators for each gesture type. The gesture types

here refer to the types classiﬁed by McNeill [2], which can be

roughly divided into ”Beat”, a rhythmic gesture that appears

for emphasis, and ”Imagistic”, a gesture that expresses some-

thing. Since Beat is a gesture like swinging arms in time with

the speech, they require speech information to be generated.

On the other hand, Imagistic is a gesture that express the con-

tent of the speech, such as making a circle with hands when

the speaker say, ”It looks like a donut”. Therefore it requires

semantic information to generate. In our system, at ﬁrst, it

predicts from the input text whether each word is likely

to be Imagistic, Beat, or No-Gesture. Each predicted word

sequence is then input into a generator dedicated to each

gesture type. The Imagistic generator generates gestures from

important words selected by DNN, and the Beat generator

generates gestures from audio converted from text, using the

gesture library respectively. And The No-Gesture generator

generates movements that complement the movements of the

previous and next gestures. The gesture libraries used in each

generator are created based on the collected TED Gesture-

Type Dataset. Finally, the generated gestures are interpolated

and integrated to match the speech duration of the uttered

word. Although the previous method [12] also generated

separate gesture types, there was not enough data to prepare

the Imagisitic word extraction network, and Beat gestures

were simply a small amount of pre-deﬁned gestures. The

contributions of this work are the following:

•We created a dataset of TED Talks videos annotated

with gesture types using crowdsourcing.

•We propose a method to generate gestures based on

explicitly separating the gesture types.

•We propose a method to build gesture libraries based

on the collected data and generate each type of gestures

by using the libraries.

II. RELATED WORK

a) Rule-based gesture generation: Cassel et al pro-

posed a rule-based method for assigning gestures to words

in conversation, according to McNeill’s deﬁnition of gesture

types [3]. He also created a toolkit called BEAT [4] that

extracts speciﬁc words and relationships between words and

assigns gestures to avatars. This method enable us to generate

Imagistic gestures that match the meaning of some words.

There are other rule-based systems [5], [6] that take text as

input as well, but while these methods can generate a limited

number of Imagistic gestures, they do not use speech and

cannot generate Beat gestures.

b) Data-driven gesture generation: With the remark-

able development of deep learning, recent research on gesture

generation tends to be data-driven methods, with some

methods generating gestures from only text [10], some from

only speech [8], [13], [9], and some from both [14], [11].

In recent years, the workshop [1] has been held to establish

a benchmark for gesture generation. In this workshop, the

evaluations of the generated gestures are signiﬁcantly below

the original gesture in the index of human-likeness.

Yoon et al [10] proposed a data-driven gesture generation

method that uses Seq2Seq network to translate text into

gestures. However, in order to generate a gesture such as

Beat, which is a gesture in which the arm and voice are

synchronized, audio information is required.

In the studies on generating gestures from audio,

Kucherenko et al [8] proposed a data-driven approach that

uses LSTM to transform voice into gestures. Ginosar et al

[13] proposed the method to generate gestures from the

spectrum of speech using CNN. Ferstl et al [9] proposed

a method for predicting appropriate gesture parameters from

speech and using those parameters to pull gestures from a

database. Using actual gestures from the database is effective

in generating more realistic gestures. However, these meth-

ods using only speech as input can generate Beat, but it

cannot generate Imagistic gestures to express the content of a

sentence because it does not consider the context. Therefore,

we use text as input and synthetic audio converted from text

to generate Beat gestures and synchronize voice and hand

movements.

As studies for generating gestures using both text and

speech, Kucherenko et al [14] proposed a method to generate

gestures by aligning the features of each and using an

autoregressive network. Yoon et al [11] proposed a gesture

generation method by using a network consisting of GRU

with the speaker’s ID as input in addition to text and audio.

These methods generate gestures directly from the end-to-

end trained network, therefore they are less human-like than

the actual gestures. In our method, generated gestures are

expected to generate more human-like movements, since the

gestures are synthesized using actual gestures in the library.

III. TED GESTURE-TYPE DATASET

We introduce a new 3D gesture dataset with annotated

gesture types from Yoon’s TED dataset et al[10]. The reason

for using TED videos is the availability of gestures, speeches

and manually annotated subtitle. The actual annotations were

done on Amazon Mechanical Turk, an on-line crowdsourcing

service. The annotators divided the TED video into gestures

and determined whether the gesture was Beat, Imagistic,

or No-Gesture. For Imagistic gestures, annotations about

representative words for each gesture were also collected.

13,714 gesture sequences have been collected in total.

As shown in Figure 2. a), Beat gestures, Imagistic ges-

tures and No-Gestures account for 48%, 30% and 22%

respectively. In the Extranarative (non-narrative speech, such

as setting descriptions and character introductions) gestures

of McNeil’s experiment, they accounted for 54% of Beat

gestures, 28% of Imagistic gestures, and 18% of No-Gesture

gestures, which means that the data was collected almost as

deﬁned by McNeill. Figure 2. b) shows the distribution of

the length of the collected gestures. Many of the gestures

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DeepGestureGenerationforSocialRobotsUsingType-SpecicLibrariesHitoshiTeshima1NaokiWake3DiegoThomas1YutaNakashima2HiroshiKawasaki1KatsushiIkeuchi31KyushuUniversity2OsakaUniversity3Microsoftteshima.hitoshi.058@s.kyushu-u.ac.jpFig.1:Ourproposedsystempredictsthegesturetypeforeachwordfromtheinputtext,and...

展开>> 收起<<

Deep Gesture Generation for Social Robots Using Type-Speciﬁc Libraries Hitoshi Teshima1Naoki Wake3Diego Thomas1Yuta Nakashima2Hiroshi Kawasaki1Katsushi Ikeuchi3.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Deep Gesture Generation for Social Robots Using Type-Speciﬁc Libraries Hitoshi Teshima1Naoki Wake3Diego Thomas1Yuta Nakashima2Hiroshi Kawasaki1Katsushi Ikeuchi3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: