Deep Gesture Generation for Social Robots Using Type-Specific Libraries Hitoshi Teshima1Naoki Wake3Diego Thomas1Yuta Nakashima2Hiroshi Kawasaki1Katsushi Ikeuchi3

2025-04-24 0 0 1.26MB 6 页 10玖币
侵权投诉
Deep Gesture Generation for Social Robots
Using Type-Specific Libraries
Hitoshi Teshima1Naoki Wake3Diego Thomas1Yuta Nakashima2Hiroshi Kawasaki1Katsushi Ikeuchi3
1Kyushu University 2Osaka University 3Microsoft
teshima.hitoshi.058@s.kyushu-u.ac.jp
Merge &
Interpolate
Beat Gesture Imagistic Gesture No-Gesture
TED Gesture-Type Dataset
Text
Gesture
Type
Prediction
Beat
Library
Beat
Words
Imagistic
Words
No-Gesture
Words
Audio
Beat Generator
No-Gesture Generator
No-Gesture
Library
Imagistic Generator
Imagistic
Library
this idea of the market
with small shops…
this idea of …
small
Important
Words
Labanotation
Gesture
𝑇𝑖𝑚𝑒 × 𝐽𝑜𝑖𝑛𝑡𝑠 × 𝑋𝑌𝑍
Avatar
Robot
Beat
Gesture
Imagistic
Gesture
Gesture Generation System
Fig. 1: Our proposed system predicts the gesture type for each word from the input text, and inputs each word sequence to
separate generator to generate gestures.
Abstract Body language such as conversational gesture is a
powerful way to ease communication. Conversational gestures
do not only make a speech more lively but also contain semantic
meaning that helps to stress important information in the
discussion. In the field of robotics, giving conversational agents
(humanoid robots or virtual avatars) the ability to properly use
gestures is critical, yet remain a task of extraordinary difficulty.
This is because given only a text as input, there are many
possibilities and ambiguities to generate an appropriate gesture.
Different to previous works we propose a new method that
explicitly takes into account the gesture types to reduce these
ambiguities and generate human-like conversational gestures.
Key to our proposed system is a new gesture database built
on the TED dataset that allows us to map a word to one of
three types of gestures: ”Imagistic” gestures, which express
the content of the speech, ”Beat” gestures, which emphasize
words, and ”No gestures.” We propose a system that first maps
the words in the input text to their corresponding gesture
type, generate type-specific gestures and combine the generated
gestures into one final smooth gesture. In our comparative
experiments, the effectiveness of the proposed method was
confirmed in user studies for both avatar and humanoid robot.
I. INTRODUCTION
We communicate information in two ways, speech and
gesture, and they are both important elements. As McNeill
[2] argued, many human gestures are related to speech.
Gestures supplement the content of the speech and make it
easier for the listener to understand the information. In recent
years, agents for interacting with humans, such as avatars
and humanoid robots, are commonly used. If the agent
communicates not only through speech but also through
gestures, it will be easier for humans to understand the
conversation with the agent.
Recent gesture generation systems generate gestures by
learning end-to-end from text, speech, or both. However,
it is still a difficult task to generate appropriate gestures
that correspond to the contents of speech due to the high
ambiguity of the gesture. In the workshop [1] to establish a
benchmark for gesture generation, many gestures generated
by the participant’s method were below the Ground Truth
arXiv:2210.06790v1 [cs.RO] 13 Oct 2022
gestures in the measure of ”human-like” and below the
random gestures in the measure of ”gesture appropriateness”.
We identify the cause is that the features of text or audio as
input did not match the characteristics of the gesture.
In this paper, we propose a gesture generation system with
separate generators for each gesture type. The gesture types
here refer to the types classified by McNeill [2], which can be
roughly divided into ”Beat”, a rhythmic gesture that appears
for emphasis, and ”Imagistic”, a gesture that expresses some-
thing. Since Beat is a gesture like swinging arms in time with
the speech, they require speech information to be generated.
On the other hand, Imagistic is a gesture that express the con-
tent of the speech, such as making a circle with hands when
the speaker say, ”It looks like a donut”. Therefore it requires
semantic information to generate. In our system, at first, it
predicts from the input text whether each word is likely
to be Imagistic, Beat, or No-Gesture. Each predicted word
sequence is then input into a generator dedicated to each
gesture type. The Imagistic generator generates gestures from
important words selected by DNN, and the Beat generator
generates gestures from audio converted from text, using the
gesture library respectively. And The No-Gesture generator
generates movements that complement the movements of the
previous and next gestures. The gesture libraries used in each
generator are created based on the collected TED Gesture-
Type Dataset. Finally, the generated gestures are interpolated
and integrated to match the speech duration of the uttered
word. Although the previous method [12] also generated
separate gesture types, there was not enough data to prepare
the Imagisitic word extraction network, and Beat gestures
were simply a small amount of pre-defined gestures. The
contributions of this work are the following:
We created a dataset of TED Talks videos annotated
with gesture types using crowdsourcing.
We propose a method to generate gestures based on
explicitly separating the gesture types.
We propose a method to build gesture libraries based
on the collected data and generate each type of gestures
by using the libraries.
II. RELATED WORK
a) Rule-based gesture generation: Cassel et al pro-
posed a rule-based method for assigning gestures to words
in conversation, according to McNeill’s definition of gesture
types [3]. He also created a toolkit called BEAT [4] that
extracts specific words and relationships between words and
assigns gestures to avatars. This method enable us to generate
Imagistic gestures that match the meaning of some words.
There are other rule-based systems [5], [6] that take text as
input as well, but while these methods can generate a limited
number of Imagistic gestures, they do not use speech and
cannot generate Beat gestures.
b) Data-driven gesture generation: With the remark-
able development of deep learning, recent research on gesture
generation tends to be data-driven methods, with some
methods generating gestures from only text [10], some from
only speech [8], [13], [9], and some from both [14], [11].
In recent years, the workshop [1] has been held to establish
a benchmark for gesture generation. In this workshop, the
evaluations of the generated gestures are significantly below
the original gesture in the index of human-likeness.
Yoon et al [10] proposed a data-driven gesture generation
method that uses Seq2Seq network to translate text into
gestures. However, in order to generate a gesture such as
Beat, which is a gesture in which the arm and voice are
synchronized, audio information is required.
In the studies on generating gestures from audio,
Kucherenko et al [8] proposed a data-driven approach that
uses LSTM to transform voice into gestures. Ginosar et al
[13] proposed the method to generate gestures from the
spectrum of speech using CNN. Ferstl et al [9] proposed
a method for predicting appropriate gesture parameters from
speech and using those parameters to pull gestures from a
database. Using actual gestures from the database is effective
in generating more realistic gestures. However, these meth-
ods using only speech as input can generate Beat, but it
cannot generate Imagistic gestures to express the content of a
sentence because it does not consider the context. Therefore,
we use text as input and synthetic audio converted from text
to generate Beat gestures and synchronize voice and hand
movements.
As studies for generating gestures using both text and
speech, Kucherenko et al [14] proposed a method to generate
gestures by aligning the features of each and using an
autoregressive network. Yoon et al [11] proposed a gesture
generation method by using a network consisting of GRU
with the speaker’s ID as input in addition to text and audio.
These methods generate gestures directly from the end-to-
end trained network, therefore they are less human-like than
the actual gestures. In our method, generated gestures are
expected to generate more human-like movements, since the
gestures are synthesized using actual gestures in the library.
III. TED GESTURE-TYPE DATASET
We introduce a new 3D gesture dataset with annotated
gesture types from Yoon’s TED dataset et al[10]. The reason
for using TED videos is the availability of gestures, speeches
and manually annotated subtitle. The actual annotations were
done on Amazon Mechanical Turk, an on-line crowdsourcing
service. The annotators divided the TED video into gestures
and determined whether the gesture was Beat, Imagistic,
or No-Gesture. For Imagistic gestures, annotations about
representative words for each gesture were also collected.
13,714 gesture sequences have been collected in total.
As shown in Figure 2. a), Beat gestures, Imagistic ges-
tures and No-Gestures account for 48%, 30% and 22%
respectively. In the Extranarative (non-narrative speech, such
as setting descriptions and character introductions) gestures
of McNeil’s experiment, they accounted for 54% of Beat
gestures, 28% of Imagistic gestures, and 18% of No-Gesture
gestures, which means that the data was collected almost as
defined by McNeill. Figure 2. b) shows the distribution of
the length of the collected gestures. Many of the gestures
摘要:

DeepGestureGenerationforSocialRobotsUsingType-SpecicLibrariesHitoshiTeshima1NaokiWake3DiegoThomas1YutaNakashima2HiroshiKawasaki1KatsushiIkeuchi31KyushuUniversity2OsakaUniversity3Microsoftteshima.hitoshi.058@s.kyushu-u.ac.jpFig.1:Ourproposedsystempredictsthegesturetypeforeachwordfromtheinputtext,and...

展开>> 收起<<
Deep Gesture Generation for Social Robots Using Type-Specific Libraries Hitoshi Teshima1Naoki Wake3Diego Thomas1Yuta Nakashima2Hiroshi Kawasaki1Katsushi Ikeuchi3.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:6 页 大小:1.26MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注