gestures in the measure of ”human-like” and below the
random gestures in the measure of ”gesture appropriateness”.
We identify the cause is that the features of text or audio as
input did not match the characteristics of the gesture.
In this paper, we propose a gesture generation system with
separate generators for each gesture type. The gesture types
here refer to the types classified by McNeill [2], which can be
roughly divided into ”Beat”, a rhythmic gesture that appears
for emphasis, and ”Imagistic”, a gesture that expresses some-
thing. Since Beat is a gesture like swinging arms in time with
the speech, they require speech information to be generated.
On the other hand, Imagistic is a gesture that express the con-
tent of the speech, such as making a circle with hands when
the speaker say, ”It looks like a donut”. Therefore it requires
semantic information to generate. In our system, at first, it
predicts from the input text whether each word is likely
to be Imagistic, Beat, or No-Gesture. Each predicted word
sequence is then input into a generator dedicated to each
gesture type. The Imagistic generator generates gestures from
important words selected by DNN, and the Beat generator
generates gestures from audio converted from text, using the
gesture library respectively. And The No-Gesture generator
generates movements that complement the movements of the
previous and next gestures. The gesture libraries used in each
generator are created based on the collected TED Gesture-
Type Dataset. Finally, the generated gestures are interpolated
and integrated to match the speech duration of the uttered
word. Although the previous method [12] also generated
separate gesture types, there was not enough data to prepare
the Imagisitic word extraction network, and Beat gestures
were simply a small amount of pre-defined gestures. The
contributions of this work are the following:
•We created a dataset of TED Talks videos annotated
with gesture types using crowdsourcing.
•We propose a method to generate gestures based on
explicitly separating the gesture types.
•We propose a method to build gesture libraries based
on the collected data and generate each type of gestures
by using the libraries.
II. RELATED WORK
a) Rule-based gesture generation: Cassel et al pro-
posed a rule-based method for assigning gestures to words
in conversation, according to McNeill’s definition of gesture
types [3]. He also created a toolkit called BEAT [4] that
extracts specific words and relationships between words and
assigns gestures to avatars. This method enable us to generate
Imagistic gestures that match the meaning of some words.
There are other rule-based systems [5], [6] that take text as
input as well, but while these methods can generate a limited
number of Imagistic gestures, they do not use speech and
cannot generate Beat gestures.
b) Data-driven gesture generation: With the remark-
able development of deep learning, recent research on gesture
generation tends to be data-driven methods, with some
methods generating gestures from only text [10], some from
only speech [8], [13], [9], and some from both [14], [11].
In recent years, the workshop [1] has been held to establish
a benchmark for gesture generation. In this workshop, the
evaluations of the generated gestures are significantly below
the original gesture in the index of human-likeness.
Yoon et al [10] proposed a data-driven gesture generation
method that uses Seq2Seq network to translate text into
gestures. However, in order to generate a gesture such as
Beat, which is a gesture in which the arm and voice are
synchronized, audio information is required.
In the studies on generating gestures from audio,
Kucherenko et al [8] proposed a data-driven approach that
uses LSTM to transform voice into gestures. Ginosar et al
[13] proposed the method to generate gestures from the
spectrum of speech using CNN. Ferstl et al [9] proposed
a method for predicting appropriate gesture parameters from
speech and using those parameters to pull gestures from a
database. Using actual gestures from the database is effective
in generating more realistic gestures. However, these meth-
ods using only speech as input can generate Beat, but it
cannot generate Imagistic gestures to express the content of a
sentence because it does not consider the context. Therefore,
we use text as input and synthetic audio converted from text
to generate Beat gestures and synchronize voice and hand
movements.
As studies for generating gestures using both text and
speech, Kucherenko et al [14] proposed a method to generate
gestures by aligning the features of each and using an
autoregressive network. Yoon et al [11] proposed a gesture
generation method by using a network consisting of GRU
with the speaker’s ID as input in addition to text and audio.
These methods generate gestures directly from the end-to-
end trained network, therefore they are less human-like than
the actual gestures. In our method, generated gestures are
expected to generate more human-like movements, since the
gestures are synthesized using actual gestures in the library.
III. TED GESTURE-TYPE DATASET
We introduce a new 3D gesture dataset with annotated
gesture types from Yoon’s TED dataset et al[10]. The reason
for using TED videos is the availability of gestures, speeches
and manually annotated subtitle. The actual annotations were
done on Amazon Mechanical Turk, an on-line crowdsourcing
service. The annotators divided the TED video into gestures
and determined whether the gesture was Beat, Imagistic,
or No-Gesture. For Imagistic gestures, annotations about
representative words for each gesture were also collected.
13,714 gesture sequences have been collected in total.
As shown in Figure 2. a), Beat gestures, Imagistic ges-
tures and No-Gestures account for 48%, 30% and 22%
respectively. In the Extranarative (non-narrative speech, such
as setting descriptions and character introductions) gestures
of McNeil’s experiment, they accounted for 54% of Beat
gestures, 28% of Imagistic gestures, and 18% of No-Gesture
gestures, which means that the data was collected almost as
defined by McNeill. Figure 2. b) shows the distribution of
the length of the collected gestures. Many of the gestures