tions form the backbone of AURL and enable efficient
learning of dynamic skills with our datasets.
We experimentally evaluate AURL on five dynamic ma-
nipulation tasks including swatting a fly swatter (See Fig. 1),
striking a box, and rattling a child’s toy on a UR10 robot.
Across all tasks, AURL can predict parameters for dynamic
behavior from audio with lower errors than our baselines.
Furthermore, when given desired sounds to emulate, the
dynamic behavior produced by AURL can generate sounds
with high similarity to the desired ones. On our aural simi-
larity metric, we find that self-supervised learning improves
performance by 34.5% over plain supervised learning.
To summarize, this paper makes three contributions. First,
we collect an audio-behavior dataset of 25k examples on
five tasks that capture contact-rich, dynamic interactions.
Second, we show that self-supervised learning techniques can
significantly improve the performance of behavior prediction
from audio. Third, we perform an ablative analysis on various
design choices such as the self-supervised objective used and
the amount of data needed for training. To the best of our
knowledge, AURL represents the first work to demonstrate
that dynamic behaviors can be learned solely from auditory
inputs. The dataset and robot videos from AURL are avail-
able on audio-robot-learning.github.io.
II. RELATED WORK
Our framework builds on several important works in
audio-based methods in robotics, self-supervised learning,
and dynamic manipulation. In this section, we will briefly
describe works that are most relevant to ours.
A. Audio for Robotics
The use of audio information has been extensively ex-
plored in the context of multimodal learning with visual
data [11,12,13,10,14]. Several of our design decisions
like the use of convolutional networks are inspired by this
line of work [11]. In the context of robotics, several works
have used audio for better navigation [15,16,9,17], where
the central idea is that sound provides a useful signal of the
environment around the robot. Recently, the sound gener-
ated by a quadrotor has been shown to provide signal for
visual localization [18]. The use of audio for manipulation
remains sparse [7,8,19]. Prior work in this domain has
looked at using sound to improve manipulation with granular
material [7], connecting sound with robotic planar manipula-
tion [8], and learning imitation-based policies for multimodal
sensory inputs [19]. We draw several points of inspiration
from these works including the use of contact microphones to
record sound [7,8] and training behavior models [19]. AURL
differs from these works in two aspects. First, we focus on
dynamic manipulation that generates sound through contact-
rich interaction. Second, we show that such manipulation be-
haviors can be learned without any visual or multimodal data.
B. Self-Supervised Learning on Sensory Data
Self-supervised representation learning has led to impres-
sive results in computer vision [20,21,22,23] and natural
language processing [24,25]. The goal of these methods is
to extract low-dimensional representations that can improve
downstream learning tasks without the need for labeled data.
This is done by first training a model on ‘pretext’ tasks with
an unlabeled dataset (e.g. Internet images). These pretext
tasks often include instance invariance to augmentations such
as color jitters or rotations [26,27,20,21,28]. The use of
such self-supervised learning has recently shown promise in
visual robotics tasks such as door opening [29] and dexterous
manipulation [30].
In the context of audio, several recent works have ex-
plored self-supervised representation learning for speech
and music data [31,32,33,34,35,36,37,38]. AURL
takes inspiration from BYOL-A [31] to use the BYOL [22]
framework for learning auditory representations. However,
unlike prior works that use music or speech data to train
their representations, our data includes multi-channel contact
sound that requires special consideration. For example, we
found that the Mixup augmentation used in BYOL-A does
not work well on our contact sound data.
C. Dynamic manipulation
Training dynamic manipulation behaviors, i.e. behaviors
that require dynamic properties such as momentum, has
received significant interest in the robotics community [39,
40,41,42,43,44,45,46,47]. Several works in this
domain learn policies that output parameters of predefined
motion primitives [43,45], which makes them amenable to
supervised learning approaches. Our work uses a similar
action parameterization for the dynamic tasks we consider.
However, in contrast to many of these works [44,45,46],
AURL can operate in domains where visual information
does not contain sufficient temporal resolution or contact
information to effectively solve the task.
III. AUDIO-BEHAVIOR DATASET COLLECTION
Since there are no prior works that have released datasets
to study the interaction of audio with behavior learning,
we will need to create our own datasets. We select tasks
that allow for self-supervised dataset collection while being
dynamic in nature (see Fig. 2for task visualization). For tasks
that involve making contact with the environment, we use
deformable tools to ensure safety. All of our data is collected
on a UR10 robot, with behaviors generated through joint
velocity control, and audio recorded by contact microphones.
A. Dynamic Tasks and Setup
Our dataset consists of five tasks: rattling a rattle, shaking
a tambourine, swatting a fly swatter, striking a horizontal sur-
face, and striking a vertical surface. For each task, we attach
a different object to our UR10’s end effector with 3D printed
mounts. Dynamic motion for each task is generated by
using motion primitives that control the robot’s velocity and
acceleration profile. After the execution of the robot motion,
distinctive sounds are produced either from the object itself
(e.g. rattle) or through interaction with the environment (e.g.
swatter). Audio generated from each interaction is recorded