Learning on the Job Self-Rewarding Ofﬂine-to-Online Finetuning for Industrial Insertion of Novel Connectors from Vision Ashvin Nair1 Brian Zhu12 Gokul Narayanan2 Eugen Solowjow2 Sergey Levine1

2025-05-02 0 0 3.08MB 10 页 10玖币

侵权投诉

Learning on the Job: Self-Rewarding Ofﬂine-to-Online Finetuning for

Industrial Insertion of Novel Connectors from Vision

Ashvin Nair∗1, Brian Zhu∗12, Gokul Narayanan2, Eugen Solowjow2, Sergey Levine1

Abstract— Learning-based methods in robotics hold the

promise of generalization, but what can be done if a learned

policy does not generalize to a new situation? In principle,

if an agent can at least evaluate its own success (i.e., with a

reward classiﬁer that generalizes well even when the policy

does not), it could actively practice the task and ﬁnetune the

policy in this situation. We study this problem in the setting

of industrial insertion tasks, such as inserting connectors in

sockets and setting screws. Existing algorithms rely on precise

localization of the connector or socket and carefully managed

physical setups, such as assembly lines, to succeed at the task.

But in unstructured environments such as homes or even some

industrial settings, robots cannot rely on precise localization

and may be tasked with previously unseen connectors. Ofﬂine

reinforcement learning on a variety of connector insertion tasks

is a potential solution, but what if the robot is tasked with

inserting previously unseen connector? In such a scenario, we

will still need methods that can robustly solve such tasks with

online practice. One of the main observations we make in this

work is that, with a suitable representation learning and domain

generalization approach, it can be signiﬁcantly easier for the

reward function to generalize to a new but structurally similar

task (e.g., inserting a new type of connector) than for the policy.

This means that a learned reward function can be used to facil-

itate the ﬁnetuning of the robot’s policy in situations where the

policy fails to generalize in zero shot, but the reward function

generalizes successfully. We show that such an approach can

be instantiated in the real world, pretrained on 50 different

connectors, and successfully ﬁnetuned to new connectors via

the learned reward function. Videos and visualizations can be

viewed at sites.google.com/view/learningonthejob

I. INTRODUCTION

Generalizable policies require broad and diverse datasets,

but for realistic applications, learning policies that can always

generalize in zero shot to new objects and environments is

often infeasible – indeed, even humans do not exhibit such

universal generalization capabilities. Instead, when faced

with a task that we don’t precisely know how to do, we can

quickly learn the task by leveraging our prior knowledge

and a little bit of practice. Reinforcement learning (RL)

provides us with a way to implement this kind of learning

on the job, using online ﬁnetuning in the new domain or

task, and potentially even extending it into a lifelong learning

system where the robot improves its generalization capacity

continually with each new task it masters.

However, instantiating this concept in a practical robotics

setting requires overcoming a number of obstacles. The robot

must be able to combine large amounts of diverse ofﬂine data

with small amounts of targeted online experience, and do so

∗First two authors contributed equally, 1University of California, Berke-

ley. 2Siemens Corporation. Correspondence: anair17@berkeley.edu

in a way that doesn’t require revisiting previously learned

tasks or domains, which means that we need an ofﬂine

RL algorithm that supports online ﬁnetuning. Perhaps more

importantly, the entire ﬁnetuning process must be supported

by the robot’s own sensors, without privileged information

or environment instrumentation, so as to retain the beneﬁts

of autonomous learning. In particular, this means that when

adapting to a new task, the robot must be able to evaluate

on its own whether it is making progress on the task, using

a learned reward function.

We study this problem in the setting of learning a policy

from vision for performing industrial insertion tasks. This

family of assembly tasks, including plugging in connectors

into sockets, keys into locks, screwdrivers into screw intru-

sions, setting screws, and so on, are found in many stages

of manufacturing. When automated in factories today, these

tasks are done by robots with specialized control algorithms

that rely on precise localization of the socket location.

Typ e A Bl ack

Type A White

Typ e A Br own

Type A Pink

Type A Rotated

Typ e B B rown

Model E

RCA Red

RCA Yellow

USB

Aux

Craftsman

6mm

Craftsman

7mm

Craftsman

10mm

Craftsman

14mm

Craftsman

square

VGA

Displayport

Mini DP

PS2

Lightning

DT2

DT3

DT4

DT6

USB-B

USB-C

BNC

Europlug

DT8

DVI

Ethernet green

Ethernet red

Ethernet

HDMI

DSub

NEMA-TT

Typ e G

IEC 320

Key copper

Key silver

Key pink

Megablock 1x1

Megablock 2x2

NEMA-L14-Y

NEMA-L5

NEMA-L14-W

Typ e G

2-Prong

DAIB

<latexit sha1_base64="NubU5t3E46KImJ2ccATU8E1qMEY=">AAACMHicbVBNS8NAEN34bf2KevSyWAQFKYn4dRQV9KaC1UJbymS70aWbTdidiCXkJ3nxp+hFQRGv/go3bQ9afbDweG9mduYFiRQGPe/VGRkdG5+YnJouzczOzS+4i0tXJk4141UWy1jXAjBcCsWrKFDyWqI5RIHk10HnqPCv77g2IlaX2E14M4IbJULBAK3Uck8ayO8x8yv0LAyLIfQY0M5Dur7j0XYcgVBmk+aNCPCWgcyO81a/BbW18j7fyFtu2at4PdC/xB+QMhngvOU+NdoxSyOukEkwpu57CTYz0CiY5HmpkRqeAOvADa9bqiDippn1Ds7pmlXaNIy1fQppT/3ZkUFkTDcKbGWxtxn2CvE/r55iuN/MhEpS5Ir1PwpTSTGmRXq0LTRnKLuWANPC7krZLWhgaDMu2RD84ZP/kqutir9b2b7YLh8cDuKYIitklawTn+yRA3JKzkmVMPJAnskbeXcenRfnw/nsl444g55l8gvO1zeWiKoJ</latexit>

1. O✏ine Dataset (50 domains, Dtrain )

<latexit sha1_base64="8oQbH8B1Z3jPYagXi/WUQtkhE90=">AAACKHicbVBNSwMxEM36bf2qevQSLIIeLLsq6k1REY+KVoW2lGw6W4PZ7JJMxLLsz/HiX/EioohXf4nZtge/HgQe8+ZNZl6YSmHQ9z+8oeGR0bHxicnS1PTM7Fx5fuHSJFZzqPFEJvo6ZAakUFBDgRKuUw0sDiVchbeHhX51B9qIRF1gN4VmzDpKRIIzdKVWea+BcI/ZZpWeg4zWz20K+k4YaNNjNxGtEqpDc7raiBnecCazo7zVtyAYzNda5Ypf9Xugf0kwIBUywGmr/NJoJ9zGoJBLZkw98FNsZkyj4BLyUsMaSBm/ZR2oO6pYDKaZ9Q7N6YotNosS7Z5C2qt+d2QsNqYbh66z2Nf81orif1rdYrTbzIRKLYLi/Y8iKykmtEiNtoUGjrLrCONauF0pv2GacXTZllwIwe+T/5LLjWqwXd0626rsHwzimCBLZJmskoDskH1yQk5JjXDyQJ7IK3nzHr1n79376LcOeQPPIvkB7/MLyOWnFw==</latexit>

3. Self-Supervised Finetuning (Dtest)

<latexit sha1_base64="A6oHpJqOkJDcSEmKJ70xHPa2+9k=">AAACA3icbVDLSgMxFM3UV62vqjvdBIvgapgpRV0W3bizQl/QDiWTZtrQTGZI7ohlKLjxV9y4UMStP+HOvzHTdqGtBy4czrk3uff4seAaHOfbyq2srq1v5DcLW9s7u3vF/YOmjhJFWYNGIlJtn2gmuGQN4CBYO1aMhL5gLX90nfmte6Y0j2QdxjHzQjKQPOCUgJF6xaMusAdIyza+DYLsEVxXhEsuB5NeseTYzhR4mbhzUkJz1HrFr24/oknIJFBBtO64TgxeShRwKtik0E00iwkdkQHrGCpJyLSXTm+Y4FOj9HEQKVMS8FT9PZGSUOtx6JvOkMBQL3qZ+J/XSSC49FIu4wSYpLOPgkRgiHAWCO5zxSiIsSGEKm52xXRIFKFgYiuYENzFk5dJs2y753blrlKqXs3jyKNjdILOkIsuUBXdoBpqIIoe0TN6RW/Wk/VivVsfs9acNZ85RH9gff4Ae1SXcA==</latexit>

2. O✏ine Training

<latexit sha1_base64="aUwaESUNCFn/XrSd0vdKh7DTLoU=">AAACCXicbVDLSsNAFJ3UV62vqEs3g0WsUEoiRV0W3bisYB/QhDKZTNqhk0mYmQglZOvGX3HjQhG3/oE7/8ZpmoW2HrhwOOfeuXOPFzMqlWV9G6WV1bX1jfJmZWt7Z3fP3D/oyigRmHRwxCLR95AkjHLSUVQx0o8FQaHHSM+b3Mz83gMRkkb8Xk1j4oZoxGlAMVJaGpqwJusQ1WHq5G+lgviZM0YqFVlWh/L0bGhWrYaVAy4TuyBVUKA9NL8cP8JJSLjCDEk5sK1YuSkSimJGsoqTSBIjPEEjMtCUo5BIN823Z/BEKz4MIqGLK5irvydSFEo5DT3dGSI1loveTPzPGyQquHJTyuNEEY7ni4KEQRXBWSzQp4JgxaaaICyo/ivEYyQQVjq8ig7BXjx5mXTPG/ZFo3nXrLauizjK4AgcgxqwwSVogVvQBh2AwSN4Bq/gzXgyXox342PeWjKKmUPwB8bnD/OdmUA=</latexit>

(s, a, ˆr,s

<latexit sha1_base64="W8Yte/4B2nxkBdP4L7/nt96H/FE=">AAACDXicbVA9SwNBEN3zM8avU0ubxSjEJtyFoJYBG8so5gNyR9jbbJIle7fH7pwYzvwBG/+KjYUitvZ2/hs3yRWa+GDg8d4MM/OCWHANjvNtLS2vrK6t5zbym1vbO7v23n5Dy0RRVqdSSNUKiGaCR6wOHARrxYqRMBCsGQwvJ37zjinNZXQLo5j5IelHvMcpASN17GMP2D2kN1IILBPAY0ywp3mIvZgXPdqV8KBPO3bBKTlT4EXiZqSAMtQ69pfXlTQJWQRUEK3brhODnxIFnAo2znuJZjGhQ9JnbUMjEjLtp9NvxvjEKF3ck8pUBHiq/p5ISaj1KAxMZ0hgoOe9ifif106gd+GnPIoTYBGdLeolAoPEk2hwlytGQYwMIVRxcyumA6IIBRNg3oTgzr+8SBrlkntWqlxXCtVyFkcOHaIjVEQuOkdVdIVqqI4oekTP6BW9WU/Wi/Vufcxal6xs5gD9gfX5A5sLmzc=</latexit>

Roll out a⇠⇡(·|s)

<latexit sha1_base64="3qFWfOpIMD1pzMY3l8H4kIZ4CZI=">AAACCnicbVDLSsNAFJ3UV62vqEs3o0Wom5KUom6EghuXVewDmhAmk2k7dDIJMxOhhKzd+CtuXCji1i9w5984TbPQ1gMXDufcO3Pv8WNGpbKsb6O0srq2vlHerGxt7+zumfsHXRklApMOjlgk+j6ShFFOOooqRvqxICj0Gen5k+uZ33sgQtKI36tpTNwQjTgdUoyUljzzOHXyR1JBggw6Y6RSkWXwCt55TixpTZ55ZtWqWzngMrELUgUF2p755QQRTkLCFWZIyoFtxcpNkVAUM5JVnESSGOEJGpGBphyFRLppvkUGT7USwGEkdHEFc/X3RIpCKaehrztDpMZy0ZuJ/3mDRA0v3ZTyOFGE4/lHw4RBFcFZLjCggmDFppogLKjeFeIxEggrnV5Fh2AvnrxMuo26fV5v3jarrUYRRxkcgRNQAza4AC1wA9qgAzB4BM/gFbwZT8aL8W58zFtLRjFzCP7A+PwB4r+aUQ==</latexit>

ˆr=R (s)

<latexit sha1_base64="jmX105UQDWkHDgMoakQhqlkSD9g=">AAAB9HicbVBNT8JAEN3iF+IX6tFLIzHxRFpC1COJF4+YWCCBhmy3A2zYbuvulEgafocXDxrj1R/jzX/jAj0o+JJJXt6bycy8IBFco+N8W4WNza3tneJuaW//4PCofHzS0nGqGHgsFrHqBFSD4BI85CigkyigUSCgHYxv5357AkrzWD7gNAE/okPJB5xRNJLfQ3jCzEtCijDrlytO1VnAXiduTiokR7Nf/uqFMUsjkMgE1brrOgn6GVXImYBZqZdqSCgb0yF0DZU0Au1ni6Nn9oVRQnsQK1MS7YX6eyKjkdbTKDCdEcWRXvXm4n9eN8XBjZ9xmaQIki0XDVJhY2zPE7BDroChmBpCmeLmVpuNqKIMTU4lE4K7+vI6adWq7lW1fl+vNGp5HEVyRs7JJXHJNWmQO9IkHmHkkTyTV/JmTawX6936WLYWrHzmlPyB9fkDWtKScQ==</latexit>

Update

<latexit sha1_base64="wogKOrYDHAyOXR8U6jxu+bKQXhU=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRbBQym7pVSPBS8eW7Af0i4lm2bb0CS7JFmhLP0VXjwo4tWf481/Y9ruQVsfDDzem2FmXhBzpo3rfju5re2d3b38fuHg8Oj4pHh61tFRoghtk4hHqhdgTTmTtG2Y4bQXK4pFwGk3mN4t/O4TVZpF8sHMYuoLPJYsZAQbKz0OYlZGrTLqDIslt+IugTaJl5ESZGgOi1+DUUQSQaUhHGvd99zY+ClWhhFO54VBommMyRSPad9SiQXVfro8eI6urDJCYaRsSYOW6u+JFAutZyKwnQKbiV73FuJ/Xj8x4a2fMhknhkqyWhQmHJkILb5HI6YoMXxmCSaK2VsRmWCFibEZFWwI3vrLm6RTrXj1Sq1VKzWqWRx5uIBLuAYPbqAB99CENhAQ8Ayv8OYo58V5dz5WrTknmzmHP3A+fwAZF49B</latexit>

⇡,Q,V

<latexit sha1_base64="StcdBwn3txr+ftAgd4Oy+7m79GI=">AAAB/XicbVDLSsNAFJ34rPUVHzs3g0VwVZJS1GVBBJcV7APaUCbTSTt0MgkzN8Uair/ixoUibv0Pd/6N0zQLbT1w4XDOvdx7jx8LrsFxvq2V1bX1jc3CVnF7Z3dv3z44bOooUZQ1aCQi1faJZoJL1gAOgrVjxUjoC9byR9czvzVmSvNI3sMkZl5IBpIHnBIwUs8+7gJ7gLRaxjdjIpJMnfbsklN2MuBl4uakhHLUe/ZXtx/RJGQSqCBad1wnBi8lCjgVbFrsJprFhI7IgHUMlSRk2kuz66f4zCh9HETKlAScqb8nUhJqPQl90xkSGOpFbyb+53USCK68lMs4ASbpfFGQCAwRnkWB+1wxCmJiCKGKm1sxHRJFKJjAiiYEd/HlZdKslN2LcvWuWqpV8jgK6ASdonPkoktUQ7eojhqIokf0jF7Rm/VkvVjv1se8dcXKZ47QH1ifP1fnlR8=</latexit>

4. Evaluation

<latexit sha1_base64="cxlroT9/+puyiP8rVo31TSCj9YU=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRahXspuKdVjwYvHKvZD2qVk02wbmmSXJCuUpb/CiwdFvPpzvPlvTNs9aOuDgcd7M8zMC2LOtHHdbye3sbm1vZPfLeztHxweFY9P2jpKFKEtEvFIdQOsKWeStgwznHZjRbEIOO0Ek5u533miSrNIPphpTH2BR5KFjGBjpcf7QT/WrKwvB8WSW3EXQOvEy0gJMjQHxa/+MCKJoNIQjrXueW5s/BQrwwins0I/0TTGZIJHtGepxIJqP10cPEMXVhmiMFK2pEEL9fdEioXWUxHYToHNWK96c/E/r5eY8NpPmYwTQyVZLgoTjkyE5t+jIVOUGD61BBPF7K2IjLHCxNiMCjYEb/XlddKuVrx6pXZXKzWqWRx5OINzKIMHV9CAW2hCCwgIeIZXeHOU8+K8Ox/L1pyTzZzCHzifPxtoj+o=</latexit>

R (s)

<latexit sha1_base64="m6Pxhz1XIfbolRGflDoqUPk8MTQ=">AAAB9HicbVDLTgJBEJzFF+IL9ehlIjHxRHYJUY8kXjyikUcCGzI728CE2YczvSjZ8B1ePGiMVz/Gm3/jAHtQsJJOKlXd6e7yYik02va3lVtb39jcym8Xdnb39g+Kh0dNHSWKQ4NHMlJtj2mQIoQGCpTQjhWwwJPQ8kbXM781BqVFFN7jJAY3YINQ9AVnaCS3i/CE6R08MuVPe8WSXbbnoKvEyUiJZKj3il9dP+JJACFyybTuOHaMbsoUCi5hWugmGmLGR2wAHUNDFoB20/nRU3pmFJ/2I2UqRDpXf0+kLNB6EnimM2A41MveTPzP6yTYv3JTEcYJQsgXi/qJpBjRWQLUFwo4yokhjCthbqV8yBTjaHIqmBCc5ZdXSbNSdi7K1dtqqVbJ4siTE3JKzolDLkmN3JA6aRBOHsgzeSVv1th6sd6tj0VrzspmjskfWJ8/XdiScw==</latexit>

Reward

<latexit sha1_base64="biXVK/jm8Hx8VzXI4pY4PavSwgo=">AAAB9HicbVBNSwMxEM3Wr1q/qh69BIvgqeyWoh4LXjxWsB/QLiWbZtvQbLIms8Vl6e/w4kERr/4Yb/4b03YP2vpg4PHeDDPzglhwA6777RQ2Nre2d4q7pb39g8Oj8vFJ26hEU9aiSijdDYhhgkvWAg6CdWPNSBQI1gkmt3O/M2XacCUfII2ZH5GR5CGnBKzk94E9QdZUgtN0NihX3Kq7AF4nXk4qKEdzUP7qDxVNIiaBCmJMz3Nj8DOigVPBZqV+YlhM6ISMWM9SSSJm/Gxx9AxfWGWIQ6VtScAL9fdERiJj0iiwnRGBsVn15uJ/Xi+B8MbPuIwTYJIuF4WJwKDwPAE85JpREKklhGpub8V0TDShYHMq2RC81ZfXSbtW9a6q9ft6pVHL4yiiM3SOLpGHrlED3aEmaiGKHtEzekVvztR5cd6dj2VrwclnTtEfOJ8/bo2Sfg==</latexit>

Policy

<latexit sha1_base64="Y2rlGVV1DXpjj8p+2TJSquGJIvE=">AAAB63icbVBNSwMxEJ2tX7V+VT16CRbBg5TdUqrHghePLdgPaJeSTbNtaJJdkqxQlv4FLx4U8eof8ua/MdvuQVsfDDzem2FmXhBzpo3rfjuFre2d3b3ifung8Oj4pHx61tVRogjtkIhHqh9gTTmTtGOY4bQfK4pFwGkvmN1nfu+JKs0i+WjmMfUFnkgWMoJNJrVvUHdUrrhVdwm0SbycVCBHa1T+Go4jkggqDeFY64HnxsZPsTKMcLooDRNNY0xmeEIHlkosqPbT5a0LdGWVMQojZUsatFR/T6RYaD0Xge0U2Ez1upeJ/3mDxIR3fspknBgqyWpRmHBkIpQ9jsZMUWL43BJMFLO3IjLFChNj4ynZELz1lzdJt1b1GtV6u15p1vI4inABl3ANHtxCEx6gBR0gMIVneIU3RzgvzrvzsWotOPnMOfyB8/kDDv6Njg==</latexit>

Q, V

<latexit sha1_base64="XTDfL5t6u2XFsrzDLD2e1H/fba8=">AAAB73icbVDLSgNBEOz1GeMr6tHLYBDiJeyGoB4DXjxGMA9IltA7mSRDZmfXmVkhrPkJLx4U8ervePNvnCR70MSChqKqm+6uIBZcG9f9dtbWNza3tnM7+d29/YPDwtFxU0eJoqxBIxGpdoCaCS5Zw3AjWDtWDMNAsFYwvpn5rUemNI/kvZnEzA9xKPmAUzRWandjXsInfdErFN2yOwdZJV5GipCh3it8dfsRTUImDRWodcdzY+OnqAyngk3z3USzGOkYh6xjqcSQaT+d3zsl51bpk0GkbElD5urviRRDrSdhYDtDNCO97M3E/7xOYgbXfsplnBgm6WLRIBHERGT2POlzxagRE0uQKm5vJXSECqmxEeVtCN7yy6ukWSl7l+XqXbVYq2Rx5OAUzqAEHlxBDW6hDg2gIOAZXuHNeXBenHfnY9G65mQzJ/AHzucPh9+PmQ==</latexit>

⇡(a|s)

DAIB

ResNet-18

<latexit sha1_base64="lBcJoq3IorS9BzjZM1ahX/3DYpI=">AAAB6nicbVDLTgJBEOzFF+IL9ehlIjHxRHYNUY9ELx4xyCOBDZkdemHC7OxmZtYECZ/gxYPGePWLvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR7cxvPaLSPJYPZpygH9GB5CFn1Fip/tSr94olt+zOQVaJl5ESZKj1il/dfszSCKVhgmrd8dzE+BOqDGcCp4VuqjGhbEQH2LFU0gi1P5mfOiVnVumTMFa2pCFz9ffEhEZaj6PAdkbUDPWyNxP/8zqpCa/9CZdJalCyxaIwFcTEZPY36XOFzIixJZQpbm8lbEgVZcamU7AheMsvr5LmRdm7LFfuK6XqTRZHHk7gFM7Bgyuowh3UoAEMBvAMr/DmCOfFeXc+Fq05J5s5hj9wPn8ARTKNzQ==</latexit>

<latexit sha1_base64="9W6W5mEAPHuoMKZo9Hi0v0h0Abk=">AAAB6nicbVBNSwMxEJ3Ur1q/qh69BIvgqexKUY9FLx4r2g9ol5JNs21okl2SrFCW/gQvHhTx6i/y5r8xbfegrQ8GHu/NMDMvTAQ31vO+UWFtfWNzq7hd2tnd2z8oHx61TJxqypo0FrHuhMQwwRVrWm4F6ySaERkK1g7HtzO//cS04bF6tJOEBZIMFY84JdZJDz2Z9ssVr+rNgVeJn5MK5Gj0y1+9QUxTyZSlghjT9b3EBhnRllPBpqVealhC6JgMWddRRSQzQTY/dYrPnDLAUaxdKYvn6u+JjEhjJjJ0nZLYkVn2ZuJ/Xje10XWQcZWklim6WBSlAtsYz/7GA64ZtWLiCKGau1sxHRFNqHXplFwI/vLLq6R1UfUvq7X7WqV+k8dRhBM4hXPw4QrqcAcNaAKFITzDK7whgV7QO/pYtBZQPnMMf4A+fwBgTI3f</latexit>

<latexit sha1_base64="AP1GKE831ATthHd/5R6XRwc0Lig=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKqMegF48RzQOSJcxOZpMx81hmZoWw5B+8eFDEq//jzb9xkuxBEwsaiqpuuruihDNjff/bK6ysrq1vFDdLW9s7u3vl/YOmUakmtEEUV7odYUM5k7RhmeW0nWiKRcRpKxrdTP3WE9WGKflgxwkNBR5IFjOCrZOa3Xs2ELhXrvhVfwa0TIKcVCBHvVf+6vYVSQWVlnBsTCfwExtmWFtGOJ2UuqmhCSYjPKAdRyUW1ITZ7NoJOnFKH8VKu5IWzdTfExkWxoxF5DoFtkOz6E3F/7xOauOrMGMySS2VZL4oTjmyCk1fR32mKbF87AgmmrlbERlijYl1AZVcCMHiy8ukeVYNLqrnd+eV2nUeRxGO4BhOIYBLqMEt1KEBBB7hGV7hzVPei/fufcxbC14+cwh/4H3+AG5Njww=</latexit>

⌃

<latexit sha1_base64="VcduhImtG31xtIwCRH/3qEwy95Y=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHYNUY9ELx4hkUcCGzI79MLI7OxmZtYECV/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR7cxvPaLSPJb3ZpygH9GB5CFn1Fip/tQrltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmhdl77JcqVdK1ZssjjycwCmcgwdXUIU7qEEDGCA8wyu8OQ/Oi/PufCxac042cwx/4Hz+AOwvjQc=</latexit>

<latexit sha1_base64="a8+hVxiwdzQEQdY/+z7PctWKn08=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKexKUI9BL3qLaB6QLGF2MkmGzM4uM71CXPIJXjwo4tUv8ubfOEn2oIkFDUVVN91dQSyFQdf9dnIrq2vrG/nNwtb2zu5ecf+gYaJEM15nkYx0K6CGS6F4HQVK3oo1p2EgeTMYXU/95iPXRkTqAccx90M6UKIvGEUr3T91b7vFklt2ZyDLxMtICTLUusWvTi9iScgVMkmNaXtujH5KNQom+aTQSQyPKRvRAW9bqmjIjZ/OTp2QE6v0SD/SthSSmfp7IqWhMeMwsJ0hxaFZ9Kbif147wf6lnwoVJ8gVmy/qJ5JgRKZ/k57QnKEcW0KZFvZWwoZUU4Y2nYINwVt8eZk0zsreeblyVylVr7I48nAEx3AKHlxAFW6gBnVgMIBneIU3RzovzrvzMW/NOdnMIfyB8/kDNgqNww==</latexit>

<latexit sha1_base64="41LxcE0huifLA3ceaNWLHK43ug0=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8cK9gPaUDabTbt0swm7E6GE/ggvHhTx6u/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61TZJpxlsskYnuBtRwKRRvoUDJu6nmNA4k7wTju5nfeeLaiEQ94iTlfkyHSkSCUbRSpz+imIfTQbXm1t05yCrxClKDAs1B9asfJiyLuUImqTE9z03Rz6lGwSSfVvqZ4SllYzrkPUsVjbnx8/m5U3JmlZBEibalkMzV3xM5jY2ZxIHtjCmOzLI3E//zehlGN34uVJohV2yxKMokwYTMfieh0JyhnFhCmRb2VsJGVFOGNqGKDcFbfnmVtC/q3lX98uGy1rgt4ijDCZzCOXhwDQ24hya0gMEYnuEV3pzUeXHenY9Fa8kpZo7hD5zPH5TGj74=</latexit>

<latexit sha1_base64="PgDI4vWes99CIsWm3tfccqA2aAA=">AAACBnicdVDLSgMxFM3UV62vqksRgkWoIGWmtOqy6MaFiwq2FjpDuZNJ29DMgyRTKMOs3Pgrblwo4tZvcOffmGkrqOiBkMM593LvPW7EmVSm+WHkFhaXllfyq4W19Y3NreL2TluGsSC0RUIeio4LknIW0JZiitNOJCj4Lqe37ugi82/HVEgWBjdqElHHh0HA+oyA0lKvuG/7oIYEeHKV9hLwxmnZO8b2EFTipUe9Ysms1E2NKjYr2V+r47lSx9ZUMc0SmqPZK77bXkhinwaKcJCya5mRchIQihFO04IdSxoBGcGAdjUNwKfSSaZnpPhQKx7uh0K/QOGp+r0jAV/Kie/qymxp+dvLxL+8bqz6Z07CgihWNCCzQf2YYxXiLBPsMUGJ4hNNgAimd8VkCAKI0skVdAhfl+L/SbtasU4qtetaqXE+jyOP9tABKiMLnaIGukRN1EIE3aEH9ISejXvj0XgxXmelOWPes4t+wHj7BC2ImPI=</latexit>

Ladv (d, ˆ

Fig. 1: We describe how a robot can learn to insert an unseen connector from

prior experience under realistic conditions by evaluating its own reward. (1)

We ﬁrst collect a diverse dataset with 50 connectors. There is signiﬁcant

variation in the shape of the connectors, sockets, and background. (2) We

learn a reward function, policy, and value functions using ofﬂine RL. We

propose a domain adversarial information bottleneck (DAIB) in order to

generalize to new domains. A domain invariance loss is applied on part

of the latent representation zI, while a domain speciﬁc latent variable zS

is constrained by an information bottleneck. (3) We ﬁnetune online with

self-generated rewards to master a new domain. (4) We evaluate πin Dtest .

arXiv:2210.15206v2 [cs.RO] 27 Feb 2023

For robots to perform these insertion tasks in industrial

and warehouse settings with less human supervision, or in

unstructured environments such as homes, they must rely on

highly accurate state information of the external world (e.g.,

socket position and in-hand pose estimation). But such state

estimation, using either machine learning or computer vision

approaches, is brittle on unseen connectors. To solve the

general problem of inserting a novel connector, one promis-

ing solution is to generalize previously collected experience

of connector insertion to learn a policy to insert connectors

from vision. Among these tasks, there is enough variability

to require generalization and adaptation, but also enough

internal structural regularity that we expect transfer between

connectors. We ﬁrst collected a large ofﬂine dataset with

insertion data of 50 connectors across 2 robots and diverse

backgrounds with actions, images, and sparse reward labels.

Ofﬂine RL on this data alone generalizes to connectors very

similar to those in the training dataset, but we will also expect

robots to be able to perform tasks in new domains, perhaps

after some practice. How can a robot insert test connectors

from vision in this setting, utilizing ofﬂine RL from ofﬂine

data to enable active online ﬁnetuning on a new connector?

The key insight is that we need to (1) adapt to new tasks

quickly with online ﬁnetuning if the zero-shot solution is not

sufﬁcient and (2) generalize to new domains by ﬁnding com-

mon structure between domains while preserving important

domain-speciﬁc information. Ideally, a policy trained ofﬂine

can generalize from vision to new tasks. But if it does not, we

can still ﬁnetune in a new domain with minimal supervision

as long as we have a reward function that generalizes instead.

For training policies and reward functions that generalize to

test domains, we propose a split representation that combines

domain adversarial neural networks [1] for domain invariance

and a variational information bottleneck [2] for controlling

the ﬂow of domain-speciﬁc information. This representation,

which we call domain adversarial information bottleneck

(DAIB) is used ﬁrst for learning a robust reward function to

detect successful insertions for an unseen connector. Next,

we modify implicit Q-learning (IQL), an ofﬂine RL algo-

rithm amenable to online ﬁnetuning, to use DAIB. During

online ﬁnetuning, DAIB can be used in combination with

online RL to enable fast learning of novel connectors.

We present two main contributions. We demonstrate a

system for ﬁnetuning under realistic real-world constraints

with minimal human supervision, and applied it to insert

connectors robustly from vision without the need of accurate

socket localization, both for observations and rewards. To

accomplish this, we propose a novel representation learn-

ing method that allows better generalization of policies

and reward functions to unseen domains. We outperform

regression-based baselines on the same dataset that combine

localizing the socket with hand-designed control policies, as

well as prior RL methods. We show that new tasks can be

ﬁnetuned within 200 trials (about 50 minutes of real-world

interaction), given our dataset of off-policy data from 50

prior domains of 70,000 trajectories. This system allows us

to ﬁnetune IQL to a test connector, increasing performance

signiﬁcantly over the ofﬂine performance. Project videos

and our dataset of robotic insertion will be made public at

sites.google.com/view/learningonthejob

II. RELATED WORK

Reinforcement learning has been applied to a variety of

robotics tasks [3]–[11]. To utilize ofﬂine datasets with diverse

data in robotics, algorithms developed for ofﬂine RL [12]–

[15] have been studied in the robotics setting [16]–[20]. A

subset of ofﬂine RL algorithms are amenable to ﬁnetuning

[14], [21]–[25]. Our work builds on the direction of ofﬂine

pretraining followed by online ﬁnetuning in robotics. But

beyond this line of work, we focus on ﬁnetuning from visual

input in realistic settings with multiple domains and without

ground truth reward functions for the new task.

In this respect, our work is closest to prior work on

self-supervised RL that does not assume an external reward

function and instead learns it from data. One class of self-

supervised RL methods uses goal-conditioned RL with self-

supervised rewards [26]–[36]. While general, this class of

methods is a poor ﬁt for industrial insertion, as high precision

is required in both the policy and in evaluating rewards. In-

stead, we train a domain generalizing reward classiﬁer from

prior data. Prior methods have used learned rewards [37] and

classiﬁer rewards have been proposed as a scalable solution

for robotics tasks previously [38], [39]. However, learned

rewards have not been shown to be useful for ﬁnetuning

in novel real-world robotic domains previously. Because we

focus on applying ofﬂine RL and ﬁnetuning from vision in

the industrial insertion setting, domain generalization of the

reward function is vital for our method to work in practice.

Many aspects of robotic insertion, or peg-in-hole assembly,

has been studied in prior work [40]–[45], often utilizing

geometry and dynamic analysis, force control, tactile sens-

ing, and search, but these methods can be brittle to state

estimation errors. Learning-based methods, including RL,

have also been applied, usually for a single connector from

ground-truth state information [46]–[48]. In these cases, the

RL algorithm must learn to navigate the speciﬁc dynamics of

the single connector, but does not generalize across connec-

tors. More recent work has considered using meta-learning

to generalize and improve few-shot between domains [49].

Zhao et al. use ofﬂine RL and ﬁnetuning combined with

meta-learning to adapt to a new connector [50]. This work

assumes a known position of the socket and consistent

grasping of the connector, and is robust to a small amount

(±1mm) of noise. With known socket position and small

error, the learning algorithm can learn a structured noise

or exploration strategy that can overcome these errors. In

contrast, we initialize connectors within ±20mm of the

socket (20×the variance), which requires the robot to rely on

visual feedback since blind exploration will rarely succeed.

Closest to our work is prior work that also uses pixel input

for robotic insertion. Luo et al. incorporate vision alongside

proprioception, using a VAE to embed pixel input [51].

InsertionNet uses a vision system to localize the object and

socket, operating on a ”residual policy” which is learned

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LearningontheJob:Self-RewardingOfine-to-OnlineFinetuningforIndustrialInsertionofNovelConnectorsfromVisionAshvinNair1,BrianZhu12,GokulNarayanan2,EugenSolowjow2,SergeyLevine1AbstractLearning-basedmethodsinroboticsholdthepromiseofgeneralization,butwhatcanbedoneifalearnedpolicydoesnotgeneralizetoane...

展开>> 收起<<

Learning on the Job Self-Rewarding Ofﬂine-to-Online Finetuning for Industrial Insertion of Novel Connectors from Vision Ashvin Nair1 Brian Zhu12 Gokul Narayanan2 Eugen Solowjow2 Sergey Levine1.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Learning on the Job Self-Rewarding Ofﬂine-to-Online Finetuning for Industrial Insertion of Novel Connectors from Vision Ashvin Nair1 Brian Zhu12 Gokul Narayanan2 Eugen Solowjow2 Sergey Levine1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: