From e82feb7521cf5988076d307b9527bf7d547a5a02 Mon Sep 17 00:00:00 2001 From: Amit Parag <aparag@laas.fr> Date: Mon, 21 Feb 2022 14:16:35 +0100 Subject: [PATCH] amit icra 22 --- amit_icra_22.md | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) create mode 100644 amit_icra_22.md diff --git a/amit_icra_22.md b/amit_icra_22.md new file mode 100644 index 0000000..c42af56 --- /dev/null +++ b/amit_icra_22.md @@ -0,0 +1,30 @@ +--- +title: "Value learning from trajectory optimization and Sobolev descent: A step toward reinforcement learning with superlinear convergence properties" +subtitle: IEEE ICRA - International Conference on Robotics and Automation, 2022 +author: +- Amit Parag ^1, 2^ +- Sébastien Kleff ^1,3^ +- Léo Saci ^1^ +- <a href="https://gepettoweb.laas.fr/index.php/Members/NicolasMansard">Nicolas Mansard</a> ^1,2^ +- <a href="https://homepages.laas.fr/ostasse/drupal/">Olivier Stasse</a> ^1,2^ +org: +- ^1^ Gepetto Team, LAAS-CNRS, France. +- ^2^ Artificial and Natural Intelligence Toulouse Institute, Toulouse. +- ^3^ New York University, USA + +hal: https://hal.archives-ouvertes.fr/hal-03356261 +peertube: https://peertube.laas.fr/videos/embed/9a3c5258-e5b7-49a5-a153-02e804a06f65 +sourcecode: https://github.com/amitparag/Kuka-arm-DvP +... + +## Abstract + +The recent successes in deep reinforcement learning largely rely on the capabilities of generating masses of data, which in turn implies the use of a simulator. +In particular, current progress in multi body dynamic simulators are underpinning the implementation of reinforcement learning for end-to-end control of robotic systems. +Yet simulators are mostly considered as black boxes while we have the knowledge to make them produce a richer information. +In this paper, we are proposing to use the derivatives of the simulator to help with the convergence of the learning. +For that, we combine model-based trajectory optimization to produce informative trials using 1st- and 2nd-order simulation derivatives. +These locally-optimal runs give fair estimates of the value function and its derivatives, that we use to accelerate the convergence of the critics using Sobolev learning. +We empirically demonstrate that the algorithm leads to a faster and more accurate estimation of the value function. +The resulting value estimate is used in model-predictive controller as a proxy for shortening the preview horizon. +We believe that it is also a first step toward superlinear reinforcement learning algorithm using simulation derivatives, that we need for end-to-end legged locomotion. -- GitLab