Reinforcement Methods for Autonomous Online Learning of Optimal - PowerPoint PPT Presentation

Supported by : F.L. Lewis, Dan Popa NSF - Paul Werbos Automation & Robotics Research Institute (ARRI) ARO- Sam Stanton The University of Texas at Arlington, USA AFOSR- Fariba Fahroo Guido Herrmann Bristol Robotics Lab, University of Bristol, UK Reinforcement Methods for Autonomous Online Learning of Optimal Robot Behaviors Talk available online at http://ARRI.uta.edu/acs

Invited by Rolf Johansson  Optimal Control  Reinforcement learning  Policy Iteration  Q Learning  Humanoid Robot Learning Control Using RL  Telerobotic Interface Learning Using RL

It is man’s obligation to explore the most difficult questions in the clearest possible way and use reason and intellect to arrive at the best answer. Man’s task is to understand patterns in nature and society. The first task is to understand the individual problem, then to analyze symptoms and causes, and only then to design treatment and controls. Ibn Sina 1002-1042 (Avicenna)

Importance of Feedback Control Darwin- FB and natural selection Volterra- FB and fish population balance Adam Smith- FB and international economy James Watt- FB and the steam engine FB and cell homeostasis The resources available to most species for their survival are meager and limited Nature uses Optimal control

F.L. Lewis and D. Vrabie, “Reinforcement learning and adaptive dynamic programming for feedback control,” IEEE Circuits & Systems Magazine, Invited Feature Article, pp. 32-50, Third Quarter 2009. IEEE Control systems magazine, to appear.

Discrete-Time Optimal Control    ( ) ( ) x f x g x u system 1 k k k k        ( , ) T T ( ) i k ( , ) cost r x u x Qx u Ru Example V x r x u h k i i k k k k k k  i k         ( 1) ( ) ( , ) ( , ) i k Difference eq. equivalent V x r x u r x u h k k k i i   1 i k u  ( ) Control policy = the prescribed control input function h x k k   Example u Kx Linear state variable feedback k k     ( ) ( , ( )) ( ) , (0) 0 V x r x h x V x V Bellman equation  1 h k k k h k h     ( ) ( ) T T V x x Qx u Ru V x  1 h k k k k k h k Bellman’s Principle gives Bellman opt. eq= DT HJB *    * ( ) min ( ( , ) ( )) V x r x u V x  1 k k k k u k    * * ( ) arg min ( ( , ) ( )) h x r x u V x Optimal Control  1 k k k k u k   ( ) V x      1 ( ) ( ) T 1 1 k u x R g x Off-line solution  2 k k x  Dynamics must be known 1 k

DT Optimal Control – Linear Systems Quadratic cost (LQR) system    x Ax Bu 1 k k k  cost    ( ) T T V x x Qx u Ru k i i i i  i k  ( ) T for some symmetric matrix P Fact. The cost is quadratic V x x Px k k k HJB = DT Riccati equation       1 0 ( ) T T T T A PA P Q A PB R B PB B PA   u Lx Optimal Control k k    1 ( ) T T L R B PB B PA Off-line solution Optimal Cost Dynamics must be known  * ( ) T V x x Px k k k

We want robot controllers that learn optimal control solutions online in real-time Synthesis of  Computational intelligence  Control systems  Neurobiology Different methods of learning Machine learning- the formal study of learning systems Supervised learning Unsupervised learning Reinforcement learning

Different methods of learning Reinforcement learning We want OPTIMAL performance Ivan Pavlov 1890s - ADP- Approximate Dynamic Programming Actor-Critic Learning Desired performance Reinforcement signal Critic environment Tune actor Control Inputs Adaptive System Learning system outputs Actor

RL Policy Iterations to Solve Optimal Control Problem    ( ) ( ) x f x g x u system 1 k k k k      ( ) i k ( , ) cost V x r x u h k i i  i k         ( 1) ( ) ( , ) ( , ) i k Difference eq. equivalent V x r x u r x u h k k k i i   1 i k     ( ) ( , ( )) ( ) , (0) 0 V x r x h x V x V Bellman equation  1 h k k k h k h     ( ) ( ) T T V x x Qx u Ru V x  1 h k k k k k h k Bellman’s Principle gives Bellman opt. eq= DT HJB *    * ( ) min ( ( , ) ( )) V x r x u V x  1 k k k k u Focus on these two eqs. k    * * ( ) arg min ( ( , ) ( )) h x r x u V x Optimal Control  1 k k k k u k   ( ) V x      1 ( ) ( ) T 1 1 k u x R g x 2  k k x  1 k

Bellman Equation    ( ) ( , ( )) ( ) V x r x h x V x  1 h k k k h k Can be interpreted as a consistency equation that must be satisfied by the value function at each time stage. Expresses a relation between the current value of being in state x and the value(s) of being in next state x’ given that policy Captures the action, observation, evaluation, and improvement mechanisms of reinforcement learning.      ( ) ( , ( )) ( ) Temporal Difference Idea e V x r x h x V x  1 k h k k k h k

Policy Evaluation and Policy Improvement consider algorithms that repeatedly interleave the two procedures: Policy Evaluation by Bellman Equation:    ( ) ( , ( )) ( ) V x r x h x V x  1 h k k k h k Policy Improvement:  ( ) 1 V x    1  1 '( ) ( ) T k h x R g x  k 2 k x  1 k  ' ( ) ( ) V x V x Policy Improvement makes h h (Bertsekas and Tsitsiklis 1996, Sutton and Barto 1998). ( ) '( ) the policy is said to be greedy with respect to value function V x h x h k At each step, one obtains a policy that is no worse than the previous policy. Can prove convergence under fairly mild conditions to the optimal value and optimal policy. Most such proofs are based on the Banach Fixed Point Theorem. One step is a contraction map. There is a large family of algorithms that implement the policy evaluation and policy improvement procedures in various ways

DT Policy Iteration to solve HJB Cost for any given control policy h(x k ) satisfies the recursion    ( ) ( , ( )) ( ) V x r x h x V x Bellman eq.  1 h k k k h k Recursive form Recursive solution Consistency equation Pick stabilizing initial control Policy Evaluation – solve Bellman Equation    ( ) ( , ( )) ( ) f(.) and g(.) do not appear V x r x h x V x    1 1 1 j k k j k j k Policy Improvement    ( ) arg min( ( , ) ( )) h x r x u V x     1 1 1 1 j k k k j k u k Howard (1960) proved convergence for MDP (Bertsekas and Tsitsiklis 1996, Sutton and Barto 1998). 1 ( ) 1 ( ) the policy is said to be greedy with respect to value function V x h x   j j k At each step, one obtains a policy that is no worse than the previous policy. Can prove convergence under fairly mild conditions to the optimal value and optimal policy. Most such proofs are based on the Banach Fixed Point Theorem. One step is a contraction map.

Methods to implement Policy Iteration  Exact Computation- needs full system dynamics  Temporal Difference- for robot trajectory following  Montecarlo Learning- for learning episodic robot tasks

DT Policy Iteration – Linear Systems Quadratic Cost- LQR        ( ) , u Lx x Ax Bu A BL x k k 1 k k k k For any stabilizing policy, the cost is     ( ) ( ) ( ) T T V x x Qx u x Ru x k i i i i  i k  ( ) T LQR value is quadratic V x x Px Solves Lyapunov eq. without knowing A and B DT Policy iterations    ( ) ( ) ( ) ( ) T T V x x Qx u x Ru x V x    1 1 1 j k k k j k j k j k ( ) 1 dV x   1 1    1 ( ) ( ) j k T u x R g x   1 1 j k k 2 dx  1 k Equivalent to an Underlying Problem- DT LQR:       ( ) ( ) T T DT Lyapunov eq. A BL P A BL P Q L RL   1 1 j j j j j j    1 ( ) T T L R B P B B P A    1 1 1 j j j Hewer proved convergence in 1971 Policy Iteration Solves Lyapunov equation WITHOUT knowing System Dynamics

Reinforcement Methods for Autonomous Online Learning of Optimal - PowerPoint PPT Presentation

Supported by : F.L. Lewis, Dan Popa NSF - Paul Werbos Automation & Robotics Research Institute (ARRI) ARO- Sam Stanton The University of Texas at Arlington, USA AFOSR- Fariba Fahroo Guido Herrmann Bristol Robotics Lab, University of

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

Visual Reinforcement Learning with Imagined Goals Ashvin Nair, Vitchyr Pong , Murtaza Dalal,

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Advanced planning for autonomous vehicles using reinforcement learning and deep inverse

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

DM87 2. Scheduling Problem Classification Marco Chiarandini 2 3 Outline Course Presentation

Hydropower technology Andreas Efstratiadis, Nikos Mamassis & Demetris Koutsoyiannis

rtttrt Current

Blockchain Laurent Haug laurent.haug@gmail.com I need Sent! Sending money. Thank you! some

CMS HIPAA Standards Exceptions Request Process Logistics Presentation Slides and How to

Evaluating the Proposed Coalition Battle Management Language Standard as a Basis for Enhanced C2

CRA Evaluation Measure - Single Metric framework - 25.10 It is a count the

FASAB: TRANSLATING USER NEEDS TO ACCOUNTING STANDARDS DIS DISCLA CLAIMER IMER Views

Reinforcement Methods for Autonomous Online Learning of Optimal - PowerPoint PPT Presentation

Supported by : F.L. Lewis, Dan Popa NSF - Paul Werbos Automation & Robotics Research Institute (ARRI) ARO- Sam Stanton The University of Texas at Arlington, USA AFOSR- Fariba Fahroo Guido Herrmann Bristol Robotics Lab, University of

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

Visual Reinforcement Learning with Imagined Goals Ashvin Nair*, Vitchyr Pong* , Murtaza Dalal,

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Advanced planning for autonomous vehicles using reinforcement learning and deep inverse

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

DM87 2. Scheduling Problem Classification Marco Chiarandini 2 3 Outline Course Presentation

Hydropower technology Andreas Efstratiadis, Nikos Mamassis &amp; Demetris Koutsoyiannis

rtttrt Current

Blockchain Laurent Haug laurent.haug@gmail.com I need Sent! Sending money. Thank you! some

CMS HIPAA Standards Exceptions Request Process Logistics Presentation Slides and How to

Evaluating the Proposed Coalition Battle Management Language Standard as a Basis for Enhanced C2

CRA Evaluation Measure - Single Metric framework - 25.10 It is a count the

FASAB: TRANSLATING USER NEEDS TO ACCOUNTING STANDARDS DIS DISCLA CLAIMER IMER Views

Visual Reinforcement Learning with Imagined Goals Ashvin Nair, Vitchyr Pong , Murtaza Dalal,

Hydropower technology Andreas Efstratiadis, Nikos Mamassis & Demetris Koutsoyiannis