Reinforcement Methods for Autonomous Online Learning of Optimal - - PowerPoint PPT Presentation

reinforcement methods for autonomous online learning of
SMART_READER_LITE
LIVE PREVIEW

Reinforcement Methods for Autonomous Online Learning of Optimal - - PowerPoint PPT Presentation

Supported by : F.L. Lewis, Dan Popa NSF - Paul Werbos Automation & Robotics Research Institute (ARRI) ARO- Sam Stanton The University of Texas at Arlington, USA AFOSR- Fariba Fahroo Guido Herrmann Bristol Robotics Lab, University of


slide-1
SLIDE 1

Automation & Robotics Research Institute (ARRI) The University of Texas at Arlington, USA

F.L. Lewis, Dan Popa Talk available online at http://ARRI.uta.edu/acs

Reinforcement Methods for Autonomous Online Learning of Optimal Robot Behaviors

Supported by : NSF - Paul Werbos ARO- Sam Stanton AFOSR- Fariba Fahroo Bristol Robotics Lab, University of Bristol, UK

Guido Herrmann

slide-2
SLIDE 2

 Optimal Control  Reinforcement learning  Policy Iteration  Q Learning  Humanoid Robot Learning Control Using RL  Telerobotic Interface Learning Using RL

Invited by Rolf Johansson

slide-3
SLIDE 3

It is man’s obligation to explore the most difficult questions in the clearest possible way and use reason and intellect to arrive at the best answer. Man’s task is to understand patterns in nature and society. The first task is to understand the individual problem, then to analyze symptoms and causes, and only then to design treatment and controls. Ibn Sina 1002-1042 (Avicenna)

slide-4
SLIDE 4

Importance of Feedback Control Darwin- FB and natural selection Volterra- FB and fish population balance Adam Smith- FB and international economy James Watt- FB and the steam engine FB and cell homeostasis

The resources available to most species for their survival are meager and limited

Nature uses Optimal control

slide-5
SLIDE 5

F.L. Lewis and D. Vrabie, “Reinforcement learning and adaptive dynamic programming for feedback control,” IEEE Circuits & Systems Magazine, Invited Feature Article, pp. 32-50, Third Quarter 2009. IEEE Control systems magazine, to appear.

slide-6
SLIDE 6

  

k i i i k i k h

u x r x V ) , ( ) ( 

Discrete-Time Optimal Control

cost

( 1) 1

( ) ( , ) ( , )

i k h k k k i i i k

V x r x u r x u  

    

  

1

( ) ( )

k k k k

x f x g x u

 

system Example

( , )

T T k k k k k k

r x u x Qx u Ru  

1

( ) ( , ( )) ( ) , (0)

h k k k h k h

V x r x h x V x V 

  

Bellman equation

) (

k k

x h u 

= the prescribed control input function Control policy Example

k k

u Kx  

Linear state variable feedback

1

( ) ( )

T T h k k k k k h k

V x x Qx u Ru V x 

  

Difference eq. equivalent

)) ( ) , ( ( min ) (

1 * * 

 

k k k u k

x V u x r x V

k

Bellman’s Principle gives Bellman opt. eq= DT HJB )) ( ) , ( ( min arg ) ( *

1 * 

 

k k k u k

x V u x r x h

k

 Optimal Control

1 1 1 2 1

( ) ( ) ( )T

k k k k

V x u x R g x x

    

   

Off-line solution Dynamics must be known

slide-7
SLIDE 7

1

( )

T T

L R B PB B PA

  DT Optimal Control – Linear Systems Quadratic cost (LQR)

1 k k k

x Ax Bu

 

system cost HJB = DT Riccati equation Optimal Control Optimal Cost

*(

)

T k k k

V x x Px 

1

( )

T T T T

A PA P Q A PB R B PB B PA

    

k k

u Lx  

  • Fact. The cost is quadratic

( )

T T k i i i i i k

V x x Qx u Ru

 

 

( )

T k k k

V x x Px 

for some symmetric matrix P Off-line solution Dynamics must be known

slide-8
SLIDE 8

Different methods of learning

Supervised learning Unsupervised learning Reinforcement learning

Machine learning- the formal study of learning systems

Synthesis of

  • Computational intelligence
  • Control systems
  • Neurobiology

We want robot controllers that learn optimal control solutions online in real-time

slide-9
SLIDE 9

Different methods of learning

System Adaptive Learning system Control Inputs

  • utputs

environment Tune actor Reinforcement signal Actor Critic Desired performance

Reinforcement learning Ivan Pavlov 1890s Actor-Critic Learning

We want OPTIMAL performance

  • ADP- Approximate Dynamic Programming
slide-10
SLIDE 10

  

k i i i k i k h

u x r x V ) , ( ) ( 

cost

( 1) 1

( ) ( , ) ( , )

i k h k k k i i i k

V x r x u r x u  

    

  

1

( ) ( )

k k k k

x f x g x u

 

system

1

( ) ( , ( )) ( ) , (0)

h k k k h k h

V x r x h x V x V 

  

Bellman equation

1

( ) ( )

T T h k k k k k h k

V x x Qx u Ru V x 

  

Difference eq. equivalent

)) ( ) , ( ( min ) (

1 * * 

 

k k k u k

x V u x r x V

k

Bellman’s Principle gives Bellman opt. eq= DT HJB )) ( ) , ( ( min arg ) ( *

1 * 

 

k k k u k

x V u x r x h

k

 Optimal Control

1 1 1 2 1

( ) ( ) ( )T

k k k k

V x u x R g x x

    

   

Focus on these two eqs.

RL Policy Iterations to Solve Optimal Control Problem

slide-11
SLIDE 11

Can be interpreted as a consistency equation that must be satisfied by the value function at each time stage. Expresses a relation between the current value of being in state x and the value(s) of being in next state x’ given that policy Captures the action, observation, evaluation, and improvement mechanisms of reinforcement learning.

Bellman Equation

Temporal Difference Idea

1

( ) ( , ( )) ( )

h k k k h k

V x r x h x V x 

 

1

( ) ( , ( )) ( )

k h k k k h k

e V x r x h x V x 

   

slide-12
SLIDE 12

Policy Evaluation and Policy Improvement

Policy Evaluation by Bellman Equation: Policy Improvement: consider algorithms that repeatedly interleave the two procedures:

'( )

( )

h h

V x V x  ( )

h

V x

the policy is said to be greedy with respect to value function '( )

k

h x

At each step, one obtains a policy that is no worse than the previous policy. Can prove convergence under fairly mild conditions to the optimal value and optimal policy. Most such proofs are based on the Banach Fixed Point Theorem. One step is a contraction map.

There is a large family of algorithms that implement the policy evaluation and policy improvement procedures in various ways

Policy Improvement makes

(Bertsekas and Tsitsiklis 1996, Sutton and Barto 1998). 1

( ) ( , ( )) ( )

h k k k h k

V x r x h x V x 

 

1 1 1

( ) 1 '( ) ( ) 2

T k k k k

V x h x R g x x

  

   

slide-13
SLIDE 13

DT Policy Iteration to solve HJB

) ( )) ( , ( ) (

1 1 1   

 

k j k j k k j

x V x h x r x V 

1 1 1 1

( ) arg min( ( , ) ( ))

k

j k k k j k u

h x r x u V x 

   

 

Howard (1960) proved convergence for MDP

) ( )) ( , ( ) (

1 

 

k h k k k h

x V x h x r x V 

Cost for any given control policy h(xk) satisfies the recursion Recursive solution Pick stabilizing initial control Policy Evaluation – solve Bellman Equation Policy Improvement

f(.) and g(.) do not appear

Bellman eq. Recursive form Consistency equation

1( ) j

V x

the policy is said to be greedy with respect to value function

1(

)

j k

h x

At each step, one obtains a policy that is no worse than the previous policy. Can prove convergence under fairly mild conditions to the optimal value and optimal policy. Most such proofs are based on the Banach Fixed Point Theorem. One step is a contraction map.

(Bertsekas and Tsitsiklis 1996, Sutton and Barto 1998).

slide-14
SLIDE 14

Methods to implement Policy Iteration  Exact Computation- needs full system dynamics  Temporal Difference- for robot trajectory following  Montecarlo Learning- for learning episodic robot tasks

slide-15
SLIDE 15

1 1 1

( ) ( ) ( ) ( )

T T j k k k j k j k j k

V x x Qx u x Ru x V x

  

  

( ) ( ) ( )

T T k i i i i i k

V x x Qx u x Ru x

 

 

1 1 1 1 1 1

( ) 1 ( ) ( ) 2

j k T j k k k

dV x u x R g x dx

     

 

DT Policy Iteration – Linear Systems Quadratic Cost- LQR

Solves Lyapunov eq. without knowing A and B

( )

T

V x x Px 

For any stabilizing policy, the cost is DT Policy iterations Hewer proved convergence in 1971 DT Lyapunov eq.

1 1 1 1 1 1

( ) ( ) ( )

T T j j j j j j T T j j j

A BL P A BL P Q L RL L R B P B B P A

     

       

Policy Iteration Solves Lyapunov equation WITHOUT knowing System Dynamics Equivalent to an Underlying Problem- DT LQR:

1

( ) ,

k k k k

x Ax Bu A BL x

 

  

LQR value is quadratic

k k

u Lx  

slide-16
SLIDE 16

1 1 1 1 T T T T k j k k j k k k j j

x P x x P x x Qx u Ru

   

  

1 1 1

( ) ( ) ( ) ( )

T T j k k k j k j k j k

V x x Qx u x Ru x V x

  

  

DT Policy Iteration – How to implement online? Linear Systems Quadratic Cost- LQR

Solves Lyapunov eq. without knowing A and B

( )

T

V x x Px 

DT Policy iterations

1 k k k

x Ax Bu

 

LQR cost is quadratic

   

1 1 11 12 11 12 1 2 1 2 1 1 1 2 2 12 22 12 22 1 1 2 1 2 1 1 2 1 2 11 12 22 11 12 22 1 1 2 2 2 2 1

( ) ( ) 2 2 ( ) ( )

k k k k k k k k k k k k k k k k

p p p p x x x x x x p p p p x x x x p p p x x p p p x x x x

       

                                                      

Quadratic basis set ( ) ( ) ( )

T k i i i i i k

V x x Qx u x Ru x

 

 

for some matrix P Then update control using

1

( ) ( )

T T j k j k j j k

h x L x R B P B B P Ax

  

Need to know A AND B for control update

 

1 1

( ) ( ) ( ) ( )

T T T j k k k k j k j k

W x x x Qx u x Ru x  

 

  

slide-17
SLIDE 17

Implementation- DT Policy Iteration Nonlinear Case

Value Function Approximation (VFA)

) ( ) ( x W x V

T

basis functions weights LQR case- V(x) is quadratic

( ) ( )

T T

V x x Px W x   

Quadratic basis functions

Nonlinear system case- use Neural Network ] [

12 11

 p p W T 

( ) x  

slide-18
SLIDE 18

) ( )) ( , ( ) (

1 1 1   

 

k j k j k k j

x V x h x r x V 

Value function update for given control – Bellman Equation Assume measurements of xk and xk+1 are available to compute uk+1 Then

 

)) ( , ( ) ( ) (

1 1 k j k k k T j

x h x r x x W  

 

 

Solve for weights in real-time using RLS

  • r, batch LS- many trajectories with different initial conditions over a compact set

Then update control using Need to know g(xk) for control update Since xk+1 is measured, do not need knowledge of f(x)

  • r g(x) for value fn. update

regression matrix

Implementation- DT Policy Iteration

) ( ) (

k T j k j

x W x V  

VFA

1 1 1 1 1 1 1 1 1

( ) 1 1 ( ) ( ) ( ) ( ) 2 2

j k T T T T j k k k k j k

dV x u x R g x R g x x W dx 

        

    

slide-19
SLIDE 19

) ( )) ( , ( ) (

1 1 1   

 

k j k j k k j

x V x h x r x V 

1 1 1 1 1

( ) 1 ( ) ( ) 2

j k T j k k k

dV x u x R g x dx

    

 

  • 1. Select control policy
  • 2. Find associated cost
  • 3. Improve control

Needs 10 lines of MATLAB code Direct optimal adaptive control

Solves Lyapunov eq. without knowing dynamics k k+1

  • bserve xk
  • bserve xk+1

apply uk

  • bserve cost rk

update V do until convergence to Vj+1 update control to uj+1

 

)) ( , ( ) ( ) (

1 1 k j k k k T j

x h x r x x W  

 

 

slide-20
SLIDE 20

Persistence of Excitation

Regression vector must be PE

 

)) ( , ( ) ( ) (

1 1 k j k k k T j

x h x r x x W  

 

 

slide-21
SLIDE 21

System Action network Policy Evaluation (Critic network)

( )

j k

h x

cost

The Adaptive Critic Architecture Control policy

) ( )) ( , ( ) (

1 1 1   

 

k j k j k k j

x V x h x r x V 

Adaptive Critics

)) ( ) , ( ( min arg ) (

1 1 1   

 

k j k k u k j

x V u x r x h

k

Value update Control policy update Leads to ONLINE FORWARD-IN-TIME implementation of optimal control

Optimal Adaptive Control

Use RLS until convergence

slide-22
SLIDE 22

Adaptive Control

Plant control

  • utput

Identify the Controller- Direct Adaptive Identify the system model- Indirect Adaptive Identify the performance value- Optimal Adaptive

) ( ) ( x W x V

T

slide-23
SLIDE 23
slide-24
SLIDE 24

Greedy Value Fn. Update- Approximate Dynamic Programming Value Iteration= Heuristic Dynamic Programming (HDP)

Paul Werbos

) ( )) ( , ( ) (

1 1 1   

 

k j k j k k j

x V x h x r x V 

Policy Iteration

1 1 1

( ) ( ) ( )

T T j j j j j j T T j j j

A BL P A BL P Q L RL L R B P B B P A

  

        

For LQR Underlying RE Hewer 1971

)) ( ) , ( ( min arg ) (

1 1 1   

 

k j k k u k j

x V u x r x h

k

) ( )) ( , ( ) (

1 1  

 

k j k j k k j

x V x h x r x V 

)) ( ) , ( ( min arg ) (

1 1 1   

 

k j k k u k j

x V u x r x h

k

Value Iteration

1 1

( ) ( ) ( )

T T j j j j j j T T j j j

P A BL P A BL Q L RL L R B P B B P A

 

       

For LQR Underlying RE Lancaster & Rodman proved convergence Two occurrences of cost allows def. of greedy update

Initial stabilizing control is needed

Lyapunov eq. Simple recursion Initial stabilizing control is NOT needed

slide-25
SLIDE 25

1 ' ' '

( ) ( , ) ( ')

u u j j xx xx j u x

V x x u P R V x  

     

 

' ' 1 '

( ) ( , ) ( ') .

u u k xx xx k u x

V x x u P R V x

 

 

     

 

Compare Value Iteration To Dynamic Programming

( ')

j

V x

  • ne can interpret

as an approximation or estimate for the future stage cost-to-go from the future state x’

Estimate for the future stage cost-to-go

slide-26
SLIDE 26

A problem with DT Policy Iteration and VI

Assume measurements of xk and xk+1 are available to compute uk+1 Then

 

)) ( , ( ) ( ) (

1 1 k j k k k T j

x h x r x x W  

 

 

Since xk+1 is measured, do not need knowledge of f(x)

  • r g(x) for value fn. update

) ( ) (

k T j k j

x W x V  

Policy Evaluation

1 1 1 1 1 1

( ) 1 ( ) ( ) 2

j k T j k k k

dV x u x R g x dx

     

 

Policy Improvement

1

( ) ( )

T T j k j k j j k

h x L x R B P B B P Ax

  

Need to know f(xk) AND g(xk) for control update

Easy to fix – use 2 NN

LQR case

slide-27
SLIDE 27

ˆ ( , ) ( )

T i k Vi Vi k

V x W W x  

ˆ ( , ) ( )

T i k ui ui k

u x W W x  

1 1

ˆ ˆ ˆ ( ( ), ) ( ) ( ) ( ) ˆ ˆ ( ) ( ) ( )

T T T k Vi k k i k i k i k T T T k k i k i k Vi k

d x W x Qx u x Ru x V x x Qx u x Ru x W x  

 

     

1

( ) arg min( ( ))

T T i k k k i k u

u x x Qx u Ru V x    

1 1

( ) ( )

T T i k k k i k

V x x Qx u Ru V x

 

  

Standard Neural Network VFA for On-Line Implementation Define target cost function NN for Value - Critic NN for control action HDP Backpropagation- P. Werbos Implicit equation for DT control- use gradient descent for action update

( ) ( ) 1 ( 1) ( ) ( )

ˆ ˆ ˆ ( ( )

T T k k i j i j i k ui j ui j ui j

x Qx u Ru V x W W W 

 

     

1 1 ( ) 1

( ) ˆ ( )(2 ( ) )

T j j T T k ui ui k i j k Vi k

x W W x Ru g x W x  

  

     ˆ ˆ ( , ) ( , ) argmin ˆ ˆ ( ( ) ( ) ( , ))

T T k k k k ui W i k k k

x Qx u x W Ru x W W V f x g x u x W

           

(can use 2-layer NN)

1

2 1 1

arg min{ | ( ) ( ( ), ) | }

Vi

T T Vi Vi k k Vi k W

W W x d x W dx  

  

 

Explicit equation for cost – use LS for Critic NN update or RLS

1 1

( ) ( ) ( ) ( ( ), , )

T T T T Vi k k k k Vi ui

W x x dx x d x W W dx    

   

      

 

 

1 1 1 1 1

( ) ( ) ( , ) ( )

T T T Vi Vi k Vi k k k Vi k m m m

W W x W x r x u W x 

     

     

  • r

1

( ) ( ) ( )

k k k k

x f x g x u x

 

Asma Al-Tamimi & F. Lewis

slide-28
SLIDE 28

Implicit equation for DT control- use gradient descent for action update

( ) ( ) 1 ( 1) ( ) ( )

ˆ ˆ ˆ ( ( )

T T k k i j i j i k ui j ui j ui j

x Qx u Ru V x W W W 

 

     

1 1 ( ) 1

( ) ˆ ( )(2 ( ) )

j j T T k ui ui k i j k Vi k

x W W x Ru g x W x  

  

     ˆ ˆ ( , ) ( , ) argmin ˆ ˆ ( ( ) ( ) ( , ))

T T k k k k ui i k k k

x Qx u x Ru x W V f x g x u x

  

           

ˆ ( , ) ( )

T i k ui ui k

u x W W x  

NN for control action Note that state drift dynamics f(xk) is NOT needed since:

  • 1. NN Approximation for action is used
  • 2. xk+1 is measured in training phase

Interesting Fact for HDP for Nonlinear systems

k j T j T k j k j

Ax P B B P B I x L x h

1

) ( ) (

   

Linear Case must know system A and B matrices g(.) is needed Information about A is stored in NN

slide-29
SLIDE 29

Discrete-time nonlinear HJB solution using Approximate dynamic programming

  • Simulation Example 1
  • Linear system – Aircraft longitudinal dynamics
  • The HJB, i.e. ARE, Solution

1.0722 0.0954 0 -0.0541 -0.0153 4.1534 1.1175 0 -0.8000 -0.1010 A= 0.1359 0.0071 1.0 0.0039 0.0097 0 0 0 0.1353 0 0 0 0 0 0.1353                

  • 0.0453 -0.0175
  • 1.0042 -0.1131

B= 0.0075 0.0134 0.8647 0 0 0.8647                 55.8348 7.6670 16.0470 -4.6754 -0.7265 7.6670 2.3168 1.4987 -0.8309 -0.1215 16.0470 1.4987 25.3586 -0.6709 0.0464

  • 4.6754 -0.8309 -0.6709 1.5394 0.0782

P 

  • 0.7265 -0.1215 0.0464 0.0782 1.0240

               

  • 4.1136 -0.7170 -0.3847 0.5277 0.0707
  • 0.6315 -0.1003 0.1236 0.0653 0.0798

L       

Unstable, Two-input system

1

( )

T T T T

A PA P Q A PB R B PB B PA

    

slide-30
SLIDE 30

Discrete-time nonlinear HJB solution using Approximate dynamic programming

  • Simulation
  • The Cost function approximation – quadratic basis set
  • The Policy approximation – linear basis set

ˆ ( )

T i ui k

u W x  

 

1 2 3 4 5

( )

T x

x x x x x  

11 12 13 14 15 21 22 23 24 25 u u u u u T u u u u u u

w w w w w W w w w w w       

1 1 1

ˆ ( , ) ( )

T i k Vi Vi k

V x W W x 

  

1 2

2 2 2 2 2 1 2 1 3 1 4 1 5 2 3 4 2 2 5 3 3 4 3 5 4 4 5 5

( )

T x

x x x x x x x x x x x x x x x x x x x x x x x x x      

 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 T V V V V V V V V V V V V V V V V

W w w w w w w w w w w w w w w w 

slide-31
SLIDE 31

Discrete-time nonlinear HJB solution using Approximate dynamic programming

  • Simulation

The convergence of the cost

[55.5411 15.2789 31.3032 -9.3255 -1.4536 2.3142 2.9234 -1.6594 -0.2430 24.8262 -1.3076 0.0920 1.5388 0.1564 1.0240]

T V

W 

11 12 13 14 15 1 2 3 4 5 21 22 23 24 25 2 6 7 8 9 31 32 33 34 35 3 7 10 11 12 41 42 43 44 45 4 8 11 13 51 52 53 54 55

0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5

V V V V V V V V V V V V V V V V V V V

P P P P P w w w w w P P P P P w w w w w P P P P P w w w w w P P P P P w w w w P P P P P                 

14 5 9 12 14 15

.5 0.5 0.5 0.5 0.5

V V V V V V

w w w w w w                 55.8348 7.6670 16.0470 -4.6754 -0.7265 7.6670 2.3168 1.4987 -0.8309 -0.1215 16.0470 1.4987 25.3586 -0.6709 0.0464

  • 4.6754 -0.8309 -0.6709 1.5394 0.0782

P 

  • 0.7265 -0.1215 0.0464 0.0782 1.0240

               

Actual ARE soln:

slide-32
SLIDE 32

Discrete-time nonlinear HJB solution using Approximate dynamic programming

  • Simulation

The convergence of the control policy

4.1068 0.7164 0.3756 -0.5274 -0.0707 0.6330 0.1005 -0.1216 -0.0653 -0.0798

u

W       

11 12 13 14 15 11 12 13 14 15 21 22 23 24 25 21 22 23 24 25 u u u u u u u u u u

L L L L L w w w w w L L L L L w w w w w              

  • 4.1136 -0.7170 -0.3847 0.5277 0.0707
  • 0.6315 -0.1003 0.1236 0.0653 0.0798

L       

Note- In this example, drift dynamics matrix A is NOT Needed. Riccati equation solved online without knowing A matrix

Actual optimal ctrl.

1

( )

T T T T

A PA P Q A PB R B PB B PA

    

slide-33
SLIDE 33

Batch LS LS solution for Critic NN update

Issues with Nonlinear ADP

Integral over a region of state-space Approximate using a set of points time x1 x2

1 1

( ) ( ) ( ) ( ( ), , )

T T T T Vi k k k k Vi ui

W x x dx x d x W W dx    

   

      

 

Set of points over a region vs. points along a trajectory

Exploitation (optimal regulation) vs Exploration For Linear systems- these are the same under PE condition Selection of NN Training Set time x1 x2 Take sample points along a single trajectory Recursive Least-Squares RLS

 

1 1 1 1 1

( ) ( ) ( , ) ( )

T T T Vi Vi k Vi k k k Vi k m m m

W W x W x r x u W x 

     

     

PE allows local smooth solution of Bellman eq.

slide-34
SLIDE 34
slide-35
SLIDE 35

System Action network Policy Evaluation (Critic network)

( )

j k

h x

cost

The Adaptive Critic Architecture Control policy

) ( )) ( , ( ) (

1 1 1   

 

k j k j k k j

x V x h x r x V 

Adaptive Critics

)) ( ) , ( ( min arg ) (

1 1 1   

 

k j k k u k j

x V u x r x h

k

Value update Control policy update Leads to ONLINE FORWARD-IN-TIME implementation of optimal control

Optimal Adaptive Control

Use RLS until convergence

REAL-TIME CONTROL LOOP SLOWER EVALUATION LOOP

slide-36
SLIDE 36

Motor control 200 Hz

Oscillation is a fundamental property of neural tissue

Brain has multiple adaptive clocks with different timescales theta rhythm, Hippocampus, Thalamus, 4-10 Hz sensory processing, memory and voluntary control of movement. gamma rhythms 30-100 Hz, hippocampus and neocortex high cognitive activity.

  • consolidation of memory
  • spatial mapping of the environment – place cells

The high frequency processing is due to the large amounts of sensorial data to be processed Spinal cord

  • D. Vrabie, F. Lewis, and Dr. Dan Levine- RL for Continuous-Time Systems
slide-37
SLIDE 37

Doya, Kimura, Kawato 2001

Il Hong Suh- Gregory Popper Skinner Darwin

Limbic system

slide-38
SLIDE 38

Cerebral cortex Motor areas Thalamus Basal ganglia Cerebellum Brainstem Spinal cord Interoceptive receptors Exteroceptive receptors Muscle contraction and movement Summary of Motor Control in the Human Nervous System

reflex Supervised learning Reinforcement Learning- dopamine (eye movement)

inf.

  • live

Hippocampus

Unsupervised learning Limbic System Motor control 200 Hz theta rhythms 4-10 Hz picture by E. Stingu

  • D. Vrabie

Memory functions Long term Short term

Hierarchy of multiple parallel loops

gamma rhythms 30-100 Hz,

slide-39
SLIDE 39

Adaptive Critic structure

Theta waves 4-8 Hz Reinforcement learning

Motor control 200 Hz

slide-40
SLIDE 40

Cerebral cortex Motor areas Thalamus Basal ganglia Cerebellum Brainstem Spinal cord Interoceptive receptors Exteroceptive receptors

Muscle contraction and movement

inf.

  • live

Hippocampus

gamma rhythms 30-100 Hz

Intense processing due to the amounts of information data to be processed

Cognitive map of the environment

  • place cells -

theta rhythms 4-10 Hz

Behavior reference Information sent to the lower processing levels

Motor control 200 Hz theta rhythms 4-10 Hz

slide-41
SLIDE 41

Cerebral cortex Motor areas Thalamus Basal ganglia Cerebellum Brainstem Spinal cord Interoceptive receptors Exteroceptive receptors

Muscle contraction and movement

inf.

  • live

Hippocampus

theta rhythms 4-10 Hz

Behavior reference Information Critic information sent to the Actor

A ctor

Cognitive map of the environment

  • place cells -

Critic

Motor control 200 Hz

Control signal

slide-42
SLIDE 42

Different methods of learning

Supervised learning Unsupervised learning Reinforcement learning

Machine learning- the formal study of learning systems

Synthesis of

  • Computational intelligence
  • Control systems
  • Neurobiology
slide-43
SLIDE 43
slide-44
SLIDE 44

Four ADP Methods proposed by Paul Werbos

Heuristic dynamic programming Dual heuristic programming AD Heuristic dynamic programming AD Dual heuristic programming (Watkins Q Learning) Critic NN to approximate: Value Gradient - costate

x V  

) ( k x V

Q function

) , (

k k u

x Q

Gradients u Q x Q     , Action NN to approximate the Control Bertsekas- Neurodynamic Programming Barto & Bradtke- Q-learning proof (Imposed a settling time)

Adaptive (Approximate) Dynamic Programming

Value Iteration

slide-45
SLIDE 45

Q Learning- Watkins

) ( ) , ( ) , (

1 

 

k h k k k k h

x V u x r u x Q 

policy h(.) used after time k

uk arbitrary

) ( )) ( , (

k h k k h

x V x h x Q 

Define Q function Note

)) ( , ( ) , ( ) , (

1 1  

 

k k h k k k k h

x h x Q u x r u x Q 

Bellman eq for Q

)) , ( ( min ) (

* * k k u k

u x Q x V

k

Simple expression of Bellman’s principle

)) , ( ( min arg ) ( *

* k k u k

u x Q x h

k

Action Dependent ADP – Paul Werbos ) ( )) ( , ( ) (

1 

 

k h k k k h

x V x h x r x V 

Value function recursion for given policy h(xk)

Optimal Adaptive Control for completely unknown DT systems

k k+1 xk xk+1 uk h(x)

slide-46
SLIDE 46

) ( ) , ( ) , (

1 

 

k h k k k k h

x V u x r u x Q 

Specify a control policy

,.... 1 , ); (    k k j x h u

j j

policy h(.) used after time k

uk arbitrary

) ( )) ( , (

k h k k h

x V x h x Q 

Define Q function Note

)) ( , ( ) , ( ) , (

1 1  

 

k k h k k k k h

x h x Q u x r u x Q 

Bellman equation for Q

)) ( ) , ( ) , (

1 * * 

 

k k k k k

x V u x r u x Q 

)) ( , ( ) , ( ) , (

1 * 1 * *  

 

k k k k k k

x h x Q u x r u x Q 

Optimal Q function

))) ( , ( ( min )) ( , ( ) (

* * * k k h h k k k

x h x Q x h x Q x V  

Optimal control solution

)) , ( ( min ) (

* * k k u k

u x Q x V

k

Simple expression of Bellman’s principle

)) , ( ( min arg ) ( *

* k k u k

u x Q x h

k

 )) ( , ( ( min arg ) ( *

k k h h k

x h x Q x h 

Q Function Definition

slide-47
SLIDE 47

Q Learning does not need to know f(xk) or g(xk)

) ( ) , ( ) , (

1 

 

k h k k k k h

x V u x r u x Q

) ( ) (

k k T k k k T k k T k

Bu Ax P Bu Ax Ru u Qx x      For LQR

Px x x W x V

T T

  ) ( ) ( 

                                                      

k k uu ux xu xx T k k k k T k k k k T T T T T k k

u x H H H H u x u x H u x u x PB B R PA B PB A PA A Q u x

Q is quadratic in x and u Control update is found by

] [ 2 ] ) ( [ 2

k uu k ux k T k T k

u H x H u PB B R PAx B u Q        

so

k j k ux uu k T T k

x L x H H PAx B PB B R u

1 1 1

) (

  

     

Control found only from Q function A and B not needed

V is quadratic in x

1 k k k

x Ax Bu

 

slide-48
SLIDE 48

Q Learning– Action Dependent HDP – Paul Werbos

Bradtke & Barto (1994) proved convergence for LQR Q function for any given control policy h(xk) satisfies the Bellman equation Policy Iteration Using Q Function- Recursive solution to HJB Pick stabilizing initial control policy Find Q function Update control

)) ( , ( ) , ( ) , (

1 1  

 

k k h k k k k h

x h x Q u x r u x Q 

)) ( , ( ) , ( ) , (

1 1 1   

 

k j k j k k k k j

x h x Q u x r u x Q 

)) , ( ( min arg ) (

1 1 k k j u k j

u x Q x h

k

 

Now f(xk,uk) not needed

slide-49
SLIDE 49

Q function update for control is given by Assume measurements of uk, xk and xk+1 are available to compute uk+1

) , ( ) , ( u x W u x Q

T

Then

 

) , ( ) , ( ) , (

1 1 1 k j k k j k k k T j

x L x r x L x u x W  

  

 

Solve for weights using RLS or backprop. Since xk+1 is measured in training phase, do not need knowledge of f(x) or g(x) for value fn. update regression matrix

Implementation- DT Q Function Policy Iteration

) , ( ) , ( ) , (

1 1 1 1    

 

k j k j k k k k j

x L x Q u x r u x Q 

k j k

x L u 

Now u is an input to the NN- Werbos- Action dependent NN

 ) (x 

For LQR For LQR case QFA – Q Fn. Approximation

Bradtke and Barto

slide-50
SLIDE 50

) , ( ) , ( ) , (

1 1 1 1    

 

k j k j k k k k j

x L x Q u x r u x Q 

Q Policy Iteration )) , ( ( min arg ) (

1 1 k k j u k j

u x Q x h

k

 

 Control policy update

 

) , ( ) , ( ) , (

1 1 1 k j k k j k k k T j

x L x r x L x u x W  

  

 

k j k ux uu k

x L x H H u

1 1  

  

Model-free policy iteration

Bradtke, Ydstie, Barto

Greedy Q Fn. Update - Approximate Dynamic Programming ADP Method 3. Q Learning Action-Dependent Heuristic Dynamic Programming (ADHDP)

Paul Werbos

Model-free HDP

)) ( , ( ) , ( ) , (

1 1 1   

 

k j k j k k k k j

x h x Q u x r u x Q 

Greedy Q Update

1 1 1 1

target ) , ( ) , ( ) , (

   

  

j k j k T j k j k k k T j

x L x W x L x r u x W  

Update weights by RLS or backprop. Stable initial control needed Stable initial control NOT needed

slide-51
SLIDE 51

Direct OPTIMAL ADAPTIVE CONTROL Q learning actually solves the Riccati Equation WITHOUT knowing the plant dynamics Model-free ADP Works for Nonlinear Systems

Proofs? Robustness? Comparison with adaptive control methods?

slide-52
SLIDE 52
slide-53
SLIDE 53

A Q-Learning Based Adaptive Optimal Controller Implementation for a Humanoid Robot Arm

Said Ghani Khan1, Guido Herrmann1, Frank L. Lewis2, Tony Pipe1, Chris Melhuish1

1: Bristol Robotics Laboratory, University of Bristol and University of the West of England, Bristol, UK 2: ARRI, Texas University at Arlington, USA

Conference on Decision and Control (CDC) 2011, Orlando 11 December 2011

slide-54
SLIDE 54

BRL BERT II ARM

The mechanical design and manufacturing for the BERT II torso including hand and arm has been conducted by Elumotion (www.elumotion.com), a Bristol Based company

slide-55
SLIDE 55

ADP Actor-Critic Scheme

Stingue et al. 2010

slide-56
SLIDE 56

The cost of control is modeled via an NN

Algorithm

slide-57
SLIDE 57

Algorithm

The function is a vector, linear in the control and control error and system states, e.g.

                              

n n n m

k k k k k k k k k k k k

x x d x d x u u d u x z   

1 1 1 1

) , , ( ) , , (

k k k k

d u x z

Selected elements of the Kronecker product of zk will be used as functions

  • f a polynomial neural

network  (greater detail later)

Note that the control signals uk are at most quadratic in This again is a practical assumption.

slide-58
SLIDE 58

The cost function is modified to include constraints

Introducing Constraints



The new cost function....

qL is the joint limit.

slide-59
SLIDE 59

The NN nodes are obtained by the Kronecker product of : Additional neurons are added to deal with the extra nonlinearity due to constraints

Introducing Constraints - Modelling Q

slide-60
SLIDE 60

Constrained Case-Experiment

slide-61
SLIDE 61
slide-62
SLIDE 62

Reinforcement Learning Approach for Tele-Robotic Interaction Interface

Jartuwat Rajruangrabin and Dan Popa Automation & Robotics Research Institute The University of Texas at Arlington

slide-63
SLIDE 63

Experiment Platform

               

T e

t t t t z t y t x t q    

u t

 

1 2 3          

1

2 3

Haptic Device used as Input Interface Simulated 7-degrees of freedom robotic manipulator used as a system to be controlled 6-degrees of freedom robotic manipulator mounted on a differential drive mobile robot platform

slide-64
SLIDE 64

Past Work / Challenges

Robot System

Operator Performance Evaluation Robot-Interface System Output

Interface System

input

u

xp

     

p p p p

x t A x t B u t   

Set of input / output pairs have to be specified. Set of input / output pairs have to be specified.

  • Desired trajectory is known

What if we cannot specify the desired output trajectory directly? What if we cannot specify the desired output trajectory directly?

  • Use Reinforcement Learning

Challenges

slide-65
SLIDE 65

Interface Mapping

What is it exactly?

x y

f x

 

What can we do to get f(x)?

The simplest way is to obtain a set of inputs and a set of outputs and calculate the relationship (Curve Fitting)

x y

Curve Fitting Algorithm

f  x

 

slide-66
SLIDE 66

Static Mapping Approach

x y x y

Curve Fitting Algorithm

f  x

 

f  x

 

slide-67
SLIDE 67

Reinforcement Learning

x y

Curve Fitting Algorithm

f  x

 

What if we cannot specify the desired output trajectory directly ?

“Learning by interacting with an environment”

Reinforcement Learning Reinforcement Learning

With RL we do not have to specify a desire trajectory. Instead, a reward function is used

slide-68
SLIDE 68

NN Training

Gradient steepest descent algorithm is used in conjunction with back propagation to train the feed forward network Log sigmoid function is used as a neuron activation function

 .



1 1 ex

 E  w

slide-69
SLIDE 69

RL Implementation Haptic/Robot Arm on a Mobile Robot Mapping Model:

  • Use a non-linear dynamic model

Step 1(Initialization): Train the NN so

that the weights are optimal according to the desired trajectory

Step 2: (Online Learning) Implement

the TD(λ) learning algorithm

r

VR 

 yx  yxmax  x

 

, yx  yxmax  x

 

 yx  yxmin  x

 

, yx  yxmin  x

 

0,

  • therwise

    

y k1  fw y k,x

 

Objective:

  • Exp1 : Implement RL with reward function

that allow the user to control movement of the mobile platform

  • Exp2 : Reverse the direction mapping of

mobile platform based on Exp1

Reward function:

r

VL 

 yx  yxmax  x

 

, yx  yxmax  x

 

 yx  yxmin  x

 

, yx  yxmin  x

 

0,

  • therwise

    

slide-70
SLIDE 70

Experiment Result – Online TD(λ) Learning

Reward function is set so that the function is maximized when Y move along the

  • pposite direction (with offset)
slide-71
SLIDE 71

Experiment Result – Contour Shaping

Reverse and Scale Y - Mapping Update Through TD(λ) Learning Algorithm

slide-72
SLIDE 72

NEW ROBOTIC TREATMENT SYSTEMS FOR CHILDHOOD CEREBRAL PALSY and AUTISM

  • D. O. PopaA, I. RanatungaA, D. HansonB, F. MakedonC

ADepartment of Electrical Engineering, University of Texas at Arlington,USA BHanson Robotics Inc., Plano, TX, USA CDepartment of Computer Science & Engineering Department,

University of Texas at Arlington, USA

This work was supported by: US National Science Foundation Grants #CPS 1035913, and #CNS 0923494. TxMed consortium grant: “Human-Robot Interaction System for Early Diagnosis and Treatment of Childhood Autism Spectrum Disorders (RoDiCA)”

slide-73
SLIDE 73

Introduction

 Two assistive robotic systems aimed at the treatment of children with certain motor and cognitive impairments.  In the Neptune project [1]

– Mobile manipulator for children suffering from Cerebral-Palsy. – Mobile robot base and a 6DOF robotic arm, interfaced via:

  • Wii Remote, iPad, Neuroheadset, the Kinect, and

Force sensing robotic skin

– Therapeutic outcomes

  • Hand and head gesture recognition and reward.
  • Hand motion excercises using IPAD Games

(CPlay, CPMaze, ProlloquoToGo) held by the robot.

 The RoDiCA project [2]

– focuses on treating cognitive impairments in children suffering from ASD – Zeno is a robotic platform developed by Hanson Robotics, based on a patented realistic skin. – Therapeutic outcomes

  • Real time subject tracking/joint attention
  • Advanced head-eye and hand coordination
  • Facial gesture recognition and synthesis
  • Data logging and analysis.

4/22/2012 Multiscale Robots and Systems Lab University of Texas Arlington 73

Zeno (by Hanson RoboKind Inc.) generating facial expressions and maintaining eye contact. Neptune Mobile manipulator with iPad attached.

slide-74
SLIDE 74

Advanced Control for Human Robot Interaction

4/22/2012 Multiscale Robots and Systems Lab University of Texas Arlington 74

Zeno Video

Neptune Control through Neural Headband

Visual HRI

Realistic & Intuitive Human‐Robot Interaction Physical HRI Recognize & Synthesize poses and gestures Adaptive Interfaces

Robot Touch HRI

slide-75
SLIDE 75

Adaptive Interfaces

 The supervisory control of multi-DOF robots is a demanding application.  If a single operator is tasked with direct control, performing coordinated tasks becomes non-intuitive.  We use Reinforcement Learning TD(lambda) scheme in order to adaptively change the mapping of DOF’s from the operator user interface to the robot.

4/22/2012 Multiscale Robots and Systems Lab University of Texas Arlington 75

State Propagation Interface Mapping System Metrics Evaluation state Interface input action Update Reward function reward Critic Actor TD error

) (

1 t t t t

w p w w      

 

 

         

p i q j i ij i i

u w w u y ) ( Value Function Policy

slide-76
SLIDE 76

The cloud-capped towers, the gorgeous palaces, The solemn temples, the great globe itself, Yea, all which it inherit, shall dissolve, And, like this insubstantial pageant faded, Leave not a rack behind. We are such stuff as dreams are made on, and our little life is rounded with a sleep. Our revels now are ended. These our actors, As I foretold you, were all spirits, and Are melted into air, into thin air. Prospero, in The Tempest, act 4, sc. 1, l. 152-6, Shakespeare

slide-77
SLIDE 77
slide-78
SLIDE 78

An Approximate Dynamic Programming Based Controller for an Underactuated 6DoF Quadrotor

Emanuel Stingu Frank Lewis

Automation & Robotics Research Institute University of Texas at Arlington

ARRI

Supported by ARO grant W91NF-05-1-0314 NSF grant ECCS-0801330

slide-79
SLIDE 79

3 control loops The quadrotor has 17 states and only 4 control inputs, thus it is very under-actuated. Three control loops with dynamic inversion are used to generate the 4 control signals.

x y z

amount

  • f tilt

yaw direction

  • f tilt

Momentum theory applied to the propeller

Translation and yaw Controller Attitude Controller Motor and Propeller Controller

Reference trajectory Position and velocity errors Desired attitude Desired forces Motor commands Dynamic inversion accelerations to attitude Dynamic inversion rotation accelerations to forces Altitude control: thrust force Dynamic inversion thrust forces to motor voltages

slide-80
SLIDE 80

The actor is where and The critic is Approximate Dynamic Programming Global Critic State vector and references Actor tuning Translation and yaw Controller Attitude Controller Motor and Propeller Controller

Reference trajectory Position and velocity errors Desired attitude Desired forces Motor commands Local Actor Local Actor Local Actor 1 1

( , )

k k k

u h x z

 

 

T ad z ad ad mot ad

u a  Ω V q

1 1 1 2 2 2 3 3 3

( , ) ( , ) ( , ) ( , )

T T T T k k k k k k k k

h x z h x z h x z h x z     

Subsets of the state and tracking error vectors

 

( , ) , ,

j k h k k j j j j k

V x z r x z u 

  

 

slide-81
SLIDE 81

Once the value of the Q function at is known, a backup of it is made into the RBF neural network by adjusting the weights W and/or by adding more neurons and by reconfiguring their other parameters. This is a separate process that just needs to know the coordinates and the new value to store. The update of the Q value is not made completely towards the new value. This slows down the learning, but adds robustness. The policy update step is done by simply solving after the new Q value was stored. The value for h is stored into the actor RBF neural network using the same mechanism as before: Approximate Dynamic Programming

1 1 1

( , , )

k k k

x z u

  

 

, , x z u

     

 

   

1 1 2 2 1 1 1 1 1 1 1 1 1

, , , , , , , , , , ,

k k k k T k k k k k k T k k T k k k k

r x z h x z Q x z u W x z u x z u W x z u W     

            

            

 ,

1

new

  • l

stored

  • ld

d

Q Q Q Q       

 

, ,

k k

Q x z u u   

     

, , ,

T T k k k k k k

h x z U x z u U x z          

slide-82
SLIDE 82

The Curse of Dimensionality The actor acts as a nonlinear function approximator. Normally we have In the quadrotor case, because the reference is not zero and the system is nonlinear, we need For each of the position, attitude and motor/propeller loops the state vector includes the local states and the external states that have a big coupling effect on the loop performance. It is easy to see that this way the input space can easily have n=14 or more dimensions. A RBF neural network with the neurons placed on a grid with N elements in each dimension would require

  • neurons. For N=5 and n=14, are required.

1

( )

k k

u h x

  1

( , )

k k k

u h x z

  n

N

9

6 10 

Placing neurons on a grid is no better than a look-up

  • table. The solutions to reducing the number of neurons

are the following:

  • preprocess the states to provide signals

with physical significance as inputs

  • combine multiple states into a lower

dimension signal

  • map multiple equivalent regions from the

state-space into only one.

slide-83
SLIDE 83