Reinforcement Learning for Continuous State and Action Spaces - - PowerPoint PPT Presentation

reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Reinforcement Learning for Continuous State and Action Spaces - - PowerPoint PPT Presentation

MACHINE LEARNING 2012 MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement Learning for Continuous State and Action Spaces Gradient Methods 1 MACHINE LEARNING 2012 Reinforcement Learning (RL) Supervised Learning Unsupervised


slide-1
SLIDE 1

MACHINE LEARNING – 2012

1

MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement Learning for Continuous State and Action Spaces Gradient Methods

slide-2
SLIDE 2

MACHINE LEARNING – 2012

2

Reinforcement Learning (RL)

Supervised Learning Unsupervised Learning Reinforcement learning is in-between

slide-3
SLIDE 3

MACHINE LEARNING – 2012

3

Drawbacks of classical RL

Curse of dimensionality: Computational costs increase dramatically with number of states. Markov World: Cannot handle continuous action and state space Model-based vs model free: Need a model of the world (can be estimated through exploration)  Gradient methods to handle continuous state and action spaces

slide-4
SLIDE 4

MACHINE LEARNING – 2012

4

RL in continuous state and action spaces

1....

States and actions , , are continuous: , One can no longer swipe through all states and actions to determine the optimal policy. Instead, one can either

t t N P t t

t T

s a s a

 

 

: 1) use function approximation to estimate the value function V(s) 2) use function approximation to estimate the state-action value function Q(s, a) 3) optimize a parameterized policy | (policy sear a s  ch)

slide-5
SLIDE 5

MACHINE LEARNING – 2012

5

Policy Gradients

 

Parametrize the value function: ; V s 

Open parameters

       

 

Assume an initial estimate ; . Run an episode using a greedy policy | ; Compute the error on the estimate: = ; Find the optimal parameters through gradient descent

  • n the stat

k t t

V s a s e V s E r

 

             

   

e value function: ; ~ V s e     

slide-6
SLIDE 6

MACHINE LEARNING – 2012

6

TD learning for continuous state action space

Doya, NIPS 1996

       

1

Parametrize the value function such that: ; is a set of basis function (e.g. RBF functions). These are set by the user and also called the features. are the weights associated

K T j j j j

V s s s s K       

  

to each feature. These are the unknown parameters.

slide-7
SLIDE 7

MACHINE LEARNING – 2012

7

TD learning for continuous state action space

Doya, NIPS 1996

         

1

Pick a set of parameters and compute: ; Do some roll-outs of fixed episode length and gather , 1... Estimate new ; through TD learning Gradient descent on the squared TD err

K T j j j t

V s s s r s t T V s       

  

     

 

     

1

1...

  • r gives:

; ˆ ; ; , Other technique for estimation of non-linear regression functions have been proposed elsewhere to get a better estimate

t j t t t t j t j

t

j K

r s

V s r s V s V s r s s

     

         

  • f the parameters than simple

gradient descent.

slide-8
SLIDE 8

MACHINE LEARNING – 2012

8

Robotics Applications of continuous RL

Teaching a two joints, three link robot leg to stand up

Robot configuration. θ0: pitch angle, θ1: hip joint angle, θ2: knee joint angle, θm: the angle of the line from the center of mass to the center of the foot.

Morimoto and Doya, Robotics and Autonomous Systems , 2001

slide-9
SLIDE 9

MACHINE LEARNING – 2012

9

Robotics Applications of continuous RL

Morimoto and Doya, Robotics and Autonomous Systems , 2001

  • The final goal is the upright stand-up posture.
  • Reaching the final goal is a necessary but not sufficient condition of successful

stand-up because the robot may fall down after passing through the final goal  need to define subgoals

  • The stand-up task is accomplished when the robot stands up and stays upright

for more than 2(T + 1) seconds.

slide-10
SLIDE 10

MACHINE LEARNING – 2012

10

Robotics Applications of continuous RL

Morimoto and Doya, Robotics and Autonomous Systems , 2001

Hierarchical reinforcement learning: Upper layer: discretize state-action space and discover which subgoal should be reached using Q-learning. State: pitch and joint angles Actions: joint displacement (not the torque). Lower layer: continuous state-action space The robot learns to apply appropriate torque to achieve each subgoal. State: pitch and joint angles, and their velocities Actions: torque at the two joints

slide-11
SLIDE 11

MACHINE LEARNING – 2012

11

Reward: Decompose the task into an upper-level and lower-level sets of goals. Y: height of the head

  • f the robot at a sub-goal posture

and L is total length of the robot.

Robotics Applications of continuous RL

Morimoto and Doya, Robotics and Autonomous Systems , 2001

When the robot achieves a sub-goal, the learner gets a reward < 0.5. Full reward is obtained when all subgoals and main goal are both achieved.

slide-12
SLIDE 12

MACHINE LEARNING – 2012

12

Robotics Applications of continuous RL

Morimoto and Doya, Robotics and Autonomous Systems , 2001

             

 

State and action are continuous in time: , Model value function: ; Model policy: | ; , : sigmoid function, : noise Learn through gradient descent on squared TD error. Backp

T T

s t u t V s s u s u s f s f              

     

1...

ropagate TD error to estimate : ,

j j

j K

e t t e t   

slide-13
SLIDE 13

MACHINE LEARNING – 2012

13

Robotics Applications of continuous RL

  • 750 trials in simulation + 170 on real robot
  • Goal: to stand up

Morimoto and Doya, Robotics and Autonomous Systems , 2001

slide-14
SLIDE 14

MACHINE LEARNING – 2012

14

Separate the problem in two parts: a) Learning to swing the pole up b) Learning to balance the pole upright Part a is learned through TD-learning Part b is learned using optimal control Additionally estimate a model of the dynamics of the inverted pendulum through locally weighted regression based (estimate parameters of a known model) Learning how to swing up a pendulum

Atkeson & Schaal, Intern. Conf. on Machine Learning, ICML 1997; Schaal, NIPS, 1997

Robotics Applications of continuous RL

slide-15
SLIDE 15

MACHINE LEARNING – 2012

15

Learning how to swing up a pendulum

Atkeson & Schaal, Intern. Conf. on Machine Learning, ICML 1997; Schaal, NIPS, 1997

 

 

* * 1

Collect a sequence , from a single human demonstration State of the system = , , , (human hand trajectory + velocity angular trajectory and velocity of the pendulum) Actions: = (c

T t t t t t t t t t t

s u s x x u x  

an be converted into torque command to the robot) Reward: minimizes angular and hand displacement as well as velocities

Robotics Applications of continuous RL

slide-16
SLIDE 16

MACHINE LEARNING – 2012

16

     

1

Learn first a model of the task through RL using continuous TD learning to estimate ; : Take a model ; = Parameters are estimated incrementally through non-linear function approximation (

K j i

V s V s s     

locally weighted regression)

  

  

* *

Learn then the reward: , , and are estimated so as to minimize discrepancies between human and robot trajectories. Use the reward to generate optimal trajectories from

  • ptim

T T t t t t t t t t

r x u t x x Q x x u Ru Q R     al control.

Atkeson & Schaal, Intern. Conf. on Machine Learning, ICML 1997; Schaal, NIPS, 1997

Robotics Applications of continuous RL

slide-17
SLIDE 17

MACHINE LEARNING – 2012

17

Learning how to swing up a pendulum

Atkeson & Schaal, Intern. Conf. on Machine Learning, ICML 1997; Schaal, NIPS, 1997

Robotics Applications of continuous RL

slide-18
SLIDE 18

MACHINE LEARNING – 2012

18

RL in continuous state and action spaces

1....

States and actions , , are continuous: , One can no longer swipe through all states and actions to determine the optimal policy. Instead, one can either

t t N P t t

t T

s a s a

 

 

: 1) use function approximation to estimate the value function V(s) 2) use function approximation to estimate the state-action value function Q(s, a) 3) optimize a parameterized policy | (policy sear a s  ch)

slide-19
SLIDE 19

MACHINE LEARNING – 2012

19

Least Square Policy Iteration

slide-20
SLIDE 20

MACHINE LEARNING – 2012

20

Least Square Policy Iteration

         

1

Approximate the state-value function: , ; , , , is a set of basis function , : . These are set by the user and also called the features. are the

K T j j j N P j j

Q s a s a s a s a K s a        

     

weights associated to each feature. These are the unknown parameters.

slide-21
SLIDE 21

MACHINE LEARNING – 2012

21

Least Square Policy Iteration

       

 

1: 1 1: 1: 2: 1

Perform a roll-out for time steps using policy | ; (greedy search onargmax , ; using current estimate of , ; ). Collect samples , , . Determine how good an initial choice

a T T T T

T a s Q s a Q s a s a r r s    

 

       

 

 

 

 

 

1 1 1,.. 1,..

  • f parameter is by comparing

the predicted reward and the actual reward . ' , , , ' , , , 0,1 discount factor

T t t T t t t t t t T t T

R r J R s a s s a           

   

      

Bellman Residual Error

See Lagoudakis & Parr, Journal of Machine Learning Research, 2003 for various numerical solutions to determine the optimal alpha parameter.

slide-22
SLIDE 22

MACHINE LEARNING – 2012

22

Goal: learn to balance and ride a bicycle to a target position located 1 km away from the starting location Continuous state: six-dimensional real-valued vector

  • angle of the handlebar,
  • vertical angle of the bicycle
  • angle of the bicycle to the goal

5 Discrete actions: torque applied to the handlebar (discretized to {−2, 0,+2}) and displacement of the rider (discretized to {−0.02, 0,+0.02}). Model of the world: uniform noise on each action. Model of the bike’s dynamics. The state-action value function Q(s, a) for a fixed action a is approximated by a linear combination of 20 basis functions (linear and quadratic combination of state and its 1st and 2nd order derivatives)

LSPI: Example learning to ride a bike

Lagoudakis & Parr, Journal of Machine Learning Research, 2003

S=

slide-23
SLIDE 23

MACHINE LEARNING – 2012

23

Collect training samples using random policy. Each episode lasts 20 steps. Learned policy evaluated 100 times to estimate the probability of success (reaching the goal) Shaping reward was 1% of the change (in meters) in the distance to the goal and change in the square of the vertical angle.

Lagoudakis & Parr, Journal of Machine Learning Research, 2003

LSPI: Example learning to ride a bike

The policy after the first iteration balances the bicycle, but fails to ride to the goal. The policy after the second iteration is heading towards the goal, but fails to balance. All policies thereafter balance and ride the bicycle to the goal. Crash still happens because of noise in the model.

slide-24
SLIDE 24

MACHINE LEARNING – 2012

24

Successful policies are found after a few thousand training episodes. With 5000 training episodes (60000 samples) the probability of success is about 95% and the expected number of balancing steps is about 70000 steps.

Lagoudakis & Parr, Journal of Machine Learning Research, 2003

LSPI: Example learning to ride a bike

slide-25
SLIDE 25

MACHINE LEARNING – 2012

25

RL in continuous state and action spaces

1....

States and actions , , are continuous: , One can no longer swipe through all states and actions to determine the optimal policy. Instead, one can either

t t N P t t

t T

s a s a

 

 

: 1) use function approximation to estimate the value function V(s) 2) use function approximation to estimate the state-action value function Q(s, a) 3) optimize a parameterized policy | (policy sear a s  ch)

slide-26
SLIDE 26

MACHINE LEARNING – 2012

26

Policy Gradients

   

An alternative is to parametrize the stochastic policy: | is approximated by drawing from the distribution | ; a s p a s  

Parameters of the policy

   

Actor in state , choose an action following a stochastic policy: | drawing from the distribution | ;

t t t t t

s a a s p a s  

slide-27
SLIDE 27

MACHINE LEARNING – 2012

27

       

1

Starting from a deterministic estimate of the policy: | ; = | | where | are again a set of known basis functions and the weights for each basis function. Adding some noise and

K T j j j j j

a s a s a s a s K        

 

  

          

 

making the policy explicitly time- dependent | , ; = | , , ~ 0, leads to stochastic estimate: | , ; ~ | , , | , | ,

T t t t t t T T t t t t t

a s t a s t N a s t N a s t a s t a s t               

Kobler and Peters, Machine Learning, 2011

Policy learning by Weighted Exploration with the Returns (PoWER)

Structured state- dependent exploration

slide-28
SLIDE 28

MACHINE LEARNING – 2012

28

Policy learning by Weighted Exploration with the Returns (PoWER) Finite Horizon T; episodic restarts

       

1: 1: 1 1: 1

At each roll-out (episode), the agent gathers tupple of rewards and associated states and actions: = , , . Each roll-out is generated by using a policy with parameter : ~ with

T T T t

a s r p p p s p

 

    

 

   

1 1

| , | , ;

T t t t t t

s s a a s t  

   

1 1 1

The cumulative reward for roll-out is: , , ,

T t t t t

R T r s s a t  

  

Kobler and Peters, Machine Learning, 2011

slide-29
SLIDE 29

MACHINE LEARNING – 2012

31

At each new trial, parameters are computed as the weighted average of all previous trials plus some Gaussian noise 1 ~ , 1,...number if trials

i i i i i

N w w i                  

 

Policy learning by Weighted Exploration with the Returns (PoWER)

Kobler and Peters, Machine Learning, 2011

slide-30
SLIDE 30

MACHINE LEARNING – 2012

32

Policy learning by Weighted Exploration with the Returns (PoWER)

Teaching a robot to the ball-in-a-cup task State: joint angle + velocities of robot + cartesian coordinates of the ball Action: joint space accelerations

   

| , , ~ , , , with 31 basis functions per DOF

T

x x        

Kobler and Peters, Machine Learning, 2011

slide-31
SLIDE 31

MACHINE LEARNING – 2012

33

Policy learning by Weighted Exploration with the Returns (PoWER)

Kobler and Peters, Machine Learning, 2011

slide-32
SLIDE 32

MACHINE LEARNING – 2012

34

Policy learning by Weighted Exploration with the Returns (PoWER)

slide-33
SLIDE 33

MACHINE LEARNING – 2012

35

Inverse Reinforcement Learning

A major difficulty in RL is to determine the reward. The reward encapsulates key information about the task. A poor choice may lead to suboptimal solutions. IRL was proposed as framework to estimate what the reward could be. The reward is such that it best explain the observed behavior of an “expert agent”. The expert is an agent that knows how to perform the task (in an optimal manner). Estimation of the policy and the reward is then done simultaneously.

slide-34
SLIDE 34

MACHINE LEARNING – 2012

36

Inverse Reinforcement Learning

           

1

Consider first a discrete state - action space. Assume a parametrization of the reward function: where : 0,1 are bounded basis functions. 0,1 , so that the reward is also bound

K T j j j j j

r s s s s S      

    

   

ed 0,1 r s 

Abbel & Ng, Int. Conf. on Machine Learning, 2004

slide-35
SLIDE 35

MACHINE LEARNING – 2012

37

Inverse Reinforcement Learning

           

1 1 1

Given: The value of a given policy over one time steps roll-out is given by: ;

T T t t t T t T t T T t t

r s s T E V E r s E s E s

   

           

  

                     

  

Expected features given policy A particular policy is evaluated as a linear combination of features.

slide-36
SLIDE 36

MACHINE LEARNING – 2012

39

Inverse Reinforcement Learning

 

 

: 1 : 1

1 *

Step 1: i) Record a set of demonstrated paths (a series of roll-outs) yielding a set of 1 state transitions . ii) Build an estimate of the optimal policy ,

  • f the demonstrator by

comp

t T

M i i

M T s s a 

 

 

* * * * 1 1

uting expected features for policy , 1 :

M T t i t i t

s a s M

     

 

 



slide-37
SLIDE 37

MACHINE LEARNING – 2012

40

Inverse Reinforcement Learning

 

Step 2: Initialize the agent's policy , , e.g. picking random policy. and then do for i=0...n iterations: i) Approximate (e.g. via monte-carlo) expected feature for the policy . ii) Find the opt

i

i

s a

  

 

 

  

 

* 0,... 1 : 1 1

imal weight solution of max min . iii) Set the current estimate of the reward to be and use RL to find the optimal associated policy . Terminate if is inferio

j

i i T j i T i i i i

c r s s c

  

      

   

   r to a threshold.

slide-38
SLIDE 38

MACHINE LEARNING – 2012

41

Inverse Reinforcement Learning

  

 

*

The algorithm is trying to find a reward function such that i.e. a reward on which the expert does better, by a margin of c, than any of the -1 policies found previously.

i

T i i

r s s V V c i

 

    

, *

1... 1,

The optimization problem is equivalent to max with constraints , and 1. Akin to maximum margin optimization in SVM.

j

c T T

j i

c c

 

    

 

  

slide-39
SLIDE 39

MACHINE LEARNING – 2012

42

Learning optimal policy to drive a car, or learn driving styles (Nice, Nasty, Right lane nice, right lane nasty, middle lane)

Abbel & Ng, Int. Conf. on Machine Learning, 2004

IRL Example: Driving a car

Agent car is faster than any

  • ther car and needs to learn

when and how to pass other cars. 5 actions (steer onto any of the 3 lanes or outside the road) 5 features indicate where the car is 10 features for discretized distances to closest car

slide-40
SLIDE 40

MACHINE LEARNING – 2012

43

30 iteration steps for learning each policy through IRL: (from top to bottom, left to right: nice, nasty, right lane nice, right lane nasty, middle lane) Learning based on 2 minutes expert demonstration of each driving style.

IRL Example: Driving a car

slide-41
SLIDE 41

MACHINE LEARNING – 2012

44

30 iteration steps for learning each policy through IRL: (from top to bottom, left to right: nice, nasty, right lane nice, right lane nasty, middle lane) Learning based on 2 minutes expert demonstration of each driving style.

IRL Example: Driving a car

slide-42
SLIDE 42

MACHINE LEARNING – 2012

Learning without reward: Donut

Flipping a block to stand on its end Launching a ball into a basket

The robot is provided solely with failed examples. It has no information about the task – no reward, no indication of what was incorrect.

Grollman and Billard, ICRA 2012, Best Paper Award Cognitive Robotics

slide-43
SLIDE 43

MACHINE LEARNING – 2012

47

Build a distribution (DONUT) that moves away from the bad demonstrations  Explore around the demonstrations and use the covariance to guide the exploration.  Move away from things that have been visited a lot during unsuccessful demonstrations but remain within the vicinity of the demonstrations.

2D DONUT distribution; high likelihood of drawing datapoints in the vicinity but away from the demonstrated datapoints.

Original distribution of points from failed demonstrations

Learning without reward: Donut

slide-44
SLIDE 44

MACHINE LEARNING – 2012

48

Exploration parameter that determines how far away the peaks are from the base distribution. Base distribution learned from training example (failed attempts) using GMM Donut distribution Exploratory policy

Learning without reward: Donut

   

  

  

 

Continuous state: joint angle ; Continuous action: joint velocity , draw from DONUT distribution ~ | | ; , 1 | ; , / s a s a D a s N N f                   

slide-45
SLIDE 45

MACHINE LEARNING – 2012

High coherence across trials  high confidence Little coherence across trials  low confidence

Angular Position (radian) of robot’s wrist

Velocity

  • Search around the demonstrations
  • Reproduce only parts where all demonstrators agreed
  • Avoid regions with high uncertainty

Learning without reward: Donut

slide-46
SLIDE 46

MACHINE LEARNING – 2012

High coherence across trials  high confidence Little coherence across trials  low confidence

Angular Position (radian) of robot’s wrist

Velocity

  • Search around the demonstrations
  • Reproduce only parts where all demonstrators agreed
  • Avoid regions with high uncertainty

Learning without reward: Donut

slide-47
SLIDE 47

MACHINE LEARNING – 2012

Find a solution in a few trials Is comparable in efficiency to classical reinforcement learning approaches But does not need a reward!

Learning without reward: Donut

slide-48
SLIDE 48

MACHINE LEARNING – 2012

Learning without reward: Donut

Multidimensional DONUT applied to learn how to play Golf

Region in parameter space that lead to successful hit

State: position of ball with respect to the hole Action: speed and orientation of end-effector when hitting the ball

slide-49
SLIDE 49

MACHINE LEARNING – 2012

Learning without reward: Donut

Multidimensional DONUT; Create holes in the distribution located

  • n each unsuccessful attempt.

Lightness corresponds to the likelihood of generating arbitrary 2D parameters. Red dots are human attempts, blue are system- generated trials.

slide-50
SLIDE 50

MACHINE LEARNING – 2012

Learning without reward: Donut

Multidimensional DONUT applied to learn how to play Golf

Grollman and Billard, IJSR, 2012

slide-51
SLIDE 51

MACHINE LEARNING – 2012

RL is a good alternative to supervised learning when you do not know what the optimal solution would be. Drawbacks of classical discrete state-action RL can be

  • vercome by considering continuous value functions.

Reward function can be estimated through IRL Learning without rewards and from purely unsuccessful examples is yet a little explored problem.

Summary