Discrete and Continuous Reinforcement Learning (not part of exam - - PowerPoint PPT Presentation

discrete and continuous
SMART_READER_LITE
LIVE PREVIEW

Discrete and Continuous Reinforcement Learning (not part of exam - - PowerPoint PPT Presentation

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING Brief Overview of Discrete and Continuous Reinforcement Learning (not part of exam material) 1 1 ADVANCED MACHINE LEARNING Forms of Learning Supervised learning where the algorithm


slide-1
SLIDE 1

ADVANCED MACHINE LEARNING

1

1

ADVANCED MACHINE LEARNING Brief Overview of Discrete and Continuous Reinforcement Learning (not part of exam material)

slide-2
SLIDE 2

ADVANCED MACHINE LEARNING

2

2

Forms of Learning

  • Supervised learning – where the algorithm learns a function
  • r model that maps best a set of inputs to a set of desired
  • utputs.
  • Reinforcement learning – where the algorithm learns a policy
  • r model of the set of transitions across a discrete set of input-
  • utput states (Markovian world) in order to maximize a reward

value (external reinforcement).

  • Unsupervised learning – where the algorithm learns a model

that best represent a set of inputs without any feedback (no desired output, no external reinforcement)

2

slide-3
SLIDE 3

ADVANCED MACHINE LEARNING

3

3

Learning how to stand-up

Morimoto and Doya, Robotics and Autonomous Systems , 2001

Example of RL

3

slide-4
SLIDE 4

ADVANCED MACHINE LEARNING

4

4

Reinforcement learning: Sequential Decision Problem

Problem: Search a mapping from state to action

f t t

s a s a  

4

Task: Get rock samples Feedback: success or failure It is up to the robot to figure

  • ut the best solution !

What are the rewards ?

Exploration: Have to try and explore multiple solutions to find the best!

Let’s try everything

slide-5
SLIDE 5

ADVANCED MACHINE LEARNING

5

5

Supervised / semi-supervised learning

Problem: Search a mapping from state to action

f t t

s a s a  

       

1 1 2 2 3 3

In supervised learning, at each time step provide pairs: , , , , , ,... ,

T T

s a s a s a s a

         

1 2 3 4 1 4

In semi-supervised learning, provide partial supervision , , , , , , , ,... , ? ?

T T

s s a s a s a s The set of state-action pairs provided for training are optim (expert tea al cher)

5

slide-6
SLIDE 6

ADVANCED MACHINE LEARNING

6

6

Learning how to swing up a pendulum

Atkeson & Schaal, Intern. Conf. on Machine Learning, ICML 1997; Schaal, NIPS, 1997

Example of RL bootstrapped with supervised learning

6

slide-7
SLIDE 7

ADVANCED MACHINE LEARNING

7

7

Reinforcement learning & supervised learning

some examples of optimal state-action pai The expert provides and

  • f the associated

rs reward.

         

1 1 2 2 3 3 4 4 1 2 3 4

Expert provides some sequences of optimal state-action pairs ( ) and the reward: , , , , , , , , , , , ,..., , , The a roll-outs searc gent for the solu hes by generatin tion g new action-sta

T T T

a r a r a r s s s s s a r a r

       

1 1 2 2 3 3 1 2 3 4 4 4

te pairs in a neightbourhood around the expert's demonstrations. not necessarily These solutions are The expert provides a reward for thes

  • ptimal

e roll-outs. , , , , , , , , , , , ,..., . r r r s a s a s a s r a s

 

, ,

T T T

a r

7

slide-8
SLIDE 8

ADVANCED MACHINE LEARNING

8

8

The Reward

The reward shapes the learning. Choosing it well is crucial for success of the learning. Imagine that you want to train a robot to learn to walk.

  • What reward would you choose for training a robot to stand-up?
  • What is the dimension of the state-action space?
  • How long would it take to learn through trial and error?
  • How is the reward helping reduce this number?

UC Berkeley, Darwin Robot

8

slide-9
SLIDE 9

ADVANCED MACHINE LEARNING

9

9

The Reward

One could choose a more complex reward (informative): Reward = penalty for deviation of center of mass from equilibrium point + reward for cyclic motion of left and right leg + reward for in-phase motion of upper and lower leg, etc. Reduce the search of the state-action space by looking for phase- relationships between the joints.

Unconstrained search over joint angles Constrained search for torso motion and relative leg motion

9

slide-10
SLIDE 10

ADVANCED MACHINE LEARNING

10

10

RL: Optimality

Reinforcement learning when using discrete state-action spaces with finite horizon can be solved in an optimal manner. We will see next how this can be done. This is no longer true for generic continuous state-action spaces. However, the same principles can be extended to continuous worlds, albeit with loss of optimality principle. (note that you can also guarantee optimality in continuous state and action space but some assumptions have to be made, e.g. Gaussian noise and linear control policy)

10

slide-11
SLIDE 11

ADVANCED MACHINE LEARNING

11

11

RL: Discrete State

11

Agent Fire pit

Set of possible states in the world (environment + agent) 225 states in the above example (not all shown above)

Goal Reward

slide-12
SLIDE 12

ADVANCED MACHINE LEARNING

12

12

RL: Discrete State

12

Agent Fire pit Goal

 

A policy , is used to choose an action from any state . RL le

  • ptimal policy

arns an . s a a s 

A set of possible actions of the agent:

slide-13
SLIDE 13

ADVANCED MACHINE LEARNING

13

13

RL: Discrete Actions

13

 

Illustration of a policy , . s a 

Agent Fire pit

 

1

: Transitions across states are not deterministic | , Rewards may also be stochastic. Stochastic Environment

t t t

p s a s

 Knowing requires a model of the world. It can be learned while learning the policy. p

slide-14
SLIDE 14

ADVANCED MACHINE LEARNING

14

14

RL: the effect of the environment

14

Deterministic Stochastic

RL takes into account stochasticity of the environment

slide-15
SLIDE 15

ADVANCED MACHINE LEARNING

15

15

RL: the effect of the environment

   

1 1 1 1

RL assumes that the world is first-order Marko , , ,... v | , | , ,

t t t t t t t t

p s a s p s a a s a s s

   

Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

In words: the probability of a transition to a new state (and new reward) depends only on the current state and action, not on the history of previous states and actions. If state and action sets are finite, it is a finite MDP. This assumptions reduces drastically computation. No need to propagate the probabilities.

15

slide-16
SLIDE 16

ADVANCED MACHINE LEARNING

16

16

RL: the policy

Example of greedy policy ending up in a limit cycle when using a poor policy. The agent must be able to measure how well it is doing and use this to update its policy. A good policy maximizes the expected reward.

16

     

At each time step, the agent being in a state chooses an action by drawing from , If , equiprobable for all actions , pick at random

  • Otherwise pick best action argmax

, Gree

t t a

s a s a s a a a s a     dy Policy

slide-17
SLIDE 17

ADVANCED MACHINE LEARNING

17

17

RL: exercise I

17

slide-18
SLIDE 18

ADVANCED MACHINE LEARNING

19

19

RL: Value function

1

The gives for each state an estimate of the expected reward starting from that state: ( ) It depends on the agent’s polic state -value function y.

t k t k

V s E r s s

     

       

19

Reward Value function

slide-19
SLIDE 19

ADVANCED MACHINE LEARNING

20

20

RL: Value function

20

Policy Value function

Greedy Policy

slide-20
SLIDE 20

ADVANCED MACHINE LEARNING

21

21

RL: Value function

1

shortsighted 0 1 farsight Discount future rewards: ( ) , 0 1. ed

t k t k k

V s E r s s

 

  

   

           

21

slide-21
SLIDE 21

ADVANCED MACHINE LEARNING

22

22

RL: Markov Decision Process (MDP)

22

Agent Fire pit Goal

How to find the best possible policy ? MDP Find optimal value function => gives optimal policy

slide-22
SLIDE 22

ADVANCED MACHINE LEARNING

23

23

RL: How to find an optimal policy ?

23

Find the value function

Need policy to compute the expectation

Exploit recursive property: Bellman equation

slide-23
SLIDE 23

ADVANCED MACHINE LEARNING

24

24

RL: Bellman Equation

R

t  rt1   r t2  2r t 3  3r t 4

 rt1   rt2   r

t3   2r t 4

 

 rt1   Rt1

The Bellman equation is a recursive equation describing MDPs: So:

 

 

 

s s s V r E s s R E s V

t t t t t

    

  1 1

) ( 

  

Or, without the expectation operator (assuming a MDP):

 

 

  

  

a s a s s a s s

s V R P a s s V ) ( ) , ( ) (

 

 

Bellman Equation

Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

24

slide-24
SLIDE 24

ADVANCED MACHINE LEARNING

25

25

RL: Bellman Policy Evaluation

So:

 

 

 

s s s V r E s s R E s V

t t t t t

    

  1 1

) ( 

  

Or, without the expectation operator (assuming a MDP):

 

 

  

  

a s a s s a s s

s V R P a s s V ) ( ) , ( ) (

 

 

Bellman Equation

Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

S S’ a r

) (s V  ) ' (s V 

) , ( a s 

a s s

R 

a s s

P 

25

slide-25
SLIDE 25

ADVANCED MACHINE LEARNING

26

26

Model-based reinforcement learning

To estimate V, one needs a model of the world to estimate the state transitions and reward distribution if stochastic

( ) ( , ) ( )

a a ss s a s s

V s a V P s s R

 

 

  

      

 

S S’ a r

) (s V 

) , ( a s 

a s s

R 

a s s

P 

26

If this is not known, one resorts to sample based techniques.

slide-26
SLIDE 26

ADVANCED MACHINE LEARNING

27

27

RL: How to find optimal policy ? (Again!)

27

Find the value function

Obtained by solving the 36 equations with 36 unknowns

slide-27
SLIDE 27

ADVANCED MACHINE LEARNING

28

28

RL: How to do Control ?

1

The gives for each state an estimate of the expected reward starting from that state: ( ) It depends on the state -value agent’s polic function y.

t k t k k

V s E r s s

 

   

       

1

The is a measure of the expected reward when taking an action in a state under po action-value funct licy ( , ) , . ion

k t k t t k

a s Q s a E r s s a a

 

 

   

        

action-value function state-value func The and the are directly related: ( ) ( , ) ( , ti )

  • n

a

V s s a Q s a

 

 

A

28

See handout on website RQV.pdf

slide-28
SLIDE 28

ADVANCED MACHINE LEARNING

29

29

RL: V-Q-value functions

29

V(s) tells you how good a state is, Q(s,a) tells you how good an action from a given state is.

slide-29
SLIDE 29

ADVANCED MACHINE LEARNING

30

30

RL: Dynamic Programming

30

Policy evaluation and improvement Value Iteration

  • 2. Improve policy:

becomes greedy Generalized Policy Iteration

  • 1. Evaluate policy:

Policy evaluation is Linear

  • 1. Evaluate policy & improve it:

Policy evaluation is Non-linear We need to know the models!

slide-30
SLIDE 30

ADVANCED MACHINE LEARNING

31

31

Policy Evaluation & Improvement

31

Generalized Policy Iteration

Sutton & Barto Chapter 4

slide-31
SLIDE 31

ADVANCED MACHINE LEARNING

36

36

Three main methods:  Dynamic Programming (DP)

Swipes through all states and use the Bellman’s equation to estimate V(s) at each step.

Works only if the model of the environment is known and the number of states is small enough.  Monte-Carlo (MC):

Approximate the true value function.

Can be used on-line

No model of the world necessary.  Temporal-Difference Learning (TD)

 Like MC, TD methods learn directly from raw experience  Like DP, TD methods do bootstrapping, I.e. they update estimates

without waiting for a final outcome

Methods for estimating the value functions

Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

36

slide-32
SLIDE 32

ADVANCED MACHINE LEARNING

37

37

Monte-Carlo Sampling

Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Simply follow the policy during many episodes and compute the average returns obtained for each visited state:

S a S’

  

1

) (

k k kr

s R 

Update of V(si) Update of V(si) Update of V(si)

3 ) ( ) ( ) ( ) (

3 2 1

s R s R s R s V

e e e

  

Exploration is guided by initial probabilities and by heuristic guiding the choice of branch in the tree.

37

slide-33
SLIDE 33

ADVANCED MACHINE LEARNING

38

38

Dynamic Programming

Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

 

1

Dynamic Programming: ( ) ( )

t t t

V s E r V s

 

T T T T

st

r

t1

st1

T T T T T T T T T

38

slide-34
SLIDE 34

ADVANCED MACHINE LEARNING

39

39

TD-Learning

Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

T T T T

st

r

t1

st1

T T T T T T T T T

 

Monte-Carlo: ( ) ( ) ( ) where is the actual return following state .

t t t t t t

V s V s R V s R s    

T

s

T

r

39

slide-35
SLIDE 35

ADVANCED MACHINE LEARNING

40

40

TD-Learning

Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

T T T T

st

T T T T T T T T T

st1 r

t1

 

1 1

( ) ( ) ( ) ( )

t t t t t

V s V s r V s V s  

 

   

Estimate of true return

t

R

40

slide-36
SLIDE 36

ADVANCED MACHINE LEARNING

41

41

Sarsa: On-Policy TD Control

41

slide-37
SLIDE 37

ADVANCED MACHINE LEARNING

44

44

DP – MC - TD

44

Offline Online Dynamic Programming: DP Monte Carlo: MC Physical interaction with the environment Too hard to model !!! No interaction with the environment We have the models!!! SARSA Q-Learning Temporal difference (TD) TD Value Iteration Bellman Equation Policy Evaluation Block World

slide-38
SLIDE 38

ADVANCED MACHINE LEARNING

46

46

RL: exercise II

46

slide-39
SLIDE 39

ADVANCED MACHINE LEARNING

49

49

Off-line versus on-line search

Off-line search is preferred as:

  • It speeds up learning (much faster than doing a roll-out in real time)
  • It ensures that no damage is made to hardware.

However, it is possible only when one has a very realistic simulator. Often, one bootstraps learning off-line and refines it on-line.

Morimoto and Doya, Robotics and Autonomous Systems , 2001

In experiments to teach two- legged robot to stand up, conducted:

  • 750 trials in simulation
  • 170 on real robot

49

slide-40
SLIDE 40

ADVANCED MACHINE LEARNING

50

50

Exploration versus exploitation

One may need to start acting in the real-world before have attained a reasonable estimate

  • f optimal value function (optimal policy).

For the actions to yield a reasonable

  • utcome, best to act in region of the state-

action space already visited. But when doing this, one keeps sampling from the same region and does not explore in new areas. Learning stagnates! To keep learning: balance exploitation (risk-averse) and exploration (risk-seeking) strategies.

Risks for dummies

50

slide-41
SLIDE 41

ADVANCED MACHINE LEARNING

51

51

Drawbacks of standard RL

Curse of dimensionality: Computational costs increase dramatically with number of states. Even if the problem is discrete by nature e.g.

  • Play Backgammon
  • Make train schedule
  • Determine slots for TV programs

The number of states may be huge (1020 for Backgammon) and may explode the memory of the system.

51

slide-42
SLIDE 42

ADVANCED MACHINE LEARNING

52

52

Drawbacks of standard RL

Markov World: For most real-world problems, the state and/or actions are not discrete by nature.

  • Robotics: control motion of a robot (motion and state of joints are

continuous)

  • Finances: values of stocks is continuous, actions (buy or not) may be

discrete.  Discretizing is possible, but deciding on the granularity of the discretization is difficult and will impact the precision of the control.

52

slide-43
SLIDE 43

ADVANCED MACHINE LEARNING

53

53

Drawbacks of standard RL

Curse of dimensionality: Computational costs increase dramatically with number of states. Markov World: Cannot handle continuous action and state space Model-based vs model free: Need a model of the world (can be estimated through exploration)  Gradient methods to handle continuous state and action spaces

53

slide-44
SLIDE 44

ADVANCED MACHINE LEARNING

54

54

RL in continuous state and action spaces

1....

States and actions , , are continuous: , One can no longer swipe through all states and actions to determine the optimal policy. Instead, one can either

t t N P t t

t T

s a s a

 

 

: 1) use function approximation to estimate the value function V(s) 2) use function approximation to estimate the state-action value function Q(s, a) 3) optimize a parameterized policy | (policy sear a s  ch)

54

slide-45
SLIDE 45

ADVANCED MACHINE LEARNING

55

55

RL by function approximation

 

Parametrize the value function: ; V s 

Open parameters

       

1

Parametrize the value function such that: ; form a set of basis functions (e.g. RBF functions). These are set by the user and also called featur the . weights associ are th es e

T j j j K j j

V s K s s s       

  

. These are the ated to each fea unknown paramet ture ers.

55

slide-46
SLIDE 46

ADVANCED MACHINE LEARNING

56

56

Learning the value function

 

 

       

 

 

1 1 1 1

How to update the value function? In , the target is the expected return = ; ; , : learning rate In , the ta Monte-Carl rge

  • TD

; t is ; = ; ;

t t t t t t t t t

R r V V s V s r V s V s V s s

 

            

    

        

56

 

1

Do roll-outs, measure sets of rewards for each visited state ; and update the value function using the above equations.

T t t t

s r

 One can use other techniques, e.g. ML techniques for non-linear regression, to get a better estimate of the parameters than simple gradient descent, see Deisenroth et al, Foundations & Trends in Robotics, 2011; Peters & Scahaal, Neural Networks, 2008 for surveys.

slide-47
SLIDE 47

ADVANCED MACHINE LEARNING

57

57

RL by function approximation: example

     

1

Value function: ; Example: : cartesian state of robot's gripper : features in the state, e.g.

  • distances to the holes
  • distances to the walls

K j j j j

V s s s s    



Choosing the features is not trivial and is key to success.

57

slide-48
SLIDE 48

ADVANCED MACHINE LEARNING

58

58

RL by function approximation: example

What would be a good set of weights for the robot to learn how to sink the ball to any of the holes? What would the value function look like?

     

1

Value function: ; Example: : cartesian state of robot's gripper : features in the state, e.g.

  • distances to the holes
  • distances to the walls

K j j j j

V s s s s    



58

slide-49
SLIDE 49

ADVANCED MACHINE LEARNING

59

59

Learning the value function

         

1 1 2 2 . walls . holes 1 2

Value function: ; 1 if hit a wall, 0 otherwise 1 if sunk into a hole, 0 otherwise

dist dist

V s s s s s           

1 2

Start with initial estimate: 0.5     Perform one roll-out with greedy policy Gather reward: -10 (hit a wall)

   

 

 

 

 

1 1

1 1 1

Update value function by gradient descent (TD): = The update is influenced immediately by the active feature.

t

T t s T T t t t

r s s s

 

      

  

   

slide-50
SLIDE 50

ADVANCED MACHINE LEARNING

61

61

From value function to policy.

   

When the actions are simply the derivative of the state ( ) . . motion of robot in 2D space, the greedy policy can be derived by taking the gradient of the value function: | ~ a s e g a s s V s    

61

Value function

t

s

 ,

: scaling factor

t t t

a s V s     

slide-51
SLIDE 51

ADVANCED MACHINE LEARNING

62

62

From value function to policy.

   

When the actions are simply the derivative of the state ( ) . . motion of robot in 2D space, the greedy policy can be derived by taking the gradient of the value function: | ~ a s e g a s s V s     Not possible when the actions differ from the state space, e.g. underactuated robot. : Critic: Update the parameters of the value function (or action-value function) using, e.g. TD learning. Actor: Updates policy parameters in direction suggested by criti Actor-Criti c c using gradient descent.

62

slide-52
SLIDE 52

ADVANCED MACHINE LEARNING

63

Robotics Applications of continuous RL

Teaching a two joints, three links robot leg to stand up Robot state: θ0: pitch angle, θ1: hip joint angle, θ2: knee joint angle, θm: the angle of the line from the center of mass to the center of the foot. Robot actions: torques to actuate the two joints.

Morimoto and Doya, Robotics and Autonomous Systems , 2001

63

slide-53
SLIDE 53

ADVANCED MACHINE LEARNING

64

Robotics Applications of continuous RL

Morimoto and Doya, Robotics and Autonomous Systems , 2001

GOAL: The stand-up task is accomplished when the robot stands up and stays upright for more than 2(T + 1) seconds. SUBGOALS: Reaching the final goal is a necessary but not sufficient condition of successful stand-up because the robot may fall down after passing through the final goal  need to define subgoals.

64

slide-54
SLIDE 54

ADVANCED MACHINE LEARNING

65

Decompose the task into an upper-level and lower-level sets of goals. In upper level, perform coarse discretization of state-action space and use Q-learning.

Robotics Applications of continuous RL

Morimoto and Doya, Robotics and Autonomous Systems , 2001

When the robot achieves a sub-goal, the learner gets a reward < 0.5. Full reward is obtained when all subgoals and main goal are both achieved.

65

Reward at upper level

Y: height of the head

  • f the robot at a sub-goal posture

and L is total length of the robot.

slide-55
SLIDE 55

ADVANCED MACHINE LEARNING

66

Robotics Applications of continuous RL

Morimoto and Doya, Robotics and Autonomous Systems , 2001

       

The lower level learns how to achieve each subgoal: The reward is the level of achievement of the subgoal. States and actions are continuous in time: , Model value function: ; Model policy:

T

r s t u t V s s    

     

 

     

1...

| ; , : truncated sigmoid function, : noise : Learn through gradi Critic Acto ent descent on squared TD error. : Use TD error to estimate : , r

T j j

j K

u s u s f s f e t t e t          

   

66

slide-56
SLIDE 56

ADVANCED MACHINE LEARNING

67

Robotics Applications of continuous RL

Morimoto and Doya, Robotics and Autonomous Systems , 2001

  • 750 trials in simulation + 170 on real robot
  • Goal: to stand up

67

slide-57
SLIDE 57

ADVANCED MACHINE LEARNING

68

68

RL in continuous state and action spaces

1....

States and actions , , are continuous: , One can no longer swipe through all states and actions to determine the optimal policy. Instead, one can either

t t N P t t

t T

s a s a

 

 

: 1) use function approximation to estimate the value function V(s) 2) use function approximation to estimate the state-action value function Q(s, a) 3) optimize a parameterized policy | (policy sear a s  ch)

68

slide-58
SLIDE 58

ADVANCED MACHINE LEARNING

69

69

Parametrize the Q-value function

         

1

Approximate the state-value function: , ; , , , is a set of basis function , : . These are set by the user and also called the features. are the

K T j j j N P j j

Q s a s a s a s a K s a        

     

weights associated to each feature. These are the unknown parameters.

69

slide-59
SLIDE 59

ADVANCED MACHINE LEARNING

70

70

Update on Bellman residual error

70

   

1 1

Recall (see slides on discrete RL) the update step on Q-learning: , ; , ;

t t t t t

Q s a r Q s a   

 

 

     

2 1 1

The Bellman residual error is given by: , ; , ;

t t t t t t

L Q s a r Q s a    

 

  

Use the Bellman residual error to determine the parameters

  • f the Q-value function iteratively, see next slide.
slide-60
SLIDE 60

ADVANCED MACHINE LEARNING

71

71

Update on Bellman residual error

       

 

1: 1 1: 1: 2: 1

Perform a roll-out for time steps using policy | ; (greedy search onargmax , ; using current estimate of , ; ). Collect samples , , .

a T T T T

T a s Q s a Q s a s a r r s    

 

Bellman Residual Error

See Lagoudakis & Parr, Journal of Machine Learning Research, 2003 who offers numerical solutions to determine the optimal alpha parameter.

71

       

 

 

 

 

1 1 1,..

Determine how good an initial choice of parameter is by comparing the predicted reward and the actual reward . ' , , , ' , ,

T t t T t t t t t t T

R r L R s a s s a          

  

     

 

1,.. ,

0,1 discount factor Solution is found by least-square on the objective function.

t T  

slide-61
SLIDE 61

ADVANCED MACHINE LEARNING

72

72

Goal: learn to balance and to ride a bicycle to a target position located 1 km away from the starting location Continuous state: six-dimensional real-valued vector

  • angle of the handlebar,
  • vertical angle of the bicycle
  • angle of the bicycle to the goal

5 discrete actions: torque applied to the handlebar (discretized to {−2, 0,+2}) and displacement of the rider (discretized to {−0.02, 0,+0.02}). Model of the world (simulated): uniform noise on each action. Model of the bike’s dynamics.

Example learning to ride a bike

Lagoudakis & Parr, Journal of Machine Learning Research, 2003

72

 

, , , , , S       

slide-62
SLIDE 62

ADVANCED MACHINE LEARNING

73

73

Lagoudakis & Parr, Journal of Machine Learning Research, 2003

73

 

     

20 1 2 2 2 2

State: , , , , , Q-value function parametrized with 20 basis functions: , , Features: combination of state and its 1st and second derivatives 1, , , , , , , , , , with

i i

S Q s a s a s                    

  

2 2 2 2

, , , , , , , , and uniform distribution on discrete values of . a                  

Example learning to ride a bike

Iterate between using the policy to generae roll-outs by taking greedy approach

  • n current estimate of and update using the Bellman error.

Q Q

slide-63
SLIDE 63

ADVANCED MACHINE LEARNING

74

74

Collect training samples using random policy. Each episode lasts 20 steps. Learned policy evaluated 100 times to estimate the probability of success.

Lagoudakis & Parr, Journal of Machine Learning Research, 2003

The policy after the first iteration balances the bicycle, but fails to ride to the goal. The policy after the second iteration is heading towards the goal, but fails to balance. All policies thereafter balance and ride the bicycle to the goal. Crash still happens because of noise in the model.

74

Example learning to ride a bike

slide-64
SLIDE 64

ADVANCED MACHINE LEARNING

75

75

Successful policies are found after a few thousand training episodes. With 5000 training episodes (60000 samples) the probability of success is about 95% and the expected number of balancing steps is about 70000 steps.

Lagoudakis & Parr, Journal of Machine Learning Research, 2003

75

Example learning to ride a bike

slide-65
SLIDE 65

ADVANCED MACHINE LEARNING

76

76

RL in continuous state and action spaces

1....

States and actions , , are continuous: , One can no longer swipe through all states and actions to determine the optimal policy. Instead, one can either

t t N P t t

t T

s a s a

 

 

: 1) use function approximation to estimate the value function V(s) 2) use function approximation to estimate the state-action value function Q(s, a) 3) optimize a parameterized policy | (policy sear a s  ch)

76

slide-66
SLIDE 66

ADVANCED MACHINE LEARNING

77

77

Policy Gradients

   

An alternative is to parametrize the stochastic policy: | is approximated by drawing from the distribution | ; a s p a s  

Parameters of the policy

   

Actor in state , choose an action following a stochastic policy: | drawing from the distribution | ;

t t t t t

s a a s p a s  

77

slide-67
SLIDE 67

ADVANCED MACHINE LEARNING

78

78

       

1

Starting from a deterministic estimate of the policy: | ; = | | where | are again a set of known basis functions and the weights for each basis function.

K T j j j j j

a s a s a s a s K        

Kobler and Peters, Machine Learning, 2011

Policy learning by Weighted Exploration with the Returns (PoWER)

78

 

  

          

 

~ 0, Addi : n ng som

  • ise f

e noise and

  • r explorati

making the policy explicitly time-dependent | , ; = | , , This leads to a stochastic policy: | , ; ~ | | | n , , ,

  • ,

T t t t T T t t t t t t t

a s t a s t a s t N a s t a s t a s t N                

slide-68
SLIDE 68

ADVANCED MACHINE LEARNING

79

79

 

1: 1: 1 1:

At each roll-out (episode), the agent gathers tupples of rewards and associated states and actions: = , , .

T T T

a s r 

Kobler and Peters, Machine Learning, 2011 79

Policy learning by Weighted Exploration with the Returns (PoWER)

   

1 1

Compute unbiased estimate of the Q-function: , , , , , ;

T t t t t

r s Q s a s a t t

 

 

                 

 

1 1 1 1

Update at each time step the parameters : , , , | , | , | , | , until convergence, i.e. smaller than a threshol , , ; , , ; d.

T T t t t t T T t

Q s a t Q s a t W s t W s t W s t a s t a s t a s t a s t

 

         

   

                

 

slide-69
SLIDE 69

ADVANCED MACHINE LEARNING

80

80

Teaching a robot to play the ball-in-a-cup task State: joint angle + velocities of robot + Cartesian coordinates of the ball Action: joint space accelerations Reward: distance to rim of the bucket    

| , , ~ , , , with 31 basis functions per DOF

T

x x        

Kobler and Peters, Machine Learning, 2011 80

Policy learning by Weighted Exploration with the Returns (PoWER)

slide-70
SLIDE 70

ADVANCED MACHINE LEARNING

82

82

Extensions of RL framework

1) A major difficulty in RL is to determine the reward Inverse Reinforcement Learning (closely related to Inverse Optimal Control for Robotics problems) was proposed as framework to estimate what the reward could be, see:

Ng, Andrew Y., and Stuart J. Russell. "Algorithms for inverse reinforcement learning." ICML, 2000. Ziebart, Brian D., et al. "Maximum Entropy Inverse Reinforcement Abbeel, Pieter, and Andrew Y. Ng. "Inverse reinforcement learning." Encyclopedia of machine learning. Springer US, 2011. 554-558.

2) Reinforcement learning always assume to make good observations of the system (positive reward). This is often impractical (no expert to teach the robot). Approaches to learn from bad examples only, from failure (Donut), see

Grollman, D. H., and Billard A. "Donut as I do: Learning from failed demonstrations." Robotics and Automation (ICRA), 2011 IEEE International Conference on. IEEE, 2011; & "Robot learning from failed demonstrations." International Journal of Social Robotics 4.4 (2012): 331-342. Rai, A.,, deChambrier, G and Billard, A.(2013) Learning from Failed Demonstrations in Unreliable Systems. IEEE-RAS International Conference on Humanoid Robots.

82

slide-71
SLIDE 71

ADVANCED MACHINE LEARNING

83

83

Summary

  • RL was coined first for discrete state and action spaces.
  • A RL problem is entirely determined by its states, actions, rewards, state

transitions probabilities, probability of reward, state-action transitions (policy). It assumes that all probabilities are first order Markov.

  • When the world is known, an optimal solution for the policy can be found

using Dynamic Programming (DP).

  • Otherwise, iterative techniques are used to approximate the optimal

solution (Monte-Carlo, TD, SARSA). This requires to generate roll-outs to span the state-action space. When the policy is used right away during training, one must balance exploration with exploitation.

  • Continuous RL problem extend the discrete RL framework. They reuse the

notion of Value function, Q-value function and Policy but treat these as continuous functions and approximate their value by gradient descent.

83