ADVANCED MACHINE LEARNING
1
1
Discrete and Continuous Reinforcement Learning (not part of exam - - PowerPoint PPT Presentation
ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING Brief Overview of Discrete and Continuous Reinforcement Learning (not part of exam material) 1 1 ADVANCED MACHINE LEARNING Forms of Learning Supervised learning where the algorithm
ADVANCED MACHINE LEARNING
1
1
ADVANCED MACHINE LEARNING
2
2
2
ADVANCED MACHINE LEARNING
3
3
Morimoto and Doya, Robotics and Autonomous Systems , 2001
3
ADVANCED MACHINE LEARNING
4
4
f t t
4
Task: Get rock samples Feedback: success or failure It is up to the robot to figure
What are the rewards ?
Exploration: Have to try and explore multiple solutions to find the best!
Let’s try everything
ADVANCED MACHINE LEARNING
5
5
f t t
1 1 2 2 3 3
T T
1 2 3 4 1 4
T T
5
ADVANCED MACHINE LEARNING
6
6
Atkeson & Schaal, Intern. Conf. on Machine Learning, ICML 1997; Schaal, NIPS, 1997
6
ADVANCED MACHINE LEARNING
7
7
1 1 2 2 3 3 4 4 1 2 3 4
T T T
1 1 2 2 3 3 1 2 3 4 4 4
T T T
7
ADVANCED MACHINE LEARNING
8
8
UC Berkeley, Darwin Robot
8
ADVANCED MACHINE LEARNING
9
9
Unconstrained search over joint angles Constrained search for torso motion and relative leg motion
9
ADVANCED MACHINE LEARNING
10
10
10
ADVANCED MACHINE LEARNING
11
11
11
Agent Fire pit
Set of possible states in the world (environment + agent) 225 states in the above example (not all shown above)
Goal Reward
ADVANCED MACHINE LEARNING
12
12
12
Agent Fire pit Goal
A policy , is used to choose an action from any state . RL le
arns an . s a a s
A set of possible actions of the agent:
ADVANCED MACHINE LEARNING
13
13
13
Illustration of a policy , . s a
Agent Fire pit
1
: Transitions across states are not deterministic | , Rewards may also be stochastic. Stochastic Environment
t t t
p s a s
Knowing requires a model of the world. It can be learned while learning the policy. p
ADVANCED MACHINE LEARNING
14
14
14
Deterministic Stochastic
RL takes into account stochasticity of the environment
ADVANCED MACHINE LEARNING
15
15
1 1 1 1
t t t t t t t t
Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
15
ADVANCED MACHINE LEARNING
16
16
16
At each time step, the agent being in a state chooses an action by drawing from , If , equiprobable for all actions , pick at random
, Gree
t t a
s a s a s a a a s a dy Policy
ADVANCED MACHINE LEARNING
17
17
17
ADVANCED MACHINE LEARNING
19
19
1
The gives for each state an estimate of the expected reward starting from that state: ( ) It depends on the agent’s polic state -value function y.
t k t k
V s E r s s
19
Reward Value function
ADVANCED MACHINE LEARNING
20
20
20
Policy Value function
Greedy Policy
ADVANCED MACHINE LEARNING
21
21
1
t k t k k
21
ADVANCED MACHINE LEARNING
22
22
22
Agent Fire pit Goal
How to find the best possible policy ? MDP Find optimal value function => gives optimal policy
ADVANCED MACHINE LEARNING
23
23
23
Find the value function
Need policy to compute the expectation
Exploit recursive property: Bellman equation
ADVANCED MACHINE LEARNING
24
24
t rt1 r t2 2r t 3 3r t 4
t3 2r t 4
t t t t t
1 1
a s a s s a s s
Bellman Equation
Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
24
ADVANCED MACHINE LEARNING
25
25
t t t t t
1 1
a s a s s a s s
Bellman Equation
Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
S S’ a r
a s s
a s s
25
ADVANCED MACHINE LEARNING
26
26
a a ss s a s s
S S’ a r
a s s
a s s
26
ADVANCED MACHINE LEARNING
27
27
27
Find the value function
ADVANCED MACHINE LEARNING
28
28
1
The gives for each state an estimate of the expected reward starting from that state: ( ) It depends on the state -value agent’s polic function y.
t k t k k
V s E r s s
1
k t k t t k
action-value function state-value func The and the are directly related: ( ) ( , ) ( , ti )
a
V s s a Q s a
A
28
ADVANCED MACHINE LEARNING
29
29
29
ADVANCED MACHINE LEARNING
30
30
30
Policy evaluation and improvement Value Iteration
becomes greedy Generalized Policy Iteration
Policy evaluation is Linear
Policy evaluation is Non-linear We need to know the models!
ADVANCED MACHINE LEARNING
31
31
31
Generalized Policy Iteration
Sutton & Barto Chapter 4
ADVANCED MACHINE LEARNING
36
36
Like MC, TD methods learn directly from raw experience Like DP, TD methods do bootstrapping, I.e. they update estimates
Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
36
ADVANCED MACHINE LEARNING
37
37
Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
S a S’
1
) (
k k kr
s R
3 2 1
e e e
37
ADVANCED MACHINE LEARNING
38
38
Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
1
t t t
T T T T
t1
T T T T T T T T T
38
ADVANCED MACHINE LEARNING
39
39
Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
T T T T
t1
T T T T T T T T T
t t t t t t
T
T
39
ADVANCED MACHINE LEARNING
40
40
Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
T T T T
T T T T T T T T T
1 1
t t t t t
t
40
ADVANCED MACHINE LEARNING
41
41
41
ADVANCED MACHINE LEARNING
44
44
44
Offline Online Dynamic Programming: DP Monte Carlo: MC Physical interaction with the environment Too hard to model !!! No interaction with the environment We have the models!!! SARSA Q-Learning Temporal difference (TD) TD Value Iteration Bellman Equation Policy Evaluation Block World
ADVANCED MACHINE LEARNING
46
46
46
ADVANCED MACHINE LEARNING
49
49
Morimoto and Doya, Robotics and Autonomous Systems , 2001
49
ADVANCED MACHINE LEARNING
50
50
Risks for dummies
50
ADVANCED MACHINE LEARNING
51
51
51
ADVANCED MACHINE LEARNING
52
52
52
ADVANCED MACHINE LEARNING
53
53
53
ADVANCED MACHINE LEARNING
54
54
1....
t t N P t t
t T
54
ADVANCED MACHINE LEARNING
55
55
1
T j j j K j j
55
ADVANCED MACHINE LEARNING
56
56
1 1 1 1
t t t t t t t t t
56
1
T t t t
One can use other techniques, e.g. ML techniques for non-linear regression, to get a better estimate of the parameters than simple gradient descent, see Deisenroth et al, Foundations & Trends in Robotics, 2011; Peters & Scahaal, Neural Networks, 2008 for surveys.
ADVANCED MACHINE LEARNING
57
57
1
K j j j j
57
ADVANCED MACHINE LEARNING
58
58
1
K j j j j
58
ADVANCED MACHINE LEARNING
59
59
1 1 2 2 . walls . holes 1 2
dist dist
1 2
1 1
1 1 1
t
T t s T T t t t
ADVANCED MACHINE LEARNING
61
61
61
Value function
t
s
: scaling factor
t t t
a s V s
ADVANCED MACHINE LEARNING
62
62
62
ADVANCED MACHINE LEARNING
63
Morimoto and Doya, Robotics and Autonomous Systems , 2001
63
ADVANCED MACHINE LEARNING
64
Morimoto and Doya, Robotics and Autonomous Systems , 2001
64
ADVANCED MACHINE LEARNING
65
Morimoto and Doya, Robotics and Autonomous Systems , 2001
65
Reward at upper level
Y: height of the head
and L is total length of the robot.
ADVANCED MACHINE LEARNING
66
Morimoto and Doya, Robotics and Autonomous Systems , 2001
T
1...
T j j
j K
66
ADVANCED MACHINE LEARNING
67
Morimoto and Doya, Robotics and Autonomous Systems , 2001
67
ADVANCED MACHINE LEARNING
68
68
1....
t t N P t t
t T
68
ADVANCED MACHINE LEARNING
69
69
1
K T j j j N P j j
69
ADVANCED MACHINE LEARNING
70
70
70
1 1
t t t t t
2 1 1
t t t t t t
ADVANCED MACHINE LEARNING
71
71
1: 1 1: 1: 2: 1
a T T T T
See Lagoudakis & Parr, Journal of Machine Learning Research, 2003 who offers numerical solutions to determine the optimal alpha parameter.
71
1 1 1,..
T t t T t t t t t t T
1,.. ,
t T
ADVANCED MACHINE LEARNING
72
72
Lagoudakis & Parr, Journal of Machine Learning Research, 2003
72
ADVANCED MACHINE LEARNING
73
73
Lagoudakis & Parr, Journal of Machine Learning Research, 2003
73
20 1 2 2 2 2
State: , , , , , Q-value function parametrized with 20 basis functions: , , Features: combination of state and its 1st and second derivatives 1, , , , , , , , , , with
i i
S Q s a s a s
2 2 2 2
, , , , , , , , and uniform distribution on discrete values of . a
ADVANCED MACHINE LEARNING
74
74
Lagoudakis & Parr, Journal of Machine Learning Research, 2003
The policy after the first iteration balances the bicycle, but fails to ride to the goal. The policy after the second iteration is heading towards the goal, but fails to balance. All policies thereafter balance and ride the bicycle to the goal. Crash still happens because of noise in the model.
74
ADVANCED MACHINE LEARNING
75
75
Lagoudakis & Parr, Journal of Machine Learning Research, 2003
75
ADVANCED MACHINE LEARNING
76
76
1....
t t N P t t
t T
76
ADVANCED MACHINE LEARNING
77
77
t t t t t
77
ADVANCED MACHINE LEARNING
78
78
1
K T j j j j j
Kobler and Peters, Machine Learning, 2011
78
T t t t T T t t t t t t t
ADVANCED MACHINE LEARNING
79
79
1: 1: 1 1:
T T T
Kobler and Peters, Machine Learning, 2011 79
1 1
T t t t t
1 1 1 1
T T t t t t T T t
ADVANCED MACHINE LEARNING
80
80
| , , ~ , , , with 31 basis functions per DOF
T
x x
Kobler and Peters, Machine Learning, 2011 80
ADVANCED MACHINE LEARNING
82
82
Ng, Andrew Y., and Stuart J. Russell. "Algorithms for inverse reinforcement learning." ICML, 2000. Ziebart, Brian D., et al. "Maximum Entropy Inverse Reinforcement Abbeel, Pieter, and Andrew Y. Ng. "Inverse reinforcement learning." Encyclopedia of machine learning. Springer US, 2011. 554-558.
Grollman, D. H., and Billard A. "Donut as I do: Learning from failed demonstrations." Robotics and Automation (ICRA), 2011 IEEE International Conference on. IEEE, 2011; & "Robot learning from failed demonstrations." International Journal of Social Robotics 4.4 (2012): 331-342. Rai, A.,, deChambrier, G and Billard, A.(2013) Learning from Failed Demonstrations in Unreliable Systems. IEEE-RAS International Conference on Humanoid Robots.
82
ADVANCED MACHINE LEARNING
83
83
83