MACHINE LEARNING – 2012
1
Reinforcement Learning for Continuous State and Action Spaces - - PowerPoint PPT Presentation
MACHINE LEARNING 2012 MACHINE LEARNING TECHNIQUES AND APPLICATIONS Reinforcement Learning for Continuous State and Action Spaces Gradient Methods 1 MACHINE LEARNING 2012 Reinforcement Learning (RL) Supervised Learning Unsupervised
MACHINE LEARNING – 2012
1
MACHINE LEARNING – 2012
2
MACHINE LEARNING – 2012
3
MACHINE LEARNING – 2012
4
1....
t t N P t t
t T
MACHINE LEARNING – 2012
5
k t t
MACHINE LEARNING – 2012
6
Doya, NIPS 1996
1
K T j j j j
MACHINE LEARNING – 2012
7
Doya, NIPS 1996
1
K T j j j t
1
1...
t j t t t t j t j
t
j K
r s
MACHINE LEARNING – 2012
8
Robot configuration. θ0: pitch angle, θ1: hip joint angle, θ2: knee joint angle, θm: the angle of the line from the center of mass to the center of the foot.
Morimoto and Doya, Robotics and Autonomous Systems , 2001
MACHINE LEARNING – 2012
9
Morimoto and Doya, Robotics and Autonomous Systems , 2001
stand-up because the robot may fall down after passing through the final goal need to define subgoals
for more than 2(T + 1) seconds.
MACHINE LEARNING – 2012
10
Morimoto and Doya, Robotics and Autonomous Systems , 2001
MACHINE LEARNING – 2012
11
Morimoto and Doya, Robotics and Autonomous Systems , 2001
When the robot achieves a sub-goal, the learner gets a reward < 0.5. Full reward is obtained when all subgoals and main goal are both achieved.
MACHINE LEARNING – 2012
12
Morimoto and Doya, Robotics and Autonomous Systems , 2001
T T
1...
j j
j K
MACHINE LEARNING – 2012
13
Morimoto and Doya, Robotics and Autonomous Systems , 2001
MACHINE LEARNING – 2012
14
Atkeson & Schaal, Intern. Conf. on Machine Learning, ICML 1997; Schaal, NIPS, 1997
MACHINE LEARNING – 2012
15
Atkeson & Schaal, Intern. Conf. on Machine Learning, ICML 1997; Schaal, NIPS, 1997
* * 1
Collect a sequence , from a single human demonstration State of the system = , , , (human hand trajectory + velocity angular trajectory and velocity of the pendulum) Actions: = (c
T t t t t t t t t t t
s u s x x u x
an be converted into torque command to the robot) Reward: minimizes angular and hand displacement as well as velocities
MACHINE LEARNING – 2012
16
1
Learn first a model of the task through RL using continuous TD learning to estimate ; : Take a model ; = Parameters are estimated incrementally through non-linear function approximation (
K j i
V s V s s
locally weighted regression)
* *
Learn then the reward: , , and are estimated so as to minimize discrepancies between human and robot trajectories. Use the reward to generate optimal trajectories from
T T t t t t t t t t
r x u t x x Q x x u Ru Q R al control.
Atkeson & Schaal, Intern. Conf. on Machine Learning, ICML 1997; Schaal, NIPS, 1997
MACHINE LEARNING – 2012
17
Atkeson & Schaal, Intern. Conf. on Machine Learning, ICML 1997; Schaal, NIPS, 1997
MACHINE LEARNING – 2012
18
1....
t t N P t t
t T
MACHINE LEARNING – 2012
19
MACHINE LEARNING – 2012
20
1
K T j j j N P j j
MACHINE LEARNING – 2012
21
1: 1 1: 1: 2: 1
a T T T T
1 1 1,.. 1,..
T t t T t t t t t t T t T
See Lagoudakis & Parr, Journal of Machine Learning Research, 2003 for various numerical solutions to determine the optimal alpha parameter.
MACHINE LEARNING – 2012
22
Lagoudakis & Parr, Journal of Machine Learning Research, 2003
S=
MACHINE LEARNING – 2012
23
Collect training samples using random policy. Each episode lasts 20 steps. Learned policy evaluated 100 times to estimate the probability of success (reaching the goal) Shaping reward was 1% of the change (in meters) in the distance to the goal and change in the square of the vertical angle.
Lagoudakis & Parr, Journal of Machine Learning Research, 2003
The policy after the first iteration balances the bicycle, but fails to ride to the goal. The policy after the second iteration is heading towards the goal, but fails to balance. All policies thereafter balance and ride the bicycle to the goal. Crash still happens because of noise in the model.
MACHINE LEARNING – 2012
24
Lagoudakis & Parr, Journal of Machine Learning Research, 2003
MACHINE LEARNING – 2012
25
1....
t t N P t t
t T
MACHINE LEARNING – 2012
26
t t t t t
MACHINE LEARNING – 2012
27
1
K T j j j j j
T t t t t t T T t t t t t
Kobler and Peters, Machine Learning, 2011
MACHINE LEARNING – 2012
28
1: 1: 1 1: 1
T T T t
1 1
T t t t t t
1 1 1
T t t t t
Kobler and Peters, Machine Learning, 2011
MACHINE LEARNING – 2012
31
i i i i i
Kobler and Peters, Machine Learning, 2011
MACHINE LEARNING – 2012
32
| , , ~ , , , with 31 basis functions per DOF
T
x x
Kobler and Peters, Machine Learning, 2011
MACHINE LEARNING – 2012
33
Kobler and Peters, Machine Learning, 2011
MACHINE LEARNING – 2012
34
MACHINE LEARNING – 2012
35
MACHINE LEARNING – 2012
36
1
K T j j j j j
Abbel & Ng, Int. Conf. on Machine Learning, 2004
MACHINE LEARNING – 2012
37
1 1 1
T T t t t T t T t T T t t
MACHINE LEARNING – 2012
39
: 1 : 1
1 *
t T
M i i
* * * * 1 1
M T t i t i t
MACHINE LEARNING – 2012
40
i
i
* 0,... 1 : 1 1
j
i i T j i T i i i i
MACHINE LEARNING – 2012
41
*
i
T i i
, *
1... 1,
j
c T T
j i
MACHINE LEARNING – 2012
42
Abbel & Ng, Int. Conf. on Machine Learning, 2004
MACHINE LEARNING – 2012
43
MACHINE LEARNING – 2012
44
MACHINE LEARNING – 2012
Grollman and Billard, ICRA 2012, Best Paper Award Cognitive Robotics
MACHINE LEARNING – 2012
47
2D DONUT distribution; high likelihood of drawing datapoints in the vicinity but away from the demonstrated datapoints.
MACHINE LEARNING – 2012
48
Exploration parameter that determines how far away the peaks are from the base distribution. Base distribution learned from training example (failed attempts) using GMM Donut distribution Exploratory policy
Continuous state: joint angle ; Continuous action: joint velocity , draw from DONUT distribution ~ | | ; , 1 | ; , / s a s a D a s N N f
MACHINE LEARNING – 2012
Angular Position (radian) of robot’s wrist
MACHINE LEARNING – 2012
Angular Position (radian) of robot’s wrist
MACHINE LEARNING – 2012
MACHINE LEARNING – 2012
State: position of ball with respect to the hole Action: speed and orientation of end-effector when hitting the ball
MACHINE LEARNING – 2012
MACHINE LEARNING – 2012
Grollman and Billard, IJSR, 2012
MACHINE LEARNING – 2012