SLIDE 1 Value Function Approximation
for Robot Motor Control
Masashi Sugiyama1)2) Hirotaka Hachiya1)2) Christopher Towell2) Sethu Vijayakumar2)
1) Computer Science, Tokyo Institute of Technology 2) School of Informatics, University of Edinburgh
SLIDE 2 2
Maze Problem: Guide Robot to Goal
Robot knows its position but doesn’t know which direction to go. We don’t teach the best action to take at each position but give a reward at the goal. Task: make the robot select the optimal action.
Up Right Left Down
Possible actions Position (x,y)
Goal reward
SLIDE 3 3
Markov Decision Process (MDP)
An MDP consists of
- : set of states,
- : set of actions,
- : transition probability,
- : reward,
An action the robot takes at state is specified by policy . Goal: make the robot learn optimal policy
{ }
R P A S , , , S A P R
{ }
i
s
{ }
right left, down, up, ) s Ρ(s,a, ′ ) (s a π =
∗
π ) , ( a s R a s π
SLIDE 4 4
Definition of Optimal Policy
Action-value function: discounted sum of future rewards when taking in and following thereafter Optimal value: Optimal policy:
Question: How to compute ?
⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ = = =
∑
∞ =
a a s s r E a s Q
t t t
, ) , ( γ
π
) , ( max arg ) , ( a s Q a s Q
π π
=
∗
) , ( max arg ) , ( a s Q a s
a ∗ ∗
= π π a s
∗
Q
∗
π
∗
Q
SLIDE 5 5
Policy Iteration
Starting from some initial policy iterate Steps 1 and 2 until convergence.
Step 1. Compute for current Step 2. Update by
Policy iteration always converges to if in step 1 can be computed. Question: How to compute ?
(Sutton & Barto, 1998)
π ) , ( max arg ) ( a s Q s
a π
π = π
∗
π ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = = =
∑
∞ =
a a s s r E a s Q
t t t
, | ) , ( γ
π
π ) , ( a s Qπ ) , ( a s Qπ ) , ( a s Qπ
SLIDE 6 6
- can be recursively expressed by
- can be computed by solving
Bellman equation Drawback: dimensionality of Bellman equation becomes huge in large state and action spaces
Bellman Equation
high computational cost
)) ' ( , ' ( ) ' , , ( ) , ( ) , (
'
s s Q s a s P a s R a s Q
s
π γ
π π
∑
+ = ) , ( a s Qπ ) , ( a s Qπ a s ∀ ∀ , A S ×
SLIDE 7 7
Least-Squares Policy Iteration
Linear architecture:
- is learned so as to optimally approximate
Bellman equation in the least-squares sense # of parameters is only : LSPI works well if we choose appropriate Question: How to choose ?
(Lagoudakis and Parr, 2003)
: fixed basis functions : parameters : # of basis functions
) , ( ) , ( ˆ
1
a s w a s Q
i K i iφ π
∑
=
= ) , ( a s
i
φ
i
w K K A S K × <<
K i i 1
} {
=
φ
K i i 1
} {
=
φ
K i i
w
1
} {
=
SLIDE 8 8
Popular Choice: Gaussian Kernel (GK)
Smooth Gaussian tail goes over partitions : Euclidean distance : Centre state
⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛− =
2 2
2 ) , ( exp ) ( σ s s ED s k
c
ED
c
s
c
s
c
s
Partitions
SLIDE 9
9
Approximated Value Function by GK
Values around the partitions are not approximated well. Approximated by GK Optimal value function
Log scale 20 randomly located Gaussians
SLIDE 10
10
Policy Obtained by GK
GK provides an undesired policy around the partition. GK-based policy Optimal policy
SLIDE 11
11
Aim of This Research
Gaussian tails go over the partition. Not suited for approximating discontinuous value functions. We propose new Gaussian kernel to overcome this problem.
SLIDE 12 12
State Space as a Graph
Ordinary Gaussian uses Euclidean distance. Euclidean distance does not incorporate state space structure, so tail problems occur. We represent state space structure by a graph, and use it for defining Gaussian kernels.
(Mahadevan, ICML 2005)
⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛− =
2 2
2 ) , ( exp ) ( σ s s ED s k
c
SLIDE 13 13
Natural distance on graph is shortest path. We use shortest path in Gaussian function. We call this kernel geodesic Gaussian. SP can be efficiently computed by Dijkstra.
Geodesic Gaussian Kernels
⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛− =
2 2
2 ) , ( exp ) ( σ s s SP s k
c
Euclidean distance Shortest path
SLIDE 14 14
Example of Kernels
Tails do not go across the partition. Values smoothly decrease along the maze. Geodesic Gaussian Ordinary Gaussian
c
s
c
s
c
s
SLIDE 15
15
Values near the partition are well approximated. Discontinuity across the partition is preserved.
Ordinary Gaussian Optimal
Comparison of Value Functions
Geodesic Gaussian
SLIDE 16
16
Comparison of Policies
GGKs provide good policies near the partition. Geodesic Gaussian Ordinary Gaussian
SLIDE 17 17
Average over 100 runs Geodesic Ordinary
Ordinary Gaussian: tail problem Geodesic Gaussian: no tail problem
Experimental Result
Number of kernels Fraction of optimal states
SLIDE 18 18
Robot Arm Reaching
2-DOF robot arm State space
Joint 1 Joint 2 End effector Object Obstacle Joint 1 (degree) Joint 2 (degree)
180
100
Reward: +1 reach the object 0 otherwise
Task: move the end effector to reach the object
SLIDE 19
19
Robot Arm Reaching
Successfully avoids the obstacle and can reach the object. Moves directly towards the object without avoiding the obstacle. Ordinary Gaussian Geodesic Gaussian
SLIDE 20 20
Khepera Robot Navigation
Khepera has 8 IR sensors measuring the distance to obstacles. Task: explore unknown maze without collision
Reward: +1 (forward)
(others)
Sensor value: 0 - 1030
SLIDE 21 21
State Space and Graph
Discretize 8D state space by self-organizing map.
−1000 −800 −600 −400 −200 200 400 600 800 1000 −400 −200 200 400 600 800 1000
Partitions 2D visualization
SLIDE 22
22
Khepera Robot Navigation
When facing obstacle, goes backward (and goes forward again). When facing obstacle, makes a turn (and go forward). Ordinary Gaussian Geodesic Gaussian
SLIDE 23
23
Experimental Results
Average over 30 runs
Geodesic outperforms ordinary Gaussian.
Geodesic Ordinary
SLIDE 24
24
Conclusion
Value function approximation: good basis function needed Ordinary Gaussian kernel: tail goes over discontinuities Geodesic Gaussian kernel: smooth along the state space Through the experiments, we showed geodesic Gaussian is promising in high-dimensional continuous problems!