Value Function Approximation on Non-linear Manifolds for Robot - - PowerPoint PPT Presentation

value function approximation on non linear manifolds for
SMART_READER_LITE
LIVE PREVIEW

Value Function Approximation on Non-linear Manifolds for Robot - - PowerPoint PPT Presentation

Value Function Approximation on Non-linear Manifolds for Robot Motor Control Masashi Sugiyama 1)2) Hirotaka Hachiya 1 )2) Christopher Towell 2) Sethu Vijayakumar 2) 1) Computer Science, Tokyo Institute of Technology 2) School of Informatics,


slide-1
SLIDE 1

Value Function Approximation

  • n Non-linear Manifolds

for Robot Motor Control

Masashi Sugiyama1)2) Hirotaka Hachiya1)2) Christopher Towell2) Sethu Vijayakumar2)

1) Computer Science, Tokyo Institute of Technology 2) School of Informatics, University of Edinburgh

slide-2
SLIDE 2

2

Maze Problem: Guide Robot to Goal

Robot knows its position but doesn’t know which direction to go. We don’t teach the best action to take at each position but give a reward at the goal. Task: make the robot select the optimal action.

Up Right Left Down

Possible actions Position (x,y)

Goal reward

slide-3
SLIDE 3

3

Markov Decision Process (MDP)

An MDP consists of

  • : set of states,
  • : set of actions,
  • : transition probability,
  • : reward,

An action the robot takes at state is specified by policy . Goal: make the robot learn optimal policy

{ }

R P A S , , , S A P R

{ }

i

s

{ }

right left, down, up, ) s Ρ(s,a, ′ ) (s a π =

π ) , ( a s R a s π

slide-4
SLIDE 4

4

Definition of Optimal Policy

Action-value function: discounted sum of future rewards when taking in and following thereafter Optimal value: Optimal policy:

  • is computed if is given.

Question: How to compute ?

⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ = = =

∞ =

a a s s r E a s Q

t t t

, ) , ( γ

π

) , ( max arg ) , ( a s Q a s Q

π π

=

) , ( max arg ) , ( a s Q a s

a ∗ ∗

= π π a s

Q

π

Q

slide-5
SLIDE 5

5

Policy Iteration

Starting from some initial policy iterate Steps 1 and 2 until convergence.

Step 1. Compute for current Step 2. Update by

Policy iteration always converges to if in step 1 can be computed. Question: How to compute ?

(Sutton & Barto, 1998)

π ) , ( max arg ) ( a s Q s

a π

π = π

π ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = = =

∞ =

a a s s r E a s Q

t t t

, | ) , ( γ

π

π ) , ( a s Qπ ) , ( a s Qπ ) , ( a s Qπ

slide-6
SLIDE 6

6

  • can be recursively expressed by
  • can be computed by solving

Bellman equation Drawback: dimensionality of Bellman equation becomes huge in large state and action spaces

Bellman Equation

high computational cost

)) ' ( , ' ( ) ' , , ( ) , ( ) , (

'

s s Q s a s P a s R a s Q

s

π γ

π π

+ = ) , ( a s Qπ ) , ( a s Qπ a s ∀ ∀ , A S ×

slide-7
SLIDE 7

7

Least-Squares Policy Iteration

Linear architecture:

  • is learned so as to optimally approximate

Bellman equation in the least-squares sense # of parameters is only : LSPI works well if we choose appropriate Question: How to choose ?

(Lagoudakis and Parr, 2003)

: fixed basis functions : parameters : # of basis functions

) , ( ) , ( ˆ

1

a s w a s Q

i K i iφ π

=

= ) , ( a s

i

φ

i

w K K A S K × <<

K i i 1

} {

=

φ

K i i 1

} {

=

φ

K i i

w

1

} {

=

slide-8
SLIDE 8

8

Popular Choice: Gaussian Kernel (GK)

Smooth Gaussian tail goes over partitions : Euclidean distance : Centre state

⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛− =

2 2

2 ) , ( exp ) ( σ s s ED s k

c

ED

c

s

c

s

c

s

Partitions

slide-9
SLIDE 9

9

Approximated Value Function by GK

Values around the partitions are not approximated well. Approximated by GK Optimal value function

Log scale 20 randomly located Gaussians

slide-10
SLIDE 10

10

Policy Obtained by GK

GK provides an undesired policy around the partition. GK-based policy Optimal policy

slide-11
SLIDE 11

11

Aim of This Research

Gaussian tails go over the partition. Not suited for approximating discontinuous value functions. We propose new Gaussian kernel to overcome this problem.

slide-12
SLIDE 12

12

State Space as a Graph

Ordinary Gaussian uses Euclidean distance. Euclidean distance does not incorporate state space structure, so tail problems occur. We represent state space structure by a graph, and use it for defining Gaussian kernels.

(Mahadevan, ICML 2005)

⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛− =

2 2

2 ) , ( exp ) ( σ s s ED s k

c

slide-13
SLIDE 13

13

Natural distance on graph is shortest path. We use shortest path in Gaussian function. We call this kernel geodesic Gaussian. SP can be efficiently computed by Dijkstra.

Geodesic Gaussian Kernels

⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛− =

2 2

2 ) , ( exp ) ( σ s s SP s k

c

Euclidean distance Shortest path

slide-14
SLIDE 14

14

Example of Kernels

Tails do not go across the partition. Values smoothly decrease along the maze. Geodesic Gaussian Ordinary Gaussian

c

s

c

s

c

s

slide-15
SLIDE 15

15

Values near the partition are well approximated. Discontinuity across the partition is preserved.

Ordinary Gaussian Optimal

Comparison of Value Functions

Geodesic Gaussian

slide-16
SLIDE 16

16

Comparison of Policies

GGKs provide good policies near the partition. Geodesic Gaussian Ordinary Gaussian

slide-17
SLIDE 17

17

Average over 100 runs Geodesic Ordinary

Ordinary Gaussian: tail problem Geodesic Gaussian: no tail problem

Experimental Result

Number of kernels Fraction of optimal states

slide-18
SLIDE 18

18

Robot Arm Reaching

2-DOF robot arm State space

Joint 1 Joint 2 End effector Object Obstacle Joint 1 (degree) Joint 2 (degree)

180

  • 180

100

  • 100

Reward: +1 reach the object 0 otherwise

Task: move the end effector to reach the object

slide-19
SLIDE 19

19

Robot Arm Reaching

Successfully avoids the obstacle and can reach the object. Moves directly towards the object without avoiding the obstacle. Ordinary Gaussian Geodesic Gaussian

slide-20
SLIDE 20

20

Khepera Robot Navigation

Khepera has 8 IR sensors measuring the distance to obstacles. Task: explore unknown maze without collision

Reward: +1 (forward)

  • 2 (collision)

(others)

Sensor value: 0 - 1030

slide-21
SLIDE 21

21

State Space and Graph

Discretize 8D state space by self-organizing map.

−1000 −800 −600 −400 −200 200 400 600 800 1000 −400 −200 200 400 600 800 1000

Partitions 2D visualization

slide-22
SLIDE 22

22

Khepera Robot Navigation

When facing obstacle, goes backward (and goes forward again). When facing obstacle, makes a turn (and go forward). Ordinary Gaussian Geodesic Gaussian

slide-23
SLIDE 23

23

Experimental Results

Average over 30 runs

Geodesic outperforms ordinary Gaussian.

Geodesic Ordinary

slide-24
SLIDE 24

24

Conclusion

Value function approximation: good basis function needed Ordinary Gaussian kernel: tail goes over discontinuities Geodesic Gaussian kernel: smooth along the state space Through the experiments, we showed geodesic Gaussian is promising in high-dimensional continuous problems!