Function Approximation via Tile Coding: Automating Parameter Choice - - PowerPoint PPT Presentation

function approximation via tile coding automating
SMART_READER_LITE
LIVE PREVIEW

Function Approximation via Tile Coding: Automating Parameter Choice - - PowerPoint PPT Presentation

Function Approximation via Tile Coding: Automating Parameter Choice Alexander Sherstov and Peter Stone Department of Computer Sciences The University of Texas at Austin About the Authors Alex Sherstov Peter Stone Thanks to Nick Jong for


slide-1
SLIDE 1

Function Approximation via Tile Coding: Automating Parameter Choice

Alexander Sherstov and Peter Stone Department of Computer Sciences The University of Texas at Austin

slide-2
SLIDE 2

About the Authors

Alex Sherstov Peter Stone

Thanks to Nick Jong for presenting!

slide-3
SLIDE 3

Overview

  • TD reinforcement learning

– Leading abstraction for decision making – Uses function approximation to store value function

action r Q(s,a) reward function transition function t reward, new state value function Agent Environment

slide-4
SLIDE 4

Overview

  • TD reinforcement learning

– Leading abstraction for decision making – Uses function approximation to store value function

action r Q(s,a) reward function transition function t reward, new state value function Agent Environment

  • Existing methods

– Discretization, neural nets, radial basis, case-based, ...

[Santamaria et al., 1997]

– Trade-offs: representational power, time/space req’s, ease of use

slide-5
SLIDE 5

Overview, cont.

  • “Happy medium": tile coding

– Widely used in RL

[Stone and Sutton, 2001, Santamaria et al., 1997, Sutton, 1996].

– Use in robot soccer:

Action values Full soccer state Few continuous state variables (13)

Sparse, coarse, tile coding Linear

(about 400 1’s and 40,000 0’s) Huge binary feature vector, F

map a

slide-6
SLIDE 6

Our Results

  • We show that:

– Tile-coding is parameter-sensitive – Optimal parameterization depends on the problem and elapsed training time

slide-7
SLIDE 7

Our Results

  • We show that:

– Tile-coding is parameter-sensitive – Optimal parameterization depends on the problem and elapsed training time

  • We contribute:

– An automated parameter-adjustment scheme – Empirical validation

slide-8
SLIDE 8

Background: Reinforcement Learning

  • RL problem given by S, A, t, r:

– S, set of states; – A, set of actions; – t : S × A → Pr(S), transition function; – r : S × A → R, reward function.

slide-9
SLIDE 9

Background: Reinforcement Learning

  • RL problem given by S, A, t, r:

– S, set of states; – A, set of actions; – t : S × A → Pr(S), transition function; – r : S × A → R, reward function.

  • Solution:

– policy π∗ : S → A that maximizes return ∞

i=0 γiri

– Q-learning: find π∗ by approximating optimal value function Q∗ : S × A → R

slide-10
SLIDE 10

Background: Reinforcement Learning

  • RL problem given by S, A, t, r:

– S, set of states; – A, set of actions; – t : S × A → Pr(S), transition function; – r : S × A → R, reward function.

  • Solution:

– policy π∗ : S → A that maximizes return ∞

i=0 γiri

– Q-learning: find π∗ by approximating optimal value function Q∗ : S × A → R

  • Need FA to generalize Q∗ to unseen situations
slide-11
SLIDE 11

Background: Tile Coding

Tiling #1

State Variable #1

Tiling #2

State Variable #1 S t a t e V a r i a b l e # 2 S t a t e V a r i a b l e # 2

  • Maintaining arbitrary f : D → R (often D = S × A):

– D partitioned into tiles, each with a weight – Each partition is a tiling; several used – Given x ∈ D, sum weights of participating tiles = ⇒ get f(x)

slide-12
SLIDE 12

Background: Tile Coding Parameters

  • We study canonical univariate tile coding:

– w, tile width (same for all tiles) – t, # of tilings (“generalization breadth") – r = w/t, resolution – tilings uniformly offset

slide-13
SLIDE 13

Background: Tile Coding Parameters

  • We study canonical univariate tile coding:

– w, tile width (same for all tiles) – t, # of tilings (“generalization breadth") – r = w/t, resolution – tilings uniformly offset

  • Empirical model:

– Fix resolution r, vary generalization breadth t – Same resolution = ⇒ same rep power, asymptotic perf – But: t affects intermediate performance – How to set t?

slide-14
SLIDE 14

Testbed Domain: Grid World

  • Domain and optimal policy:

.8 .8 .8 start goal .7 .7 .7 .7 .7 .7 .7 .7 .7 .7 .7 .7 .6 .6 .6 .6 .6 .5 .5 abyss wall .7 .8 .8 .8 .8 .8 .8

  • Episodic task (cliff, goal cells terminal)
  • Actions:

(d, p) ∈ {↑, ↓, →, ←} × [0, 1]

slide-15
SLIDE 15

Testbed Domain, cont.

  • Move succeeds w/ prob. F(p), random o/w;

F varies from cell to cell:

0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 p

slide-16
SLIDE 16

Testbed Domain, cont.

  • Move succeeds w/ prob. F(p), random o/w;

F varies from cell to cell:

0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 p

  • 2 reward functions:

−100 cliff, +100 goal, −1 o/w (“informative"); +100 goal, 0 o/w (“uninformative")

slide-17
SLIDE 17

Testbed Domain, cont.

  • Move succeeds w/ prob. F(p), random o/w;

F varies from cell to cell:

0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 p

  • 2 reward functions:

−100 cliff, +100 goal, −1 o/w (“informative"); +100 goal, 0 o/w (“uninformative")

  • Use of tile coding: generalize over actions (p)
slide-18
SLIDE 18

Generalization Helps Initially

20 40 60 80 100 250 500 750 1000 % OPTIMAL EPISODES COMPLETED 1 TILING 3 TILINGS 6 TILINGS 20 40 60 80 100 1000 2000 3000 4000 % OPTIMAL EPISODES COMPLETED 1 TILING 3 TILINGS 6 TILINGS

informative reward uninformative reward Generalization improves cliff avoidance.

slide-19
SLIDE 19

Generalization Helps Initially, cont.

20 40 60 80 100 10000 20000 30000 40000 50000 % OPTIMAL EPISODES COMPLETED 1 TILING 3 TILINGS 6 TILINGS 20 40 60 80 100 10000 20000 30000 40000 50000 % OPTIMAL EPISODES COMPLETED 1 TILING 3 TILINGS 6 TILINGS 20 40 60 80 100 10000 20000 30000 40000 50000 % OPTIMAL EPISODES COMPLETED 1 TILING 3 TILINGS 6 TILINGS

α = 0.5 α = 0.1 α = 0.05 Generalization improves discovery of better actions.

slide-20
SLIDE 20

Generalization Hurts Eventually

95 96 97 98 99 40000 60000 80000 100000 % OPTIMAL EPISODES COMPLETED 1 TILING 3 TILINGS 6 TILINGS 95 96 97 98 99 40000 60000 80000 100000 % OPTIMAL EPISODES COMPLETED 1 TILING 3 TILINGS 6 TILINGS

informative reward uninformative reward Generalization slows convergence.

slide-21
SLIDE 21

Adaptive Generalization

  • Best to adjust generalization over time
slide-22
SLIDE 22

Adaptive Generalization

  • Best to adjust generalization over time
  • Solution: reliability index ρ(s, a) ∈ [0, 1]

– ρ(s, a) ≈ 1 = ⇒ Q(s, a) reliable (and vice versa) – large backup error on (s, a) decreases ρ(s, a) (and vice versa)

slide-23
SLIDE 23

Adaptive Generalization

  • Best to adjust generalization over time
  • Solution: reliability index ρ(s, a) ∈ [0, 1]

– ρ(s, a) ≈ 1 = ⇒ Q(s, a) reliable (and vice versa) – large backup error on (s, a) decreases ρ(s, a) (and vice versa)

  • Use of ρ(s, a):

– An update to Q(s, a) is generalized to largest nearby region R that is unreliable on average: 1 |R|

  • (s,a)∈R ρ(s, a) ≤ 1

2

slide-24
SLIDE 24

Effects of Adaptive Generalization

  • Time-variant generalization

– Encourages generalization when Q(s, a) changing – Suppresses generalization near convergence

slide-25
SLIDE 25

Effects of Adaptive Generalization

  • Time-variant generalization

– Encourages generalization when Q(s, a) changing – Suppresses generalization near convergence

  • Space-variant generalization

– Rarely-visited states benefit from generalization for a longer time

slide-26
SLIDE 26

Adaptive Generalization at Work

20 40 60 80 100 200 400 600 800 1000 % OPTIMAL EPISODES COMPLETED ADAPTIVE 1 TILING 3 TILINGS 6 TILINGS 90 92 94 96 98 100 20000 40000 60000 80000 100000 % OPTIMAL EPISODES COMPLETED ADAPTIVE 1 TILING 3 TILINGS 6 TILINGS

episodes 0–1000 episodes 1000–1000000 Adaptive generalization better than any fixed setting.

slide-27
SLIDE 27

Conclusions

  • Precise empirical study of parameter choice in tile coding
slide-28
SLIDE 28

Conclusions

  • Precise empirical study of parameter choice in tile coding
  • No single setting ideal for all problems, or even throughout

learning curve on the same problem

slide-29
SLIDE 29

Conclusions

  • Precise empirical study of parameter choice in tile coding
  • No single setting ideal for all problems, or even throughout

learning curve on the same problem

  • Contributed

algorithm for adjusting parameters as needed in different regions of S × A (space-variant gen.) and at different learning stages (time-variant gen.)

slide-30
SLIDE 30

Conclusions

  • Precise empirical study of parameter choice in tile coding
  • No single setting ideal for all problems, or even throughout

learning curve on the same problem

  • Contributed

algorithm for adjusting parameters as needed in different regions of S × A (space-variant gen.) and at different learning stages (time-variant gen.)

  • Showed superiority of this adaptive technique to any

fixed setting

slide-31
SLIDE 31

References

[Santamaria et al., 1997] Santamaria,

  • J. C.,

Sutton,

  • R. S.,

and Ram,

  • A. (1997).

Experiments with reinforcement learning in problems with continuous state and action spaces. Adaptive Behavior, 6(2):163–217. [Stone and Sutton, 2001] Stone, P . and Sutton, R. S. (2001). Scaling reinforcement learning toward RoboCup soccer. In Proc. 18th International Conference on Machine Learning (ICML-01), pages 537–544. Morgan Kaufmann, San Francisco, CA. [Sutton, 1996] Sutton, R. S. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Tesauro, G., Touretzky, D., and Leen, T., editors, Advances in Neural Information Processing Systems 8, pages 1038–1044, Cambridge, MA. MIT Press.