Q Learning Forall s, a Initialize Q(s, a) = 0 Repeat Forever - - PDF document

q learning
SMART_READER_LITE
LIVE PREVIEW

Q Learning Forall s, a Initialize Q(s, a) = 0 Repeat Forever - - PDF document

11/9/16 Approximate Q-Learning Dan Weld / University of Washington [Many slides taken from Dan Klein and Pieter Abbeel / CS188 Intro to AI at UC Berkeley materials available at http://ai.berkeley.edu.] Q Learning Forall s, a Initialize


slide-1
SLIDE 1

11/9/16 1

Approximate Q-Learning

Dan Weld / University of Washington

[Many slides taken from Dan Klein and Pieter Abbeel / CS188 Intro to AI at UC Berkeley – materials available at http://ai.berkeley.edu.]

Q Learning

Forall s, a

Initialize Q(s, a) = 0

Repeat Forever

Where are you? s. Choose some action a Execute it in real world: (s, a, r, s’) Do update:

} Equivalently

slide-2
SLIDE 2

11/9/16 2

Q Learning

Forall s, a

Initialize Q(s, a) = 0

Repeat Forever

Where are you? s. Choose some action a Execute it in real world: (s, a, r, s’) Do update:

Example: Pacman

Let’s say we discover through experience that this state is bad: Or even this

  • ne!
slide-3
SLIDE 3

11/9/16 3

Q-learning, no features, 50 learning trials

QuickTime™ and a GIF decompressor are needed to see this picture.

Q-learning, no features, 1000 learning trials:

QuickTime™ and a GIF decompressor are needed to see this picture.

slide-4
SLIDE 4

11/9/16 4

Feature-Based Representations

Soln: describe states w/ vector of features (aka “properties”) – Features = functions from states to R (often 0/1) capturing important properties of the state – Examples:

  • Distance to closest ghost or dot
  • Number of ghosts
  • 1 / (dist to dot)2
  • Is Pacman in a tunnel? (0/1)

…… etc.

  • Is state the exact state on this slide?

– Can also describe a q-state (s, a) with features (e.g. action moves closer to food)

How to use features?

V(s) = g(f1(s), f2(s), …, fn(s))

Using features we can represent V and/or Q as follows:

Q(s,a) = g(f1(s,a), f2(s,a), …, fn(s,a))

What should we use for g? (and f)?

slide-5
SLIDE 5

11/9/16 5

Linear Combination

  • Using a feature representation, we can write a q function

(or value function) for any state using a few weights:

  • Advantage: our experience is summed up in a few

powerful numbers

  • Disadvantage: states sharing features may actually have

very different values!

Approximate Q-Learning

  • Q-learning with linear Q-functions:
  • Intuitive interpretation:

– Adjust weights of active features – E.g., if something unexpectedly bad happens, blame the features that were on: disprefer all states with that state’s features

  • Formal justification: in a few slides!

Exact Q’s Approximate Q’s

slide-6
SLIDE 6

11/9/16 6

Example: Pacman Features

𝑅 𝑡, 𝑏 = 𝑥(𝑔

*+, 𝑡, 𝑏 + 𝑥.𝑔 /0,(𝑡, 𝑏) 𝑔

*+, 𝑡, 𝑏 =

1 𝑒𝑗𝑡𝑢𝑏𝑜𝑑𝑓 𝑢𝑝 𝑑𝑚𝑝𝑡𝑓𝑡𝑢 𝑔𝑝𝑝𝑒 𝑏𝑔𝑢𝑓𝑠 𝑢𝑏𝑙𝑗𝑜𝑕 𝑏 𝑔

/0, 𝑡, 𝑏 = 𝑒𝑗𝑡𝑢𝑏𝑜𝑑𝑓 𝑢𝑝 𝑑𝑚𝑝𝑡𝑓𝑡𝑢 𝑕ℎ𝑝𝑡𝑢 𝑏𝑔𝑢𝑓𝑠 𝑢𝑏𝑙𝑗𝑜𝑕

𝑔

*+, 𝑡, 𝑂𝑃𝑆𝑈𝐼 = 0.5

𝑔

/0, 𝑡, 𝑂𝑃𝑆𝑈𝐼 = 1.0

Example: Q-Pacman

[Demo: approximate Q- learning pacman α = 0.004

slide-7
SLIDE 7

11/9/16 7

Video of Demo Approximate Q- Learning -- Pacman Sidebar: Q-Learning and Least Squares

slide-8
SLIDE 8

11/9/16 8

20 20 40 10 20 30 40 10 20 30 20 22 24 26

Linear Approximation: Regression

Prediction: Prediction:

Optimization: Least Squares

20

Error or “residual” Prediction Observation

slide-9
SLIDE 9

11/9/16 9

Minimizing Error

Approximate q update explained: Imagine we had only one point x, with features f(x), target value y, and weights w: “target” “prediction”

2 4 6 8 10 12 14 16 18 20

  • 15
  • 10
  • 5

5 10 15 20 25 30

Degree 15 polynomial

Overfitting: Why Limiting Capacity Can Help

slide-10
SLIDE 10

11/9/16 10

Simple Problem

21

Given: Features of current state Predict: Will Pacman die on the next step?

Just one feature. See a pattern?

22

§ Ghost one step away, pacman dies § Ghost one step away, pacman dies § Ghost one step away, pacman dies § Ghost one step away, pacman dies § Ghost one step away, pacman lives § Ghost more than one step away, pacman lives § Ghost more than one step away, pacman lives § Ghost more than one step away, pacman lives § Ghost more than one step away, pacman lives § Ghost more than one step away, pacman lives § Ghost more than one step away, pacman lives Learn: Ghost one step away à pacman dies!

slide-11
SLIDE 11

11/9/16 11

What if we add more features?

24

§ Ghost one step away, score 211, pacman dies § Ghost one step away, score 341, pacman dies § Ghost one step away, score 231, pacman dies § Ghost one step away, score 121, pacman dies § Ghost one step away, score 301, pacman lives § Ghost more than one step away, score 205, pacman lives § Ghost more than one step away, score 441, pacman lives § Ghost more than one step away, score 219, pacman lives § Ghost more than one step away, score 199, pacman lives § Ghost more than one step away, score 331, pacman lives § Ghost more than one step away, score 251, pacman lives

Learn: Ghost one step away AND score is NOT prime number à pacman dies!

2 4 6 8 10 12 14 16 18 20

  • 15
  • 10
  • 5

5 10 15 20 25 30

Degree 1 polynomial

There’s fitting, and there’s

slide-12
SLIDE 12

11/9/16 12

2 4 6 8 10 12 14 16 18 20

  • 15
  • 10
  • 5

5 10 15 20 25 30

Degree 2 polynomial

There’s fitting, and there’s

2 4 6 8 10 12 14 16 18 20

  • 15
  • 10
  • 5

5 10 15 20 25 30

Degree 15 polynomial

Overfitting

slide-13
SLIDE 13

11/9/16 13

Approximating Q Function

  • Linear Approximation
  • Could also use Deep Neural Network

– https://www.nervanasys.com/demystifying-deep- reinforcement-learning/

Q(s,a)

Deepmind Atari

https://www.youtube.com/watch?v=V1eYniJ0Rnk

slide-14
SLIDE 14

11/9/16 14

DQN Results on Atari

Slide adapted from David Silver

Approximating the Q Function

Linear Approximation

Q f1(s,a) f2(s,a) fm(s,a) Q f1(s,a) f2(s,a) fm(s,a)

Neural Approximation (nonlinear)

h(z) =

1 1+e−z 1 z a

h(z)

slide-15
SLIDE 15

11/9/16 15

Deep Representations

I A deep representation is a composition of many functions

x

/ h1 / ... / hn / y / l

w1

O

... wn

O

I Its gradient can be backpropagated by the chain rule ∂l ∂x ∂l ∂h1

∂h1 ∂x

  • ∂h1

∂w1 ✏

...

∂h2 ∂h1

  • ∂l

∂hn

∂hn ∂hn−1

  • ∂hn

∂wn ✏

∂l ∂y

∂y ∂hn

  • ∂l

∂w1

...

∂l ∂wn

Slide adapted from David Silver

hidden

  • utput

input

i j k

vij

[ X1 , X2 , X3 ] z e a

  • +

=

1 1 1 z a

[ Y1 , Y2 ]

wjk

  • Multiple Layers
  • Feed Forward
  • Connected Weights
  • 1-of-N Output

ij i iw

x z j å =

Multi Layer Perceptron

slide-16
SLIDE 16

11/9/16 16

Training via Stochastic Gradient Descent

I Sample gradient of expected loss L(w) = E [l]

∂l ∂w ∼ E  ∂l ∂w

  • = ∂L(w)

∂w

I Adjust w down the sampled gradient

∆w ∝ ∂l ∂w

  • "$
  • &$

Slide adapted from David Silver

i j k

vij wjk

  • Minimize error of

calculated output

  • Adjust weights
  • Gradient Descent
  • Procedure
  • Forward Phase
  • Backpropagation
  • f errors
  • For each sample,

multiple epochs

Aka ... Backpropagation

slide-17
SLIDE 17

11/9/16 17

Weight Sharing

Recurrent neural network shares weights between time-steps yt yt+1 ...

/ ht / O

ht+1

/ O

... w

?

xt

O

w

=

xt+1

O

Convolutional neural network shares weights between local regions

w1 w1 w2 w2 x h1 h2

Slide adapted from David Silver

Recap: Approx Q-Learning

I Optimal Q-values should obey Bellman equation

Q⇤(s, a) = Es0  r + γ max

a0

Q(s0, a0)⇤ | s, a

  • I Treat right-hand side r + γ max

a0

Q(s0, a0, w) as a target

I Minimise MSE loss by stochastic gradient descent

l = ⇣ r + γ max

a

Q(s0, a0, w) − Q(s, a, w) ⌘2

I Converges to Q⇤ using table lookup representation I But diverges using neural networks due to:

I Correlations between samples I Non-stationary targets

Slide adapted from David Silver

slide-18
SLIDE 18

11/9/16 18

Deep Q-Networks (DQN) Experience Replay

To remove correlations, build data-set from agent’s own experience s1, a1, r2, s2 s2, a2, r3, s3 → s, a, r, s0 s3, a3, r4, s4 ... st, at, rt+1, st+1 → st, at, rt+1, st+1 Sample experiences from data-set and apply update l = ✓ r + γ max

a0

Q(s0, a0, w) − Q(s, a, w) ◆2 To deal with non-stationarity, target parameters w are held fixed

Slide adapted from David Silver

DQN in Atari

I End-to-end learning of values Q(s, a) from pixels s I Input state s is stack of raw pixels from last 4 frames I Output is Q(s, a) for 18 joystick/button positions I Reward is change in score for that step

Network architecture and hyperparameters fixed across all games

Slide adapted from David Silver

slide-19
SLIDE 19

11/9/16 19

Deep Mind Resources

See also: http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf

That’s all for Reinforcement Learning!

  • Very tough problem: How to perform any task well in

an unknown, noisy environment!

  • Traditionally used mostly for robotics, but…

49

Reinforcement Learning Agent Data (experiences with environment) Policy (how to act in the future) Google DeepMind – RL applied to data center power usage

slide-20
SLIDE 20

11/9/16 20

That’s all for Reinforcement Learning!

Lots of open research areas:

– How to best balance exploration and exploitation? – How to deal with cases where we don’t know a good state/feature representation?

50

Reinforcement Learning Agent Data (experiences with environment) Policy (how to act in the future)

Conclusion

  • We’re done with Part I: Search

and Planning!

  • We’ve seen how AI methods can

solve problems in:

– Search – Constraint Satisfaction Problems – Games – Markov Decision Problems – Reinforcement Learning

  • Next up: Part II: Uncertainty and

Learning!