q learning
play

Q Learning Forall s, a Initialize Q(s, a) = 0 Repeat Forever - PDF document

11/9/16 Approximate Q-Learning Dan Weld / University of Washington [Many slides taken from Dan Klein and Pieter Abbeel / CS188 Intro to AI at UC Berkeley materials available at http://ai.berkeley.edu.] Q Learning Forall s, a Initialize


  1. 11/9/16 Approximate Q-Learning Dan Weld / University of Washington [Many slides taken from Dan Klein and Pieter Abbeel / CS188 Intro to AI at UC Berkeley – materials available at http://ai.berkeley.edu.] Q Learning Forall s, a Initialize Q(s, a) = 0 Repeat Forever Where are you? s. Choose some action a Execute it in real world: (s, a, r, s’) Do update: } Equivalently 1

  2. 11/9/16 Q Learning Forall s, a Initialize Q(s, a) = 0 Repeat Forever Where are you? s. Choose some action a Execute it in real world: (s, a, r, s’) Do update: Example: Pacman Let’s say we discover Or even this through experience one! that this state is bad: 2

  3. 11/9/16 Q-learning, no features, 50 learning trials QuickTime™ and a GIF decompressor are needed to see this picture. Q-learning, no features, 1000 learning trials: QuickTime™ and a GIF decompressor are needed to see this picture. 3

  4. 11/9/16 Feature-Based Representations Soln: describe states w/ vector of features (aka “properties”) – Features = functions from states to R (often 0/1) capturing important properties of the state – Examples: • Distance to closest ghost or dot • Number of ghosts • 1 / (dist to dot) 2 • Is Pacman in a tunnel? (0/1) …… etc. • Is state the exact state on this slide? – Can also describe a q-state (s, a) with features (e.g. action moves closer to food) How to use features? Using features we can represent V and/or Q as follows: V(s) = g(f 1 (s), f 2 (s), …, f n (s)) Q(s,a) = g(f 1 (s,a), f 2 (s,a), …, f n (s,a)) What should we use for g? (and f)? 4

  5. 11/9/16 Linear Combination • Using a feature representation, we can write a q function (or value function) for any state using a few weights: • Advantage: our experience is summed up in a few powerful numbers • Disadvantage: states sharing features may actually have very different values! Approximate Q-Learning • Q-learning with linear Q-functions: Exact Q’s Approximate Q’s • Intuitive interpretation: – Adjust weights of active features – E.g., if something unexpectedly bad happens, blame the features that were on: disprefer all states with that state’s features • Formal justification: in a few slides! 5

  6. 11/9/16 Example: Pacman Features 𝑅 𝑡, 𝑏 = 𝑥 ( 𝑔 *+, 𝑡, 𝑏 + 𝑥 . 𝑔 /0, (𝑡, 𝑏) 1 𝑔 *+, 𝑡, 𝑏 = 𝑒𝑗𝑡𝑢𝑏𝑜𝑑𝑓 𝑢𝑝 𝑑𝑚𝑝𝑡𝑓𝑡𝑢 𝑔𝑝𝑝𝑒 𝑏𝑔𝑢𝑓𝑠 𝑢𝑏𝑙𝑗𝑜𝑕 𝑏 𝑔 *+, 𝑡, 𝑂𝑃𝑆𝑈𝐼 = 0.5 𝑔 /0, 𝑡, 𝑏 = 𝑒𝑗𝑡𝑢𝑏𝑜𝑑𝑓 𝑢𝑝 𝑑𝑚𝑝𝑡𝑓𝑡𝑢 𝑕ℎ𝑝𝑡𝑢 𝑏𝑔𝑢𝑓𝑠 𝑢𝑏𝑙𝑗𝑜𝑕 𝑔 /0, 𝑡, 𝑂𝑃𝑆𝑈𝐼 = 1.0 Example: Q-Pacman α = 0.004 [Demo: approximate Q- learning pacman 6

  7. 11/9/16 Video of Demo Approximate Q- Learning -- Pacman Sidebar: Q-Learning and Least Squares 7

  8. 11/9/16 Linear Approximation: Regression 40 26 24 20 22 20 30 40 20 0 30 0 20 20 10 10 0 0 Prediction: Prediction: Optimization: Least Squares Error or “residual” Observation Prediction 0 0 20 8

  9. 11/9/16 Minimizing Error Imagine we had only one point x, with features f(x), target value y, and weights w: Approximate q update explained: “prediction” “target” Overfitting: Why Limiting Capacity Can Help 30 25 20 Degree 15 polynomial 15 10 5 0 -5 -10 -15 0 2 4 6 8 10 12 14 16 18 20 9

  10. 11/9/16 Simple Problem Given: Features of current state Predict: Will Pacman die on the next step? 21 Just one feature. See a pattern? § Ghost one step away, pacman dies § Ghost one step away, pacman dies § Ghost one step away, pacman dies § Ghost one step away, pacman dies § Ghost one step away, pacman lives § Ghost more than one step away, pacman lives § Ghost more than one step away, pacman lives § Ghost more than one step away, pacman lives § Ghost more than one step away, pacman lives § Ghost more than one step away, pacman lives § Ghost more than one step away, pacman lives Learn: Ghost one step away à pacman dies! 22 10

  11. 11/9/16 What if we add more features? § Ghost one step away, score 211, pacman dies § Ghost one step away, score 341, pacman dies § Ghost one step away, score 231, pacman dies § Ghost one step away, score 121, pacman dies § Ghost one step away, score 301, pacman lives § Ghost more than one step away, score 205, pacman lives § Ghost more than one step away, score 441, pacman lives § Ghost more than one step away, score 219, pacman lives § Ghost more than one step away, score 199, pacman lives § Ghost more than one step away, score 331, pacman lives § Ghost more than one step away, score 251, pacman lives Learn: Ghost one step away AND score is NOT prime number à pacman dies! 24 There’s fitting, and there’s 30 25 20 Degree 1 polynomial 15 10 5 0 -5 -10 -15 0 2 4 6 8 10 12 14 16 18 20 11

  12. 11/9/16 There’s fitting, and there’s 30 25 20 Degree 2 polynomial 15 10 5 0 -5 -10 -15 0 2 4 6 8 10 12 14 16 18 20 Overfitting 30 25 20 Degree 15 polynomial 15 10 5 0 -5 -10 -15 0 2 4 6 8 10 12 14 16 18 20 12

  13. 11/9/16 Approximating Q Function • Linear Approximation • Could also use Deep Neural Network – https://www.nervanasys.com/demystifying-deep- reinforcement-learning/ Q(s,a) Deepmind Atari https://www.youtube.com/watch?v=V1eYniJ0Rnk 13

  14. 11/9/16 DQN Results on Atari Slide adapted from David Silver Approximating the Q Function Linear Approximation f 1 (s,a) f 2 (s,a) Q f m (s,a) 1 Neural Approximation (nonlinear) h ( z ) = 1 + e − z f 1 (s,a) f 2 (s,a) Q 1 f m (s,a) h(z) a 0 z 0 14

  15. o O o o o O 11/9/16 Deep Representations I A deep representation is a composition of many functions / h 1 / h n / y / l x / ... w 1 w n ... I Its gradient can be backpropagated by the chain rule ∂ h 2 ∂ hn ∂ h 1 ∂ y ∂ hn − 1 ∂ h 1 ∂ l ∂ x ∂ l ∂ l ∂ hn ∂ l ∂ x ∂ h 1 ... ∂ h n ∂ y ∂ h 1 ∂ hn ∂ w 1 ✏ ∂ wn ✏ ∂ l ∂ l ∂ w 1 ∂ w n ... Slide adapted from David Silver Multi Layer Perceptron • Multiple Layers [ Y 1 , Y 2 ] • Feed Forward output k z j å • Connected Weights = x i w ij w jk i • 1-of-N Output hidden j 1 v ij a 0 z 0 i input 1 = a [ X 1 , X 2 , X 3 ] - z + e 1 15

  16. 11/9/16 Training via Stochastic Gradient Descent "$ � I Sample gradient of expected loss L ( w ) = E [ l ]  ∂ l � = ∂ L ( w ) ∂ l ∂ w ∼ E ∂ w ∂ w I Adjust w down the sampled gradient &$ � ∆ w ∝ ∂ l ∂ w Slide adapted from David Silver Aka ... Backpropagation • Minimize error of calculated output k • Adjust weights • Gradient Descent w jk • Procedure • Forward Phase j • Backpropagation of errors v ij • For each sample, multiple epochs i 16

  17. O / O ? O O / = 11/9/16 Weight Sharing Recurrent neural network shares weights between time-steps y t y t +1 / h t h t +1 ... ... x t x t +1 w w Convolutional neural network shares weights between local regions w 2 w 1 w 2 w 1 h 2 h 1 x Slide adapted from David Silver Recap: Approx Q-Learning I Optimal Q-values should obey Bellman equation  � Q ( s 0 , a 0 ) ⇤ | s , a Q ⇤ ( s , a ) = E s 0 r + γ max a 0 I Treat right-hand side r + γ max Q ( s 0 , a 0 , w ) as a target a 0 I Minimise MSE loss by stochastic gradient descent ⌘ 2 ⇣ l = r + γ max Q ( s 0 , a 0 , w ) − Q ( s , a , w ) a I Converges to Q ⇤ using table lookup representation I But diverges using neural networks due to: I Correlations between samples I Non-stationary targets Slide adapted from David Silver 17

  18. 11/9/16 Deep Q-Networks (DQN) Experience Replay To remove correlations, build data-set from agent’s own experience s 1 , a 1 , r 2 , s 2 s 2 , a 2 , r 3 , s 3 s , a , r , s 0 → s 3 , a 3 , r 4 , s 4 ... s t , a t , r t +1 , s t +1 s t , a t , r t +1 , s t +1 → Sample experiences from data-set and apply update ◆ 2 ✓ Q ( s 0 , a 0 , w � ) − Q ( s , a , w ) l = r + γ max a 0 To deal with non-stationarity, target parameters w � are held fixed Slide adapted from David Silver DQN in Atari I End-to-end learning of values Q ( s , a ) from pixels s I Input state s is stack of raw pixels from last 4 frames I Output is Q ( s , a ) for 18 joystick/button positions I Reward is change in score for that step Network architecture and hyperparameters fixed across all games Slide adapted from David Silver 18

  19. 11/9/16 Deep Mind Resources See also: http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf That’s all for Reinforcement Learning! Data (experiences Reinforcement Policy (how to with environment) Learning Agent act in the future) • Very tough problem: How to perform any task well in an unknown, noisy environment! • Traditionally used mostly for robotics, but… Google DeepMind – RL applied to data center power usage 49 19

  20. 11/9/16 That’s all for Reinforcement Learning! Data (experiences Reinforcement Policy (how to with environment) Learning Agent act in the future) Lots of open research areas: – How to best balance exploration and exploitation? – How to deal with cases where we don’t know a good state/feature representation? 50 Conclusion • We’re done with Part I: Search and Planning! • We’ve seen how AI methods can solve problems in: – Search – Constraint Satisfaction Problems – Games – Markov Decision Problems – Reinforcement Learning • Next up: Part II: Uncertainty and Learning! 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend