reinforcement learning
play

Reinforcement Learning Part 2 Yingyu Liang yliang@cs.wisc.edu - PowerPoint PPT Presentation

Reinforcement Learning Part 2 Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [Based on slides from David Page, Mark Craven] Goals for the lecture you should understand the following concepts


  1. Reinforcement Learning Part 2 Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [Based on slides from David Page, Mark Craven]

  2. Goals for the lecture you should understand the following concepts • value functions and value iteration (review) • Q functions and Q learning • exploration vs. exploitation tradeoff • compact representations of Q functions 2

  3. Value function for a policy • given a policy π : S → A define      t ( ) [ ] assuming action sequence chosen V s E r t according to π starting at state s  0 t we want the optimal policy π * where • p * = argmax p V p ( s ) for all s we’ll denote the value function for this optimal policy as V * ( s ) 3

  4. Value iteration for learning V * ( s ) initialize V ( s ) arbitrarily loop until policy good enough { loop for s ∈ S { loop for a ∈ A {     ( , ) ( , ) ( ' | , ) ( ' ) Q s a r s a P s s a V s  ' s S }  ( ) max ( , ) V s Q s a a } } 4

  5. Q functions define a new function, closely related to V*        * ( , ) ( , ) ( ' ) Q s a E r s a E V s ' | , s s a if agent knows Q ( s, a ) , it can choose optimal action without knowing P ( s’ | s , a )    * * ( ) arg max ( , ) ( ) max ( , ) s Q s a V s Q s a a a and it can learn Q ( s, a ) without knowing P ( s ’ | s , a ) 5

  6. Q values 0 100 0 90 100 G G 0 0 0 0 0 0 100 0 0 81 90 100 0 0 r ( s, a ) (immediate reward) values V* ( s ) values 90 100 G 0 81 72 81 81 90 100 81 90 72 81 Q ( s, a ) values 6

  7. Q learning for deterministic worlds ˆ  for each s, a initialize table entry ( , ) 0 Q s a observe current state s do forever select an action a and execute it receive immediate reward r observe the new state s ’ update table entry ˆ ˆ    ( , ) max ( ' , ' ) Q s a r Q s a a ' s ← s ’ 7

  8. Updating Q 72 90 100 100 63 63 81 81 a right ˆ ˆ    ( , ) max ( , ' ) Q s a r Q s a 1 ' 2 right a   0 0 . 9 max{ 63 , 81 , 100 }  90 8

  9. Q V V Q’ s vs. V’ s Q V • Which action do we choose when we’re in a given state? • V’ s (model-based) – need to have a ‘next state’ function to generate all possible states – choose next state with highest V value. • Q’ s (model-free) – need only know which actions are legal – generally choose next state with highest Q value. 9

  10. Exploration vs. Exploitation • in order to learn about better alternatives, we shouldn’t always follow the current policy (exploitation) • sometimes, we should select random actions (exploration) • one way to do this: select actions probabilistically according to: ˆ ( , ) Q s a c i  ( | ) P a s  ˆ i ( , ) Q s a c j j where c > 0 is a constant that determines how strongly selection favors actions with higher Q values 10

  11. Q learning with a table As described so far, Q learning entails filling in a huge table states s 0 s 1 s 2 . . . s n . a 1 . a 2 . A table is a very a 3 Q ( s 2 , a 3 ) actions . . . verbose way to . represent a function . . a k 11

  12. Representing Q functions more compactly We can use some other function representation (e.g. a neural net) to compactly encode a substitute for the big table Q ( s, a 1 ) Q ( s, a 2 ) encoding of the state ( s ) Q ( s, a k ) each input unit encodes or could have one net a property of the state for each possible action (e.g., a sensor value) 12

  13. Why use a compact Q function? Full Q table may not fit in memory for realistic problems 1. 2. Can generalize across states, thereby speeding up convergence i.e. one instance ‘fills’ many cells in the Q table Notes When generalizing across states, cannot use α =1 1. Convergence proofs only apply to Q tables 2. 3. Some work on bounding errors caused by using compact representations (e.g. Singh & Yee, Machine Learning 1994) 13

  14. Q tables vs. Q nets Given: 100 Boolean-valued features 10 possible actions Size of Q table 10 × 2 100 entries Size of Q net (assume 100 hidden units) 100 × 100 + 100 × 10 = 11,000 weights weights between weights between inputs and HU’s HU’s and outputs 14

  15. Representing Q functions more compactly • we can use other regression methods to represent Q functions k -NN regression trees support vector regression etc. 15

  16. Q learning with function approximation measure sensors, sense state s 0 1. ˆ Q n ( s 0 , a ) predict for each action a 2. select action a to take (with randomization to 3. ensure exploration) apply action a in the real world 4. sense new state s 1 and immediate reward r 5. ˆ Q n ( s 1 , a ') calculate action a’ that maximizes 6. 7. train with new instance  s x 0   ˆ ˆ        ( 1 ) ( , ) max ( , ' ) y Q s a r Q s a 0 ' 1 a Calculate Q-value you would have put into Q-table, and use it as the training label 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend