Reinforcement Learning Part 2 Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [Based on slides from David Page, Mark Craven]
Goals for the lecture you should understand the following concepts • value functions and value iteration (review) • Q functions and Q learning • exploration vs. exploitation tradeoff • compact representations of Q functions 2
Value function for a policy • given a policy π : S → A define t ( ) [ ] assuming action sequence chosen V s E r t according to π starting at state s 0 t we want the optimal policy π * where • p * = argmax p V p ( s ) for all s we’ll denote the value function for this optimal policy as V * ( s ) 3
Value iteration for learning V * ( s ) initialize V ( s ) arbitrarily loop until policy good enough { loop for s ∈ S { loop for a ∈ A { ( , ) ( , ) ( ' | , ) ( ' ) Q s a r s a P s s a V s ' s S } ( ) max ( , ) V s Q s a a } } 4
Q functions define a new function, closely related to V* * ( , ) ( , ) ( ' ) Q s a E r s a E V s ' | , s s a if agent knows Q ( s, a ) , it can choose optimal action without knowing P ( s’ | s , a ) * * ( ) arg max ( , ) ( ) max ( , ) s Q s a V s Q s a a a and it can learn Q ( s, a ) without knowing P ( s ’ | s , a ) 5
Q values 0 100 0 90 100 G G 0 0 0 0 0 0 100 0 0 81 90 100 0 0 r ( s, a ) (immediate reward) values V* ( s ) values 90 100 G 0 81 72 81 81 90 100 81 90 72 81 Q ( s, a ) values 6
Q learning for deterministic worlds ˆ for each s, a initialize table entry ( , ) 0 Q s a observe current state s do forever select an action a and execute it receive immediate reward r observe the new state s ’ update table entry ˆ ˆ ( , ) max ( ' , ' ) Q s a r Q s a a ' s ← s ’ 7
Updating Q 72 90 100 100 63 63 81 81 a right ˆ ˆ ( , ) max ( , ' ) Q s a r Q s a 1 ' 2 right a 0 0 . 9 max{ 63 , 81 , 100 } 90 8
Q V V Q’ s vs. V’ s Q V • Which action do we choose when we’re in a given state? • V’ s (model-based) – need to have a ‘next state’ function to generate all possible states – choose next state with highest V value. • Q’ s (model-free) – need only know which actions are legal – generally choose next state with highest Q value. 9
Exploration vs. Exploitation • in order to learn about better alternatives, we shouldn’t always follow the current policy (exploitation) • sometimes, we should select random actions (exploration) • one way to do this: select actions probabilistically according to: ˆ ( , ) Q s a c i ( | ) P a s ˆ i ( , ) Q s a c j j where c > 0 is a constant that determines how strongly selection favors actions with higher Q values 10
Q learning with a table As described so far, Q learning entails filling in a huge table states s 0 s 1 s 2 . . . s n . a 1 . a 2 . A table is a very a 3 Q ( s 2 , a 3 ) actions . . . verbose way to . represent a function . . a k 11
Representing Q functions more compactly We can use some other function representation (e.g. a neural net) to compactly encode a substitute for the big table Q ( s, a 1 ) Q ( s, a 2 ) encoding of the state ( s ) Q ( s, a k ) each input unit encodes or could have one net a property of the state for each possible action (e.g., a sensor value) 12
Why use a compact Q function? Full Q table may not fit in memory for realistic problems 1. 2. Can generalize across states, thereby speeding up convergence i.e. one instance ‘fills’ many cells in the Q table Notes When generalizing across states, cannot use α =1 1. Convergence proofs only apply to Q tables 2. 3. Some work on bounding errors caused by using compact representations (e.g. Singh & Yee, Machine Learning 1994) 13
Q tables vs. Q nets Given: 100 Boolean-valued features 10 possible actions Size of Q table 10 × 2 100 entries Size of Q net (assume 100 hidden units) 100 × 100 + 100 × 10 = 11,000 weights weights between weights between inputs and HU’s HU’s and outputs 14
Representing Q functions more compactly • we can use other regression methods to represent Q functions k -NN regression trees support vector regression etc. 15
Q learning with function approximation measure sensors, sense state s 0 1. ˆ Q n ( s 0 , a ) predict for each action a 2. select action a to take (with randomization to 3. ensure exploration) apply action a in the real world 4. sense new state s 1 and immediate reward r 5. ˆ Q n ( s 1 , a ') calculate action a’ that maximizes 6. 7. train with new instance s x 0 ˆ ˆ ( 1 ) ( , ) max ( , ' ) y Q s a r Q s a 0 ' 1 a Calculate Q-value you would have put into Q-table, and use it as the training label 16
Recommend
More recommend