Reinforcement Learning Part 2 Yingyu Liang yliang@cs.wisc.edu - PowerPoint PPT Presentation

Reinforcement Learning Part 2 Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [Based on slides from David Page, Mark Craven]

Goals for the lecture you should understand the following concepts • value functions and value iteration (review) • Q functions and Q learning • exploration vs. exploitation tradeoff • compact representations of Q functions 2

Value function for a policy • given a policy π : S → A define      t ( ) [ ] assuming action sequence chosen V s E r t according to π starting at state s  0 t we want the optimal policy π * where • p * = argmax p V p ( s ) for all s we’ll denote the value function for this optimal policy as V * ( s ) 3

Value iteration for learning V * ( s ) initialize V ( s ) arbitrarily loop until policy good enough { loop for s ∈ S { loop for a ∈ A {     ( , ) ( , ) ( ' | , ) ( ' ) Q s a r s a P s s a V s  ' s S }  ( ) max ( , ) V s Q s a a } } 4

Q functions define a new function, closely related to V*        * ( , ) ( , ) ( ' ) Q s a E r s a E V s ' | , s s a if agent knows Q ( s, a ) , it can choose optimal action without knowing P ( s’ | s , a )    * * ( ) arg max ( , ) ( ) max ( , ) s Q s a V s Q s a a a and it can learn Q ( s, a ) without knowing P ( s ’ | s , a ) 5

Q values 0 100 0 90 100 G G 0 0 0 0 0 0 100 0 0 81 90 100 0 0 r ( s, a ) (immediate reward) values V* ( s ) values 90 100 G 0 81 72 81 81 90 100 81 90 72 81 Q ( s, a ) values 6

Q learning for deterministic worlds ˆ  for each s, a initialize table entry ( , ) 0 Q s a observe current state s do forever select an action a and execute it receive immediate reward r observe the new state s ’ update table entry ˆ ˆ    ( , ) max ( ' , ' ) Q s a r Q s a a ' s ← s ’ 7

Updating Q 72 90 100 100 63 63 81 81 a right ˆ ˆ    ( , ) max ( , ' ) Q s a r Q s a 1 ' 2 right a   0 0 . 9 max{ 63 , 81 , 100 }  90 8

Q V V Q’ s vs. V’ s Q V • Which action do we choose when we’re in a given state? • V’ s (model-based) – need to have a ‘next state’ function to generate all possible states – choose next state with highest V value. • Q’ s (model-free) – need only know which actions are legal – generally choose next state with highest Q value. 9

Exploration vs. Exploitation • in order to learn about better alternatives, we shouldn’t always follow the current policy (exploitation) • sometimes, we should select random actions (exploration) • one way to do this: select actions probabilistically according to: ˆ ( , ) Q s a c i  ( | ) P a s  ˆ i ( , ) Q s a c j j where c > 0 is a constant that determines how strongly selection favors actions with higher Q values 10

Q learning with a table As described so far, Q learning entails filling in a huge table states s 0 s 1 s 2 . . . s n . a 1 . a 2 . A table is a very a 3 Q ( s 2 , a 3 ) actions . . . verbose way to . represent a function . . a k 11

Representing Q functions more compactly We can use some other function representation (e.g. a neural net) to compactly encode a substitute for the big table Q ( s, a 1 ) Q ( s, a 2 ) encoding of the state ( s ) Q ( s, a k ) each input unit encodes or could have one net a property of the state for each possible action (e.g., a sensor value) 12

Why use a compact Q function? Full Q table may not fit in memory for realistic problems 1. 2. Can generalize across states, thereby speeding up convergence i.e. one instance ‘fills’ many cells in the Q table Notes When generalizing across states, cannot use α =1 1. Convergence proofs only apply to Q tables 2. 3. Some work on bounding errors caused by using compact representations (e.g. Singh & Yee, Machine Learning 1994) 13

Q tables vs. Q nets Given: 100 Boolean-valued features 10 possible actions Size of Q table 10 × 2 100 entries Size of Q net (assume 100 hidden units) 100 × 100 + 100 × 10 = 11,000 weights weights between weights between inputs and HU’s HU’s and outputs 14

Representing Q functions more compactly • we can use other regression methods to represent Q functions k -NN regression trees support vector regression etc. 15

Q learning with function approximation measure sensors, sense state s 0 1. ˆ Q n ( s 0 , a ) predict for each action a 2. select action a to take (with randomization to 3. ensure exploration) apply action a in the real world 4. sense new state s 1 and immediate reward r 5. ˆ Q n ( s 1 , a ') calculate action a’ that maximizes 6. 7. train with new instance  s x 0   ˆ ˆ        ( 1 ) ( , ) max ( , ' ) y Q s a r Q s a 0 ' 1 a Calculate Q-value you would have put into Q-table, and use it as the training label 16

Reinforcement Learning Part 2 Yingyu Liang yliang@cs.wisc.edu - PowerPoint PPT Presentation

Reinforcement Learning Part 2 Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [Based on slides from David Page, Mark Craven] Goals for the lecture you should understand the following concepts

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

Sequential Decision Making AIMA Chapters: 17.1, 17.2, 17.3. Sutton and Barto, Reinforcement

Off-policy methods with approximation Recall off-policy learning involves two policies One

SQL Workshop Data Types Doug Shook Data Types Four categories String Numeric

Reconstructing Control Flow from Predicated Assembly Code Bjrn Decker, Saarland University

devicetree: kernel internals and practical troubleshooting There have been many presentations on

Why is Key-Value Store + GPU important? GPU Key-Value Store Massive Parallelism Good to store

A Perception Index (And its square peg adaptor) Toby Young QCON 11 th March 2009 1 A

Toward Detection of Unsafe Driving with Wearables Luyang Liu Cagdas Karatas, Hongyu Li, Marco