R i f R i f Reinforcement Learning III Reinforcement Learning III t - PowerPoint PPT Presentation

R i f R i f Reinforcement Learning III Reinforcement Learning III t L t L i i III III Dec 03 2008 1

Large State Spaces h When a problem has a large state space we can not longer represent the U or Q functions as explicit tables explicit tables h Even if we had enough memory 5 N 5 Never enough training data! h t i i d t ! 5 Learning takes too long h What to do?? 3

Function Approximation h Never enough training data! 5 Must generalize what is learned from one situation to other “similar” new situations h Idea: 5 Instead of using large table to represent U or Q, use a parameterized function parameterized function g small number of parameters (generally exponentially fewer parameters than the number of states) 5 Learn parameters from experience Learn parameters from experience 5 When we update parameters based on observations in one state, then the U or Q estimate will also change for other similar states g facilitates generalization of experience g facilitates generalization of experience 4

Example h Consider grid problem with no obstacles, deterministic actions C id id bl i h b l d i i i i U/D/L/R (49 states) h Features for state s=(x,y): f1(s)=x, f2(s)=y (just 2 features) ( ,y) ( ) , ( ) y (j ) 6 0 10 10 0 0 6 5

Linear Function Approximation h Define a set of state features f 1 (s), …, f n (s) 5 The features are used as our representation of states 5 States with similar feature values will be treated similarly h A common approximation is to represent V (s) as a weighted sum of the features (i.e. a linear approximation) = θ θ + θ θ + θ θ + + θ θ U U ( ( s s ) ) f f ( ( s s ) ) f f ( ( s s ) ) ... ... f f ( ( s s ) ) θ θ 0 0 1 1 1 1 2 2 2 2 n n n n 6

Example h Consider grid problem with no obstacles deterministic actions Consider grid problem with no obstacles, deterministic actions U/D/L/R (49 states) h Features for state s=(x,y): f1(s)=x, f2(s)=y (just 2 features) h U(s) = θ 0 + θ 1 x + θ 2 y 6 0 h Is there a good linear 10 0 10 0 approximation? approximation? 5 Yes. 5 θ 0 =10, θ 1 = ‐ 1, θ 2 = ‐ 1 5 (note upper right is origin) (note upper right is origin) h U(s) = 10 ‐ x ‐ y subtracts Manhattan dist. from goal reward from goal reward h Instead of storing a table of 49 entries, we now only need to i l d store 3 parameters 6 7

Function approximation accuracy h The approximation accuracy is fundamentally limited by the information provided by the features h Can we always define features that allow for a perfect linear approximation? 5 Yes Assign each state an indicator feature (I e i’th feature is 1 iff Yes. Assign each state an indicator feature. (I.e. i th feature is 1 iff i’th state is present and θ i represents value of i’th state) 5 Of course this requires far too many features and gives no generalization generalization. 8

Changed Reward: Bad linear approximation h U(s) = θ 0 + θ 1 x + θ 2 y h Is there a good linear approximation? g pp 0 5 No. 0 10 9

But What If… h U(s) = θ 0 + θ 1 x + θ 2 y + θ 3 z 0 3 0 h Include new feature z 5 z= |3-x| + |3-y| 5 z is dist to goal location z is dist. to goal location 10 3 h Does this allow a good linear approx? 5 θ 0 =10, θ 1 = θ 2 = 0, θ 0 = -1 10

Linear Function Approximation h Define a set of features f1(s), …, fn(s) 5 The features are used as our representation of states 5 States with similar feature values will be treated similarly S i h i il f l ill b d i il l 5 More complex functions require more complex features = θ + θ + θ + + θ U ( s ) f ( s ) f ( s ) ... f ( s ) θ 0 1 1 2 2 n n h Our goal is to learn good parameter values (i.e. feature O l i l d l (i f weights) that approximate the value function well 5 How can we do this? How can we do this? 5 Use TD ‐ based RL and somehow update parameters based on each experience. 11

TD ‐ based RL for Linear Approximators Start with initial parameter values 1. Take action according to an explore/exploit policy g p p p y 2. (should converge to greedy policy, i.e. GLIE) Update estimated model p 3. Perform TD update for each parameter 4. θ θ ← ? ? i Goto 2 5. What is a “TD update” for a parameter? 12

Aside: Gradient Descent for Squared Error h Suppose that we have a sequence of states and target values for each state K s , u ( s ) , s , u ( s ) , 1 1 2 2 5 E g produced by the TD based RL loop E.g. produced by the TD ‐ based RL loop h Our goal is to minimize the sum of squared errors between our estimated function and each target value: g ( ) 2 1 = − ˆ E U ( s ) u ( s ) θ j j j 2 squared error of example j target value for j’th state our estimated value for j’th state h After seeing j’th state gradient descent rule tells us to update all h After seeing j th state gradient descent rule tells us to update all parameters by: ∂ ∂ ∂ ∂ ˆ E E E U ( s ) θ θ θ ← ← θ θ − α = j j j j j j j j , ∂ θ ∂ θ ∂ θ ∂ i i ˆ U ( s ) θ i i i j learning rate 13

Aside: continued ( ( ) ) ∂ ∂ ∂ ∂ ˆ E E U U ( ( s s ) ) θ θ ← θ + α = θ + α − ˆ j j u ( s ) U ( s ) θ ∂ θ ∂ θ i i i j j i i ∂ ∂ E E j depends on form of ∂ ˆ U ( s ) approximator θ j • For a linear approximation function: = = θ θ + + θ θ + + θ θ + + + + θ θ ˆ U U ( ( s s ) ) f f ( ( s s ) ) f f ( ( s s ) ) ... f f ( ( s s ) ) θ 1 1 1 2 2 n n ∂ ˆ U ( s ) θ = j f f ( ( s ) ) ∂ ∂ θ θ i i j j ( ) i θ ← θ + α − ˆ u ( s ) U ( s ) f ( s ) • Thus the update becomes: θ i i j j i j • For linear functions this update is guaranteed to converge to best approximation for suitable learning rate schedule 14

TD ‐ based RL for Linear Approximators Start with initial parameter values Start with initial parameter values 1. 1 Take action according to an explore/exploit policy 2. (should converge to greedy policy, i.e. GLIE) Transition from s to s’ Update estimated model 3. Perform TD update for each parameter 4. ( ( ) ) θ ← θ + α − ˆ u ( ( s ) ) U ( ( s ) ) f i f ( ( s ) ) θ θ i i i i i Goto 2 5. What should we use for “target value” v(s)? What should we use for target value v(s)? • Use the TD prediction based on the next state s’ = + + γ ˆ u ( ( s ) ) R R ( ( s ) ) U U ( ( s ' ' ) ) θ this is the same as previous TD method only with approximation 15

TD ‐ based RL for Linear Approximators Start with initial parameter values Start with initial parameter values 1. 1 Take action according to an explore/exploit policy 2. (should converge to greedy policy, i.e. GLIE) Update estimated model 3. Perform TD update for each parameter p p 4. ( ) θ ← θ + α + γ − ˆ ˆ R ( s ) U ( s ' ) U ( s ) f ( s ) θ θ i i i G t 2 Goto 2 5. 5 • Note that step 2 still requires T to select action • To avoid this we can do the same thing for model-free T id thi d th thi f d l f Q-learning 16

Q ‐ learning with Linear Approximators = = θ θ + + θ θ + + θ θ + + + + θ θ ˆ Q Q ( ( s s , a a ) ) f f ( ( s s , a a ) ) f f ( ( s s , a a ) ) ... f f ( ( s s , a a ) ) θ 0 1 1 2 2 n n Features are a function of states and actions. Start with initial parameter values 1. Take action according to an explore/exploit policy Take action according to an explore/exploit policy 2 2. (should converge to greedy policy, i.e. GLIE) Perform TD update for each parameter p p 3. ( ) θ ← θ + α + γ − ˆ ˆ R ( s ) max Q ( s ' , a ' ) Q ( s , a ) f ( s , a ) θ θ i i i a ' Goto 2 4. • For both Q and U, these algorithms converge to the closest linear approximation to optimal Q or U. 17

Summary of RL h MDP 5 Definition of an MDP (T, R, S) 5 Solving MDP for optimal policy: Value iteration, policy iteration h RL 5 Difference between RL and MDP 5 Different methods for Passive RL: DUE, ADP, TD 5 Different method for Active RL: ADP, Q ‐ Learning with TD learning TD learning 5 Function approximation for large state/action space 18

Learning objectives 1) Students are able to apply supervised learning algorithms to prediction problems and evaluate the results. 2) Students are able to apply unsupervised learning algorithms to data analysis problems and evaluate results. 3) Students are able to apply reinforcement learning algorithms to control problem and evaluate results. 4) Students are able to take a description of a new problem and decide what kind of problem (supervised, unsupervised, or reinforcement) it is. i f t) it i 20

R i f R i f Reinforcement Learning III Reinforcement Learning III t - PowerPoint PPT Presentation

R i f R i f Reinforcement Learning III Reinforcement Learning III t L t L i i III III Dec 03 2008 1 Large State Spaces h When a problem has a large state space we can not longer represent the U or Q functions as explicit tables explicit

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Fatigue: Fully Reversed Lecture 28 ME EN 372 Andrew Ning aning@byu.edu Outline Stress

Stacking With Auxiliary Features Nazneen Rajani and Ray Mooney nrajani@cs.utexas.edu and

SF 6 Lifetime Adjustment Based on Measured Loss in the Stratospheric Polar Vortex Eric Ray 1,3 ,

San Francisco Department of the Environment Reducing Emissions in a Growing Economy in the City

Can We Represent Infinite Lists? Lazy Evaluation Amtoft Motivation Lazy Lists Conversions

Yerba Buena Island Ramps Improvements Treasure Island Development Authority SAN FRANCISCO

SEQ part 3 Samira Khan The slides are prepared by Charles Reiss 1 Review each instruction

SGML Documents: SGML Documents: Where Does Quality Go? Where Does Quality Go? Jos Carlos