10/20/2009 Announcements Introduction to Artificial Intelligence - PDF document

10/20/2009 Announcements Introduction to Artificial Intelligence • Assignment 2 due next Monday at midnight V22.0472-001 Fall 2009 • Please send email to me about final exam • Please send email to me about final exam Lecture 11: Reinforcement Learning 2 Lecture 11: Reinforcement Learning 2 Rob Fergus – Dept of Computer Science, Courant Institute, NYU Slides from Alan Fern, Daniel Weld, Dan Klein, John DeNero 2 Last Time: Q-Learning Example: Pacman • In realistic situations, we cannot possibly learn • Let’s say we discover through experience that about every single state! this state is bad: • Too many states to visit them all in training • Too many states to hold the q-tables in memory • In naïve q learning we • In naïve q learning, we • Instead, we want to generalize: know nothing about this state or its q states: • Learn about some small number of training states from experience • Generalize that experience to new, similar states • This is a fundamental idea in machine learning, and • Or even this one! we’ll see it over and over again 3 4 Function Approximation Feature-Based Representations • Never enough training data! • Must generalize what is learned from one situation to other • Solution: describe a state using a “similar” new situations vector of features • Idea: • Features are functions from states to • Instead of using large table to represent V or Q, use a real numbers (often 0/1) that capture parameterized function important properties of the state • Example features: • The number of parameters should be small compared to number of states (generally exponentially fewer number of states (generally exponentially fewer • Distance to closest ghost • Distance to closest ghost parameters) • Distance to closest dot • Number of ghosts • Learn parameters from experience • 1 / (dist to dot) 2 • When we update the parameters based on observations in • Is Pacman in a tunnel? (0/1) one state, then our V or Q estimate will also change for other • …… etc. similar states • Can also describe a q-state (s, a) with • I.e. the parameterization facilitates generalization of features (e.g. action moves closer to experience food) 5 6 1

10/20/2009 Linear Function Approximation TD-based RL for Linear Approximators • Define a set of features f1(s), …, fn(s) 1. Start with initial parameter values • The features are used as our representation of states 2. Take action according to an • States with similar feature values will be treated explore/exploit policy similarly • More complex functions require more complex features 3. Update estimated model ˆ = = θ θ + + θ θ + + θ θ + + + + θ θ V V ( ( s s ) ) f f ( ( s s ) ) f f ( ( s s ) ) ... f f ( ( s s ) ) θ 4. Perform TD update for each parameter 0 1 1 2 2 n n • Our goal is to learn good parameter values (i.e. θ ← ? feature weights) that approximate the value i function well 5. Goto 2 • How can we do this? What is a “TD update” for a parameter? • Use TD-based RL and somehow update parameters based on each experience. 7 8 Aside: Gradient Descent Aside: Gradient Descent for Squared Error Given a function f ( θ 1 ,…, θ n ) of n real values θ = ( θ 1 ,…, θ n ) suppose • Suppose that we have a sequence of states and target values • we want to minimize f with respect to θ for each state K s , v ( s ) , s , v ( s ) , A common approach to doing this is gradient descent • 1 1 2 2 • E.g. produced by the TD-based RL loop The gradient of f at point θ , denoted by ∇ θ f ( θ ), is an • Our goal is to minimize the sum of squared errors between n-dimensional vector that points in the direction where • f increases most steeply at point θ our estimated function and each target value: ( ( ) 2 ) • Vector calculus tells us that ∇ θ f ( θ ) is just a vector of 1 = ˆ − ( ( ) ) ( ( ) ) E E V V s s v v s s partial derivatives ti l d i ti θ θ j j j 2 ⎡ ∂ θ ∂ θ ⎤ ( ) ( ) f f ∇ θ = ( ) ⎢ , K , ⎥ f squared error of example j target value for j’th state θ ∂ θ ∂ θ our estimated value ⎣ ⎦ 1 n for j’th state After seeing j’th state the gradient descent rule tells us that • ∂ θ θ θ θ + ε θ θ − θ where ( ) ( , K , , , K , ) ( ) f f f we can decrease error by updating parameters by: = − + lim 1 i 1 i i 1 n ∂ θ ε ε → 0 ∂ ∂ ∂ ∂ ˆ i ( ) E E E V s can decrease f by moving in negative gradient direction θ ← θ − α = θ j j j j , ∂ θ ∂ θ ∂ ˆ ∂ θ i i V ( s ) θ i i i j 9 10 learning rate Aside: continued TD-based RL for Linear Approximators ( ) ˆ ∂ ∂ 1. Start with initial parameter values E V ( s ) θ ← θ − α = θ − α ˆ − θ j j V ( s ) v ( s ) 2. Take action according to an explore/exploit policy θ i i ∂ θ i j j ∂ θ i i ∂ Transition from s to s’ E j depends on form of 3. Update estimated model ∂ ˆ ( ) V s approximator θ j 4. Perform TD update for each parameter ( ( ) ) θ θ ← ← θ θ + + α α − − ˆ • For a linear approximation function: v v ( ( s s ) ) V V ( ( s s ) ) f f ( ( s s ) ) θ i i i ˆ = θ + θ + θ + + θ V ( s ) f ( s ) f ( s ) ... f ( s ) 5. Goto 2 θ 1 1 1 2 2 n n • Thus the update becomes: • For linear functions this update is guaranteed to converge What should we use for “target value” v(s)? to best approximation for suitable learning rate schedule ( ) • Use the TD prediction based on the next state s’ ∂ ˆ θ ← θ + α − ˆ ( ) v ( s ) V ( s ) f ( s ) V s θ θ = j i i j j i j ( ) f s ∂ θ this is the same as previous TD method only with approximation i j i = + β ˆ ( ) ( ) ( ' ) v s R s V s 11 12 θ 2

10/20/2009 Q-learning with Linear Approximators TD-based RL for Linear Approximators ˆ 1. Start with initial parameter values = θ + θ + θ + + θ ( , ) ( , ) ( , ) ... ( , ) Q s a f s a f s a f s a θ 0 1 1 2 2 n n 2. Take action according to an explore/exploit policy Features are a function of states and actions. 3. Update estimated model 1. Start with initial parameter values 4. Perform TD update for each parameter ( ) 2. Take action a according to an explore/exploit policy θ ← θ + α + β ˆ − ˆ R ( s ) V ( s ' ) V ( s ) f ( s ) transitioning from s to s’ θ θ i i i 5 5. Goto 2 G 2 3. Perform TD update for each parameter ) ( ˆ ˆ θ ← θ + α + β − R ( s ) max Q ( s ' , a ' ) Q ( s , a ) f ( s , a ) θ θ i i i a ' 4. Goto 2 • Step 2 requires a model to select action • For applications such as Backgammon it is easy to get a simulation-based model • For both Q and V, these algorithms converge to the closest • For others it is difficult to get a good model linear approximation to optimal Q or V. • But we can do the same thing for model-free Q-learning 13 14 Example: Tactical Battles in Wargus Example: Tactical Battles in Wargus • Wargus is real-time strategy (RTS) game • States : contain information about the locations, health, and current activity of all friendly and enemy agent • Tactical battles are a key aspect of the game • Actions : Attack(F,E) • causes friendly agent F to attack enemy E • Policy : represented via Q-function Q(s,Attack(F,E)) • Each decision cycle loop through each friendly agent F and select • Each decision cycle loop through each friendly agent F and select enemy E to attack that maximizes Q(s,Attack(F,E)) 10 vs. 10 5 vs. 5 • Q(s,Attack(F,E)) generalizes over any friendly and enemy • RL Task : learn a policy to control n friendly agents in a agents F and E battle against m enemy agents • We used a linear function approximator with Q-learning • Policy should be applicable to tasks with different sets and numbers of agents 15 16 Example: Tactical Battles in Wargus Example: Tactical Battles in Wargus • Linear Q-learning in 5 vs. 5 battle ˆ = θ + θ + θ + + θ ( , ) ( , ) ( , ) ... ( , ) Q s a f s a f s a f s a θ 1 1 1 2 2 n n • Engineered a set of relational features 700 {f 1 (s,Attack(F,E)), …., f n (s,Attack(F,E))} 600 500 • Example Features : erential 400 • # of other friendly agents that are currently attacking E • Health of friendly agent F • Health of friendly agent F 300 300 Damage Diffe • Health of enemy agent E 200 • Difference in health values 100 • Walking distance between F and E 0 • Is E the enemy agent that F is currently attacking? • Is F the closest friendly agent to E? -100 • Is E the closest enemy agent to E? Episodes • … • Features are well defined for any number of agents 17 18 3

10/20/2009 Announcements Introduction to Artificial Intelligence - PDF document

10/20/2009 Announcements Introduction to Artificial Intelligence Assignment 2 due next Monday at midnight V22.0472-001 Fall 2009 Please send email to me about final exam Please send email to me about final exam Lecture 11:

SURVEY AREA WWW-YES-2009-France Water Survey Results 3 June 2009 WWW-YES-2009-France water

2009 Half Year Results Presentation 6 months to 30 June 2009 13 August 2009 2009 Half Year

First Quarter 2009 - A Good Start 1Q 2009 Results Presentation - 29 April 2009 Agenda 1Q 2009

Platinum Platinum 2009 2009 th May 2009 18 18 th May 2009 Good morning to everyone, and

anton@linevich.com http://viewdle.com Friday, July 3, 2009 Friday, July 3, 2009 Friday, July 3,

Thursday, September 10, 2009 Thursday, September 10, 2009 Thursday, September

Pinal County Adopted Budget FY 2009 FY 2009 - 2010 2010 June 24, 2009 Pinal County Truth in

COPPER PRODUCER IN 2009 COPPER PRODUCER IN 2009 COPPER PRODUCER IN 2009 COPPER PRODUCER IN 2009

Construction Storm Water Construction Storm Water Workshop Workshop 2009 2009 2009 2009

Merging Merb into Rails Wednesday, November 18, 2009 Me Wednesday, November 18, 2009 Yehuda

West Virginia Performance West Virginia Performance Eff Effectiveness Review Tool Effectiveness

ITV plc Interim Results 2009 6 th August 2009 Interim Results 2009 0 Overview Michael Grade

PHP code audits OSCON 2009 San Jos, CA, USA July 21th 2009 samedi 25 juillet 2009 1 Agenda

AP PROJECT UPDATE AP PROJECT UPDATE Current Performance (Million Baht) 2008 Q1 2009 Q2 2009

Swedbank Q3 Results 2009 Q3 Results 2009 Swedbank 2073075 CEO Michael Wolf October 20, 2009

Bank of Georgia Q3 2009 & YTD 2009 financials January 2010 Bank of Georgia consolidated

Reinforcement Learning of Reinforcement Learning of Affordance Cues Affordance Cues Final

Screening: Survey of Well-being of Young Children (SWYC) Rhonda Burk - Crete Public

Value Function Approximation on Non-linear Manifolds for Robot Motor Control Masashi Sugiyama

Utilization of ASQ in Web Design Course Brankica Brati c, Vladimir Kurbalija, Vasileios

Maestro Workflow Conductor: A vision for the future of HPC Workflow Computing Expo Francesco Di

CONSTRAINTS A DEVELOPER'S SECRET WEAPON PG Day Paris 2018-03-15 WILL LEINWEBER @LEINWEBER

The Hypergraph Assignment Problem Olga Heismann joint work with: Ralf Borndrfer, Achim

Polynomial-Scaling Algorithm for the Linear Sum Assignment Problem Yubo Paul Yang, 2020

10/20/2009 Announcements Introduction to Artificial Intelligence - PDF document

10/20/2009 Announcements Introduction to Artificial Intelligence Assignment 2 due next Monday at midnight V22.0472-001 Fall 2009 Please send email to me about final exam Please send email to me about final exam Lecture 11:

SURVEY AREA WWW-YES-2009-France Water Survey Results 3 June 2009 WWW-YES-2009-France water

2009 Half Year Results Presentation 6 months to 30 June 2009 13 August 2009 2009 Half Year

First Quarter 2009 - A Good Start 1Q 2009 Results Presentation - 29 April 2009 Agenda 1Q 2009

Platinum Platinum 2009 2009 th May 2009 18 18 th May 2009 Good morning to everyone, and

anton@linevich.com http://viewdle.com Friday, July 3, 2009 Friday, July 3, 2009 Friday, July 3,

Thursday, September 10, 2009 Thursday, September 10, 2009 Thursday, September

Pinal County Adopted Budget FY 2009 FY 2009 - 2010 2010 June 24, 2009 Pinal County Truth in

COPPER PRODUCER IN 2009 COPPER PRODUCER IN 2009 COPPER PRODUCER IN 2009 COPPER PRODUCER IN 2009

Construction Storm Water Construction Storm Water Workshop Workshop 2009 2009 2009 2009

Merging Merb into Rails Wednesday, November 18, 2009 Me Wednesday, November 18, 2009 Yehuda

West Virginia Performance West Virginia Performance Eff Effectiveness Review Tool Effectiveness

ITV plc Interim Results 2009 6 th August 2009 Interim Results 2009 0 Overview Michael Grade

PHP code audits OSCON 2009 San Jos, CA, USA July 21th 2009 samedi 25 juillet 2009 1 Agenda

AP PROJECT UPDATE AP PROJECT UPDATE Current Performance (Million Baht) 2008 Q1 2009 Q2 2009

Swedbank Q3 Results 2009 Q3 Results 2009 Swedbank 2073075 CEO Michael Wolf October 20, 2009

Bank of Georgia Q3 2009 &amp; YTD 2009 financials January 2010 Bank of Georgia consolidated

Reinforcement Learning of Reinforcement Learning of Affordance Cues Affordance Cues Final

Screening: Survey of Well-being of Young Children (SWYC) Rhonda Burk - Crete Public

Value Function Approximation on Non-linear Manifolds for Robot Motor Control Masashi Sugiyama

Utilization of ASQ in Web Design Course Brankica Brati c, Vladimir Kurbalija, Vasileios

Maestro Workflow Conductor: A vision for the future of HPC Workflow Computing Expo Francesco Di

CONSTRAINTS A DEVELOPER'S SECRET WEAPON PG Day Paris 2018-03-15 WILL LEINWEBER @LEINWEBER

The Hypergraph Assignment Problem Olga Heismann joint work with: Ralf Borndrfer, Achim

Polynomial-Scaling Algorithm for the Linear Sum Assignment Problem Yubo Paul Yang, 2020

Bank of Georgia Q3 2009 & YTD 2009 financials January 2010 Bank of Georgia consolidated