10 20 2009
play

10/20/2009 Announcements Introduction to Artificial Intelligence - PDF document

10/20/2009 Announcements Introduction to Artificial Intelligence Assignment 2 due next Monday at midnight V22.0472-001 Fall 2009 Please send email to me about final exam Please send email to me about final exam Lecture 11:


  1. 10/20/2009 Announcements Introduction to Artificial Intelligence • Assignment 2 due next Monday at midnight V22.0472-001 Fall 2009 • Please send email to me about final exam • Please send email to me about final exam Lecture 11: Reinforcement Learning 2 Lecture 11: Reinforcement Learning 2 Rob Fergus – Dept of Computer Science, Courant Institute, NYU Slides from Alan Fern, Daniel Weld, Dan Klein, John DeNero 2 Last Time: Q-Learning Example: Pacman • In realistic situations, we cannot possibly learn • Let’s say we discover through experience that about every single state! this state is bad: • Too many states to visit them all in training • Too many states to hold the q-tables in memory • In naïve q learning we • In naïve q learning, we • Instead, we want to generalize: know nothing about this state or its q states: • Learn about some small number of training states from experience • Generalize that experience to new, similar states • This is a fundamental idea in machine learning, and • Or even this one! we’ll see it over and over again 3 4 Function Approximation Feature-Based Representations • Never enough training data! • Must generalize what is learned from one situation to other • Solution: describe a state using a “similar” new situations vector of features • Idea: • Features are functions from states to • Instead of using large table to represent V or Q, use a real numbers (often 0/1) that capture parameterized function important properties of the state • Example features: • The number of parameters should be small compared to number of states (generally exponentially fewer number of states (generally exponentially fewer • Distance to closest ghost • Distance to closest ghost parameters) • Distance to closest dot • Number of ghosts • Learn parameters from experience • 1 / (dist to dot) 2 • When we update the parameters based on observations in • Is Pacman in a tunnel? (0/1) one state, then our V or Q estimate will also change for other • …… etc. similar states • Can also describe a q-state (s, a) with • I.e. the parameterization facilitates generalization of features (e.g. action moves closer to experience food) 5 6 1

  2. 10/20/2009 Linear Function Approximation TD-based RL for Linear Approximators • Define a set of features f1(s), …, fn(s) 1. Start with initial parameter values • The features are used as our representation of states 2. Take action according to an • States with similar feature values will be treated explore/exploit policy similarly • More complex functions require more complex features 3. Update estimated model ˆ = = θ θ + + θ θ + + θ θ + + + + θ θ V V ( ( s s ) ) f f ( ( s s ) ) f f ( ( s s ) ) ... f f ( ( s s ) ) θ 4. Perform TD update for each parameter 0 1 1 2 2 n n • Our goal is to learn good parameter values (i.e. θ ← ? feature weights) that approximate the value i function well 5. Goto 2 • How can we do this? What is a “TD update” for a parameter? • Use TD-based RL and somehow update parameters based on each experience. 7 8 Aside: Gradient Descent Aside: Gradient Descent for Squared Error Given a function f ( θ 1 ,…, θ n ) of n real values θ = ( θ 1 ,…, θ n ) suppose • Suppose that we have a sequence of states and target values • we want to minimize f with respect to θ for each state K s , v ( s ) , s , v ( s ) , A common approach to doing this is gradient descent • 1 1 2 2 • E.g. produced by the TD-based RL loop The gradient of f at point θ , denoted by ∇ θ f ( θ ), is an • Our goal is to minimize the sum of squared errors between n-dimensional vector that points in the direction where • f increases most steeply at point θ our estimated function and each target value: ( ( ) 2 ) • Vector calculus tells us that ∇ θ f ( θ ) is just a vector of 1 = ˆ − ( ( ) ) ( ( ) ) E E V V s s v v s s partial derivatives ti l d i ti θ θ j j j 2 ⎡ ∂ θ ∂ θ ⎤ ( ) ( ) f f ∇ θ = ( ) ⎢ , K , ⎥ f squared error of example j target value for j’th state θ ∂ θ ∂ θ our estimated value ⎣ ⎦ 1 n for j’th state After seeing j’th state the gradient descent rule tells us that • ∂ θ θ θ θ + ε θ θ − θ where ( ) ( , K , , , K , ) ( ) f f f we can decrease error by updating parameters by: = − + lim 1 i 1 i i 1 n ∂ θ ε ε → 0 ∂ ∂ ∂ ∂ ˆ i ( ) E E E V s can decrease f by moving in negative gradient direction θ ← θ − α = θ j j j j , ∂ θ ∂ θ ∂ ˆ ∂ θ i i V ( s ) θ i i i j 9 10 learning rate Aside: continued TD-based RL for Linear Approximators ( ) ˆ ∂ ∂ 1. Start with initial parameter values E V ( s ) θ ← θ − α = θ − α ˆ − θ j j V ( s ) v ( s ) 2. Take action according to an explore/exploit policy θ i i ∂ θ i j j ∂ θ i i ∂ Transition from s to s’ E j depends on form of 3. Update estimated model ∂ ˆ ( ) V s approximator θ j 4. Perform TD update for each parameter ( ( ) ) θ θ ← ← θ θ + + α α − − ˆ • For a linear approximation function: v v ( ( s s ) ) V V ( ( s s ) ) f f ( ( s s ) ) θ i i i ˆ = θ + θ + θ + + θ V ( s ) f ( s ) f ( s ) ... f ( s ) 5. Goto 2 θ 1 1 1 2 2 n n • Thus the update becomes: • For linear functions this update is guaranteed to converge What should we use for “target value” v(s)? to best approximation for suitable learning rate schedule ( ) • Use the TD prediction based on the next state s’ ∂ ˆ θ ← θ + α − ˆ ( ) v ( s ) V ( s ) f ( s ) V s θ θ = j i i j j i j ( ) f s ∂ θ this is the same as previous TD method only with approximation i j i = + β ˆ ( ) ( ) ( ' ) v s R s V s 11 12 θ 2

  3. 10/20/2009 Q-learning with Linear Approximators TD-based RL for Linear Approximators ˆ 1. Start with initial parameter values = θ + θ + θ + + θ ( , ) ( , ) ( , ) ... ( , ) Q s a f s a f s a f s a θ 0 1 1 2 2 n n 2. Take action according to an explore/exploit policy Features are a function of states and actions. 3. Update estimated model 1. Start with initial parameter values 4. Perform TD update for each parameter ( ) 2. Take action a according to an explore/exploit policy θ ← θ + α + β ˆ − ˆ R ( s ) V ( s ' ) V ( s ) f ( s ) transitioning from s to s’ θ θ i i i 5 5. Goto 2 G 2 3. Perform TD update for each parameter ) ( ˆ ˆ θ ← θ + α + β − R ( s ) max Q ( s ' , a ' ) Q ( s , a ) f ( s , a ) θ θ i i i a ' 4. Goto 2 • Step 2 requires a model to select action • For applications such as Backgammon it is easy to get a simulation-based model • For both Q and V, these algorithms converge to the closest • For others it is difficult to get a good model linear approximation to optimal Q or V. • But we can do the same thing for model-free Q-learning 13 14 Example: Tactical Battles in Wargus Example: Tactical Battles in Wargus • Wargus is real-time strategy (RTS) game • States : contain information about the locations, health, and current activity of all friendly and enemy agent • Tactical battles are a key aspect of the game • Actions : Attack(F,E) • causes friendly agent F to attack enemy E • Policy : represented via Q-function Q(s,Attack(F,E)) • Each decision cycle loop through each friendly agent F and select • Each decision cycle loop through each friendly agent F and select enemy E to attack that maximizes Q(s,Attack(F,E)) 10 vs. 10 5 vs. 5 • Q(s,Attack(F,E)) generalizes over any friendly and enemy • RL Task : learn a policy to control n friendly agents in a agents F and E battle against m enemy agents • We used a linear function approximator with Q-learning • Policy should be applicable to tasks with different sets and numbers of agents 15 16 Example: Tactical Battles in Wargus Example: Tactical Battles in Wargus • Linear Q-learning in 5 vs. 5 battle ˆ = θ + θ + θ + + θ ( , ) ( , ) ( , ) ... ( , ) Q s a f s a f s a f s a θ 1 1 1 2 2 n n • Engineered a set of relational features 700 {f 1 (s,Attack(F,E)), …., f n (s,Attack(F,E))} 600 500 • Example Features : erential 400 • # of other friendly agents that are currently attacking E • Health of friendly agent F • Health of friendly agent F 300 300 Damage Diffe • Health of enemy agent E 200 • Difference in health values 100 • Walking distance between F and E 0 • Is E the enemy agent that F is currently attacking? • Is F the closest friendly agent to E? -100 • Is E the closest enemy agent to E? Episodes • … • Features are well defined for any number of agents 17 18 3

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend