announcements cs 4100 artificial intelligence
play

Announcements CS 4100: Artificial Intelligence Markov Decision - PowerPoint PPT Presentation

Announcements CS 4100: Artificial Intelligence Markov Decision Processes II Homework k 4: MDPs s (lead TA: Iris) Due Mon 7 Oct at 11:59pm Pr Project 2 t 2: Mu Multi-Ag Agent Search (lead TA: Zhaoqing) Due Thu 10 Oct at 11:59pm


  1. Announcements CS 4100: Artificial Intelligence Markov Decision Processes II • Homework k 4: MDPs s (lead TA: Iris) • Due Mon 7 Oct at 11:59pm • Pr Project 2 t 2: Mu Multi-Ag Agent Search (lead TA: Zhaoqing) • Due Thu 10 Oct at 11:59pm • Offi Office H Hours • Iris: s: Mon 10.00am-noon, RI 237 • JW JW: Tue 1.40pm-2.40pm, DG 111 • Zh Zhaoqi qing: : Thu 9.00am-11.00am, HS 202 • El Eli: Fri 10.00am-noon, RY 207 Jan-Willem van de Meent, Northeastern University [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.] Example: Grid World Recap: MDPs • Marko kov v decisi sion processe sses: s: A A maze-like ke problem • s • The agent lives in a grid • Set of st states S Walls block the agent’s path • a • Start st state s 0 No Nois isy movement: act actions s do o not ot al always ays go as as plan anned ed • • Se s, a Set of actions A • 80% of the time, the action North takes the agent North • Transi sitions P( P(s’ s’|s, s,a) (or T( T(s, s,a,s’) ’) ) (if there is no wall there) s,a,s’ s,a,s’) (and discount g ) • Re • 10% of the time, North takes the agent West; 10% East Rewards R( R(s, s’ If there is a wall in the direction the agent would have • been taken, the agent stays put • MDP quantities s so so far: • The The age gent nt receives s rewards s each h time st step • Small “living” reward each step (can be negative) • Po Policy = Choice of action for each state • Big rewards come at the end (good or bad) • Ut Utilit ility = sum of (discounted) rewards Go Goal: l: maxim imiz ize sum of rewa wards •

  2. Optimal Quantities Gridworld V*(s) s) values Th The value (uti utility ty) ) of f a st state s • V * (s (s) = expected utility starting in s s • and acting opt optima mally s is a s state a Th The value (uti utility ty) ) of f a q-st state (s, s,a) • (s, a) is a s, a Q * (s, s,a) = expected utility starting out q-state • having taken action a from state s s s,a,s’ (s,a,s’) is a and (thereafter) acting optimally transition s’ Th The opt optima mal pol policy • p * (s) s) = optimal action from state s • [Demo – gridworld values (L8D4)] Gridworld Q*(s, s, a) values The Bellman Equations Ho How to be optima mal: St Step 1: 1: Take correct first action Step 2: Keep being optimal St

  3. The Bellman Equations Val Value I e Iter erat ation on • De Defin init itio ion of “optimal utility” via ex expect ectimax ax recurrence • Be Bellm llman equatio ions ch char aract acter erize e the the opti timal va values: V(s) gives a simple on one-st step looka kahead re relation onship p s amongst optimal utility values a a s, a s, a s,a,s’ • Va Value iteration co computes es up update tes: V(s’) s,a,s’ s’ • These are the Be Bellm llman equatio ions , and they characterize • Va Value iteration is s just st a fixe xed point so solution method opt optima mal values in a way we’ll use over and over • … though V k vectors are also interpretable as time-limited values Convergence* Policy Methods • How do we kn know the V k vect vectors ar are e going to co conver verge ge? Case 1: If the tree has ma m depth M , • Ca maximu mum then V M holds the actual unt untrunc uncated value ues • Ca Case 2: If the di discount is less than 1 • Ske ketch: For any state V k and V k+ k+1 can be viewed as depth k+1 expectimax results in nearly identical search trees k+ The dif difference is that on the bo botto ttom la laye yer , • k+1 has actual rewards while V k has zeros V k+ That last layer is at at bes est all R ma • max • It is at at wo worst R mi min ted by γ k that far out • But everything is dis discounte k+1 are at most γ k (R ma • So V k and V k+ max - R mi min ) different So as k increases, the valu values es co conver verge •

  4. Fixed Policies Policy Evaluation Do what p sa Do Do the optim imal l actio ion Do says t s to d do s s a p (s) s, a s, p (s) s,a,s’ s, p (s),s’ s’ s’ • Ex Expecti tima max: compute ma max ov over all acti tion ons to compute the op opti timal values cy p (s) , then the tree would be simpler – on • For fi fixed ed pol olicy only on one acti tion on per sta tate te • … though the tree’s val value e would depend on wh whic ich polic licy we use Utilities for a Fixed Policy Example: Policy Evaluation Always Go Right Always Go Forward Another basic operation: compute the utility of a state s • An s under a fi fixed ed (g (generally non-op opti timal) pol olicy p (s) • De Defin ine the ut utility of a state s , under a fi fixed ed pol olicy cy p : s, p (s) V p (s (s) = expected total discounted rewards starting in s and following p s, p (s),s’ s’ • Re Recur ursive relation (one-step look-ahead / Bellman equation):

  5. Example: Policy Evaluation Policy Evaluation Always Go Right Always Go Forward • Ho How w do do we we calcul ulate the he V’ V’s for for a a fi fixed ed pol olicy cy p ? s • Id Idea ea 1: Turn recursive Bellman equations into updates p (s) (like value iteration) s, p (s) s, p (s),s’ s’ • Ef Efficiency: y: O(S O(S 2 ) per iteration • Id Idea ea 2: Wi Witho hout ut the he maxes , the Bellman equations are ju just a lin linear r system • Solve with Matlab (or your favorite linear system solver) Policy Extraction Computing Actions from Values • Let’s s imagine we have ve the optimal va values s V*(s) s) • How sh should we act? • It’s s not obvi vious! s! • We We need to do a mi mini ni-exp xpectimax (one st step) • This is called policy y ext xtraction , since it finds the policy implied by the values

  6. Computing Actions from Q-Values Policy Iteration • Let’s s imagine we have ve the optimal q-va values: • How sh should we act? • Completely trivi vial to decide! • Important lesso sson: actions are easier to select from q-va values than va values ! Problems with Value Iteration k=0 • Va Value iteration repeats the Be Bellm llman updates : s a s, a s,a,s’ • Pr Problem 1: It’s slow – O(S O(S 2 A) A) per iteration s’ • Pr Problem 2: The “m “max” at each state ra rare rely change ges • Pr Problem 3: The pol policy often converges long be before ore the values Noise = 0.2 Discount = 0.9 [Demo: value iteration (L9D2)] Living reward = 0

  7. k=1 k=2 Noise = 0.2 Noise = 0.2 Discount = 0.9 Discount = 0.9 Living reward = 0 Living reward = 0 k=3 k=4 Noise = 0.2 Noise = 0.2 Discount = 0.9 Discount = 0.9 Living reward = 0 Living reward = 0

  8. k=5 k=6 Noise = 0.2 Noise = 0.2 Discount = 0.9 Discount = 0.9 Living reward = 0 Living reward = 0 k=7 k=8 Noise = 0.2 Noise = 0.2 Discount = 0.9 Discount = 0.9 Living reward = 0 Living reward = 0

  9. k=9 k=10 Noise = 0.2 Noise = 0.2 Discount = 0.9 Discount = 0.9 Living reward = 0 Living reward = 0 k=11 k=12 Noise = 0.2 Noise = 0.2 Discount = 0.9 Discount = 0.9 Living reward = 0 Living reward = 0

  10. k=100 Policy Iteration • Alternative ve approach for optimal va values: s: • Step 1: Policy y eva valuation: calculate ut utilities es for some fixe xed policy y (not optimal utilities!) until convergence • Step 2: Policy y improve vement: update po policy using one-step look-ahead with converged (but not optimal!) utilities s as future values • Re Repeat steps until policy conve verges • This s is s policy y iteration • It’s still optimal! • Can converge (much) faster under some conditions Noise = 0.2 Discount = 0.9 Living reward = 0 Policy Iteration Value Iteration vs Policy Iteration • Both va value iteration and policy y iteration compute the same thing ( all optimal va values ) values V p (with policy evaluation): • Eva xed current policy p , find va valuation: For fixe • Ite Iterate te until values conve verge : • In va value iteration: • Eve very y iteration updates both the va values s and (implicitly) the po policy • We don’t extract the po policy , but taking the max over actions implicitly y (re)computes s it • In policy y iteration: • Improve vement: For fixe xed va values , get a better po policy (using policy extraction) • We do several passes that update utilities s with fixe xed policy y • On One-st step look-ahead: (each pass is fast st because we consider only y one action , not all of them) • After the po policy is evaluated, we update the policy y ( sl slow like a value iteration pass) • The new policy y will be be better r (or we’re done) • Both are dyn ynamic programs s for so solvi ving MDPs

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend