10 12 2012
play

10/12/2012 Logistics PS 2 due Tuesday Thursday 10/18 PS 3 due - PDF document

10/12/2012 Logistics PS 2 due Tuesday Thursday 10/18 PS 3 due Thursday 10/25 CSE 473 Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer MDPs Planning


  1. 10/12/2012 Logistics  PS 2 due Tuesday  Thursday 10/18  PS 3 due Thursday 10/25 CSE 473 Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer MDPs Planning Agent Static vs. Dynamic Markov Decision Processes • Planning Under Uncertainty Environment Fully • Mathematical Framework vs. • Bellman Equations Partially Deterministic ete st c Observable Ob bl • Value Iteration vs. What action Stochastic • Real ‐ Time Dynamic Programming next? Andrey Markov • Policy Iteration (1856 ‐ 1922) Perfect Instantaneous vs. vs. • Reinforcement Learning Durative Noisy Percepts Actions Objective of an MDP Review: Expectimax • Find a policy  : � → �  What if we don’t know what the result of an action will be? E.g., • In solitaire, next card is unknown • which optimizes max • In pacman, the ghosts act randomly • minimizes expected cost to reach a discounted  Can do expectimax search goal or  Max nodes as in minimax search  Max nodes as in minimax search chance chance undiscount. • maximizes expected reward  Chance nodes, like min nodes, except the outcome is uncertain ‐ take • maximizes expected (reward ‐ cost) average (expectation) of children  Calculate expected utilities 10 4 5 7 • given a ____ horizon • finite  Today, we formalize as an Markov Decision Process • infinite  Handle intermediate rewards & infinite plans  More efficient processing • indefinite 1

  2. 10/12/2012 Grid World Markov Decision Processes  An MDP is defined by:  Walls block the agent’s path • A set of states s  S  Agent’s actions may go astray: • A set of actions a  A  80% of the time, North action • A transition function T(s,a,s’) • Prob that a from s leads to s’ takes the agent North • i.e., P(s’ | s,a) (assuming no wall) • Also called “the model”  10% ‐ actually go West • A reward function R(s, a, s’) • Sometimes just R(s) or R(s’)  10% ‐ actually go East • A start state (or distribution)  If there is a wall in the chosen • Maybe a terminal state direction, the agent stays put • MDPs: non ‐ deterministic search  Small “living” reward each step Reinforcement learning: MDPs where we don’t  Big rewards come at the end know the transition or reward functions  Goal: maximize sum of rewards What is Markov about MDPs? Solving MDPs  In deterministic single-agent search problems, want an optimal  Andrey Markov (1856 ‐ 1922) plan, or sequence of actions, from start to a goal  “Markov” generally means that  In an MDP, we want an optimal policy  *: S → A • A policy  gives an action for each state • conditioned on the present state, • the future is independent of the past p p • An optimal policy maximizes expected utility if followed An optimal policy maximizes expected utility if followed • Defines a reflex agent  For Markov decision processes, “Markov” means: Optimal policy when R(s, a, s’) = ‐ 0.03 for all non ‐ terminals s Example Optimal Policies Example Optimal Policies R(s) = ‐ 0.01 R(s) = ‐ 0.03 R(s) = ‐ 0.01 R(s) = ‐ 0.03 R(s) = ‐ 0.4 R(s) = ‐ 0.4 R(s) = ‐ 2.0 R(s) = ‐ 2.0 2

  3. 10/12/2012 Example Optimal Policies Example Optimal Policies R(s) = ‐ 0.01 R(s) = ‐ 0.03 R(s) = ‐ 0.01 R(s) = ‐ 0.03 R(s) = ‐ 0.4 R(s) = ‐ 2.0 R(s) = ‐ 0.4 R(s) = ‐ 2.0 Example: High ‐ Low High ‐ Low as an MDP  States: • 2, 3, 4, done  Three card types: 2, 3, 4  Actions: • High, Low • Infinite deck, twice as many 2’s  Start with 3 showing  Model: T(s, a, s’):  After each card, you say “high” or “low” 1/4 • P(s’=4 | 4, Low) = 3 3 1/4 • P(s’=3 | 4, Low) =  New card is flipped 1/2 / • P(s’=2 | 4, Low) = P(s 2 | 4, Low) • If you’re right, you win the points shown on If ’ i h i h i h • P(s’=done | 4, Low) = 0 the new card • P(s’=4 | 4, High) = 1/4 • Ties are no ‐ ops (no reward) ‐ 0 • P(s’=3 | 4, High) = 0 • If you’re wrong, game ends • P(s’=2 | 4, High) = 0 • P(s’=done | 4, High) = 3/4 • …  Rewards: R(s, a, s’):  Differences from expectimax problems: • Number shown on s’ if s’<s  a=“high” …  #1: get rewards as you go • 0 otherwise  #2: you might play forever!  Start: 3 Search Tree: High ‐ Low MDP Search Trees  Each MDP state gives an expectimax ‐ like search tree High Low s is a s state a , Low Low , High High (s, a) is a s, a q-state (s,a,s’) called a T = T = 0, T = T = 0.5, R 0.25, R R = 4 0.25, R transition s,a,s’ = 2 = 3 = 0 T(s,a,s’) = P(s’|s,a) s’ R(s,a,s’) High Low Low High Low High 3

  4. 10/12/2012 Infinite Utilities?! Utilities of Sequences  In order to formalize optimality of a policy, need to  Problem: infinite state sequences have infinite rewards understand utilities of sequences of rewards  Typically consider stationary preferences:  Solutions: • Finite horizon: • Terminate episodes after a fixed T steps (e.g. life) • Gives nonstationary policies (  depends on time left) • Absorbing state: guarantee that for every policy, a terminal state will eventually be reached (like “done” for High ‐ Low)  Theorem: only two ways to define stationary utilities • Discounting: for 0 <  < 1  Additive utility:  Discounted utility: • Smaller  means smaller “horizon” – shorter term focus Discounting Recap: Defining MDPs  Markov decision processes: • States S s • Start state s 0 a  Typically discount • Actions A s, a • Transitions P(s’|s, a) rewards by  < 1 each aka T(s,a,s’) ( , , ) s,a,s’ s,a,s time step • Rewards R(s,a,s’) (and discount  ) s’ • Sooner rewards have higher utility than  MDP quantities so far: later rewards • Policy,  = Function that chooses an action for each state • Also helps the • Utility (aka “return”) = sum of discounted rewards algorithms converge Optimal Utilities Why Not Search Trees?  Define the value of a state s:  Why not solve with expectimax? V * (s) = expected utility starting in s and acting optimally s  Define the value of a q ‐ state (s,a):  Problems: a Q * (s,a) = expected utility starting in s, taking action a • This tree is usually infinite (why?) s, a and thereafter acting optimally • Same states appear over and over (why?)  Define the optimal policy: s,a,s’ ’ • We would search once per state (why?) • We would search once per state (why?)  * (s) = optimal action from state s s’  Idea: Value iteration • Compute optimal values for all states all at once using successive approximations • Will be a bottom ‐ up dynamic program similar in cost to memoization • Do all planning offline, no replanning needed! 4

  5. 10/12/2012 The Bellman Equations Bellman Equations for MDPs  Definition of “optimal utility” leads to a simple one ‐ step look ‐ ahead relationship between Q*(a, s) optimal utility values: (1920 ‐ 1984) s a s, a s,a,s’ s’ Bel Bellman Backup an Backup Bellman Backup (MDP) Q 1 (s,a 1 ) = 2 +  0 • Given an estimate of V* function (say V n ) ~ 2 • Backup V n function at state s Q 1 (s,a 2 ) = 5 +  0.9~ • calculate a new estimate (V n+1 ) : a 1 s 1 V 0 = 0 +  0.1~ 2 V 1 = 6.5 = 6.5 ~ 6.1 5   s 0 s 0 V V a 2 a 2 � � Q 1 (s,a 3 ) = 4.5 +  2 s 2 V 0 = 1 ~ 6.5 ax V a 3 • Q n+1 (s,a) : value/cost of the strategy: s 3 V 0 = 2 max • execute action a in s, execute  n subsequently •  n = argmax a ∈ Ap(s) Q n (s,a) Value iteration [Bellman’57] Value Iteration • assign an arbitrary assignment of V 0 to each state.  Idea: • Start with V 0 * (s) = 0, which we know is right (why?) * , calculate the values for all states for depth i+1: • Given V i • repeat • for all states s Iterat Iteration on n+1 n+1 • compute V n+1 (s) by Bellman backup at s. n 1 • until max s |V n+1 (s) – V n (s)| <  • This is called a value update or Bellman update  -convergence • Repeat until convergence Residual Res dual(s) (s)  Theorem: will converge to unique optimal values  Theorem: will converge to unique optimal values  Basic idea: approximations get refined towards optimal values  Basic idea: approximations get refined towards optimal values  Policy may converge long before values do  Policy may converge long before values do 5

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend