1
play

1 Solving MDPs Example Optimal Policies In deterministic - PDF document

Announcements Introduction to Artificial Intelligence Assignment 1 graded V22.0472-001 Fall 2009 Come and see me after class if you have Lecture 9: Markov Decision Processes Lecture 9: Markov Decision Processes questions questions Rob


  1. Announcements Introduction to Artificial Intelligence • Assignment 1 graded V22.0472-001 Fall 2009 • Come and see me after class if you have Lecture 9: Markov Decision Processes Lecture 9: Markov Decision Processes questions questions Rob Fergus – Dept of Computer Science, Courant Institute, NYU Many slides from Dan Klein, Stuart Russell or Andrew Moore 2 Reinforcement Learning Grid World � • Basic idea: The agent lives in a grid � Walls block the agent’s path • Receive feedback in the form of rewards � • Agent’s utility is defined by the reward function The agent’s actions do not always go as planned: • Must learn to act so as to maximize expected rewards � 80% of the time, the action North takes the agent North takes the agent North (if there is no wall there) � 10% of the time, North takes the agent West; 10% East � If there is a wall in the direction the agent would have been taken, the agent stays put � Small “living” reward each step � Big rewards come at the end � Goal: maximize sum of rewards* Markov Decision Processes What is Markov about MDPs? • An MDP is defined by: • Andrey Markov (1856-1922) • A set of states s ∈ S • A set of actions a ∈ A • “Markov” generally means that given the • A transition function T(s,a,s’) • Prob that a from s leads to s’ present state, the future and the past are • i.e., P(s’ | s,a) independent • Also called the model Also called the model • A reward function R(s, a, s’) • For Markov decision processes, “Markov” • Sometimes just R(s) or R(s’) • A start state (or distribution) means: • Maybe a terminal state • MDPs are a family of non- deterministic search problems • Reinforcement learning: MDPs where we don’t know the transition First order Markov or reward functions 5 1

  2. Solving MDPs Example Optimal Policies • In deterministic single-agent search problems, want an optimal plan, or sequence of actions, from start to a goal • In an MDP, we want an optimal policy π *: S → A • A policy π gives an action for each state • An optimal policy maximizes expected utility if followed • Defines a reflex agent R(s) = -0.01 R(s) = -0.03 Optimal policy when R(s, a, s’) = -0.03 for all non-terminals s 8 R(s) = -0.4 R(s) = -2.0 Example: High-Low High-Low as an MDP • States: 2, 3, 4, done • Three card types: 2, 3, 4 • Actions: High, Low • Infinite deck, twice as many 2’s • Model: T(s, a, s’): • Start with 3 showing • P(s’=4 | 4, Low) = 1/4 • After each card, you say “high” or 3 • P(s’=3 | 4, Low) = 1/4 3 “low” • P(s’=2 | 4, Low) = 1/2 • N • New card is flipped d i fli d • P(s’=done | 4, Low) = 0 • If you’re right, you win the points • P(s’=4 | 4, High) = 1/4 shown on the new card • P(s’=3 | 4, High) = 0 • Ties are no-ops • P(s’=2 | 4, High) = 0 • P(s’=done | 4, High) = 3/4 • If you’re wrong, game ends • … • Rewards: R(s, a, s’): • Differences from expectimax: • Number shown on s’ if s ≠ s’ • #1: get rewards as you go • 0 otherwise • #2: you might play forever! • Start: 3 9 Example: High-Low MDP Search Trees • Each MDP state gives an expectimax-like search tree High Low s is a state s a , Low Low , High High (s, a) is a s, a q-state (s,a,s’) called a transition T = 0.5, T = 0.25, T = 0, T = 0.25, R = 3 R = 4 R = 0 R = 2 T(s,a,s’) = P(s’|s,a) s,a,s’ R(s,a,s’) s’ High Low Low High Low High 11 12 2

  3. Utilities of Sequences Infinite Utilities?! • In order to formalize optimality of a policy, need to • Problem: infinite state sequences have infinite rewards understand utilities of sequences of rewards • Typically consider stationary preferences: • Solutions: • Finite horizon: • Terminate episodes after a fixed T steps (e.g. life) • Gives nonstationary policies ( π depends on time left) • Absorbing state: guarantee that for every policy, a terminal state will eventually be reached (like “done” for High-Low) • Theorem: only two ways to define stationary utilities • Discounting: for 0 < γ < 1 • Additive utility: • Discounted utility: • Smaller γ means smaller “horizon” – shorter term focus 13 14 Discounting Recap: Defining MDPs • Markov decision processes: • Typically discount s • States S rewards by γ < 1 • Start state s 0 a each time step • Actions A s, a • Transitions P(s’|s,a) (or T(s,a,s’)) • Sooner rewards have Sooner rewards have s,a,s’ , , • Rewards R(s,a,s’) (and discount γ ) s’ higher utility than later rewards • MDP quantities so far: • Also helps the • Policy = Choice of action for each state algorithms converge • Utility (or return) = sum of discounted rewards 15 16 Optimal Utilities The Bellman Equations • Fundamental operation: compute the • Definition of “optimal utility” leads to a values (optimal expectimax utilities) simple one-step lookahead relationship of states s s s amongst optimal utility values: a a • Why? Optimal values define optimal policies! s, a s, a Optimal rewards = maximize over first action and then follow optimal policy and then follow optimal policy • • Define the value of a state s: Define the al e of a state s: s,a,s’ ’ s,a,s’ V * (s) = expected utility starting in s and s’ s’ acting optimally • Formally: • Define the value of a q-state (s,a): Q * (s,a) = expected utility starting in s, taking action a and thereafter acting optimally • Define the optimal policy: π * (s) = optimal action from state s 17 18 3

  4. Solving MDPs Why Not Search Trees? • We want to find the optimal policy π * • Why not solve with expectimax? • Problems: • Proposal 1: modified expectimax search, starting from each • This tree is usually infinite state s: • Same states appear over and over • We would search once per state We would search once per state • Idea: Value iteration s • Compute optimal values for all states all at a once using successive approximations s, a • Will be a bottom-up dynamic program similar in cost to memoization s,a,s’ • Do all planning offline, no replanning needed! s’ 19 20 Value Estimates Memoized Recursion? • Calculate estimates V k * (s) • Recurrences: • Not the optimal value of s! • The optimal value considering only next k time steps (k rewards) • As k → ∞ , it approaches the optimal value optimal value • Why: • If discounting, distant rewards become negligible • If terminal states reachable from everywhere, fraction of episodes not ending becomes negligible • Otherwise, can get infinite expected utility and then this approach actually • Cache all function call results so you never repeat work won’t work • What happened to the evaluation function? 21 22 Value Iteration Value Iteration • Idea: • Problems with the recursive computation: * (s) = 0, which we know is right (why?) • Start with V 0 • Have to keep all the V k * (s) around all the time • Given V i * , calculate the values for all states for depth i+1: • Don’t know which depth π k (s) to ask for when planning • Solution: value iteration • Calculate values for all states, bottom-up • This is called a value update or Bellman update • Keep increasing k until convergence • Repeat until convergence • Theorem: will converge to unique optimal values • Basic idea: approximations get refined towards optimal values • Policy may converge long before values do 23 24 4

  5. Example: γ =0.9, living reward=0, noise=0.2 Example: Bellman Updates Example: Value Iteration V 2 V 3 • Information propagates outward from terminal states and eventually all states have correct value estimates max happens for a=right, other 25 26 [DEMO] actions not shown Convergence* Practice: Computing Actions • Define the max-norm: • Which action should we chose from state s: • Given optimal values V? • Theorem: For any two approximations U and V • I.e. any distinct approximations must get closer to each other, so, in particular, any approximation must get closer to the true U and value • Given optimal q-values Q? iteration converges to a unique, stable, optimal solution • Theorem: • Lesson: actions are easier to select from Q’s! • I.e. once the change in our approximation is small, it must also be close to correct 27 28 Recap: MDPs Utilities for Fixed Policies • Markov decision processes: • Another basic operation: compute the s s utility of a state s under a fixed (general • States S non-optimal) policy π (s) a • Actions A s, π (s) s, a • Transitions P(s’|s,a) (or T(s,a,s’)) • Define the utility of a state s, under a • Rewards R(s,a,s’) (and discount γ ) s, π (s),s’ s,a,s’ , , fixed policy π : fixed policy π : • Start state s 0 s’ s’ V π (s) = expected total discounted rewards (return) starting in s and following π • Quantities: • Recursive relation (one-step look-ahead • Returns = sum of discounted rewards / Bellman equation): • Values = expected future returns from a state (optimal, or for a fixed policy) • Q-Values = expected future returns from a q-state (optimal, or for a fixed policy) 29 30 5

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend