max p s a u s preferences must exist consistent utility
play

max ( | ) ( ) P s a U s preferences, must exist - PDF document

The Winding Path to RL Markov Decision Processes & Decision Theory Descriptive theory of optimal Reinforcement Learning behavior Markov Decision Processes Mathematical/Algorithmic realization of Decision Theory


  1. The Winding Path to RL Markov Decision Processes & • Decision Theory • Descriptive theory of optimal Reinforcement Learning behavior • Markov Decision Processes • Mathematical/Algorithmic realization of Decision Theory (Lecture 1) • Application of learning techniques to • Reinforcement Learning challenges of MDPs with numerous or unknown parameters Ron Parr Duke University Decision Theory Covered Today What does it mean to make an optimal decision? • Decision Theory • Asked by economists to study consumer behavior • MDPs • Asked by MBAs to maximize profit • Asked by leaders to allocate resources • Algorithms for MDPs – Value Determination • Asked in OR to maximize efficiency of operations – Optimal Policy Selection • Asked in AI to model intelligence • Value Iteration • Policy Iteration • Linear Programming • Asked (sort of) by any intelligent person every day Are Utility Functions Natural? Utility Functions • Some have argued that people don’t really have utility functions • A utility function is a mapping from • What is the utility of the current state? world states to real numbers • What was your utility at 8:00pm last night? • Also called a value function • Utility elicitation is difficult problem • Rational or optimal behavior is typically viewed as maximizing expected utility: • It’s easy to communicate preferences • Given a plausible set of assumptions about ∑ max ( | ) ( ) P s a U s preferences, must exist consistent utility function a s a = actions, s = states (More precise statement of this is a theorem.) 1

  2. Swept under the rug today… Playing a Game Show • Assume series of questions • Utility of money (assumed 1:1) – Increasing difficulty – Increasing payoff • How to determine costs/utilities • Choice: – Accept accumulated earnings and quit • How to determine probabilities – Continue and risk losing everything • “Who wants to be a millionaire?” State Representation Making Optimal Decisions (simplified game) • Work backwards from future to present Start 1 correct 2 correct 2 correct $100 $1,000 $10,000 $100,000 • Consider $100,000 question $111,100 – Suppose P(correct) = 1/10 – V(stop)=$11,100 $0 $0 $0 $0 – V(continue) = 0.9*$0 + 0.1*$100K = $10K $100 $1,100 $11,100 • Optimal decision STOPS at last step Working Recursively Decision Theory Summary • Provides theory of optimal decisions V=$3,747 V=$4,163 V=$5,550 V=$11.1K 9/10 3/4 • Principle of maximizing utility 1/2 1/10 X X $0 X $0 • Easy for small, tree structured spaces with $0 $0 – Known utilities – Known probabilities $100 $1,100 $11,100 2

  3. Covered Today Dealing with Loops Suppose you can pay $1000 (from any losing state) to play again • Decision Theory • MDPs 9/10 3/4 1/2 1/10 • Algorithms for MDPs – Value Determination $0 $0 – Optimal Policy Selection $0 $0 • Value Iteration $-1000 • Policy Iteration • Linear Programming $100 $1,100 $11,100 And the solution is… From Policies to Linear Systems w/o V=$3.7K V=$4.1K V=$5.6K V=$11.1K • Suppose we always pay until we win. cheat • What is value of following this policy? V=$82.4K V=$82.6K V=$83.0K V=$84.4K ( ) 0 . 10 ( 1000 ( )) 0 . 90 ( ) V s = − + V s + V s 9/10 3/4 1/2 1/10 0 0 1 ( ) 0 . 25 ( 1000 ( )) 0 . 75 ( ) V s = − + V s + V s 1 0 2 ( ) 0 . 50 ( 1000 ( )) 0 . 50 ( ) V s = − + V s + V s 2 0 3 ( ) 0 . 90 ( 1000 ( )) 0 . 10 ( 111100 ) V s = − + V s + $-500 3 0 Is this optimal? Return to Start Continue How do we find the optimal policy? Applications of MDPs The MDP Framework • State space: S • AI/Computer Science • Action space: A – Robotic control (Koenig & Simmons, Thrun et al., Kaelbling et al.) • Transition function: P – Air Campaign Planning (Meuleau et al.) • Reward function: R – Elevator Control (Barto & Crites) • Discount factor: γ – Computation Scheduling (Zilberstein et al.) • Policy: – Control and Automation (Moore et al.) π ( s → ) a – Spoken dialogue management (Singh et al.) Objective: Maximize expected, discounted return – Cellular channel allocation (Singh & Bertsekas) (decision theoretic optimal behavior) 3

  4. Applications of MDPs Applications of MDPs • EE/Control – Missile defense (Bertsekas et al.) – Inventory management (Van Roy et al.) • Economics/Operations Research – Football play selection (Patek & Bertsekas) – Fleet maintenance (Howard, Rust) • Agriculture – Road maintenance (Golabi et al.) – Herd management (Kristensen, Toft) – Packet Retransmission (Feinberg et al.) – Nuclear plant management (Rothwell & Rust) The Markov Assumption Understanding Discounting • Let S t be a random variable for the state at time t • Mathematical motivation – Keeps values bounded – What if I promise you $0.01 every day you visit me? • P(S t |A t-1 S t-1 ,…,A 0 S 0 ) = P(S t |A t-1 S t-1 ) • Economic motivation – Discount comes from inflation – Promise of $1.00 in future is worth $0.99 today • Markov is special kind of conditional independence • Probability of dying – Suppose ε probability of dying at each decision interval • Future is independent of past given current state – Transition w/prob ε to state with value 0 – Equivalent to 1- ε discount factor Discounting in Practice Covered Today • Decision Theory • Often chosen unrealistically low – Faster convergence • MDPs – Slightly myopic policies • Algorithms for MDPs – Value Determination • Can reformulate most algs for avg reward – Optimal Policy Selection – Mathematically uglier • Value Iteration • Policy Iteration – Somewhat slower run time • Linear Programming 4

  5. Value Determination Matrix Form Determine the value of each state under policy π ⎛ ⎞ ( | , ( )) ( | , ( )) ( | , ( )) P s s π s P s s π s P s s π s ∑ ( ) ( , ( )) ( ' | , ( )) ( ' ) V s = R s π s + γ P s s π s V s ⎜ ⎟ 1 1 1 2 1 1 3 1 1 s ' ⎜ ⎟ ( | , ( )) ( | , ( )) ( | , ( )) P = P s s π s P s s π s P s s π s 1 2 2 2 2 2 3 2 2 ⎜ ⎟ Bellman Equation ⎝ ⎠ ( | , ( )) ( | , ( )) ( | , ( )) P s s π s P s s π s P s s π s 1 3 3 2 3 3 3 3 3 0.4 S2 S1 R=1 V = γ P V + R π S3 0.6 How do we solve this system? ( ) 1 ( 0 . 4 ( ) 0 . 6 ( )) V s = + γ V s + V s 1 2 3 Iteratively Solving for Values Solving for Values V = γ P V + R V = γ P V + R π π For larger numbers of states we can solve this system indirectly: For moderate numbers of states we can solve this system exacty: − 1 ( ) V = I − γ P R = γ + V 1 P V R i + i π π Guaranteed convergent because γ P π has spectral radius <1 Guaranteed invertible because γ P π has spectral radius <1 Establishing Convergence Contraction Analysis • Eigenvalue analysis • Define maximum norm max V = i V i • Monotonicity ∞ – Assume all values start pessimistic • Consider V1 and V2 – One value must always increase − = ε V V – Can never overestimate 1 2 ∞ • WLOG say • Contraction analysis… r V ≤ V + ε 1 2 5

  6. Contraction Analysis Contd. Importance of Contraction • At next iteration for V2: • Any two value functions get closer ' 2 2 = + γ V R PV • For V1 • True value function V* is a fixed point ' 1 1 2 2 2 ( ) ( ) V = R + γ P V ≤ R + γ P V + ε r = R + γ PV + γ P ε r = R + γ PV + γ ε r • Max norm distance from V* decreases Distribute exponentially quickly with iterations • Conclude: ' ' 0 * ( n ) * n 2 1 − = ε → − ≤ γ ε V V V V V − V ≤ γε ∞ ∞ ∞ Finding Good Policies Covered Today Suppose an expert told you the “value” of each state: • Decision Theory V(S1) = 10 V(S2) = 5 • MDPs • Algorithms for MDPs – Value Determination – Optimal Policy Selection 0.7 S1 0.5 S1 • Value Iteration • Policy Iteration • Linear Programming S2 S2 0.3 0.5 Action 2 Action 1 Value Iteration Improving Policies We can’t solve the system directly with a max in the equation Can we solve it by iteration? • How do we get the optimal policy? • Need to ensure that we take the optimal action ∑ ( ) = max ( , ) + γ ( ' | , ) ( ' ) V + 1 s R s a P s s a V s i i a ' s in every state: •Called value iteration or simply successive approximation •Same as value determination, but we can change actions ∑ ( ) max ( , ) ( ' | , ) ( ' ) V s = R s a + γ P s s a V s a s ' •Convergence: • Can’t do eigenvalue analysis (not linear) • Still monotonic Decision theoretic optimal choice given V • Still a contraction in max norm (exercise) • Converges exponentially quickly 6

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend