max ( | ) ( ) P s a U s preferences, must exist - PDF document

The Winding Path to RL Markov Decision Processes & • Decision Theory • Descriptive theory of optimal Reinforcement Learning behavior • Markov Decision Processes • Mathematical/Algorithmic realization of Decision Theory (Lecture 1) • Application of learning techniques to • Reinforcement Learning challenges of MDPs with numerous or unknown parameters Ron Parr Duke University Decision Theory Covered Today What does it mean to make an optimal decision? • Decision Theory • Asked by economists to study consumer behavior • MDPs • Asked by MBAs to maximize profit • Asked by leaders to allocate resources • Algorithms for MDPs – Value Determination • Asked in OR to maximize efficiency of operations – Optimal Policy Selection • Asked in AI to model intelligence • Value Iteration • Policy Iteration • Linear Programming • Asked (sort of) by any intelligent person every day Are Utility Functions Natural? Utility Functions • Some have argued that people don’t really have utility functions • A utility function is a mapping from • What is the utility of the current state? world states to real numbers • What was your utility at 8:00pm last night? • Also called a value function • Utility elicitation is difficult problem • Rational or optimal behavior is typically viewed as maximizing expected utility: • It’s easy to communicate preferences • Given a plausible set of assumptions about ∑ max ( | ) ( ) P s a U s preferences, must exist consistent utility function a s a = actions, s = states (More precise statement of this is a theorem.) 1

Swept under the rug today… Playing a Game Show • Assume series of questions • Utility of money (assumed 1:1) – Increasing difficulty – Increasing payoff • How to determine costs/utilities • Choice: – Accept accumulated earnings and quit • How to determine probabilities – Continue and risk losing everything • “Who wants to be a millionaire?” State Representation Making Optimal Decisions (simplified game) • Work backwards from future to present Start 1 correct 2 correct 2 correct $100 $1,000 $10,000 $100,000 • Consider $100,000 question $111,100 – Suppose P(correct) = 1/10 – V(stop)=$11,100 $0 $0 $0 $0 – V(continue) = 0.9*$0 + 0.1*$100K = $10K $100 $1,100 $11,100 • Optimal decision STOPS at last step Working Recursively Decision Theory Summary • Provides theory of optimal decisions V=$3,747 V=$4,163 V=$5,550 V=$11.1K 9/10 3/4 • Principle of maximizing utility 1/2 1/10 X X $0 X $0 • Easy for small, tree structured spaces with $0 $0 – Known utilities – Known probabilities $100 $1,100 $11,100 2

Covered Today Dealing with Loops Suppose you can pay $1000 (from any losing state) to play again • Decision Theory • MDPs 9/10 3/4 1/2 1/10 • Algorithms for MDPs – Value Determination $0 $0 – Optimal Policy Selection $0 $0 • Value Iteration $-1000 • Policy Iteration • Linear Programming $100 $1,100 $11,100 And the solution is… From Policies to Linear Systems w/o V=$3.7K V=$4.1K V=$5.6K V=$11.1K • Suppose we always pay until we win. cheat • What is value of following this policy? V=$82.4K V=$82.6K V=$83.0K V=$84.4K ( ) 0 . 10 ( 1000 ( )) 0 . 90 ( ) V s = − + V s + V s 9/10 3/4 1/2 1/10 0 0 1 ( ) 0 . 25 ( 1000 ( )) 0 . 75 ( ) V s = − + V s + V s 1 0 2 ( ) 0 . 50 ( 1000 ( )) 0 . 50 ( ) V s = − + V s + V s 2 0 3 ( ) 0 . 90 ( 1000 ( )) 0 . 10 ( 111100 ) V s = − + V s + $-500 3 0 Is this optimal? Return to Start Continue How do we find the optimal policy? Applications of MDPs The MDP Framework • State space: S • AI/Computer Science • Action space: A – Robotic control (Koenig & Simmons, Thrun et al., Kaelbling et al.) • Transition function: P – Air Campaign Planning (Meuleau et al.) • Reward function: R – Elevator Control (Barto & Crites) • Discount factor: γ – Computation Scheduling (Zilberstein et al.) • Policy: – Control and Automation (Moore et al.) π ( s → ) a – Spoken dialogue management (Singh et al.) Objective: Maximize expected, discounted return – Cellular channel allocation (Singh & Bertsekas) (decision theoretic optimal behavior) 3

Applications of MDPs Applications of MDPs • EE/Control – Missile defense (Bertsekas et al.) – Inventory management (Van Roy et al.) • Economics/Operations Research – Football play selection (Patek & Bertsekas) – Fleet maintenance (Howard, Rust) • Agriculture – Road maintenance (Golabi et al.) – Herd management (Kristensen, Toft) – Packet Retransmission (Feinberg et al.) – Nuclear plant management (Rothwell & Rust) The Markov Assumption Understanding Discounting • Let S t be a random variable for the state at time t • Mathematical motivation – Keeps values bounded – What if I promise you $0.01 every day you visit me? • P(S t |A t-1 S t-1 ,…,A 0 S 0 ) = P(S t |A t-1 S t-1 ) • Economic motivation – Discount comes from inflation – Promise of $1.00 in future is worth $0.99 today • Markov is special kind of conditional independence • Probability of dying – Suppose ε probability of dying at each decision interval • Future is independent of past given current state – Transition w/prob ε to state with value 0 – Equivalent to 1- ε discount factor Discounting in Practice Covered Today • Decision Theory • Often chosen unrealistically low – Faster convergence • MDPs – Slightly myopic policies • Algorithms for MDPs – Value Determination • Can reformulate most algs for avg reward – Optimal Policy Selection – Mathematically uglier • Value Iteration • Policy Iteration – Somewhat slower run time • Linear Programming 4

Value Determination Matrix Form Determine the value of each state under policy π ⎛ ⎞ ( | , ( )) ( | , ( )) ( | , ( )) P s s π s P s s π s P s s π s ∑ ( ) ( , ( )) ( ' | , ( )) ( ' ) V s = R s π s + γ P s s π s V s ⎜ ⎟ 1 1 1 2 1 1 3 1 1 s ' ⎜ ⎟ ( | , ( )) ( | , ( )) ( | , ( )) P = P s s π s P s s π s P s s π s 1 2 2 2 2 2 3 2 2 ⎜ ⎟ Bellman Equation ⎝ ⎠ ( | , ( )) ( | , ( )) ( | , ( )) P s s π s P s s π s P s s π s 1 3 3 2 3 3 3 3 3 0.4 S2 S1 R=1 V = γ P V + R π S3 0.6 How do we solve this system? ( ) 1 ( 0 . 4 ( ) 0 . 6 ( )) V s = + γ V s + V s 1 2 3 Iteratively Solving for Values Solving for Values V = γ P V + R V = γ P V + R π π For larger numbers of states we can solve this system indirectly: For moderate numbers of states we can solve this system exacty: − 1 ( ) V = I − γ P R = γ + V 1 P V R i + i π π Guaranteed convergent because γ P π has spectral radius <1 Guaranteed invertible because γ P π has spectral radius <1 Establishing Convergence Contraction Analysis • Eigenvalue analysis • Define maximum norm max V = i V i • Monotonicity ∞ – Assume all values start pessimistic • Consider V1 and V2 – One value must always increase − = ε V V – Can never overestimate 1 2 ∞ • WLOG say • Contraction analysis… r V ≤ V + ε 1 2 5

Contraction Analysis Contd. Importance of Contraction • At next iteration for V2: • Any two value functions get closer ' 2 2 = + γ V R PV • For V1 • True value function V* is a fixed point ' 1 1 2 2 2 ( ) ( ) V = R + γ P V ≤ R + γ P V + ε r = R + γ PV + γ P ε r = R + γ PV + γ ε r • Max norm distance from V* decreases Distribute exponentially quickly with iterations • Conclude: ' ' 0 * ( n ) * n 2 1 − = ε → − ≤ γ ε V V V V V − V ≤ γε ∞ ∞ ∞ Finding Good Policies Covered Today Suppose an expert told you the “value” of each state: • Decision Theory V(S1) = 10 V(S2) = 5 • MDPs • Algorithms for MDPs – Value Determination – Optimal Policy Selection 0.7 S1 0.5 S1 • Value Iteration • Policy Iteration • Linear Programming S2 S2 0.3 0.5 Action 2 Action 1 Value Iteration Improving Policies We can’t solve the system directly with a max in the equation Can we solve it by iteration? • How do we get the optimal policy? • Need to ensure that we take the optimal action ∑ ( ) = max ( , ) + γ ( ' | , ) ( ' ) V + 1 s R s a P s s a V s i i a ' s in every state: •Called value iteration or simply successive approximation •Same as value determination, but we can change actions ∑ ( ) max ( , ) ( ' | , ) ( ' ) V s = R s a + γ P s s a V s a s ' •Convergence: • Can’t do eigenvalue analysis (not linear) • Still monotonic Decision theoretic optimal choice given V • Still a contraction in max norm (exercise) • Converges exponentially quickly 6

max ( | ) ( ) P s a U s preferences, must exist - PDF document

The Winding Path to RL Markov Decision Processes & Decision Theory Descriptive theory of optimal Reinforcement Learning behavior Markov Decision Processes Mathematical/Algorithmic realization of Decision Theory

Rational preferences Idea: preferences of a rational agent must obey constraints. Rational

Rational preferences Idea: preferences of a rational agent must obey constraints. Rational

Feasibility of Consistent, Feasibility of Consistent, Feasibility of Consistent, Feasibility of

1. We must SEE Jesus clearly 1. We must SEE Jesus clearly 1. We must SEE Jesus clearly 1. We

CSS Modules with BEM Consistent Design Consistent Design Different Module Versions Consistent

General Structure of a PW code Self-Consistent KS eqs. or Global Minimization approach

CONSUMER PREFERENCES AND UTILITY De fi ned over conumption set of quantity vectors;

Utility Flood SOLUTIONS November 9, 2017 UTILITY LIGHTING PRODUCTS 1 1 HO HOWARD WARD

Moral Preferences F R A N C E S C A R O S S I Decision making Based on our preferences

Max India Limited Max India Limited I Investor Presentation t P t ti June, 2014

Positive definite max-QP Recall max-cut max vE(1-v) s.t. 0 v 1 max

14.54 International Trade Lecture 3: Preferences and Demand 14.54 Week 2 Fall 2016 14.54 (Week

Axiomatic Foundations of Multiplier Preferences Tomasz Strzalecki Multiplier preferences

Building Consistent Cross-Platform Interfaces Building Consistent Cross-Platform Interfaces

Jean-Max Roger Sancerre Menetou-Salon Pouilly-Fum Jean-Max Roger s.a.s. 11 place du Carrou

JBoss Tools & Developer Studio Max Rydahl Andersen http://in.relation.to/Bloggers/Max

Danielle Kidd & Alasdair Wood 3 May 2012

Q3 2020 Highlights November 6, 2020 Forward Looking Statements This presentation contains

A Washington Perspective on Regulatory and Legislative Developments E ric M. Ma rio n Silve

Optimal dividend distribution under a ruin constraint Wolfgang Runggaldiers 65th birthday

Economics 2 Professor Christina Romer Spring 2020 Professor David Romer LECTURE 17 March 31,

H1 FY14 Earnings presentation November 12, 2013 Yves Guillemot, President and Chief Executive

and the Bounded Moment Leakage Model G. Barthe, F. Dupressoir, S. Faust, B. Grgoire, F.-X.

How Do Household Portfolio Shares Vary With Age? John Ameriks The Vanguard Group

max ( | ) ( ) P s a U s preferences, must exist - PDF document

The Winding Path to RL Markov Decision Processes & Decision Theory Descriptive theory of optimal Reinforcement Learning behavior Markov Decision Processes Mathematical/Algorithmic realization of Decision Theory

Rational preferences Idea: preferences of a rational agent must obey constraints. Rational

Rational preferences Idea: preferences of a rational agent must obey constraints. Rational

Feasibility of Consistent, Feasibility of Consistent, Feasibility of Consistent, Feasibility of

1. We must SEE Jesus clearly 1. We must SEE Jesus clearly 1. We must SEE Jesus clearly 1. We

CSS Modules with BEM Consistent Design Consistent Design Different Module Versions Consistent

General Structure of a PW code Self-Consistent KS eqs. or Global Minimization approach

CONSUMER PREFERENCES AND UTILITY De fi ned over conumption set of quantity vectors;

Utility Flood SOLUTIONS November 9, 2017 UTILITY LIGHTING PRODUCTS 1 1 HO HOWARD WARD

Moral Preferences F R A N C E S C A R O S S I Decision making Based on our preferences

Max India Limited Max India Limited I Investor Presentation t P t ti June, 2014

Positive definite max-QP Recall max-cut max vE(1-v) s.t. 0 v 1 max

14.54 International Trade Lecture 3: Preferences and Demand 14.54 Week 2 Fall 2016 14.54 (Week

Axiomatic Foundations of Multiplier Preferences Tomasz Strzalecki Multiplier preferences

Building Consistent Cross-Platform Interfaces Building Consistent Cross-Platform Interfaces

Jean-Max Roger Sancerre Menetou-Salon Pouilly-Fum Jean-Max Roger s.a.s. 11 place du Carrou

JBoss Tools &amp; Developer Studio Max Rydahl Andersen http://in.relation.to/Bloggers/Max

Danielle Kidd &amp; Alasdair Wood 3 May 2012

Q3 2020 Highlights November 6, 2020 Forward Looking Statements This presentation contains

A Washington Perspective on Regulatory and Legislative Developments E ric M. Ma rio n Silve

Optimal dividend distribution under a ruin constraint Wolfgang Runggaldiers 65th birthday

Economics 2 Professor Christina Romer Spring 2020 Professor David Romer LECTURE 17 March 31,

H1 FY14 Earnings presentation November 12, 2013 Yves Guillemot, President and Chief Executive

and the Bounded Moment Leakage Model G. Barthe, F. Dupressoir, S. Faust, B. Grgoire, F.-X.

How Do Household Portfolio Shares Vary With Age? John Ameriks The Vanguard Group

JBoss Tools & Developer Studio Max Rydahl Andersen http://in.relation.to/Bloggers/Max

Danielle Kidd & Alasdair Wood 3 May 2012