Markov Decision Processes (MDPs) Machine Learning 10701/15781 - PowerPoint PPT Presentation

Reading: Kaelbling et al. 1996 (see class website) Markov Decision Processes (MDPs) Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University May 1 st , 2006 1

Announcements � Project: � Poster session: Friday May 5 th 2-5pm, NSH Atrium � please arrive a little early to set up � FCEs!!!! � Please, please, please, please, please, please give us your feedback, it helps us improve the class! ☺ � http://www.cmu.edu/fce 2

Discount Factors People in economics and probabilistic decision-making do this all the time. The “Discounted sum of future rewards” using discount factor γ ” is (reward now) + γ (reward in 1 time step) + γ 2 (reward in 2 time steps) + γ 3 (reward in 3 time steps) + : : (infinite sum) 3

t n u o c s D i e m The Academic Life u s 9 s . 0 A γ = r o t c a F 0.6 0.6 0.2 0.2 0.7 B. A. T. Assoc. Assistant Tenured Prof Prof Prof 60 20 400 0.2 S. 0.2 0.3 D. On the Dead Street 0 10 Define: 0.7 0.3 V A = Expected discounted future rewards starting in state A V B = Expected discounted future rewards starting in state B V T = “ “ “ “ “ “ “ T V S = “ “ “ “ “ “ “ S V D = “ “ “ “ “ “ “ D How do we compute V A , V B , V T , V S , V D ? 4

Computing the Future Rewards of an Academic 0.6 0.6 0.7 0.2 0.2 B. A. T. Assoc. Assistant Tenured Prof Prof Prof 60 20 400 S. 0.2 0.2 0.3 D. On the Dead Street 0 10 0.7 0.3 Assume Discount Factor γ = 0.9 5

Joint Decision Space Markov Decision Process (MDP) Representation: � State space: � Joint state x of entire system � Action space: � Joint action a = {a 1 ,…, a n } for all agents � Reward function: � Total reward R( x , a ) � sometimes reward can depend on action � Transition model: � Dynamics of the entire system P( x ’| x , a ) 6

Policy At state x , Policy: π ( x ) = a action a for all agents x 0 π ( x 0 ) = both peasants get wood x 1 π ( x 1 ) = one peasant builds barrack, other gets gold x 2 π ( x 2 ) = peasants get gold, footmen attack 7

Value of Policy Expected long- Value: V π ( x ) term reward starting from x V π ( x 0 ) = E π [R( x 0 ) + γ R( x 1 ) + γ 2 R( x 2 ) + Start γ 3 R( x 3 ) + γ 4 R( x 4 ) + L ] from x 0 π (x 0 ) x 0 Future rewards x 1 π (x 1 ) discounted by γ ∈ [0,1) R(x 0 ) x 2 π (x 2 ) R(x 1 ) π ( x 1 ’ ) x 3 π (x 3 ) x 1 ’ R(x 2 ) x 4 R( x 1 ’ ) π ( x 1 ’’ ) R(x 3 ) x 1 ’’ R(x 4 ) 8 R( x 1 ’’ )

Computing the value of a policy V π ( x 0 ) = E π [R( x 0 ) + γ R( x 1 ) + γ 2 R( x 2 ) + γ 3 R( x 3 ) + γ 4 R( x 4 ) + L ] � Discounted value of a state: � value of starting from x 0 and continuing with policy π from then on � A recursion! 9

Computing the value of a policy 1 – the matrix inversion approach � Solve by simple matrix inversion: 10

Computing the value of a policy 2 – iteratively � If you have 1000,000 states, inverting a 1000,000x1000,000 matrix is hard! � Can solve using a simple convergent iterative approach: (a.k.a. dynamic programming) � Start with some guess V 0 � Iteratively say: � V t+1 = R + γ P π V t � Stop when ||V t+1 -V t || ∞ · ε � means that ||V π -V t+1 || ∞ · ε /(1- γ ) 11

But we want to learn a Policy � So far, told you how good a At state x , action Policy: π ( x ) = a a for all agents policy is… � But how can we choose the x 0 π ( x 0 ) = both peasants get wood best policy??? x 1 π ( x 1 ) = one peasant builds barrack, other gets gold � Suppose there was only one x 2 π ( x 2 ) = peasants get gold, time step: footmen attack � world is about to end!!! � select action that maximizes reward! 12

Another recursion! � Two time steps: address tradeoff � good reward now � better reward in the future 13

Unrolling the recursion � Choose actions that lead to best value in the long run � Optimal value policy achieves optimal value V * 14

Bellman equation � Evaluating policy π : � Computing the optimal value V * - Bellman equation ∑ ∗ ∗ = + γ ( ) max ( , ) ( ' | , ) ( ' ) x x a x x a x V R P V a ' x 15

Optimal Long-term Plan Optimal value Optimal Policy: π * ( x ) function V * ( x ) ∑ ∗ ∗ = + γ ( , ) ( , ) ( ' | , ) ( ' ) x a x a x x a x Q R P V ' x Optimal policy: ∗ ∗ π = ( x ) arg max ( x , a ) Q a 16

Interesting fact – Unique value ∑ ∗ ∗ = + γ ( ) max ( , ) ( ' | , ) ( ' ) x x a x x a x V R P V a ' x � Slightly surprising fact : There is only one V * that solves Bellman equation! � there may be many optimal policies that achieve V * � Surprising fact : optimal policies are good everywhere!!! 17

Solving an MDP Solve Optimal Optimal Bellman policy π * ( x ) value V * ( x ) equation ∑ ∗ ∗ = + γ ( ) max ( , ) ( ' | , ) ( ' ) x x a x x a x V R P V a ' x Bellman equation is non-linear!!! Many algorithms solve the Bellman equations: � Policy iteration [Howard ‘60, Bellman ‘57] � Value iteration [Bellman ‘57] � Linear programming [Manne ‘60] � … 18

Value iteration (a.k.a. dynamic programming) – the simplest of all ∑ ∗ ∗ = + γ ( ) max ( , ) ( ' | , ) ( ' ) x x a x x a x V R P V a ' x � Start with some guess V 0 � Iteratively say: ∑ = + γ ( ) max ( , ) ( ' | , ) ( ' ) x x a x x a x � V R P V + 1 t t a ' x � Stop when ||V t+1 -V t || ∞ · ε � means that ||V ∗ -V t+1 || ∞ · ε /(1- γ ) 19

A simple example γ = 0.9 1 S You run a 1 startup Poor & Poor & 1/2 A company. Unknown Famous A 1/2 +0 +0 In every state you S 1/2 must 1/2 1 choose 1/2 1/2 1/2 between A S Saving A 1/2 money or Rich & Rich & Advertising. Famous 1/2 S Unknown +10 1/2 +10 20

Let’s compute V t (x) for our example γ = 0.9 1 S t V t (PU) V t (PF) V t (RU) V t (RF) 1 Poor & Poor & 1/2 A Unknown Famous A 1/2 +0 +0 1 S 1/2 1/2 1 1/2 1/2 2 1/2 A S A 1/2 3 Rich & Rich & Famous S 1/2 Unknown +10 4 1/2 +10 5 6 ∑ = + γ ( ) max ( , ) ( ' | , ) ( ' ) x x a x x a x V R P V + 1 t t a ' x 21

Let’s compute V t (x) for our example γ = 0.9 1 S t V t (PU) V t (PF) V t (RU) V t (RF) 1 Poor & Poor & 1/2 A Unknown Famous A 1/2 +0 +0 1 0 0 10 10 S 1/2 1/2 1 1/2 1/2 2 0 4.5 14.5 19 1/2 A S A 1/2 3 2.03 6.53 25.08 18.55 Rich & Rich & Famous S 1/2 Unknown +10 4 3.852 12.20 29.63 19.26 1/2 +10 5 7.22 15.07 32.00 20.40 6 10.03 17.65 33.58 22.43 ∑ = + γ ( ) max ( , ) ( ' | , ) ( ' ) x x a x x a x V R P V + 1 t t a ' x 22

Policy iteration – Another approach for computing π * � Start with some guess for a policy π 0 � Iteratively say: ∑ = = π + γ = π ( ) ( , ( )) ( ' | , ( )) ( ' ) V x R x a x P x x a x V x � evaluate policy: t t t t x ' ∑ π = + γ ( ) max ( , ) ( ' | , ) ( ' ) � improve policy: x R x a P x x a V x + 1 t t a x ' � Stop when � policy stops changing � usually happens in about 10 iterations � or ||V t+1 -V t || ∞ · ε � means that ||V ∗ -V t+1 || ∞ · ε /(1- γ ) 23

Policy Iteration & Value Iteration: Which is best ??? It depends. Lots of actions? Choose Policy Iteration Already got a fair policy? Policy Iteration Few actions, acyclic? Value Iteration Best of Both Worlds: Modified Policy Iteration [Puterman] …a simple mix of value iteration and policy iteration 3 rd Approach Linear Programming 24

LP Solution to MDP [Manne ‘60] Value computed by linear programming: ∑ minimize : ( x ) V (x ) V x ∑ ⎧ ≥ + γ ( x ) V V (x ) ( , ) ( ' | , ) ( ' ) x a x x a x R P V ⎨ subject to : ∀ a ' ∀ x ⎩ , x , x a � One variable V ( x ) for each state � One constraint for each state x and action a � Polynomial time solution 25

What you need to know � What’s a Markov decision process � state, actions, transitions, rewards � a policy � value function for a policy � computing V π � Optimal value function and optimal policy � Bellman equation � Solving Bellman equation � with value iteration, policy iteration and linear programming 26

Acknowledgment � This lecture contains some material from Andrew Moore’s excellent collection of ML tutorials: � http://www.cs.cmu.edu/~awm/tutorials 27

Reading: Kaelbling et al. 1996 (see class website) Reinforcement Learning Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University May 1 st , 2006 28

The Reinforcement Learning task World : You are in state 34. Your immediate reward is 3. You have possible 3 actions. Robot : I’ll take action 2. World : You are in state 77. Your immediate reward is -7. You have possible 2 actions. Robot : I’ll take action 1. World : You’re in state 34 (again). Your immediate reward is 3. You have possible 3 actions. 29

Markov Decision Processes (MDPs) Machine Learning 10701/15781 - PowerPoint PPT Presentation

Reading: Kaelbling et al. 1996 (see class website) Markov Decision Processes (MDPs) Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University May 1 st , 2006 1 Announcements Project: Poster session: Friday May 5 th

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

POMDPs (Ch. 17.4-17.6) Markov Decision Process Recap of Markov Decision Processes (MDPs): Know:

Model Repair for Markov Decision Model Repair for Markov Decision Model Repair for Markov

CS 188: Artificial Intelligence Markov Decision Processes (MDPs) Pieter Abbeel UC Berkeley

Outline CS 188: Artificial Intelligence Markov Decision Processes (MDPs) Formalism

Introduction to Partially Observable Markov Decision Processes CS 886 Sequential Decision Making

Outline Md Md Markov Markov Decision Decision Processes Processes Grid World Example

Stochastic Processes Markov Processes Hamid R. Rabiee 1 Overview o Markov Property o Markov

The simplex method is strongly polynomial for deterministic Markov decision processes Ian Post

Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo

Planning and Optimization December 4, 2019 G1. Factored MDPs G1.1 Factored MDPs Planning and

Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Discrete Markov Processes

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Markov Systems, Markov Decision Processes, and Dynamic Programming Andrew W. Moore Note to

INTRODUCTION TO ECE477 OUTLINE Course Overview Communications Staff and TAs

1 Solving MDPs Example Optimal Policies In deterministic single-agent search problems, want

10/12/2012 Logistics PS 2 due Tuesday Thursday 10/18 PS 3 due Thursday 10/25 CSE 473

CS449/649: Human-Computer Interaction Winter 2018 Course website:

CSE 517 Natural Language Processing Winter 2015 Frames Yejin Choi Some slides adapted from

Warren Weber Bank Liability Insurance Schemes before 1865 Robert E. Lucas, Jr. Conference in

FISD General Meeting in Asia Regulation in a Changing Industry: The Drive Towards Transparency

MUST OMBUDSMEN RETAIN REMIT OVER PRIVATISED SERVICES? Brian Thompson, School of Law, University