Markov Decision Processes and Exact Solu6on Methods: - PowerPoint PPT Presentation

Markov ¡Decision ¡Processes ¡ and ¡ Exact ¡Solu6on ¡Methods: ¡ Value ¡Itera6on ¡ Policy ¡Itera6on ¡ Linear ¡Programming ¡ ¡ ¡ Pieter ¡Abbeel ¡ UC ¡Berkeley ¡EECS ¡ ¡ ¡

Markov ¡Decision ¡Process ¡ AssumpJon: ¡agent ¡gets ¡to ¡observe ¡the ¡state ¡ [Drawing ¡from ¡SuEon ¡and ¡Barto, ¡Reinforcement ¡Learning: ¡An ¡IntroducJon, ¡1998] ¡

Markov ¡Decision ¡Process ¡(S, ¡A, ¡T, ¡R, ¡γ, ¡H) ¡ Given ¡ S: ¡set ¡of ¡states ¡ n A: ¡set ¡of ¡acJons ¡ n T: ¡S ¡x ¡A ¡x ¡S ¡x ¡{0,1,…,H} ¡ à ¡[0,1] ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡T t (s,a,s’) ¡= ¡P(s t+1 ¡= ¡s’ ¡| ¡s t ¡= ¡s, ¡a t ¡=a) ¡ n R: ¡ ¡S ¡x ¡A ¡x ¡S ¡x ¡{0, ¡1, ¡…, ¡H} ¡ à ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡R t (s,a,s’) ¡= ¡reward ¡for ¡(s t+1 ¡= ¡s’, ¡s t ¡= ¡s, ¡a t ¡=a) ¡ R n γ ¡in ¡(0,1]: ¡discount ¡factor ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡H: ¡horizon ¡over ¡which ¡the ¡agent ¡will ¡act ¡ n Goal: ¡ ¡ Find ¡ π *: ¡S ¡x ¡{0, ¡1, ¡…, ¡H} ¡ à ¡A ¡ ¡that ¡maximizes ¡expected ¡sum ¡of ¡rewards, ¡i.e., ¡ ¡ n

Examples ¡ MDP ¡(S, ¡A, ¡T, ¡R, ¡γ, ¡H), ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡goal: ¡ q Server ¡management ¡ q Cleaning ¡robot ¡ q Shortest ¡path ¡problems ¡ q Walking ¡robot ¡ q Model ¡for ¡animals, ¡people ¡ q Pole ¡balancing ¡ q Games: ¡tetris, ¡backgammon ¡

Canonical ¡Example: ¡Grid ¡World ¡ § The ¡agent ¡lives ¡in ¡a ¡grid ¡ § Walls ¡block ¡the ¡agent’s ¡path ¡ § The ¡agent’s ¡acJons ¡do ¡not ¡ always ¡go ¡as ¡planned: ¡ § 80% ¡of ¡the ¡Jme, ¡the ¡acJon ¡North ¡ takes ¡the ¡agent ¡North ¡ ¡ (if ¡there ¡is ¡no ¡wall ¡there) ¡ § 10% ¡of ¡the ¡Jme, ¡North ¡takes ¡the ¡ agent ¡West; ¡10% ¡East ¡ § If ¡there ¡is ¡a ¡wall ¡in ¡the ¡direcJon ¡ the ¡agent ¡would ¡have ¡been ¡ taken, ¡the ¡agent ¡stays ¡put ¡ § Big ¡rewards ¡come ¡at ¡the ¡end ¡

Solving ¡MDPs ¡ In ¡an ¡MDP, ¡we ¡want ¡to ¡find ¡an ¡opJmal ¡policy ¡ π *: ¡S ¡x ¡0:H ¡→ ¡A ¡ n A ¡policy ¡ π ¡gives ¡an ¡acJon ¡for ¡each ¡state ¡for ¡each ¡Jme ¡ n t=5=H ¡ t=4 ¡ t=3 ¡ t=2 ¡ t=1 ¡ t=0 ¡ An ¡opJmal ¡policy ¡maximizes ¡expected ¡sum ¡of ¡rewards ¡ n Contrast: ¡If ¡determinisJc, ¡just ¡need ¡an ¡opJmal ¡plan, ¡or ¡sequence ¡of ¡ n acJons, ¡from ¡start ¡to ¡a ¡goal ¡

Outline ¡ n OpJmal ¡Control ¡ n Exact ¡Methods: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡= ¡ ¡ n Value ¡Itera*on ¡ given ¡an ¡MDP ¡(S, ¡A, ¡T, ¡R, ¡γ, ¡H) ¡ n Policy ¡IteraJon ¡ find ¡the ¡opJmal ¡policy ¡ π * ¡ n Linear ¡Programming ¡ For ¡now: ¡discrete ¡state-‑acJon ¡spaces ¡as ¡they ¡are ¡simpler ¡to ¡get ¡the ¡main ¡concepts ¡ across. ¡ ¡We ¡will ¡consider ¡conJnuous ¡spaces ¡later! ¡

Value ¡IteraJon ¡ Algorithm: ¡ Start ¡with ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡for ¡all ¡s. ¡ For ¡i ¡= ¡1, ¡… ¡, ¡H ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡For ¡all ¡states ¡s ¡ in ¡S: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡This ¡is ¡called ¡a ¡value ¡update ¡or ¡Bellman ¡update/back-‑up ¡ = ¡expected ¡sum ¡of ¡rewards ¡accumulated ¡starJng ¡from ¡state ¡s, ¡acJng ¡opJmally ¡for ¡i ¡steps ¡ ¡ ¡ = ¡opJmal ¡acJon ¡when ¡in ¡state ¡s ¡and ¡geqng ¡to ¡act ¡for ¡i ¡steps

Value ¡IteraJon ¡in ¡Gridworld ¡ noise ¡= ¡0.2, ¡γ ¡=0.9, ¡two ¡terminal ¡states ¡with ¡R ¡= ¡+1 ¡and ¡-‑1 ¡

Value ¡IteraJon ¡in ¡Gridworld ¡ noise ¡= ¡0.2, ¡γ =0.9, ¡two ¡terminal ¡states ¡with ¡R ¡= ¡+1 ¡and ¡-‑1 ¡

Value ¡IteraJon ¡in ¡Gridworld ¡ noise ¡= ¡0.2, ¡γ ¡=0.9, ¡two ¡terminal ¡states ¡with ¡R ¡= ¡+1 ¡and ¡-‑1 ¡

Value ¡IteraJon ¡Convergence ¡ Theorem. ¡ ¡ ¡ Value ¡iteraJon ¡converges. ¡ ¡At ¡convergence, ¡we ¡have ¡found ¡the ¡ opJmal ¡value ¡funcJon ¡V* ¡for ¡the ¡discounted ¡infinite ¡horizon ¡problem, ¡which ¡ saJsfies ¡the ¡Bellman ¡equaJons ¡ ¡ ¡ ¡ ¡ ¡ ¡ § Now ¡we ¡know ¡how ¡to ¡act ¡for ¡infinite ¡horizon ¡with ¡discounted ¡rewards! ¡ Run ¡value ¡iteraJon ¡Jll ¡convergence. ¡ § § This ¡produces ¡V*, ¡which ¡in ¡turn ¡tells ¡us ¡how ¡to ¡act, ¡namely ¡following: ¡ Note: ¡the ¡infinite ¡horizon ¡opJmal ¡policy ¡is ¡staJonary, ¡i.e., ¡the ¡opJmal ¡acJon ¡at ¡ § a ¡state ¡s ¡is ¡the ¡same ¡acJon ¡at ¡all ¡Jmes. ¡ ¡(Efficient ¡to ¡store!) ¡

Convergence: ¡IntuiJon ¡ V ∗ ( s ) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡= ¡expected ¡sum ¡of ¡rewards ¡accumulated ¡starJng ¡from ¡state ¡s, ¡acJng ¡opJmally ¡for ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡steps ¡ ∞ n V ∗ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡= ¡expected ¡sum ¡of ¡rewards ¡accumulated ¡starJng ¡from ¡state ¡s, ¡acJng ¡opJmally ¡for ¡H ¡steps ¡ H ( s ) n AddiJonal ¡reward ¡collected ¡over ¡Jme ¡steps ¡H+1, ¡H+2, ¡… ¡ ¡ n γ H +1 R ( s H +1 ) + γ H +2 R ( s H +2 ) + . . . ≤ γ H +1 R max + γ H +2 R max + . . . = γ H +1 1 − γ R max ¡goes ¡to ¡zero ¡as ¡H ¡goes ¡to ¡infinity ¡ H →∞ ¡ ¡ ¡Hence ¡ ¡ ¡ ¡ ¡ ¡ ¡ V ∗ → V ∗ − − − − H ¡ ¡ For ¡simplicity ¡of ¡notaJon ¡in ¡the ¡above ¡it ¡was ¡assumed ¡that ¡rewards ¡are ¡always ¡greater ¡than ¡or ¡equal ¡to ¡zero. ¡ ¡If ¡rewards ¡can ¡be ¡negaJve, ¡a ¡ similar ¡argument ¡holds, ¡using ¡max ¡|R| ¡and ¡bounding ¡from ¡both ¡sides. ¡

Convergence ¡and ¡ContracJons ¡ Define ¡the ¡max-‑norm: ¡ n Theorem: ¡For ¡any ¡two ¡approximaJons ¡U ¡and ¡V ¡ n I.e., ¡any ¡disJnct ¡approximaJons ¡must ¡get ¡closer ¡to ¡each ¡other, ¡so, ¡in ¡parJcular, ¡any ¡approximaJon ¡ n must ¡get ¡closer ¡to ¡the ¡true ¡U ¡and ¡value ¡iteraJon ¡converges ¡to ¡a ¡unique, ¡stable, ¡opJmal ¡soluJon ¡ Theorem: ¡ n I.e. ¡once ¡the ¡change ¡in ¡our ¡approximaJon ¡is ¡small, ¡it ¡must ¡also ¡be ¡close ¡to ¡correct ¡ n

Exercise ¡1: ¡Effect ¡of ¡Discount ¡and ¡Noise ¡ (1) ¡γ ¡= ¡0.1, ¡noise ¡= ¡0.5 ¡ (a) ¡Prefer ¡the ¡close ¡exit ¡(+1), ¡risking ¡the ¡cliff ¡(-‑10) ¡ (2) ¡γ ¡= ¡0.99, ¡noise ¡= ¡0 ¡ (b) ¡Prefer ¡the ¡close ¡exit ¡(+1), ¡but ¡avoiding ¡the ¡cliff ¡(-‑10) ¡ (3) ¡γ = ¡0.99, ¡noise ¡= ¡0.5 ¡ (c) ¡Prefer ¡the ¡distant ¡exit ¡(+10), ¡risking ¡the ¡cliff ¡(-‑10) ¡ (d) ¡Prefer ¡the ¡distant ¡exit ¡(+10), ¡avoiding ¡the ¡cliff ¡(-‑10) ¡ (4) ¡γ = ¡0.1, ¡noise ¡= ¡0 ¡ ¡

Exercise ¡1 ¡SoluJon ¡ (a) ¡Prefer ¡close ¡exit ¡(+1), ¡risking ¡the ¡cliff ¡(-‑10) ¡-‑-‑-‑ ¡ ¡(4) ¡γ ¡= ¡0.1, ¡noise ¡= ¡0 ¡

Markov Decision Processes and Exact Solu6on Methods: - PowerPoint PPT Presentation

Markov Decision Processes and Exact Solu6on Methods: Value Itera6on Policy Itera6on Linear Programming Pieter Abbeel UC Berkeley EECS Markov

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Model Repair for Markov Decision Model Repair for Markov Decision Model Repair for Markov

CS287 Fall 2019 Lecture 2 Markov Decision Processes and Exact Solution Methods Pieter

Outline Md Md Markov Markov Decision Decision Processes Processes Grid World Example

Stochastic Processes Markov Processes Hamid R. Rabiee 1 Overview o Markov Property o Markov

Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo

Markov Decision Processes and Exact Solution Methods: Value Iteration Policy Iteration Linear

Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Discrete Markov Processes

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Markov Systems, Markov Decision Processes, and Dynamic Programming Andrew W. Moore Note to

Introduction to Partially Observable Markov Decision Processes CS 886 Sequential Decision Making

The simplex method is strongly polynomial for deterministic Markov decision processes Ian Post

Discrete Time Markov Chains Discrete-Time Markov Chains Books - Introduction to Stochastic

Markov processes (Markov chains) Construct a Bayes net from these variables: parents? Markov

C++ support for better hardware/software co-design in C# with SME Kenneth Skovhede FSP 2017

GOVERNMENT PROMOTION OF LEARNING AND INNOVATION IN SMEs OF INDUSTRIALIZING ECONOMIES Subsidies,

The Role of Small and Medium Enterprises in Structural transformation and Economic Development

SME Roundtable Moderated by Peter Connock Board Chairman, memsstar SME roundtable panelists

Sincronia: Near-Optimal Network Design for Coflows Shijin Rajakrishnan Joint work with Saksham

Anno unc e me nt Anno unc e me nt FIT100 FIT100 FIT100 No: Ye s: Midte rms Midte rms

Defining Classes, Part 1 Rose-Hulman Institute of Technology Computer Science and Software

Darrell Bethea June 3, 2011 Lab 8 solution posted Program 4 due in 1 week Final exam