CMU-Q 15-381 Lecture 18: Reinforcement Learning I Teacher: Gianni - PowerPoint PPT Presentation

CMU-Q 15-381 Lecture 18: Reinforcement Learning I Teacher: Gianni A. Di Caro

H OW REALISTIC ARE MDP S ? § Assumption 1: state is known exactly after performing an action § Do we always have an infinitely powerful “GPS” that tells us where we are in the world? Think of a robot moving in a building, how does it know where it is? § Relax the assumption : Partially Observable MDP (POMDP) § Assumption 2: known model of dynamics and reward of the world, ! and " § Do we always know what will be the effect of our actions when chance is playing against us? Where those numbers come from? Image to fill in the ! matrix for the action of a wheeled robot on an icy surface … § Relax the assumption : Reinforcement Learning Problems 2

R EINFORCEMENT L EARNING Memoryless Transition stochastic reward Model? process (MRP) Action State Reward model? Agent Goal: Maximize expected sum of future rewards 3

MDP P LANNING VS . R EINFORCEMENT L EARNING Don’t have a simulator ! Have to actually learn what happens if take an action in a state Drawings by Ketrina Yim 4

R EINFORCEMENT LEARNING PROBLEM ü The agent can ”sense” the environment (it knows the state ) and has goals ü Learning effect of actions from interaction with the environment Trial and Error search § ( Delayed ) Rewards (Advisory signals ≠ Error signals) § § What actions to take? → Exploration- exploitation dilemma The agent has to generate the training set by interaction § 5

R EINFORCEMENT L EARNING Memoryless Transition stochastic reward Model? process (MRP) Action State Reward model? Agent Goal: Maximize expected sum of future rewards 6

P ASSIVE R EINFORCEMENT L EARNING § Before figuring out how to act, let’s first just try to figure out how good a (given) particular policy ! is § Passive learning: agent’s policy is fixed (i.e., in state " it always execute action !(") ) and the task is to estimate policy’s value → Learn state values, % " , or State-action values, '(", () → Policy evaluation Policy evaluation in MDPs ∼ Passive RL (*, +) Model (*, +) Model Bellman eqs. Learning 7

P ASSIVE R EINFORCEMENT L EARNING Two approaches 1. Build a model Transition à Solve Value Iteration Model? T(s,a,s’)=0.8, R(s,a,s’)=4,… State Action Reward model? Agent 8

P ASSIVE R EINFORCEMENT L EARNING Two approaches: Transition Model? 1. Build a model 2. Model-free: directly V π (s 1 )=1.8, estimate ! " V π (s 2 )=2.5,… State Action Reward model? Agent 9

P ASSIVE RL: B UILD A MODEL 1. Build a model Transition Model? T(s,a,s’)=0.8, State R(s,a,s’)=4,… Action Reward model? Agent 10

G RID W ORLD E XAMPLE Start at (1,1) 11

G RID W ORLD E XAMPLE G RID W ORLD E XAMPLE Start at (1,1) s=(1,1) action= tup try up Adaption of drawing by Ketrina Yim 12

G RID W ORLD E XAMPLE G RID W ORLD E XAMPLE Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 Adaption of drawing by Ketrina Yim 13

G RID W ORLD E XAMPLE G RID W ORLD E XAMPLE Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup Adaption of drawing by Ketrina Yim 14

G RID W ORLD E XAMPLE G RID W ORLD E XAMPLE Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 Adaption of drawing by Ketrina Yim 15

G RID W ORLD E XAMPLE G RID W ORLD E XAMPLE Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 s=(1,2) action=tup, s’=(1,3), r = -.01 s=(1,3) action=tright, s’=(2,3), r = -.01 s=(2,3) action=tright, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(3,2), r = -.01 s=(3,2) action=tup, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(4,3), r = 1 Adaption of drawing by Ketrina Yim 16

G RID W ORLD E XAMPLE G RID W ORLD E XAMPLE Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 s=(1,2) action=tup, s’=(1,3), r = -.01 s=(1,3) action=tright, s’=(2,3), r = -.01 s=(2,3) action=tright, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(3,2), r = -.01 The gathered s=(3,2) action=tup, s’=(3,3), r = -.01 experience can be s=(3,3) action=tright, s’=(4,3), r = 1 used to estimate MDP’s ! and " Adaption of drawing by Ketrina Yim models 17

G RID W ORLD E XAMPLE G RID W ORLD E XAMPLE Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 s=(1,2) action=tup, s’=(1,3), r = -.01 s=(1,3) action=tright, s’=(2,3), r = -.01 s=(2,3) action=tright, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(3,2), r = -.01 The gathered s=(3,2) action=tup, s’=(3,3), r = -.01 experience can be s=(3,3) action=tright, s’=(4,3), r = 1 used to estimate MDP’s ! and " Estimate of T(<1,2>,tup,<1,3>) = 1/2 models Adaption of drawing by Ketrina Yim 18

M ODEL -B ASED P ASSIVE R EINFORCEMENT L EARNING 1. Follow policy ! , observe transitions and rewards 2. Estimate MDP model parameters " and # given observed transitions and rewards § If finite set of states and actions, can just make a table, count, and average counts 3. Use estimated MDP to do policy evaluation of ! (using Value Iteration) Does this give us all the parameters for an MDP? 19

S OME PARAMETERS ARE MISSING G RID W ORLD E XAMPLE Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 s=(1,2) action=tup, s’=(1,3), r = -.01 s=(1,3) action=tright, s’=(2,3), r = -.01 s=(2,3) action=tright, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(3,2), r = -.01 s=(3,2) action=tup, s’=(3,3), r = -.01 Estimate of T(<1,2>,tright,<1,3>)? s=(3,3) action=tright, s’=(4,3), r = 1 No idea! Never tried this action… Adaption of drawing by Ketrina Yim 20

P ASSIVE M ODEL -B ASED RL § Does this give us all the parameters of the underlying MDP? § No. § But does that matter for computing policy value? § No, don’t need to reconstruct the whole MDP for performing policy evaluation! § Have all parameters we need! § We have !(#) , we can assign non-zero probabilities to all observed transitions and zero to the unobserved ones § We need to visit all states # ∈ & at least once in order to solve the Bellman equations for all states " # V π ( s ) = E π R ( s t +1 ) + γ V π ( s t +1 ) | s t = s ⇣ ⌘ X p ( s 0 | s, π ( s )) R ( s 0 , s, π ( s )) + γ V π ( s 0 ) = ∀ s ∈ S s 0 2 S 21

P ASSIVE M ODEL -B ASED RL Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 s=(1,2) action=tup, s’=(1,3), r = -.01 s=(1,3) action=tright, s’=(2,3), r = -.01 s=(2,3) action=tright, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(3,2), r = -.01 s=(3,2) action=tup, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(4,3), r = 1 s=(1,1) action= tup, s’=(2,1), r = -.01 2 episodes of experience in MDP. Use to s=(2,1) action= tright, s’=(3,1), r = -.01 estimate MDP parameters & evaluate ! s=(3,1) action= tup, s’=(4,1), r = -.01 s=(4,1) action= tleft, s’=(3,1), r = -.01 Is the computed policy value likely s=(3,1) action= tup, s’=(3,2), r = -.01 to be correct? s=(3,2) action= tup, s’=(4,2), r = -1 (1) Yes (2) No (3) Not sure Adaption of drawing by Ketrina Yim 22

P ASSIVE R EINFORCEMENT L EARNING Two Approaches: Transition Model? 1. Build a model 2. Model-free: directly estimate ! " V π (s 1 )=1.8, State V π (s 2 )=2.5,… Action Reward model? Agent 23

L ET ’ S CONSIDER AN EPISODIC SCENARIO Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 s=(1,2) action=tup, s’=(1,3), r = -.01 s=(1,3) action=tright, s’=(2,3), r = -.01 s=(2,3) action=tright, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(3,2), r = -.01 s=(3,2) action=tup, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(4,3), r = 1 s=(1,1) action= tup, s’=(2,1), r = -.01 s=(2,1) action= tright, s’=(3,1), r = -.01 Estimate of !(1,1) ? s=(3,1) action= tup, s’=(4,1), r = -.01 ! 1,1 = 1 s=(4,1) action= tleft, s’=(3,1), r = -.01 & 1 + 7 + −0.01 + (−1 + 5 + −0.01 ) 2 s=(3,1) action= tup, s’=(3,2), r = -.01 0 2 0 1 s=(3,2) action= tup, s’=(4,2), r = -1 Adaption of drawing by Ketrina Yim 2 episodes of (MDP) experiences Averaging episode returns 24

A VERAGING OBSERVED RETURNS § Averaging the returns from ( episodes, , $ , , 6 , ⋯ , , " " "#$ % = 1 § Arithmetic average: ! ( ) , * *+$ § Incremental arithmetic average: " % + 1 ! "#$ % = ! ( (, " −! " % ) § Incremental weighted arithmetic average: § Weight of an episode: 1 " " " = ∑ *+4 § Sum of ( episodes: 2 1 * " % + 1 " ! "#$ % = ! (, " −! " % ) 2 " 25

A VERAGING OBSERVED RETURNS § Exponentially-weighted average ( moving average ): ! "#$ % = ! " % + ((* " −! " % ) $ (Note: constant ( vs. " ) = (! " % + (1 − ()* " § Weights decrease exponentially: " ( "20 (1 − ()* 0 "#$ % = ( " ! ! . % + / 01$ ! $ % = (! . % + 1 − ( * $ ! 3 % = (! $ % + 1 − ( * 3 = ( (! . % + 1 − ( * $ + 1 − ( * 3 = ( 3 ! . % + ( 1 − ( * $ + 1 − ( * 3 26

CMU-Q 15-381 Lecture 18: Reinforcement Learning I Teacher: Gianni - PowerPoint PPT Presentation

CMU-Q 15-381 Lecture 18: Reinforcement Learning I Teacher: Gianni A. Di Caro H OW REALISTIC ARE MDP S ? Assumption 1: state is known exactly after performing an action Do we always have an infinitely powerful GPS that tells us where

Slides for 15-381/781 15-381/781 Fall 2016

FACT: A Diagnostic for Group Fairness Trade-offs Joon Kim, CMU (joonsikk@cs.cmu.edu ) Jiahao Chen,

The bluetides simulation Tiziana DiMatteo (CMU ) Yu Feng (Berkeley), Rupert Croft (CMU ), Aklant

A New Boosting Algorithm Using Input-Dependent Regularizer Rong Jin rong+@cs.cmu.edu Yan Liu

CMU-Q 15-381 Lecture 1: Introduction AI, basic definitions, problems, road map Teacher:

CMU-Q 15-381 Lecture 4: Path Planning Teacher: Gianni A. Di Caro A PPLICATION : M OTION P

15-381: Artificial Intelligence Introduction and Overview Course data All up-to-date info is

CMU-Q 15-381 Lecture 20: Game Theory I Teacher: Gianni A. Di Caro I CE - CREAM W ARS

CMU-Q 15-381 Lecture 5: Classical Planning Factored Representations STRIPS Teacher: Gianni A.

CMU MDPs 15-381/781 Emma Brunskill (THIS TIME) Ariel Procaccia DeepMind 2 So long

CMU-Q 15-381 Lecture 8: Optimization I: Optimization for CSP Local Search Teacher: Gianni A.

CMU-Q 15-381 Lecture 15: Predictions in Markov Chains Markov Decision Processes Teacher:

CMU-Q 15-381 Lecture 16: Markov Decision Processes I Teacher: Gianni A. Di Caro R ECAP : M

CMU-Q 15-381 Lecture 23: Supervised Learning 1 Teacher: Gianni A. Di Caro M ACHINE L EARNING ?

EOR Enhanced Oil Recovery 3535 W. 16 th . St. Odessa, Texas 79763 Tel. (432) 381-6540 Fax

6. Parameter Passing Parameter Passing CS 381 Spring 2016 Example (Formal) Parameter void

DESIGN, SIMULATION, TESTING CSSE 120 Rose Hulman Institute of Technology += and related

52 52 ca cards pe per de deck } For a player, the ace is always counted in the way its

The Relation between Monetary Policy and Financial-Stability Policy Lars E.O. Svensson Stockholm

ASEAN Insight Seminar Yorkshire and the Humber Export Finance Support for UK Exporters

blackjack.cc blackjack.cc nov 05, 13 11:33 Page 1/7 nov 05, 13 11:33 Page 2/7

Abstract My purpose is to design and create a better AI model than others to play the Dou Di Zhu.

Reinforcement Learning Environments Fully-observable vs partially-observable Single agent

Databases Announcements Create Table and Drop Table Create Table 4 Create Table CREATE