Lecture 2: Making Sequences of Good Decisions Given a Model of the - PowerPoint PPT Presentation

Lecture 2: Making Sequences of Good Decisions Given a Model of the World Emma Brunskill CS234 Reinforcement Learning Winter 2020 Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 1 / 62

Refresh Your Knowledge 1. Piazza Poll In a Markov decision process, a large discount factor γ means that short term rewards are much more influential than long term rewards. [Enter your answer in piazza] True False Don’t know False. A large γ implies we weigh delayed / long term rewards more. γ = 0 only values immediate rewards Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 2 / 62

Today’s Plan Last Time: Introduction Components of an agent: model, value, policy This Time: Making good decisions given a Markov decision process Next Time: Policy evaluation when don’t have a model of how the world works Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 3 / 62

Models, Policies, Values Model : Mathematical models of dynamics and reward Policy : Function mapping agent’s states to actions Value function : future rewards from being in a state and/or action when following a particular policy Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 4 / 62

Today: Given a model of the world Markov Processes Markov Reward Processes (MRPs) Markov Decision Processes (MDPs) Evaluation and Control in MDPs Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 5 / 62

Full Observability: Markov Decision Process (MDP) MDPs can model a huge number of interesting problems and settings Bandits: single state MDP Optimal control mostly about continuous-state MDPs Partially observable MDPs = MDP where state is history Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 6 / 62

Recall: Markov Property Information state: sufficient statistic of history State s t is Markov if and only if: p ( s t +1 | s t , a t ) = p ( s t +1 | h t , a t ) Future is independent of past given present Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 7 / 62

Markov Process or Markov Chain Memoryless random process Sequence of random states with Markov property Definition of Markov Process S is a (finite) set of states ( s ∈ S ) P is dynamics/transition model that specifices p ( s t +1 = s ′ | s t = s ) Note: no rewards, no actions If finite number ( N ) of states, can express P as a matrix   P ( s 1 | s 1 ) P ( s 2 | s 1 ) · · · P ( s N | s 1 ) P ( s 1 | s 2 ) P ( s 2 | s 2 ) · · · P ( s N | s 2 )     P = . . . ...  . . .  . . .   P ( s 1 | s N ) P ( s 2 | s N ) · · · P ( s N | s N ) Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 8 / 62

Example: Mars Rover Markov Chain Transition Matrix, P ! " ! # ! $ ! % ! & ! ' ! ( 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.6 0.2 0.2 0.6 0.2 0.2 0.2   0 . 6 0 . 4 0 0 0 0 0 0 . 4 0 . 2 0 . 4 0 0 0 0     0 0 . 4 0 . 2 0 . 4 0 0 0     P = 0 0 0 . 4 0 . 2 0 . 4 0 0     0 0 0 0 . 4 0 . 2 0 . 4 0     0 0 0 0 0 . 4 0 . 2 0 . 4   0 0 0 0 0 0 . 4 0 . 6 Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 9 / 62

Example: Mars Rover Markov Chain Episodes ! " ! # ! $ ! % ! & ! ' ! ( 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.6 0.2 0.2 0.2 0.2 0.2 0.6 Example: Sample episodes starting from S4 s 4 , s 5 , s 6 , s 7 , s 7 , s 7 , . . . s 4 , s 4 , s 5 , s 4 , s 5 , s 6 , . . . s 4 , s 3 , s 2 , s 1 , . . . Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 10 / 62

Markov Reward Process (MRP) Markov Reward Process is a Markov Chain + rewards Definition of Markov Reward Process (MRP) S is a (finite) set of states ( s ∈ S ) P is dynamics/transition model that specifices P ( s t +1 = s ′ | s t = s ) R is a reward function R ( s t = s ) = E [ r t | s t = s ] Discount factor γ ∈ [0 , 1] Note: no actions If finite number ( N ) of states, can express R as a vector Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 11 / 62

Example: Mars Rover MRP ! " ! # ! $ ! % ! & ! ' ! ( 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.6 0.2 0.2 0.6 0.2 0.2 0.2 Reward: +1 in s 1 , +10 in s 7 , 0 in all other states Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 12 / 62

Return & Value Function Definition of Horizon Number of time steps in each episode Can be infinite Otherwise called finite Markov reward process Definition of Return, G t (for a MRP) Discounted sum of rewards from time step t to horizon G t = r t + γ r t +1 + γ 2 r t +2 + γ 3 r t +3 + · · · Definition of State Value Function, V ( s ) (for a MRP) Expected return from starting in state s V ( s ) = E [ G t | s t = s ] = E [ r t + γ r t +1 + γ 2 r t +2 + γ 3 r t +3 + · · · | s t = s ] Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 13 / 62

Discount Factor Mathematically convenient (avoid infinite returns and values) Humans often act as if there’s a discount factor < 1 γ = 0: Only care about immediate reward γ = 1: Future reward is as beneficial as immediate reward If episode lengths are always finite, can use γ = 1 Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 14 / 62

Example: Mars Rover MRP ! " ! # ! $ ! % ! & ! ' ! ( 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.6 0.2 0.2 0.2 0.2 0.6 0.2 Reward: +1 in s 1 , +10 in s 7 , 0 in all other states Sample returns for sample 4-step episodes, γ = 1 / 2 s 4 , s 5 , s 6 , s 7 : 0 + 1 2 × 0 + 1 4 × 0 + 1 8 × 10 = 1 . 25 Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 15 / 62

Example: Mars Rover MRP ! " ! # ! $ ! % ! & ! ' ! ( 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.6 0.6 0.2 0.2 0.2 Reward: +1 in s 1 , +10 in s 7 , 0 in all other states Sample returns for sample 4-step episodes, γ = 1 / 2 s 4 , s 5 , s 6 , s 7 : 0 + 1 2 × 0 + 1 4 × 0 + 1 8 × 10 = 1 . 25 s 4 , s 4 , s 5 , s 4 : 0 + 1 2 × 0 + 1 4 × 0 + 1 8 × 0 = 0 s 4 , s 3 , s 2 , s 1 : 0 + 1 2 × 0 + 1 4 × 0 + 1 8 × 1 = 0 . 125 Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 16 / 62

Example: Mars Rover MRP ! " ! # ! $ ! % ! & ! ' ! ( 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.2 0.6 0.6 0.2 0.2 0.2 0.2 Reward: +1 in s 1 , +10 in s 7 , 0 in all other states Value function: expected return from starting in state s V ( s ) = E [ G t | s t = s ] = E [ r t + γ r t +1 + γ 2 r t +2 + γ 3 r t +3 + · · · | s t = s ] Sample returns for sample 4-step episodes, γ = 1 / 2 s 4 , s 5 , s 6 , s 7 : 0 + 1 2 × 0 + 1 4 × 0 + 1 8 × 10 = 1 . 25 s 4 , s 4 , s 5 , s 4 : 0 + 1 2 × 0 + 1 4 × 0 + 1 8 × 0 = 0 s 4 , s 3 , s 2 , s 1 : 0 + 1 2 × 0 + 1 4 × 0 + 1 8 × 1 = 0 . 125 V = [1.53 0.37 0.13 0.22 0.85 3.59 15.31] Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 17 / 62

Computing the Value of a Markov Reward Process Could estimate by simulation Generate a large number of episodes Average returns Concentration inequalities bound how quickly average concentrates to expected value Requires no assumption of Markov structure Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 18 / 62

Computing the Value of a Markov Reward Process Could estimate by simulation Markov property yields additional structure MRP value function satisfies � P ( s ′ | s ) V ( s ′ ) V ( s ) = R ( s ) + γ �� s ′ ∈ S Immediate reward � �� Discounted sum of future rewards Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the World Winter 2020 19 / 62

Lecture 2: Making Sequences of Good Decisions Given a Model of the - PowerPoint PPT Presentation

Lecture 2: Making Sequences of Good Decisions Given a Model of the World Emma Brunskill CS234 Reinforcement Learning Winter 2020 Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the

Today Making Simple Decisions Making Decisions Making Sequential Decisions Planning

20-03-06 7. Learning Sequences/Behaviors How to use sequences/behaviors? Sequences and more

Sequences Sequences and Difference Equations "Sequences" is a central topic in

Sequences Sequences and Difference Equations "Sequences" is a central topic in

Making Decisions 10 AI Slides (6e) c Lin Zuoquan@PKU 1998-2020 10 1 10 Making Decisions

Architecture Aromatique Good Taste Good Food Good Health Based on sustainability Technical

Making better decisions and improving Making better decisions and improving performance

Making maps pretty Andrea Aime Jim Groffen Making Maps Pretty Making Maps Pretty 1 1 Making

Construction of covering arrays from Outline m-sequences Covering arrays Definition Research

Sequences Sequences are ordered lists of elements, e.g. 2, 3, 5, 7, 11, 13, 17, 19, . . . or a , b

Towards a Generative Model of Natural Motion C. Karen Liu University of Southern California

GCSE or Equivalent Options Decisions! Decisions! Decisions! An important time for our Year 10

Doing Your Taxes Decisions Decisions Decisions How do I get ready? Should I

Dysphagia: decisions, decisions, decisions Sean White Home Enteral Feed Dietitian Sheffield

Fall Vegetable Garden A Successful Garden Good Siting Sunlight at least 6 hrs. Good

WHERE ARE ALL THE GOOD JOBS GOING? Holzer, Lane, Rosenblum, Andersson Russell Sage Foundation,

Capturing WPA2 Enterprise credentials with a Pi Richard Frovarp Principal Software Engineer

Physician Compare Preview Period Part I: 2018 Quality Payment Program Performance Information

Where to put a facility? Given locations p 1 , . . . , p m in R n of m houses, want to choose a

cluster for data mining algorithms Joo Saffran, Gabriel Garcia, Matheus A. Souza , Pedro H.

15-780: MarkovDecisionProcesses J. Zico Kolter Feburary 29, 2016 1 Outline Introduction

0 1 2 3 4 Keys ( k i ) B D F H J Key probabilities ( p i ) .15 .1 .05 .1 .2 Miss

Medium Access and Interference Cancellation: Protocol and Evaluation Abishek Sankararaman and

MicroStrategy Training for Disaster Recovery Grant Reporting (DRGR) System Users Welcome and

Lecture 2: Making Sequences of Good Decisions Given a Model of the - PowerPoint PPT Presentation

Lecture 2: Making Sequences of Good Decisions Given a Model of the World Emma Brunskill CS234 Reinforcement Learning Winter 2020 Emma Brunskill (CS234 Reinforcement Learning) Lecture 2: Making Sequences of Good Decisions Given a Model of the

Today Making Simple Decisions Making Decisions Making Sequential Decisions Planning

20-03-06 7. Learning Sequences/Behaviors How to use sequences/behaviors? Sequences and more

Sequences Sequences and Difference Equations &quot;Sequences&quot; is a central topic in

Sequences Sequences and Difference Equations &quot;Sequences&quot; is a central topic in

Making Decisions 10 AI Slides (6e) c Lin Zuoquan@PKU 1998-2020 10 1 10 Making Decisions

Architecture Aromatique Good Taste Good Food Good Health Based on sustainability Technical

Making better decisions and improving Making better decisions and improving performance

Making maps pretty Andrea Aime Jim Groffen Making Maps Pretty Making Maps Pretty 1 1 Making

Construction of covering arrays from Outline m-sequences Covering arrays Definition Research

Sequences Sequences are ordered lists of elements, e.g. 2, 3, 5, 7, 11, 13, 17, 19, . . . or a , b

Towards a Generative Model of Natural Motion C. Karen Liu University of Southern California

GCSE or Equivalent Options Decisions! Decisions! Decisions! An important time for our Year 10

Doing Your Taxes Decisions Decisions Decisions How do I get ready? Should I

Dysphagia: decisions, decisions, decisions Sean White Home Enteral Feed Dietitian Sheffield

Fall Vegetable Garden A Successful Garden Good Siting Sunlight at least 6 hrs. Good

WHERE ARE ALL THE GOOD JOBS GOING? Holzer, Lane, Rosenblum, Andersson Russell Sage Foundation,

Capturing WPA2 Enterprise credentials with a Pi Richard Frovarp Principal Software Engineer

Physician Compare Preview Period Part I: 2018 Quality Payment Program Performance Information

Where to put a facility? Given locations p 1 , . . . , p m in R n of m houses, want to choose a

cluster for data mining algorithms Joo Saffran, Gabriel Garcia, Matheus A. Souza , Pedro H.

15-780: MarkovDecisionProcesses J. Zico Kolter Feburary 29, 2016 1 Outline Introduction

0 1 2 3 4 Keys ( k i ) B D F H J Key probabilities ( p i ) .15 .1 .05 .1 .2 Miss

Medium Access and Interference Cancellation: Protocol and Evaluation Abishek Sankararaman and

MicroStrategy Training for Disaster Recovery Grant Reporting (DRGR) System Users Welcome and

Sequences Sequences and Difference Equations "Sequences" is a central topic in

Sequences Sequences and Difference Equations "Sequences" is a central topic in