Reinforcement Learning++ Emma Brunskill (today) Ariel - PowerPoint PPT Presentation

Reinforcement ¡Learning++ ¡ Emma ¡Brunskill ¡(today) ¡ Ariel ¡Procaccia ¡ 1 ¡

Recall ¡MDPs: ¡What ¡You ¡Should ¡Know ¡ • DefiniGon ¡ • How ¡to ¡define ¡for ¡a ¡problem ¡ • Value ¡iteraGon ¡and ¡policy ¡iteraGon ¡ – How ¡to ¡implement ¡ – Convergence ¡guarantees ¡ – ComputaGonal ¡complexity ¡

Reinforcement ¡Learning ¡ TransiGon ¡ Model ? ¡ AcGon ¡ State ¡ Reward ¡model ? ¡ Agent ¡ Goal: ¡Maximize ¡expected ¡sum ¡of ¡future ¡rewards ¡ ¡

Recap ¡of ¡Last ¡Time ¡ • Model-‑based ¡RL ¡when ¡select ¡acGons ¡randomly ¡ – EsGmate ¡a ¡model ¡of ¡the ¡dynamics ¡and ¡rewards ¡ from ¡data ¡(e.g. ¡T(s1|s2,a2) ¡~ ¡0.3) ¡ – Do ¡MDP ¡planning ¡given ¡those ¡esGmated ¡models ¡ • Q-‑learning ¡ – No ¡model ¡of ¡dynamics ¡and ¡rewards ¡ – Directly ¡esGmate ¡state-‑acGon ¡value ¡funcGon ¡ 4 ¡

Q-‑Learning ¡ • At ¡each ¡step, ¡for ¡current ¡state ¡s ¡and ¡acGon ¡taken ¡ – Observe ¡r ¡and ¡s’ ¡ ¡ – Update ¡Q(s,a) ¡ ¡ sampleQ ( s , a ) = R ( s , a , s ') + γ max a ' Q ( s ', a ') Q ( s , a ) = (1 − α ) Q ( s , a ) + α * sampleQ ( s , a ) • IntuiGon: ¡using ¡samples ¡to ¡approximate ¡ – Future ¡rewards ¡ – ExpectaGon ¡over ¡next ¡states ¡due ¡to ¡transiGon ¡model ¡ uncertainty ¡ ¡

Q-‑Learning ¡ProperGes ¡ • If ¡acGng ¡randomly, ¡Q-‑learning ¡converges* ¡to ¡ opGmal ¡state—acGon ¡values, ¡and ¡also ¡ therefore ¡finds ¡opGmal ¡policy ¡ • Off-‑policy ¡learning ¡ – Can ¡act ¡in ¡one ¡way ¡ – But ¡learning ¡values ¡of ¡another ¡policy ¡(the ¡opGmal ¡ one!) ¡

Towards ¡Gathering ¡High ¡Reward ¡ • Fortunately, ¡acGng ¡randomly ¡is ¡sufficient, ¡but ¡ not ¡necessary, ¡to ¡learn ¡the ¡opGmal ¡values ¡and ¡ policy ¡

How ¡to ¡Act? ¡ • IniGalize ¡s ¡to ¡a ¡starGng ¡state ¡ • IniGalize ¡Q(s,a) ¡values ¡ ¡ • For ¡t=1,2,… ¡ – Choose ¡a ¡= ¡argmax ¡Q(s,a) ¡ – Observe ¡s’,r(s,a,s’) ¡ ¡ – Update/Compute ¡Q ¡values ¡

Is ¡this ¡Approach ¡Guaranteed ¡to ¡Learn ¡ OpGmal ¡Policy? ¡ • IniGalize ¡s ¡to ¡a ¡starGng ¡state ¡ • IniGalize ¡Q(s,a) ¡values ¡ ¡ • For ¡t=1,2,… ¡ – Choose ¡a ¡= ¡argmax ¡Q(s,a) ¡ – Observe ¡s’,r(s,a,s’) ¡ ¡ – Update/Compute ¡Q ¡values ¡(using ¡model-‑based ¡or ¡Q-‑learning ¡ approach) ¡ ¡ 1. ¡Yes ¡ ¡ ¡ ¡2. ¡No ¡ ¡ ¡ ¡3. ¡Not ¡sure ¡

To ¡Explore ¡or ¡Exploit? ¡ Slide adapted from Klein and Abbeel Drawing ¡by ¡Ketrina ¡Yim ¡

Simple ¡Approach: ¡E-‑greedy ¡ • With ¡probability ¡1-‑e ¡ – Choose ¡argmax a ¡Q(s,a) ¡ • With ¡probability ¡e ¡ – Select ¡random ¡acGon ¡ ¡ • Guaranteed ¡to ¡compute ¡opGmal ¡policy ¡ • But ¡even ¡aker ¡millions ¡of ¡steps ¡sGll ¡won’t ¡always ¡be ¡ following ¡policy ¡compute ¡(the ¡argmax ¡Q(s,a)) ¡

Greedy ¡in ¡Limit ¡of ¡Infinite ¡ExploraGon ¡ (GLIE) ¡ • E-‑Greedy ¡approach ¡ • But ¡decay ¡epsilon ¡over ¡Gme ¡ • Eventually ¡will ¡be ¡following ¡opGmal ¡policy ¡ almost ¡all ¡the ¡Gme ¡ ¡

How ¡should ¡we ¡evaluate ¡the ¡performance ¡of ¡an ¡ algorithm? ¡ 13 ¡

How ¡should ¡we ¡evaluate ¡the ¡performance ¡of ¡an ¡ algorithm? ¡ ¡ -‑ ComputaGonal ¡efficiency ¡ -‑ How ¡much ¡reward ¡gathered ¡under ¡algorithm? ¡ 14 ¡

The ¡Speed ¡of ¡Learning ¡and ¡ Speeding ¡Learning ¡ 15 ¡

ObjecGves ¡for ¡an ¡RL ¡Algorithm ¡ • AsymptoGc ¡guarantees ¡ – In ¡limit ¡converge ¡to ¡a ¡policy ¡idenGcal ¡to ¡the ¡opGmal ¡ policy ¡if ¡knew ¡unknown ¡model ¡parameters ¡ 16 ¡

ObjecGves ¡for ¡an ¡RL ¡Algorithm ¡ • AsymptoGc ¡guarantees ¡ – In ¡limit ¡converge ¡to ¡a ¡policy ¡idenGcal ¡to ¡the ¡opGmal ¡ policy ¡if ¡knew ¡unknown ¡model ¡parameters ¡ – Q-‑learning! ¡(under ¡what ¡condiGons?) ¡ • Probably ¡Approximately ¡Correct ¡ – On ¡all ¡but ¡finite ¡number ¡of ¡samples, ¡choose ¡acGon ¡ whose ¡expected ¡reward ¡is ¡close ¡to ¡expected ¡reward ¡ of ¡acGon ¡take ¡if ¡knew ¡model ¡parameters ¡ – E 3 ¡(Kearns ¡& ¡Singh), ¡R-‑MAX ¡(Brafman ¡& ¡Tennenholtz) ¡ 17 ¡

Model-‑Based ¡RL ¡ • Given ¡data ¡seen ¡so ¡far ¡ • Build ¡an ¡explicit ¡model ¡of ¡the ¡MDP ¡ • Compute ¡policy ¡for ¡it ¡ • Select ¡acGon ¡for ¡current ¡state ¡given ¡policy, ¡ observe ¡next ¡state ¡and ¡reward ¡ • Repeat ¡ 18 ¡

R-‑max ¡(Brafman ¡& ¡Tennenholtz) ¡ … S2 ¡ S1 ¡ ¡ ¡ ¡ ¡ ¡ ¡

R-‑max ¡is ¡Model-‑based ¡RL ¡ Think ¡hard: ¡esGmate ¡models ¡& ¡compute ¡policies ¡ Act ¡in ¡world ¡ ¡ Rmax ¡leverages ¡opGmism ¡under ¡uncertainty! ¡

R-‑max ¡Algorithm: ¡ ¡ IniGalize: ¡Define ¡“Known” ¡MDP ¡ Reward ¡ ¡ ¡ S1 S2 S3 S4 … S1 S2 S3 S4 … ¡ ¡ ¡ ¡ R max R max R max R max U U U U Known/ ¡ U U U U R max R max R max R max Unknown ¡ U U U U R max R max R max R max U U U U R max R max R max R max In ¡the ¡“known” ¡MDP, ¡ S1 S2 S3 S4 … ¡ ¡ any ¡unknown ¡(s,a) ¡pair ¡ 0 0 0 0 TransiGon ¡ has ¡its ¡dynamics ¡set ¡as ¡ ¡ 0 0 0 0 Counts ¡ a ¡self ¡loop ¡& ¡ 0 0 0 0 ¡reward ¡= ¡Rmax ¡ 0 0 0 0

R-‑max ¡Algorithm ¡ Plan ¡in ¡known ¡MDP ¡

R-‑max: ¡Planning ¡ • Compute ¡opGmal ¡policy ¡π known ¡for ¡ “ known ” ¡MDP ¡

Exercise: ¡What ¡Will ¡IniGal ¡Value ¡of ¡Q(s,a) ¡be ¡for ¡ each ¡(s,a) ¡Pair ¡in ¡the ¡Known ¡MDP? ¡What ¡is ¡the ¡ Policy? ¡ Reward ¡ ¡ ¡ S1 S2 S3 S4 … S1 S2 S3 S4 … ¡ ¡ ¡ ¡ R max R max R max R max U U U U Known/ ¡ U U U U R max R max R max R max Unknown ¡ U U U U R max R max R max R max U U U U R max R max R max R max In ¡the ¡“known” ¡MDP, ¡ S1 S2 S3 S4 … ¡ ¡ any ¡unknown ¡(s,a) ¡pair ¡ 0 0 0 0 TransiGon ¡ has ¡its ¡dynamics ¡set ¡as ¡ ¡ 0 0 0 0 Counts ¡ a ¡self ¡loop ¡& ¡ 0 0 0 0 ¡reward ¡= ¡Rmax ¡ 0 0 0 0

R-‑max ¡Algorithm ¡ Act ¡using ¡ ¡ policy ¡ Plan ¡in ¡known ¡MDP ¡ • Given ¡opGmal ¡policy ¡π known ¡for ¡ “ known ” ¡MDP ¡ • Take ¡best ¡acGon ¡for ¡current ¡state ¡π known (s), ¡ transiGon ¡to ¡new ¡state ¡s’ ¡and ¡get ¡reward ¡r ¡

R-‑max ¡Algorithm ¡ Act ¡using ¡ ¡ policy ¡ Plan ¡in ¡known ¡MDP ¡ Update ¡state-‑acGon ¡ counts ¡

Update ¡Known ¡MDP ¡ Reward ¡ ¡ ¡ S2 S2 S3 S4 … S2 S2 S3 S4 … ¡ ¡ ¡ ¡ R max R max R max R max U U U U Known/ ¡ U U U U R max R max R max R max Unknown ¡ U U U U R max R max R max R max U U U U R max R max R max R max S2 S2 S3 S4 … ¡ ¡ 0 0 0 0 Increment ¡counts ¡for ¡ TransiGon ¡ 0 0 1 0 state-‑acGon ¡tuple ¡ Counts ¡ 0 0 0 0 0 0 0 0

Update ¡Known ¡MDP ¡ Reward ¡ ¡ ¡ S2 S2 S3 S4 … S2 S2 S3 S4 … ¡ ¡ ¡ ¡ R max R max R max R max U U U U Known/ ¡ U U K U R max R max R R max Unknown ¡ U U U U R max R max R max R max U U U U R max R max R max R max If ¡counts ¡for ¡(s,a) ¡> ¡N, ¡ S2 S2 S3 S4 … ¡ ¡ (s,a) ¡becomes ¡known: ¡ 3 3 4 3 use ¡observed ¡data ¡to ¡ TransiGon ¡ 2 4 5 0 Counts ¡ esGmate ¡transiGon ¡& ¡ 4 0 4 4 reward ¡model ¡ for ¡(s,a) ¡ 2 2 4 1 when ¡planning ¡

EsGmaGng ¡MDP ¡Model ¡for ¡a ¡(s,a) ¡ Pair ¡Given ¡Data ¡ • TransiGon ¡model ¡esGmaGon ¡ ¡ • Reward ¡model ¡esGmaGon ¡ 29 ¡

R-‑max ¡Algorithm ¡ Act ¡using ¡ ¡ policy ¡ Plan ¡in ¡known ¡MDP ¡ Update ¡state-‑acGon ¡ counts ¡ Update ¡known ¡MDP ¡ dynamics ¡& ¡reward ¡ models ¡

R-‑max ¡Behavior ¡ 31 ¡

Reinforcement Learning++ Emma Brunskill (today) Ariel - PowerPoint PPT Presentation

Reinforcement Learning++ Emma Brunskill (today) Ariel Procaccia 1 Recall MDPs: What You Should Know DefiniGon How to define for a problem Value

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

DHCPv6 Failover Update IETF85 Kim Kinnear <kkinnear@cisco.com> Tomek Mrugalski

CS 287: Advanced Robo2cs Fall 2013 Lecture 1: Introduc.on

Antimicrobial Resistance The Case for Diagnostics to Better Direct Therapy FOR INTERNAL USE

Susan Huang, MD MPH Ken Kleinman, ScD Collaboratory Grand Rounds Agenda Project Overview

Reconstructing a Fragmented Face from a Cryptographic Identification Protocol Andy Luong, Michael

The na'onal bioinforma'cs infrastructure

IETF Journal IETF 69 Chicago October 2007 Volume 3, Issue 2 Published by the Internet

Fostering Systems Research in Europe A White Paper by EuroSys, the European Professional Society

Sambuz

Useful Links

Newsletter

Mail Us

Reinforcement Learning++ Emma Brunskill (today) Ariel - PowerPoint PPT Presentation

Reinforcement Learning++ Emma Brunskill (today) Ariel Procaccia 1 Recall MDPs: What You Should Know DefiniGon How to define for a problem Value

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

DHCPv6 Failover Update IETF85 Kim Kinnear &lt;kkinnear@cisco.com&gt; Tomek Mrugalski

CS 287: Advanced Robo2cs Fall 2013 Lecture 1: Introduc.on

Antimicrobial Resistance The Case for Diagnostics to Better Direct Therapy FOR INTERNAL USE

Susan Huang, MD MPH Ken Kleinman, ScD Collaboratory Grand Rounds Agenda Project Overview

Reconstructing a Fragmented Face from a Cryptographic Identification Protocol Andy Luong, Michael

The na'onal bioinforma'cs infrastructure

IETF Journal IETF 69 Chicago October 2007 Volume 3, Issue 2 Published by the Internet

Fostering Systems Research in Europe A White Paper by EuroSys, the European Professional Society

Sambuz

Useful Links

Newsletter

Mail Us

DHCPv6 Failover Update IETF85 Kim Kinnear <kkinnear@cisco.com> Tomek Mrugalski