Class Structure Last time: Midterm! This time: Exploration and - PowerPoint PPT Presentation

Lecture 11: Fast Reinforcement Learning 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 2 With many slides from or derived from David Silver Lecture 11: Fast Reinforcement Learning 3 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 1 / 66

Class Structure Last time: Midterm! This time: Exploration and Exploitation Next time: Batch RL Lecture 11: Fast Reinforcement Learning 4 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 2 / 66

Atari: Focus on the x-axis Last time: Midterm! This time: Exploration and Exploitation Next time: Batch RL Lecture 11: Fast Reinforcement Learning 5 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 3 / 66

Other Areas: Health, Education, ... Asymptotic convergence to good/optimal is not enough Lecture 11: Fast Reinforcement Learning 6 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 4 / 66

Table of Contents Metrics for evaluating RL algorithms 1 Exploration and Exploitation 2 Principles for RL Exploration 3 Multi-Armed Bandits 4 MDPs 5 Principles for RL Exploration 6 Lecture 11: Fast Reinforcement Learning 7 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 5 / 66

Performance Criteria of RL Algorithms Empirical performance Convergence (to something ...) Asymptotic convergence to optimal policy Finite sample guarantees: probably approximately correct Regret (with respect to optimal decisions) Optimal decisions given information have available PAC uniform Lecture 11: Fast Reinforcement Learning 8 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 6 / 66

Performance Criteria of RL Algorithms Empirical performance Convergence (to something ...) Asymptotic convergence to optimal policy Finite sample guarantees: probably approximately correct Regret (with respect to optimal decisions) Optimal decisions given information have available PAC uniform Lecture 11: Fast Reinforcement Learning 9 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 7 / 66

Strategic Exploration To get stronger guarantees on performance, need strategic exploration Lecture 11: Fast Reinforcement Learning 10 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 8 / 66

Exploration vs. Exploitation Dilemma Online decision-making involves a fundamental choice: Exploitation: Make the best decision given current information Exploration: Gather more information The best long-term strategy may involve short-term sacrifices Gather enough information to make the best overall decision Lecture 11: Fast Reinforcement Learning 12 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 10 / 66

Examples Restaurant Selection Go off-campus Eat at Treehouse (again) Online advertisements Show the most successful ad Show a different ad Oil Drilling Drill at best known location Drill at new location Game Playing Play the move you believe is best Play an experimental move Lecture 11: Fast Reinforcement Learning 13 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 11 / 66

Principles Naive Exploration Optimistic Initialization Optimism in the Face of Uncertainty Probability Matching Information State Search Lecture 11: Fast Reinforcement Learning 15 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 13 / 66

MABs Will introduce various principles for multi-armed bandits (MABs) first instead of for generic reinforcement learning MABs are a subclass of reinforcement learning Simpler (as will see shortly) Lecture 11: Fast Reinforcement Learning 17 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 15 / 66

Multiarmed Bandits Multi-armed bandit is a tuple of ( A , R ) A : known set of m actions R a ( r ) = P [ r | a ] is an unknown probability distribution over rewards At each step t the agent selects an action a t ∈ A The environment generates a reward r t ∼ R a t Goal: Maximize cumulative reward � t τ =1 r τ Lecture 11: Fast Reinforcement Learning 18 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 16 / 66

Greedy Algorithm We consider algorithms that estimate ˆ Q t ( a ) ≈ Q ( a ) Estimate the value of each action by Monte-Carlo evaluation T 1 ˆ � Q t ( a ) = r t 1 ( a t = a ) N t ( a ) t =1 The greedy algorithm selects action with highest value ˆ a ∗ t = arg max Q t ( a ) a ∈A Greedy can lock onto suboptimal action, forever Lecture 11: Fast Reinforcement Learning 19 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 17 / 66

ǫ -Greedy Algorithm With probability 1 − ǫ select a = arg max a ∈A ˆ Q t ( a ) With probability ǫ select a random action Always will be making a sub-optimal decision ǫ fraction of the time Already used this in prior homeworks Lecture 11: Fast Reinforcement Learning 20 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 18 / 66

Optimistic Initialization Simple and practical idea: initialize Q(a) to high value Update action value by incremental Monte-Carlo evaluation Starting with N ( a ) > 0 1 Q t ( a t ) = ˆ ˆ N t ( a t )( r t − ˆ Q t − 1 + Q t − 1 ) Encourages systematic exploration early on But can still lock onto suboptimal action 21 21 Depends on how high initialize Q Lecture 11: Fast Reinforcement Learning 22 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 19 / 66

Decaying ǫ t -Greedy Algorithm Pick a decay schedule for ǫ 1 , ǫ 2 , . . . Consider the following schedule c > 0 d = min a | ∆ a > 0 ∆ i � 1 , c |A| � ǫ t = min d 2 t Lecture 11: Fast Reinforcement Learning 23 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 20 / 66

How to Compare these Methods? Empirical performance Convergence (to something ...) Asymptotic convergence to optimal policy Finite sample guarantees: probably approximately correct Regret (with respect to optimal decisions) Very common criteria for bandit algorithms Also frequently considered for reinforcement learning methods Optimal decisions given information have available PAC uniform Lecture 11: Fast Reinforcement Learning 24 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 21 / 66

Regret Action-value is the mean reward for action a Q ( a ) = E [ r | a ] Optimal value V ∗ V ∗ = Q ( a ∗ ) = max a ∈A Q ( a ) Regret is the opportunity loss for one step l t = E [ V ∗ − Q ( a t )] Total Regret is the total opportunity loss t V ∗ − Q ( a τ )] � L t = E [ τ =1 Maximize cumulative reward ⇐ ⇒ minimize total regret Lecture 11: Fast Reinforcement Learning 25 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 22 / 66

Evaluating Regret Count N t ( a ) is expected number of selections for action a Gap ∆ a is the difference in value between action a and optimal action a ∗ , ∆ a = V ∗ − Q ( a ) Regret is a function of gaps and counts � t � V ∗ − Q ( a τ ) � L t = E τ =1 E [ N t ( a )]( V ∗ − Q ( a )) � = a ∈A � = E [ N t ( a )]∆ a a ∈A A good algorithm ensures small counts for large gaps But: gaps are not known Lecture 11: Fast Reinforcement Learning 26 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 23 / 66

Types of Regret bounds Problem independent : Bound how regret grows as a function of T , the total number of time steps the algorithm operates for Problem dependent : Bound regret as a function of the number of times pull each arm and the gap between the reward for the pulled arm and the true optimal arm Lecture 11: Fast Reinforcement Learning 27 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 24 / 66

”Good”: Sublinear or below regret Explore forever : have linear total regret Explore never : have linear total regret Is it possible to achieve sublinear regret? Lecture 11: Fast Reinforcement Learning 28 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 25 / 66

Greedy Bandit Algorithms and Optimistic Initialization Greedy : Linear total regret Constant ǫ -greedy : Linear total regret Decaying ǫ -greedy : Sublinear regret but schedule for decaying ǫ requires knowledge of gaps, which are unknown Optimistic initialization : Sublinear regret if initialize values sufficiently optimistically, else linear regret Check your understanding: why does fixed ǫ -greedy have linear regret? (Do a proof sketch) Lecture 11: Fast Reinforcement Learning 29 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 26 / 66

Class Structure Last time: Midterm! This time: Exploration and - PowerPoint PPT Presentation

Lecture 11: Fast Reinforcement Learning 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 2 With many slides from or derived from David Silver Lecture 11: Fast Reinforcement Learning 3 Emma Brunskill (CS234 Reinforcement Learning. )

Programming Abstraction in C++ Eric S. Roberts and Julie Zelenski Stanford University 2010

BIBLICAL SURVEY Introductory Class Introductory Class BIBLICAL SURVEY Introductory Class

Curriculum on The Cadet Corps Uniform Class A Uniform Class A Uniform Agenda C1. Class A

Electing Your Membership Class Class TG, Class TH, or Class DC As a school employee who

TwissOptics Class Joschua Dilly TwissOptics Class 2 The TwissOptics Class Resonance Driving

3/14/16 Review Class/Object Type Class Keyword class class Point

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

STRUCTURE STRUCTURE Highlight the structure of Highlight the structure of material material

Part IV I/O System Chapter 12: Mass Storage Structure Chapter 12: Mass Storage Structure 1

Curriculum on The Cadet Corps Uniform Wear It WIth honor Class C Uniform Class C Uniform

Classroom Assessment Scoring System (CLASS) 104 B New Report Format Interpreting your CLASS

Essential Oils Class with Jami Borlik 1 Essential Oils Class with Jami Borlik 2 Essential Oils

Inheritance II Is-a versus has-a When an object of class A has a n object of class B, use

Inheritance A class can be a sub-type of another class The inheriting class contains all

UML Class Diagrams Steven Zeil February 25, 2013 UML Class Diagrams Outline Class

Multiple inheritance Multiple inheritance Can derive a class from more than one base class

Scheduling and routing problems at TNT Some solutions and some future research directions orensen

Formal Theory, Informally Jonathan Worthington French Perl Workshop 2006 Formal Theory,

Tidying and exploring data Workshop 5 2 Objectives By doing this workshop and carrying out the

Converting the Ad-Hoc Configuration of a Heterogeneous Environment to a CFM How I learned to

. Needs effective (finitary) representation .. Failed Termination Proof vs. Non- TNT:

Rapid Computation of I-vector Longting XU 1,2 , Kong Aik LEE 1 , Haizhou Li 1 and Zhen Yang 2 1

Perlen der Informatik I Jan K ret nsk y Technische Universit at M unchen Winter

RECURSIVE LEAST SQUARES ALGORITHM DEDICATED TO EARLY RECOGNITION OF EXPLOSIVE COMPOUNDS THANKS TO