? Still looking for a policy p (s) New twist: dont know T or R - PDF document

CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld/ University of Washington Image from https://towardsdatascience.com/reinforcement-learning-multi-arm-bandit-implementation-5399ef67b24b [Many slides taken from Dan Klein and Pieter Abbeel / CS188 Intro to AI at UC Berkeley – materials available at http://ai.berkeley.edu.] 1 Reinforcement Learning § Still assume there is a Markov decision process (MDP): § A set of states s Î S § A set of actions (per state) A § A model T(s,a,s’) § A reward function R(s,a,s’) & discount γ ? § Still looking for a policy p (s) § New twist: don’t know T or R § I.e. we don’t know which states are good or what the actions do § Must actually try actions and states out to learn 2 1

Offline (MDPs) vs. Online (RL) Most people call this RL as well Simulator T, R Planning Monte Carlo Online Learning (Offline Solution) Planning (RL) Don’t know T,R Don’t know T,R Know T,R Differences: with MC planning 1) dying ok; 2) have (re)set button 3 Reminder: Q-Value Iteration For MDPs with known T,R § Forall s, a § Initialize Q 0 (s, a) = 0 no time steps left means an expected reward of zero § K = 0 § Repeat Q k+1 (s,a) do Bellman backups For every (s,a) pair: a s, a s,a,s’ V k (s’)=Max a’ Q k (s’,a’) K += 1 § Until convergence I.e., Q values don’t change much Problem: what if don’t know T, R? We know this…. We can sample this 4 2

Reminder: Q Learning For Reinforcement Learning § Forall s, a § Initialize Q(s, a) = 0 § Repeat Forever Where are you? s. Choose some action a ; e.g. using ∈ -greedy or by maximizing Q e (s, a) Execute it in real world: (s, a, r, s’) Do update: difference ß [r + γ Max a’ Q(s’, a’)] - Q(s,a) Problem: don’t want to store table of Q(-,-) Q(s,a) ß Q(s,a) + 𝛽 (difference) 5 Q(s,a) = w 1 f 1 (s, a) + w 2 f 2 (s, a) + … + w n f n (s, a) § Forall i § Initialize w i = 0 § Repeat Forever Where are you? s. Choose some action a Execute it in real world: (s, a, r, s’) Do update: difference ß [r + γ Max a’ Q(s’, a’)] - Q(s,a) Q(s,a) ß Q(s,a) + 𝛽 (difference) 6 3

Reminder: Approximate Q Learning Q(s,a) = w 1 f 1 (s, a) + w 2 f 2 (s, a) + … + w n f n (s, a) § Forall i § Initialize w i = 0 § Repeat Forever Where are you? s. Choose some action a Execute it in real world: (s, a, r, s’) Do update: difference ß [r + γ Max a’ Q(s’, a’)] - Q(s,a) Forall i do: 7 Reminder: Approximate Q Learning Q(s,a) = w 1 f 1 (s, a) + w 2 f 2 (s, a) + … + w n f n (s, a) § Forall i Wait?! Which One? How? § Initialize w i = 0 § Repeat Forever Where are you? s. Choose some action a Execute it in real world: (s, a, r, s’) Do update: difference ß [r + γ Max a’ Q(s’, a’)] - Q(s,a) Forall i do: 8 4

Exploration vs. Exploitation 9 Questions § How to explore? axploration Uniform exploration Epsilon Greedy Exploration Functions (such as UCB) Thompson Sampling § When to exploit? § How to even think about this tradeoff? 10 5

Questions § How to explore? § Random Exploration § Uniform exploration § Epsilon Greedy § Exploration Functions (such as UCB) § Thompson Sampling § When to exploit? § How to even think about this tradeoff? 11 Video of Demo Crawler Bot More demos at: http://inst.eecs.berkeley.edu/~ee128/fa11/videos.html 12 6

Epsilon-Greedy § With (small) probability e , act randomly § With (large) probability 1- e , act on current policy § Maybe e decrease over time. 13 Evaluation § Is epsilon-greedy good? § Could any method be better? § How should we even THINK about this question? 14 14 7

Regret § Even if you learn the optimal policy, you still make mistakes along the way! § Regret is a measure of your total mistake cost : the difference between your (expected) rewards, including youthful sub-optimality, and optimal (expected) rewards § Minimizing regret goes beyond learning to be optimal – it requires optimally learning to be optimal 15 Two KINDS of Regret § Cumulative Regret: § Goal: achieve near optimal cumulative lifetime reward (in expectation) § Simple Regret: § Goal: quickly identify policy with high reward (in expectation) 16 16 8

Regret Reward Choosing optimal action each time ∞ Time Exploration policy that minimizes cumulative regret Minimizes red area 17 17 Regret You care about performance at times after here Reward You are here ∞ t Time Exploration policy that minimizes simple regret… Given a time, t, in the future , explore in order to minimize red area after t 18 18 9

Offline (MDPs) vs. Online (RL) Simulator Monte Carlo Online Learning Planning (RL) Don’t know T,R Don’t know T,R Minimize: Simple Regret Cumulative Regret 19 RL on Single State MDP § Suppose MDP has a single state and k actions § Can sample rewards of actions using call to simulator § Sampling action a is like pulling slot machine arm with random payoff function R ( s , a ) s a 1 a 2 a k … … R(s,a 2 ) R(s,a k ) R(s,a 1 ) Multi-Armed Bandit Problem 20 Slide adapted from Alan Fern (OSU) 20 10

Multi-Armed Bandits § Bandit algorithms are not just useful as components for RL & Monte-Carlo planning § Pure bandit problems arise in many applications § Applicable whenever: § set of independent options with unknown utilities § cost for sampling options or a limit on total samples § Want to find the best option or maximize utility of samples Slide adapted from Alan Fern (OSU) 21 Multi-Armed Bandits: Example 1 Clinical Trials § Arms = possible treatments § Arm Pulls = application of treatment to individual § Rewards = outcome of treatment § Objective = maximize cumulative reward = maximize benefit to trial population (or find best treatment quickly) Slide adapted from Alan Fern (OSU) 22 11

Multi-Armed Bandits: Example 2 § Online Advertising § Online Advertising § Arms = § Arms = different ads/ad-types for a web page § Arm Pulls = § Arm Pulls = displaying an ad upon a page access § Rewards = § Rewards = click through § Objective = § Objective = maximize cumulative reward = maximum clicks (or find best ad quickly) 23 Multi-Armed Bandit: Possible Objectives § PAC Objective: § find a near optimal arm w/ high probability § Cumulative Regret: § achieve near optimal cumulative reward over lifetime of pulling (in expectation) § Simple Regret: s § quickly identify arm with high reward a k a 1 a 2 § (in expectation) … … R(s,a 2 ) R(s,a k ) R(s,a 1 ) 24 Slide adapted from Alan Fern (OSU) 24 12

Cumulative Regret Objective h Problem: find arm-pulling strategy such that the expected total reward at time n is close to the best possible (one pull per time step) 5 Optimal (in expectation) is to pull optimal arm n times 5 Pull arms uniformly? (UniformBandit) ?? s a 1 a 2 a k … 25 Slide adapted from Alan Fern (OSU) 25 Cumulative Regret Objective h Problem: find arm-pulling strategy such that the expected total reward at time n is close to the best possible (one pull per time step) 5 Optimal (in expectation) is to pull optimal arm n times 5 UniformBandit is poor choice --- waste time on bad arms 5 Must balance exploring all arms to find good payoffs and exploiting current knowledge (pulling best arm) s a 1 a 2 a k … 26 Slide adapted from Alan Fern (OSU) 26 13

Idea • The problem is uncertainty… How to quantify? • Error bars If arm has been sampled n times, With probability at least 1- 𝜀 : log(2 𝜀) 𝜈 − 𝜈 < # 2𝑜 Slide adapted from Travis Mandel (UW) 27 Given Error bars, how do we act? Slide adapted from Travis Mandel (UW) 28 14

Given Error bars, how do we act? • Optimism under uncertainty! • Why? If bad, we will soon find out! Slide adapted from Travis Mandel (UW) 29 One last wrinkle • How to set confidence 𝜀 • Decrease over time If arm has been sampled n times, With probability at least 1- 𝜀 : log(2 𝜀) 𝜈 − 𝜈 < # 2𝑜 / 𝜀 = 0 Slide adapted from Travis Mandel (UW) 30 15

Upper Confidence Bound (UCB) 1. Play each arm once 2. Play arm i that maximizes: 2log(𝑢) 𝜈 1 + # 𝑜 1 3. Repeat Step 2 forever Slide adapted from Travis Mandel (UW) 31 UCB Performance Guarantee [Auer, Cesa-Bianchi, & Fischer, 2002] Theorem : The expected cumulative regret of UCB 𝑭�𝑺𝒇𝒉 𝒐 � after n arm pulls is bounded by O(log n ) 𝑭�𝑺𝒇𝒉 𝒐 � 𝑭�𝑺𝒇𝒉 𝒐 � � Is this good? � � l�� Yes. The average per-step regret is O l�� l�� Theorem: No algorithm can achieve a better expected regret (up to constant factors) 33 Slide adapted from Alan Fern (OSU) 33 16

? Still looking for a policy p (s) New twist: dont know T or R - PDF document

CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld/ University of Washington Image from https://towardsdatascience.com/reinforcement-learning-multi-arm-bandit-implementation-5399ef67b24b [Many slides taken from Dan Klein and Pieter

HYGROTHERMALLY STABLE LAMINATES WITH EXTENSION-TWIST AND BEND-TWIST COUPLINGS R. Haynes*, E.

The OPE of bare twist operators in bosonic S N orbifold CFTs at large- N A.W. Peet University of

They Don t Want Them Or You t Want Them Or You They Don Don t Have Them: t Have

Don Juans Troubles Don Juans Troubles Hey, Anna, how are you? Don Juans Troubles Hey,

Know how. Know now. Know how. Know now. Please Thank our sponsor! The Nebraska Soybean Board

What You Dont Know What You Dont Know What You Dont Know What You Dont Know That

COVINIL S.A. BEST PARTNER FOR TWIST-WRAP AND LIDDING PACKAGING COVINIL INTRODUCTION 65 MAIN

TerraTechs Durable & sustainable in a twist Double multiple fjeld setups. Detail of an

Session-Based Exploratory Session-Based Exploratory TestingWith a Twist TestingWith a

Parton Energy Loss in Generalized High-twist Approach Yuan-Yuan Zhang Central China Normal

Results on the standard twist of L -functions Jerzy Kaczorowski (joint work with Alberto Perelli)

Chinese Whispers with a Twist Communication activity MARIANA LEANDRO CRUZ PhD researcher TU

The Art The Art when you don't know! Define what you want when you do know! of of Know

I Know it Was the Blood Verse 1 I know it was the blood I know it was the blood I know it was

HOW TO BECOME AN EFFECTIVE GROUP FACILITATOR How do I prepare? Know your Know your Know your

Weir Ready: Public Education Campaign Rationale We believe that many people still don't know

In Investing in Membership Making the Grade in Marketing Your Program In Introductions Dana

Exclusive Access 0xFFFF 0xFFFF Device drivers Operating system CSC 452: Memory Management

VTint: Protecting Virtual Function Tables Integrity Chao Zhang (UC Berkeley) Chengyu Song

3/21/16 Impacting the Design, Construction and Activation of Health Care Facilities Nursing

Regulatory Framework for Mobile Financial Services Deepankar Roy, Ph.D. National Institute of

Architectures and the spaces they inhabit Aaron Sloman http://www.cs.bham.ac.uk/axs School of

TRACEY GRAHAM CHAIRMAN OF THE REMUNERATION COMMITTEE KEY PRINCIPLES OF OUR REMUNERATION POLICY

Transparency International Belgium Banks and ethics Remuneration Remuneration event organised

? Still looking for a policy p (s) New twist: dont know T or R - PDF document

CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld/ University of Washington Image from https://towardsdatascience.com/reinforcement-learning-multi-arm-bandit-implementation-5399ef67b24b [Many slides taken from Dan Klein and Pieter

HYGROTHERMALLY STABLE LAMINATES WITH EXTENSION-TWIST AND BEND-TWIST COUPLINGS R. Haynes*, E.

The OPE of bare twist operators in bosonic S N orbifold CFTs at large- N A.W. Peet University of

They Don t Want Them Or You t Want Them Or You They Don Don t Have Them: t Have

Don Juans Troubles Don Juans Troubles Hey, Anna, how are you? Don Juans Troubles Hey,

Know how. Know now. Know how. Know now. Please Thank our sponsor! The Nebraska Soybean Board

What You Dont Know What You Dont Know What You Dont Know What You Dont Know That

COVINIL S.A. BEST PARTNER FOR TWIST-WRAP AND LIDDING PACKAGING COVINIL INTRODUCTION 65 MAIN

TerraTechs Durable &amp; sustainable in a twist Double multiple fjeld setups. Detail of an

Session-Based Exploratory Session-Based Exploratory TestingWith a Twist TestingWith a

Parton Energy Loss in Generalized High-twist Approach Yuan-Yuan Zhang Central China Normal

Results on the standard twist of L -functions Jerzy Kaczorowski (joint work with Alberto Perelli)

Chinese Whispers with a Twist Communication activity MARIANA LEANDRO CRUZ PhD researcher TU

The Art The Art when you don't know! Define what you want when you do know! of of Know

I Know it Was the Blood Verse 1 I know it was the blood I know it was the blood I know it was

HOW TO BECOME AN EFFECTIVE GROUP FACILITATOR How do I prepare? Know your Know your Know your

Weir Ready: Public Education Campaign Rationale We believe that many people still don't know

In Investing in Membership Making the Grade in Marketing Your Program In Introductions Dana

Exclusive Access 0xFFFF 0xFFFF Device drivers Operating system CSC 452: Memory Management

VTint: Protecting Virtual Function Tables Integrity Chao Zhang (UC Berkeley) Chengyu Song

3/21/16 Impacting the Design, Construction and Activation of Health Care Facilities Nursing

Regulatory Framework for Mobile Financial Services Deepankar Roy, Ph.D. National Institute of

Architectures and the spaces they inhabit Aaron Sloman http://www.cs.bham.ac.uk/axs School of

TRACEY GRAHAM CHAIRMAN OF THE REMUNERATION COMMITTEE KEY PRINCIPLES OF OUR REMUNERATION POLICY

Transparency International Belgium Banks and ethics Remuneration Remuneration event organised

TerraTechs Durable & sustainable in a twist Double multiple fjeld setups. Detail of an