still looking for a policy p s new twist don t know t or
play

? Still looking for a policy p (s) New twist: dont know T or R - PDF document

CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld/ University of Washington Image from https://towardsdatascience.com/reinforcement-learning-multi-arm-bandit-implementation-5399ef67b24b [Many slides taken from Dan Klein and Pieter


  1. CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld/ University of Washington Image from https://towardsdatascience.com/reinforcement-learning-multi-arm-bandit-implementation-5399ef67b24b [Many slides taken from Dan Klein and Pieter Abbeel / CS188 Intro to AI at UC Berkeley – materials available at http://ai.berkeley.edu.] 1 Reinforcement Learning § Still assume there is a Markov decision process (MDP): § A set of states s Î S § A set of actions (per state) A § A model T(s,a,s’) § A reward function R(s,a,s’) & discount γ ? § Still looking for a policy p (s) § New twist: don’t know T or R § I.e. we don’t know which states are good or what the actions do § Must actually try actions and states out to learn 2 1

  2. Offline (MDPs) vs. Online (RL) Most people call this RL as well Simulator T, R Planning Monte Carlo Online Learning (Offline Solution) Planning (RL) Don’t know T,R Don’t know T,R Know T,R Differences: with MC planning 1) dying ok; 2) have (re)set button 3 Reminder: Q-Value Iteration For MDPs with known T,R § Forall s, a § Initialize Q 0 (s, a) = 0 no time steps left means an expected reward of zero § K = 0 § Repeat Q k+1 (s,a) do Bellman backups For every (s,a) pair: a s, a s,a,s’ V k (s’)=Max a’ Q k (s’,a’) K += 1 § Until convergence I.e., Q values don’t change much Problem: what if don’t know T, R? We know this…. We can sample this 4 2

  3. Reminder: Q Learning For Reinforcement Learning § Forall s, a § Initialize Q(s, a) = 0 § Repeat Forever Where are you? s. Choose some action a ; e.g. using ∈ -greedy or by maximizing Q e (s, a) Execute it in real world: (s, a, r, s’) Do update: difference ß [r + γ Max a’ Q(s’, a’)] - Q(s,a) Problem: don’t want to store table of Q(-,-) Q(s,a) ß Q(s,a) + 𝛽 (difference) 5 Q(s,a) = w 1 f 1 (s, a) + w 2 f 2 (s, a) + … + w n f n (s, a) § Forall i § Initialize w i = 0 § Repeat Forever Where are you? s. Choose some action a Execute it in real world: (s, a, r, s’) Do update: difference ß [r + γ Max a’ Q(s’, a’)] - Q(s,a) Q(s,a) ß Q(s,a) + 𝛽 (difference) 6 3

  4. Reminder: Approximate Q Learning Q(s,a) = w 1 f 1 (s, a) + w 2 f 2 (s, a) + … + w n f n (s, a) § Forall i § Initialize w i = 0 § Repeat Forever Where are you? s. Choose some action a Execute it in real world: (s, a, r, s’) Do update: difference ß [r + γ Max a’ Q(s’, a’)] - Q(s,a) Forall i do: 7 Reminder: Approximate Q Learning Q(s,a) = w 1 f 1 (s, a) + w 2 f 2 (s, a) + … + w n f n (s, a) § Forall i Wait?! Which One? How? § Initialize w i = 0 § Repeat Forever Where are you? s. Choose some action a Execute it in real world: (s, a, r, s’) Do update: difference ß [r + γ Max a’ Q(s’, a’)] - Q(s,a) Forall i do: 8 4

  5. Exploration vs. Exploitation 9 Questions § How to explore? axploration Uniform exploration Epsilon Greedy Exploration Functions (such as UCB) Thompson Sampling § When to exploit? § How to even think about this tradeoff? 10 5

  6. Questions § How to explore? § Random Exploration § Uniform exploration § Epsilon Greedy § Exploration Functions (such as UCB) § Thompson Sampling § When to exploit? § How to even think about this tradeoff? 11 Video of Demo Crawler Bot More demos at: http://inst.eecs.berkeley.edu/~ee128/fa11/videos.html 12 6

  7. Epsilon-Greedy § With (small) probability e , act randomly § With (large) probability 1- e , act on current policy § Maybe e decrease over time. 13 Evaluation § Is epsilon-greedy good? § Could any method be better? § How should we even THINK about this question? 14 14 7

  8. Regret § Even if you learn the optimal policy, you still make mistakes along the way! § Regret is a measure of your total mistake cost : the difference between your (expected) rewards, including youthful sub-optimality, and optimal (expected) rewards § Minimizing regret goes beyond learning to be optimal – it requires optimally learning to be optimal 15 Two KINDS of Regret § Cumulative Regret: § Goal: achieve near optimal cumulative lifetime reward (in expectation) § Simple Regret: § Goal: quickly identify policy with high reward (in expectation) 16 16 8

  9. Regret Reward Choosing optimal action each time ∞ Time Exploration policy that minimizes cumulative regret Minimizes red area 17 17 Regret You care about performance at times after here Reward You are here ∞ t Time Exploration policy that minimizes simple regret… Given a time, t, in the future , explore in order to minimize red area after t 18 18 9

  10. Offline (MDPs) vs. Online (RL) Simulator Monte Carlo Online Learning Planning (RL) Don’t know T,R Don’t know T,R Minimize: Simple Regret Cumulative Regret 19 RL on Single State MDP § Suppose MDP has a single state and k actions § Can sample rewards of actions using call to simulator § Sampling action a is like pulling slot machine arm with random payoff function R ( s , a ) s a 1 a 2 a k … … R(s,a 2 ) R(s,a k ) R(s,a 1 ) Multi-Armed Bandit Problem 20 Slide adapted from Alan Fern (OSU) 20 10

  11. Multi-Armed Bandits § Bandit algorithms are not just useful as components for RL & Monte-Carlo planning § Pure bandit problems arise in many applications § Applicable whenever: § set of independent options with unknown utilities § cost for sampling options or a limit on total samples § Want to find the best option or maximize utility of samples Slide adapted from Alan Fern (OSU) 21 Multi-Armed Bandits: Example 1 Clinical Trials § Arms = possible treatments § Arm Pulls = application of treatment to individual § Rewards = outcome of treatment § Objective = maximize cumulative reward = maximize benefit to trial population (or find best treatment quickly) Slide adapted from Alan Fern (OSU) 22 11

  12. Multi-Armed Bandits: Example 2 § Online Advertising § Online Advertising § Arms = § Arms = different ads/ad-types for a web page § Arm Pulls = § Arm Pulls = displaying an ad upon a page access § Rewards = § Rewards = click through § Objective = § Objective = maximize cumulative reward = maximum clicks (or find best ad quickly) 23 Multi-Armed Bandit: Possible Objectives § PAC Objective: § find a near optimal arm w/ high probability § Cumulative Regret: § achieve near optimal cumulative reward over lifetime of pulling (in expectation) § Simple Regret: s § quickly identify arm with high reward a k a 1 a 2 § (in expectation) … … R(s,a 2 ) R(s,a k ) R(s,a 1 ) 24 Slide adapted from Alan Fern (OSU) 24 12

  13. Cumulative Regret Objective h Problem: find arm-pulling strategy such that the expected total reward at time n is close to the best possible (one pull per time step) 5 Optimal (in expectation) is to pull optimal arm n times 5 Pull arms uniformly? (UniformBandit) ?? s a 1 a 2 a k … 25 Slide adapted from Alan Fern (OSU) 25 Cumulative Regret Objective h Problem: find arm-pulling strategy such that the expected total reward at time n is close to the best possible (one pull per time step) 5 Optimal (in expectation) is to pull optimal arm n times 5 UniformBandit is poor choice --- waste time on bad arms 5 Must balance exploring all arms to find good payoffs and exploiting current knowledge (pulling best arm) s a 1 a 2 a k … 26 Slide adapted from Alan Fern (OSU) 26 13

  14. Idea • The problem is uncertainty… How to quantify? • Error bars If arm has been sampled n times, With probability at least 1- 𝜀 : log(2 𝜀) 𝜈 − 𝜈 < # 2𝑜 Slide adapted from Travis Mandel (UW) 27 Given Error bars, how do we act? Slide adapted from Travis Mandel (UW) 28 14

  15. Given Error bars, how do we act? • Optimism under uncertainty! • Why? If bad, we will soon find out! Slide adapted from Travis Mandel (UW) 29 One last wrinkle • How to set confidence 𝜀 • Decrease over time If arm has been sampled n times, With probability at least 1- 𝜀 : log(2 𝜀) 𝜈 − 𝜈 < # 2𝑜 / 𝜀 = 0 Slide adapted from Travis Mandel (UW) 30 15

  16. Upper Confidence Bound (UCB) 1. Play each arm once 2. Play arm i that maximizes: 2log(𝑢) 𝜈 1 + # 𝑜 1 3. Repeat Step 2 forever Slide adapted from Travis Mandel (UW) 31 UCB Performance Guarantee [Auer, Cesa-Bianchi, & Fischer, 2002] Theorem : The expected cumulative regret of UCB 𝑭�𝑺𝒇𝒉 𝒐 � after n arm pulls is bounded by O(log n ) 𝑭�𝑺𝒇𝒉 𝒐 � 𝑭�𝑺𝒇𝒉 𝒐 � � Is this good? � � l�� � Yes. The average per-step regret is O l�� � l�� � � � � Theorem: No algorithm can achieve a better expected regret (up to constant factors) 33 Slide adapted from Alan Fern (OSU) 33 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend