? Still looking for a policy p (s) New twist: dont know T or R - - PDF document

still looking for a policy p s new twist don t know t or
SMART_READER_LITE
LIVE PREVIEW

? Still looking for a policy p (s) New twist: dont know T or R - - PDF document

CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld/ University of Washington Image from https://towardsdatascience.com/reinforcement-learning-multi-arm-bandit-implementation-5399ef67b24b [Many slides taken from Dan Klein and Pieter


slide-1
SLIDE 1

1

CSE 473: Artificial Intelligence

Reinforcement Learning

Dan Weld/ University of Washington

Image from https://towardsdatascience.com/reinforcement-learning-multi-arm-bandit-implementation-5399ef67b24b [Many slides taken from Dan Klein and Pieter Abbeel / CS188 Intro to AI at UC Berkeley – materials available at http://ai.berkeley.edu.]

1

Reinforcement Learning

§ Still assume there is a Markov decision process (MDP):

§ A set of states s Î S § A set of actions (per state) A § A model T(s,a,s’) § A reward function R(s,a,s’) & discount γ

§ Still looking for a policy p(s) § New twist: don’t know T or R

§ I.e. we don’t know which states are good or what the actions do § Must actually try actions and states out to learn

?

2

slide-2
SLIDE 2

2

Offline (MDPs) vs. Online (RL)

Planning (Offline Solution)

Know T,R

Online Learning (RL) Monte Carlo Planning

Differences: with MC planning 1) dying ok; 2) have (re)set button

Don’t know T,R Don’t know T,R

T, R

Simulator

Most people call this RL as well

3

Reminder: Q-Value Iteration

a Qk+1(s,a) s, a s,a,s’ Vk(s’)=Maxa’Qk(s’,a’)

§ Forall s, a

§ Initialize Q0(s, a) = 0

no time steps left means an expected reward of zero

§ K = 0 § Repeat

do Bellman backups

For every (s,a) pair: K += 1

§ Until convergence

I.e., Q values don’t change much

We know this…. We can sample this

Problem: what if don’t know T, R?

For MDPs with known T,R

4

slide-3
SLIDE 3

3

Reminder: Q Learning

§ Forall s, a

§ Initialize Q(s, a) = 0

§ Repeat Forever

Where are you? s. Choose some action a ; e.g. using ∈-greedy or by maximizing Qe(s, a) Execute it in real world: (s, a, r, s’) Do update:

difference ß [r + γ Maxa’ Q(s’, a’)] - Q(s,a) Q(s,a) ß Q(s,a) + 𝛽(difference) Problem: don’t want to store table of Q(-,-)

For Reinforcement Learning

5

§ Forall i

§ Initialize wi= 0

§ Repeat Forever

Where are you? s. Choose some action a Execute it in real world: (s, a, r, s’) Do update:

difference ß [r + γ Maxa’ Q(s’, a’)] - Q(s,a) Q(s,a) ß Q(s,a) + 𝛽(difference) Q(s,a) = w1 f1(s, a) + w2 f2(s, a) + … + wn fn(s, a)

6

slide-4
SLIDE 4

4

§ Forall i

§ Initialize wi= 0

§ Repeat Forever

Where are you? s. Choose some action a Execute it in real world: (s, a, r, s’) Do update:

difference ß [r + γ Maxa’ Q(s’, a’)] - Q(s,a)

Forall i do:

Reminder: Approximate Q Learning

Q(s,a) = w1 f1(s, a) + w2 f2(s, a) + … + wn fn(s, a)

7

§ Forall i

§ Initialize wi= 0

§ Repeat Forever

Where are you? s. Choose some action a Execute it in real world: (s, a, r, s’) Do update:

difference ß [r + γ Maxa’ Q(s’, a’)] - Q(s,a)

Forall i do:

Reminder: Approximate Q Learning

Wait?! Which One? How?

Q(s,a) = w1 f1(s, a) + w2 f2(s, a) + … + wn fn(s, a)

8

slide-5
SLIDE 5

5

Exploration vs. Exploitation

9

Questions

§ How to explore?

axploration Uniform exploration Epsilon Greedy Exploration Functions (such as UCB) Thompson Sampling

§ When to exploit? § How to even think about this tradeoff?

10

slide-6
SLIDE 6

6

Questions

§ How to explore?

§ Random Exploration § Uniform exploration § Epsilon Greedy § Exploration Functions (such as UCB) § Thompson Sampling

§ When to exploit? § How to even think about this tradeoff?

11

Video of Demo Crawler Bot

http://inst.eecs.berkeley.edu/~ee128/fa11/videos.html More demos at:

12

slide-7
SLIDE 7

7

Epsilon-Greedy

§ With (small) probability e, act randomly § With (large) probability 1-e, act on current policy § Maybe e decrease over time.

13

Evaluation

§ Is epsilon-greedy good? § Could any method be better? § How should we even THINK about this question?

14

14

slide-8
SLIDE 8

8

Regret

§ Even if you learn the optimal policy, you still make mistakes along the way! § Regret is a measure of your total mistake cost: the difference between your (expected) rewards, including youthful sub-optimality, and optimal (expected) rewards § Minimizing regret goes beyond learning to be optimal – it requires optimally learning to be optimal

15

16

Two KINDS of Regret

§ Cumulative Regret:

§ Goal: achieve near optimal cumulative lifetime reward (in expectation)

§ Simple Regret:

§ Goal: quickly identify policy with high reward (in expectation)

16

slide-9
SLIDE 9

9

Regret

17

Reward Time

Exploration policy that minimizes cumulative regret Minimizes red area

Choosing optimal action each time

17

Regret

18

Reward Time

Exploration policy that minimizes simple regret… Given a time, t, in the future, explore in order to minimize red area after t

You are here

t

You care about performance at times after here

18

slide-10
SLIDE 10

10

Offline (MDPs) vs. Online (RL)

Online Learning (RL) Monte Carlo Planning

Don’t know T,R Don’t know T,R

Simulator

Minimize: Simple Regret Cumulative Regret

19

20

RL on Single State MDP

§ Suppose MDP has a single state and k actions

§ Can sample rewards of actions using call to simulator § Sampling action a is like pulling slot machine arm with random payoff function R(s,a)

s a1 a2 ak

R(s,a1) R(s,a2) R(s,ak)

Multi-Armed Bandit Problem

… …

Slide adapted from Alan Fern (OSU)

20

slide-11
SLIDE 11

11

Multi-Armed Bandits

§ Bandit algorithms are not just useful as components for RL & Monte-Carlo planning § Pure bandit problems arise in many applications § Applicable whenever:

§ set of independent options with unknown utilities § cost for sampling options or a limit on total samples § Want to find the best option or maximize utility of samples

Slide adapted from Alan Fern (OSU)

21

Multi-Armed Bandits: Example 1

Clinical Trials

§ Arms = possible treatments § Arm Pulls = application of treatment to individual § Rewards = outcome of treatment § Objective = maximize cumulative reward = maximize benefit to trial population (or find best treatment quickly)

Slide adapted from Alan Fern (OSU)

22

slide-12
SLIDE 12

12

Multi-Armed Bandits: Example 2

§ Online Advertising

§ Arms = § Arm Pulls = § Rewards = § Objective =

§ Online Advertising

§ Arms = different ads/ad-types for a web page § Arm Pulls = displaying an ad upon a page access § Rewards = click through § Objective = maximize cumulative reward = maximum clicks (or find best ad quickly)

23

24

Multi-Armed Bandit: Possible Objectives

§ PAC Objective:

§ find a near optimal arm w/ high probability

§ Cumulative Regret:

§ achieve near optimal cumulative reward over lifetime of pulling (in expectation)

§ Simple Regret:

§ quickly identify arm with high reward § (in expectation)

s a1 a2 ak

R(s,a1) R(s,a2) R(s,ak)

… …

Slide adapted from Alan Fern (OSU)

24

slide-13
SLIDE 13

13

25

Cumulative Regret Objective

s a1 a2 ak … hProblem: find arm-pulling strategy such that the expected total reward

at time n is close to the best possible (one pull per time step)

5Optimal (in expectation) is to pull optimal arm n times 5Pull arms uniformly? (UniformBandit) ??

Slide adapted from Alan Fern (OSU)

25

26

Cumulative Regret Objective

s a1 a2 ak … hProblem: find arm-pulling strategy such that the expected total reward

at time n is close to the best possible (one pull per time step)

5Optimal (in expectation) is to pull optimal arm n times 5UniformBandit is poor choice --- waste time on bad arms 5Must balance exploring all arms to find good payoffs and

exploiting current knowledge (pulling best arm)

Slide adapted from Alan Fern (OSU)

26

slide-14
SLIDE 14

14

Idea

  • The problem is uncertainty… How to quantify?
  • Error bars

# 𝜈 − 𝜈 < log(2 𝜀) 2𝑜

If arm has been sampled n times, With probability at least 1- 𝜀:

Slide adapted from Travis Mandel (UW)

27

Given Error bars, how do we act?

Slide adapted from Travis Mandel (UW)

28

slide-15
SLIDE 15

15

Given Error bars, how do we act?

  • Optimism under uncertainty!
  • Why? If bad, we will soon find out!

Slide adapted from Travis Mandel (UW)

29

One last wrinkle

  • How to set confidence 𝜀
  • Decrease over time

# 𝜈 − 𝜈 < log(2 𝜀) 2𝑜

If arm has been sampled n times, With probability at least 1- 𝜀:

𝜀 =

/

Slide adapted from Travis Mandel (UW)

30

slide-16
SLIDE 16

16

Upper Confidence Bound (UCB)

# 𝜈1 + 2log(𝑢) 𝑜1

  • 1. Play each arm once
  • 2. Play arm i that maximizes:
  • 3. Repeat Step 2 forever

Slide adapted from Travis Mandel (UW)

31

33

UCB Performance Guarantee

[Auer, Cesa-Bianchi, & Fischer, 2002]

Theorem: The expected cumulative regret of UCB 𝑭𝑺𝒇𝒉𝒐 after n arm pulls is bounded by O(log n)

Is this good?

l

  • 𝑭𝑺𝒇𝒉𝒐
  • Yes. The average per-step regret is O l
  • 𝑭𝑺𝒇𝒉𝒐
  • l
  • Theorem: No algorithm can achieve a better

expected regret (up to constant factors)

Slide adapted from Alan Fern (OSU)

33

slide-17
SLIDE 17

17

Putting UCB into Q-Learning !!!!

How to deal with multiple states ????? (Multi armed bandit assumes ONE state)

35

# 𝜈4 + 2log(𝑢) 𝑜4

  • 1. Play each arm once
  • 2. Play arm a that maximizes:
  • 3. Repeat Step 2 forever

E x p e c t e d r e w a r d f

  • r

a B

  • n

u s f

  • r

e x p l

  • r

a t i

  • n

35

UCB Balances Exploration & Exploitation

36

§ Forall s, a

§ Initialize Q(s, a) = 0, nsa = 0

§ Repeat Forever

Where are you? s. Choose action with highest Qe Execute it in real world: (s, a, r, s’) Do update: nsa += 1; difference ß [r + γ Maxa’ Qe(s’, a’)] - Qe(s,a) Qe(s,a) ß Qe(s,a) + 𝛽(difference)

Let nsa be number of times one has executed a in s; let t = nsa

Σ

sa

Term rewards exploration, but converges to zero

𝐹𝑦𝑞𝑆𝑓𝑥𝑏𝑠𝑒(𝑏1) + 2log(𝑢) 𝑂𝑣𝑛𝑐𝑓𝑠𝑃𝑔𝑈𝑗𝑛𝑓𝑡𝐹𝑦𝑓𝑑𝑣𝑢𝑓𝑒(𝑏1)

Let Qe(s,a) = Q(s,a) + √ 2 log(t)/(1+nsa)

36

slide-18
SLIDE 18

18

Video of Demo Q-learning – Exploration Function – Crawler

39

49

What Else ….

hUCB is great when we care about cumulative regret

I.e., when the agent is acting in the real world

hBut, sometimes all we care about is finding a good arm quickly

E.g., when we are training in a simulator

hIn these cases, “Simple Regret” is better objective

49

slide-19
SLIDE 19

19

50

Two KINDS of Regret

§ Cumulative Regret:

§ achieve near optimal cumulative lifetime reward (in expectation)

§ Simple Regret:

§ quickly identify policy with high reward (in expectation)

50

51

Simple Regret Objective

Protocol: At time step n the algorithm picks an

eloaion am 𝑏 to pull and observes reward 𝑠

and also picks an arm index it thinks is best 𝑘

(𝑏, 𝑘 and 𝑠

are random variables).

If interrupted at time n the algorithm returns 𝑘.

  • 𝑭𝑺

𝑆∗ 𝑘 𝐹𝑇𝑆𝑓𝑕 𝑆∗ 𝐹𝑆𝑏

  • eloaion am 𝑏

𝑠

  • 𝑘

𝑏, 𝑘 and 𝑠

  • 𝑘

Expected Simple Regret (𝑭𝑺: difference

between 𝑆∗ and expected reward of arm 𝑘 selected by our strategy at time n 𝐹𝑇𝑆𝑓𝑕 𝑆∗ 𝐹𝑆𝑏

51

slide-20
SLIDE 20

20

How to Minimize Simple Regret?

What about UCB for simple regret?

  • Theorem: The expected simple regret of

UCB after n arm pulls is upper bounded by O 𝑜− for a constant c.

Seems good, but we can do much better (at least in theory).

Ø Intuitively: UCB puts too much emphasis on pulling the best arm Ø After an arm is looking good, maybe better to see if ∃a better arm

52

53

Incremental Uniform (or Round Robin)

Bubeck, S., Munos, R., & Stoltz, G. (2011). Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Computer Science, 412(19), 1832-1852

Algorithm:

At round n pull arm with index (k mod n) + 1 At round n return arm (if asked) with largest average reward

  • 𝑜−

𝑓−

  • This bound is exponentially decreasing in n!

𝑜−

Theorem: The expected simple regret of Uniform after n arm pulls is upper bounded by O 𝑓− for a constant c.

  • This bound is exponentially decreasing in n!

Compared to polynomially for UCB O 𝑜− .

𝑓−

53

slide-21
SLIDE 21

21

54

Can we do even better?

Algorithm ∈-Greedy : (parameter ) § At round n, with probability 1/2 pull arm with best average reward so far, otherwise pull

  • ne of the other arms at random.

§ At round n return arm (if asked) with largest average reward

Tolpin, D. & Shimony, S, E. (2012). MCTS Based on Simple Regret. AAAI Conference on Artificial Intelligence.

𝜗 0 𝜗 1

  • 𝜗
  • Theorem: The expected simple regret of 𝜗-

Greedy for 𝜗 0.5 after n arm pulls is upper bounded by O 𝑓− for a constant c that is larger than the constant for Uniform (his holds for large enogh n).

54

Summary of Bandits in Theory

PAC Objective:

§

UniformBandit is a simple PAC algorithm

§

MedianElimination improves by a factor of log(k) and is optimal up to constant factors

Cumulative Regret:

§

Uniform is very bad!

§

UCB is optimal (up to constant factors)

§

Thomson Sampling also optimal; often performs better in practice

Simple Regret:

§

UCB shown to reduce regret at polynomial rate

§

Uniform reduces at an exponential rate

§

0.5-Greedy may have even better exponential rate

55

slide-22
SLIDE 22

22

That’s all for Reinforcement Learning!

§ Very tough problem: How to perform any task well in an unknown, noisy environment! § Traditionally used mostly for robotics, but…

58

Reinforcement Learning Agent Data (experiences with environment) Policy (how to act in the future) Google DeepMind – RL applied to data center power usage

58