What we learned last time 1. Intelligence is the computational part - - PowerPoint PPT Presentation

what we learned last time
SMART_READER_LITE
LIVE PREVIEW

What we learned last time 1. Intelligence is the computational part - - PowerPoint PPT Presentation

What we learned last time 1. Intelligence is the computational part of the ability to achieve goals looking deeper: 1) its a continuum, 2) its an appearance, 3) it varies with observer and purpose 2. We will (probably) figure out how to make


slide-1
SLIDE 1

What we learned last time

  • 1. Intelligence is the computational part of the ability to achieve goals
  • looking deeper: 1) its a continuum, 2) its an appearance, 3) it varies

with observer and purpose

  • 2. We will (probably) figure out how to make intelligent systems in
  • ur lifetimes; it will change everything
  • 3. But prior to that it will probably change our careers
  • as companies gear up to take advantage of the economic
  • pportunities
  • 4. This course has a demanding workload
slide-2
SLIDE 2

Multi-arm Bandits

Sutton and Barto, Chapter 2

The simplest reinforcement learning problem

slide-3
SLIDE 3

q∗(4) = 1 3 × 0 + 1 3 × 20 + 1 3 × 13 = 0 + 20 3 + 13 3 = 33 3 = 11

20 q∗(4)

  • Action 1 — Reward is always 8
  • value of action 1 is
  • Action 2 — 88% chance of 0, 12% chance of 100!
  • value of action 2 is
  • Action 3 — Randomly between -10 and 35, equiprobable
  • Action 4 — a third 0, a third 20, and a third from {8,9,…, 18}

q∗(1) = 8

You are the algorithm! (bandit1)

q∗(2) = .88 × 0 + .12 × 100 = 12 q∗(3) = 12.5

  • 10

35 q∗(3)

slide-4
SLIDE 4

The k-armed Bandit Problem

  • On each of an infinite sequence of time steps, t=1, 2, 3, …, 


you choose an action At from k possibilities, and receive a real- valued reward Rt

  • The reward depends only on the action taken;


it is indentically, independently distributed (i.i.d.):

  • These true values are unknown. The distribution is unknown
  • Nevertheless, you must maximize your total reward
  • You must both try actions to learn their values (explore), 


and prefer those that appear best (exploit)

true values

q∗(a) . = E[Rt|At = a] , ∀a ∈ {1, . . . , k}

slide-5
SLIDE 5

A∗

t

. = arg max

a

Qt(a)

The Exploration/Exploitation Dilemma

  • Suppose you form estimates
  • Define the greedy action at time t as
  • If then you are exploiting


If then you are exploring

  • You can’t do both, but you need to do both
  • You can never stop exploring, but maybe you should explore

less with time. Or maybe not.

Qt(a) ≈ q∗(a), ∀a

action-value estimates

At = A∗

t

At 6= A∗

t

slide-6
SLIDE 6

Action-Value Methods

  • Methods that learn action-value estimates and nothing else
  • For example, estimate action values as sample averages:
  • The sample-average estimates converge to the true values


If the action is taken an infinite number of times

lim

Nt(a)→∞ Qt(a) = q∗(a)

Qt(a) . = sum of rewards when a taken prior to t

number of times a taken prior to t

= Pt−1

i=1 Ri · 1Ai=a

Pt−1

i=1 1Ai=a

The number of times action a has been taken by time t

slide-7
SLIDE 7

ε-Greedy Action Selection

  • In greedy action selection, you always exploit
  • In 𝜁-greedy, you are usually greedy, but with probability 𝜁 you

instead pick an action at random (possibly the greedy action again)

  • This is perhaps the simplest way to balance exploration and

exploitation

slide-8
SLIDE 8

A simple bandit algorithm

Initialize, for a = 1 to k: Q(a) ← 0 N(a) ← 0 Repeat forever: A ← ⇢ arg maxa Q(a) with probability 1 − ε (breaking ties randomly) a random action with probability ε R ← bandit(A) N(A) ← N(A) + 1 Q(A) ← Q(A) +

1 N(A)

⇥ R − Q(A) ⇤

slide-9
SLIDE 9

What we learned last time

  • 1. Multi-armed bandits are a simplification of the real problem
  • 1. they have action and reward (a goal), but no input or sequentiality
  • 2. A fundamental exploitation-exploration tradeoff arises in bandits
  • 1. 𝜁-greedy action selection is the simplest way of trading off
  • 3. Learning action values is a key part of solution methods
  • 4. The 10-armed testbed illustrates all
slide-10
SLIDE 10

1 2 6 3 5 4 7 8 9 10 1 2 3

  • 3
  • 2
  • 1

q∗(1) q∗(2) q∗(3) q∗(4) q∗(5) q∗(6) q∗(7) q∗(8) q∗(9) q∗(10)

Reward distribution Action

  • 4

4

One Bandit Task from 


The 10-armed Testbed

Rt ∼ N(q∗(a), 1) q∗(a) ∼ N(0, 1)

Run for 1000 steps Repeat the whole thing 2000 times with different bandit tasks

slide-11
SLIDE 11

ε-Greedy Methods on the 10-Armed Testbed

slide-12
SLIDE 12

What we learned last time

  • 1. Multi-armed bandits are a simplification of the real problem
  • 1. they have action and reward (a goal), but no input or sequentiality
  • 2. The exploitation-exploration tradeoff arises in bandits
  • 1. 𝜁-greedy action selection is the simplest way of trading off
  • 3. Learning action values is a key part of solution methods
  • 4. The 10-armed testbed illustrates all
  • 5. Learning as averaging – a fundamental learning rule
slide-13
SLIDE 13

Averaging ⟶ learning rule

  • To simplify notation, let us focus on one action
  • We consider only its rewards, and its estimate after n+1 rewards:
  • How can we do this incrementally (without storing all the rewards)?
  • Could store a running sum and count (and divide), or equivalently:
  • This is a standard form for learning/update rules:

Qn+1 = Qn + 1 n h Rn − Qn i NewEstimate ← OldEstimate + StepSize h Target − OldEstimate i Qn . = R1 + R2 + · · · + Rn−1 n − 1 .

slide-14
SLIDE 14

Derivation of incremental update

Qn . = R1 + R2 + · · · + Rn−1 n − 1 .

Qn+1 = 1 n

n

X

i=1

Ri = 1 n Rn +

n−1

X

i=1

Ri ! = 1 n Rn + (n − 1) 1 n − 1

n−1

X

i=1

Ri ! = 1 n ⇣ Rn + (n − 1)Qn ⌘ = 1 n ⇣ Rn + nQn − Qn ⌘ = Qn + 1 n h Rn − Qn i ,

slide-15
SLIDE 15

Averaging ⟶ learning rule

  • To simplify notation, let us focus on one action
  • We consider only its rewards, and its estimate after n+1 rewards:
  • How can we do this incrementally (without storing all the rewards)?
  • Could store a running sum and count (and divide), or equivalently:
  • This is a standard form for learning/update rules:

Qn+1 = Qn + 1 n h Rn − Qn i NewEstimate ← OldEstimate + StepSize h Target − OldEstimate i Qn . = R1 + R2 + · · · + Rn−1 n − 1 .

slide-16
SLIDE 16

Tracking a Non-stationary Problem

  • Suppose the true action values change slowly over time
  • then we say that the problem is nonstationary
  • In this case, sample averages are not a good idea (Why?)
  • Better is an “exponential, recency-weighted average”:
  • There is bias due to that becomes smaller over time

Qn+1 = Qn + α h Rn − Qn i = (1 − α)nQ1 +

n

X

i=1

α(1 − α)n−iRi where α is a constant, step-size parameter, 0 < α ≤ 1 Q1

slide-17
SLIDE 17

Standard stochastic approximation convergence conditions

  • To assure convergence with probability 1:
  • e.g.,
  • not

X

n=1

αn(a) = ∞ and

X

n=1

α2

n(a) < ∞.

αn = 1 n

αn = 1 n2 O(1/√n) αn = n−p, p ∈ (0, 1)

if then convergence is 
 at the optimal rate:

slide-18
SLIDE 18

Optimistic Initial Values

  • All methods so far depend on , i.e., they are biased.


So far we have used

  • Suppose we initialize the action values optimistically ( ), 


e.g., on the 10-armed testbed (with )

Q1(a) Q1(a) = 0 Q1(a) = 5

α = 0.1

0% 20% 40% 60% 80% 100%

% Optimal action

200 400 600 800 1000

Plays

  • ptimistic, greedy

Q0 = 5, = 0 realistic, !-greedy Q0 = 0, = 0.1

1 1

Steps

휀 휀

slide-19
SLIDE 19

Upper Confidence Bound (UCB) action selection

  • A clever way of reducing exploration over time
  • Estimate an upper bound on the true action values
  • Select the action with the largest (estimated) upper bound

At . = argmax

a

" Qt(a) + c s log t Nt(a) #

휀-greedy 휀 = 0.1

UCB c = 2

Average reward Steps

slide-20
SLIDE 20

Gradient-Bandit Algorithms

  • Let be a learned preference for taking action a

Ht(a) Pr{At =a} . = eHt(a) Pk

b=1 eHt(b)

. = πt(a),

% Optimal action Steps α = 0.1

100% 80% 60% 40% 20% 0%

α = 0.4 α = 0.1 α = 0.4

without baseline with baseline

250 500 750 1000

¯ Rt . = 1 t

t

X

i=1

Ri

Ht+1(a) . = Ht(a) + α ⇣ Rt − ¯ Rt ⌘⇣ {At =a} − πt(a) ⌘ , ∀a

slide-21
SLIDE 21

Derivation of gradient-bandit algorithm

In exact gradient ascent: Ht+1(a) . = Ht(a) + α∂ E [Rt] ∂Ht(a) , (1) where: E[Rt] . = X

b

πt(b)q∗(b), ∂ E[Rt] ∂Ht(a) = ∂ ∂Ht(a) "X

b

πt(b)q∗(b) # = X

b

q∗(b)∂ πt(b) ∂Ht(a) = X

b

  • q∗(b) − Xt

∂ πt(b) ∂Ht(a) , where Xt does not depend on b, because P

b ∂ πt(b) ∂Ht(a) = 0.

slide-22
SLIDE 22

∂ E[Rt] ∂Ht(a) = X

b

  • q∗(b) − Xt

∂ πt(b) ∂Ht(a) = X

b

πt(b)

  • q∗(b) − Xt

∂ πt(b) ∂Ht(a) /πt(b) = E  q∗(At) − Xt ∂ πt(At) ∂Ht(a) /πt(At)

  • = E

 Rt − ¯ Rt ∂ πt(At) ∂Ht(a) /πt(At)

  • ,

where here we have chosen Xt = ¯ Rt and substituted Rt for q∗(At), which is permitted because E[Rt|At] = q∗(At). For now assume: ∂ πt(b)

∂Ht(a) = πt(b)

  • 1a=b − πt(a)
  • . Then:

= E ⇥ Rt − ¯ Rt

  • πt(At)
  • 1a=At − πt(a)
  • /πt(At)

⇤ = E ⇥ Rt − ¯ Rt

  • 1a=At − πt(a)

⇤ . Ht+1(a) = Ht(a) + α

  • Rt − ¯

Rt

  • 1a=At − πt(a)
  • , (from (1), QED)
slide-23
SLIDE 23

Thus it remains only to show that ∂ πt(b) ∂Ht(a) = πt(b)

  • 1a=b − πt(a)
  • .

Recall the standard quotient rule for derivatives: ∂ ∂x  f (x) g(x)

  • =

∂f (x) ∂x g(x) − f (x) ∂g(x) ∂x

g(x)2 . Using this, we can write...

slide-24
SLIDE 24

∂ πt(b) ∂Ht(a) = ∂ ∂Ht(a) πt(b) = ∂ ∂Ht(a) " eht(b) Pk

c=1 eht(c)

# =

∂eht (b) ∂Ht(a)

Pk

c=1 eht(c) − eht(b) ∂ Pk

c=1 eht (c)

∂Ht(a)

⇣Pk

c=1 eht(c)

⌘2 (Q.R.) = 1a=beht(a) Pk

c=1 eht(c) − eht(b)eht(a)

⇣Pk

c=1 eht(c)

⌘2 ( ∂ex

∂x = ex)

= 1a=beht(b) Pk

c=1 eht(c) −

eht(b)eht(a) ⇣Pk

c=1 eht(c)

⌘2 = 1a=bπt(b) − πt(b)πt(a) = πt(b)

  • 1a=b − πt(a)
  • .

(Q.E.D.)

∂ ∂x  f (x) g(x)

  • =

∂f (x) ∂x g(x) − f (x) ∂g(x) ∂x

g(x)2

Quotient Rule:

slide-25
SLIDE 25

Summary Comparison of Bandit Algorithms

Average reward

  • ver first

1000 steps

1.5 1.4 1.3 1.2 1.1 1

휀-greedy UCB gradient bandit greedy with

  • ptimistic

initialization

α = 0.1

α / c / Q0

1 2 4 1/2 1/4 1/8 1/16 1/32 1/64 1/128

slide-26
SLIDE 26

Conclusions

  • These are all simple methods
  • but they are complicated enough—we will build on them
  • we should understand them completely
  • there are still open questions
  • Our first algorithms that learn from evaluative feedback
  • and thus must balance exploration and exploitation
  • Our first algorithms that appear to have a goal


—that learn to maximize reward by trial and error

slide-27
SLIDE 27

Our first dimensions!

  • Problems vs Solution Methods
  • Evaluative vs Instructive
  • Associative vs Non-associative

Bandits? Problem or Solution?

slide-28
SLIDE 28

Problem space

Single State Associative Instructive feedback Evaluative feedback

slide-29
SLIDE 29

Problem space

Single State Associative Instructive feedback Evaluative feedback Bandits

(Function optimization)

slide-30
SLIDE 30

Problem space

Single State Associative Instructive feedback Supervised learning Evaluative feedback Bandits

(Function optimization)

slide-31
SLIDE 31

Problem space

Single State Associative Instructive feedback Averaging Supervised learning Evaluative feedback Bandits

(Function optimization)

slide-32
SLIDE 32

Problem space

Single State Associative Instructive feedback Averaging Supervised learning Evaluative feedback Bandits

(Function optimization)

Associative Search

(Contextual bandits)