[PPT] - Neural Contextual Bandits with UCB-based Exploration Dongruo Zhou 1 PowerPoint Presentation

SLIDE 1

Neural Contextual Bandits with UCB-based Exploration

Dongruo Zhou 1 Lihong Li 2 Quanquan Gu 1

1Department of Computer Science, UCLA 2Google Research 1 / 49

SLIDE 2

Outline

◮ Background

◮ Contextual bandit problem ◮ Deep neural networks

2 / 49

SLIDE 3

Outline

◮ Background

◮ Contextual bandit problem ◮ Deep neural networks

◮ Algorithm – NeuralUCB

◮ Use a neural network to learn the reward ◮ Use neural network’s gradient to explore ◮ Upper confidence bound strategy

3 / 49

SLIDE 4

Outline

◮ Background

◮ Contextual bandit problem ◮ Deep neural networks

◮ Algorithm – NeuralUCB

◮ Use a neural network to learn the reward ◮ Use neural network’s gradient to explore ◮ Upper confidence bound strategy

◮ Main theory

◮ Neural tangent kernel matrix and effective dimension ◮ O( √ T) regret

4 / 49

SLIDE 5

Background – decision-making problems

Decision-making problems are everywhere! ◮ As a gambler in a casino, find a slot machine, you will...

◮ Limited budget, maximize the payoff ! ◮ Which arm to pull?

◮ As a movie recommender, you need to...

◮ Recommend movies based on users’ interests, maximize users’ purchase rate ◮ Which movie to recommend?

(a) Slot machine (b) Movie recommendation

5 / 49

SLIDE 6

Background – contextual bandit

K-armed contextual bandit problem: movie recommendation

6 / 49

SLIDE 7

Background – contextual bandit

K-armed contextual bandit problem: movie recommendation At round t, ◮ Agent observes K d-dimensional contextual vectors (user’s movie purchase history) {xt,a ∈ Rd | a ∈ [K]}

7 / 49

SLIDE 8

Background – contextual bandit

K-armed contextual bandit problem: movie recommendation At round t, ◮ Agent observes K d-dimensional contextual vectors (user’s movie purchase history) {xt,a ∈ Rd | a ∈ [K]} ◮ Agent selects an action at and receives a reward rt,at (recommends some movie and user choose to purchase or not)

8 / 49

SLIDE 9

Background – contextual bandit

K-armed contextual bandit problem: movie recommendation At round t, ◮ Agent observes K d-dimensional contextual vectors (user’s movie purchase history) {xt,a ∈ Rd | a ∈ [K]} ◮ Agent selects an action at and receives a reward rt,at (recommends some movie and user choose to purchase or not) ◮ The goal is to minimize the following pesudo regret RT = E

T
t=1

(rt,a∗

t − rt,at)

where a∗

t = argmaxa∈[K]E[rt,a] is the optimal action at round t

9 / 49

SLIDE 10

Background – contextual linear bandit

rt,at = θ∗, xt,at + ξt, ξt ∼ ν-sub-Gaussian

10 / 49

SLIDE 11

Background – contextual linear bandit

rt,at = θ∗, xt,at + ξt, ξt ∼ ν-sub-Gaussian ◮ Build confidence set for θ∗ and use optimism in the face of uncertainty (OFU) principle

11 / 49

SLIDE 12

Background – contextual linear bandit

rt,at = θ∗, xt,at + ξt, ξt ∼ ν-sub-Gaussian ◮ Build confidence set for θ∗ and use optimism in the face of uncertainty (OFU) principle ◮ Leads to O(d √ T) regret (Abbasi-Yadkori et al. 2011) ◮ Strongly depends on linear structure!

12 / 49

SLIDE 13

Background – general reward function

rt,at = h(xt,at) +ξt, 0 ≤ h(x) ≤ 1, ξt ∼ ν-sub-Gaussian

13 / 49

SLIDE 14

Background – general reward function

rt,at = h(xt,at) +ξt, 0 ≤ h(x) ≤ 1, ξt ∼ ν-sub-Gaussian ◮ Including many popular contextual bandit problems

◮ Linear bandit

◮ h(x) = θ, x, where θ2 ≤ 1, x2 ≤ 1

◮ Generalized linear bandit

◮ h(x) = g(θ, x), where θ2 ≤ 1, x2 ≤ 1, |∇g| ≤ 1

14 / 49

SLIDE 15

Background – general reward function

rt,at = h(xt,at) +ξt, 0 ≤ h(x) ≤ 1, ξt ∼ ν-sub-Gaussian ◮ Including many popular contextual bandit problems

◮ Linear bandit

◮ h(x) = θ, x, where θ2 ≤ 1, x2 ≤ 1

◮ Generalized linear bandit

◮ h(x) = g(θ, x), where θ2 ≤ 1, x2 ≤ 1, |∇g| ≤ 1

We do not know what h is...

15 / 49

SLIDE 16

Background – general reward function

rt,at = h(xt,at) +ξt, 0 ≤ h(x) ≤ 1, ξt ∼ ν-sub-Gaussian ◮ Including many popular contextual bandit problems

◮ Linear bandit

◮ h(x) = θ, x, where θ2 ≤ 1, x2 ≤ 1

◮ Generalized linear bandit

◮ h(x) = g(θ, x), where θ2 ≤ 1, x2 ≤ 1, |∇g| ≤ 1

We do not know what h is... Use some universal function approximator, such as neural networks!

16 / 49

SLIDE 17

Background – neural network

Fully connected neural networks: f(x; θ) = √mWLσ

WL−1σ
· · · σ(W1x)
17 / 49

SLIDE 18

Background – neural network

Fully connected neural networks: f(x; θ) = √mWLσ

WL−1σ
· · · σ(W1x)
◮ σ(x) = max{x, 0} is the ReLU activation function

18 / 49

SLIDE 19

Background – neural network

Fully connected neural networks: f(x; θ) = √mWLσ

WL−1σ
· · · σ(W1x)
◮ σ(x) = max{x, 0} is the ReLU activation function

◮ Wi is the weight matrix

◮ W1 ∈ Rm×d ◮ Wi ∈ Rm×m, 2 ≤ i ≤ L − 1 ◮ WL ∈ Rm×1

19 / 49

SLIDE 20

Background – neural network

Fully connected neural networks: f(x; θ) = √mWLσ

WL−1σ
· · · σ(W1x)
◮ σ(x) = max{x, 0} is the ReLU activation function

◮ θ = [vec(W1)⊤, . . . , vec(WL)⊤]⊤ ∈ Rp, p = m + md + m2(L − 1)

20 / 49

SLIDE 21

Background – neural network

Fully connected neural networks: f(x; θ) = √mWLσ

WL−1σ
· · · σ(W1x)
◮ σ(x) = max{x, 0} is the ReLU activation function

◮ θ = [vec(W1)⊤, . . . , vec(WL)⊤]⊤ ∈ Rp, p = m + md + m2(L − 1) ◮ Gradient of the neural network g(x; θ) = ∇θf(x; θ) ∈ Rp

21 / 49

SLIDE 22

Question

◮ Neural network-based contextual bandit algorithms (Riquelme et al. 2018; Zahavy and Mannor 2019) ◮ No theoretical guarantee

22 / 49

SLIDE 23

Question

◮ Neural network-based contextual bandit algorithms (Riquelme et al. 2018; Zahavy and Mannor 2019) ◮ No theoretical guarantee Can we design provably efficient neural network-based algorithm to learn the general reward function?

23 / 49

SLIDE 24

Question

◮ Neural network-based contextual bandit algorithms (Riquelme et al. 2018; Zahavy and Mannor 2019) ◮ No theoretical guarantee Can we design provably efficient neural network-based algorithm to learn the general reward function? Yes! NeuralUCB ◮ Neural network to model reward function, UCB strategy to explore ◮ Theoretical guarantee on regret O( √ T) ◮ Matches regret bound for linear setting (Abbasi-Yadkori et al. 2011)

24 / 49

SLIDE 25

NeuralUCB – initialization

◮ Special initialization on θ0

◮ For 1 ≤ l ≤ L − 1, Wl = W W

, W{i,j} ∼ N(0, 4/m)

◮ For L, W = (w⊤, −w⊤), w{i} ∼ N(0, 2/m)

25 / 49

SLIDE 26

NeuralUCB – initialization

◮ Special initialization on θ0

◮ For 1 ≤ l ≤ L − 1, Wl = W W

, W{i,j} ∼ N(0, 4/m)

◮ For L, W = (w⊤, −w⊤), w{i} ∼ N(0, 2/m)

◮ Normalization on {xi}: for any 1 ≤ i ≤ TK, xi2 = 1 and [xi]j = [xi]j+d/2

◮ For any unit vector x, construct x′ = (x; x)/ √ 2

26 / 49

SLIDE 27

NeuralUCB – initialization

◮ Special initialization on θ0

◮ For 1 ≤ l ≤ L − 1, Wl = W W

, W{i,j} ∼ N(0, 4/m)

◮ For L, W = (w⊤, −w⊤), w{i} ∼ N(0, 2/m)

◮ Normalization on {xi}: for any 1 ≤ i ≤ TK, xi2 = 1 and [xi]j = [xi]j+d/2

◮ For any unit vector x, construct x′ = (x; x)/ √ 2

Guarantee that f(xi; θ0) = 0!

27 / 49

SLIDE 28

NeuralUCB – upper confidence bounds

At round t, NeuralUCB will... ◮ Observe {xt,a}K

a=1

28 / 49

SLIDE 29

NeuralUCB – upper confidence bounds

At round t, NeuralUCB will... ◮ Observe {xt,a}K

a=1

◮ Compute upper confidence bound for each arm a, which is Ut,a = f(xt,a; θt−1)

mean

+γt−1

g(xt,a; θt−1)⊤Z−1

t−1g(xt,a; θt−1)/m

variance

29 / 49

SLIDE 30

NeuralUCB – upper confidence bounds

At round t, NeuralUCB will... ◮ Observe {xt,a}K

a=1

◮ Compute upper confidence bound for each arm a, which is Ut,a = f(xt,a; θt−1)

mean

+γt−1

g(xt,a; θt−1)⊤Z−1

t−1g(xt,a; θt−1)/m

variance

Compared with LinUCB (Li et al. 2010) Ut,a = xt,a, θt−1

mean

+γt−1

x⊤

t,aZ−1 t−1xt,a

variance

30 / 49

SLIDE 31

NeuralUCB – upper confidence bounds

At round t, NeuralUCB will... ◮ Observe {xt,a}K

a=1

◮ Compute upper confidence bound for each arm a, which is Ut,a = f(xt,a; θt−1)

mean

+γt−1

g(xt,a; θt−1)⊤Z−1

t−1g(xt,a; θt−1)/m

variance

Compared with LinUCB (Li et al. 2010) Ut,a = xt,a, θt−1

mean

+γt−1

x⊤

t,aZ−1 t−1xt,a

variance

◮ Select at = argmaxa∈[K] Ut,a, play at and observe reward rt,at

31 / 49

SLIDE 32

NeuralUCB – update parameter

After receiving reward, NeuralUCB will... ◮ Update Zt Zt = Zt−1 + g(xt,at; θt−1)g(xt,at; θt−1)⊤/m

32 / 49

SLIDE 33

NeuralUCB – update parameter

After receiving reward, NeuralUCB will... ◮ Update Zt Zt = Zt−1 + g(xt,at; θt−1)g(xt,at; θt−1)⊤/m ◮ Update θt using gradient descent

◮ Denote loss function L(θ) as L(θ) =

t

i=1

(f(xi,ai; θ) − ri,ai)2/2 + mλθ − θ(0)2

2/2

33 / 49

SLIDE 34

NeuralUCB – update parameter

After receiving reward, NeuralUCB will... ◮ Update Zt Zt = Zt−1 + g(xt,at; θt−1)g(xt,at; θt−1)⊤/m ◮ Update θt using gradient descent

◮ Denote loss function L(θ) as L(θ) =

t

i=1

(f(xi,ai; θ) − ri,ai)2/2 + mλθ − θ(0)2

2/2

◮ Run J step gradient descent on L(θ) starting from θ0, take θt as the last iterate θ(0) = θ0, θ(j+1) = θ(j) − η∇L(θ(j)), θt = θ(J)

34 / 49

SLIDE 35

NeuralUCB – confidence radius

After update neural network function, NeuralUCB will compute γt, which is ... ◮ Under the overparameterized setting (m ≫ 1), γt = O √ λS + ν

log det Zt

δ det λI

confidence radius

+ (λ + tL)(1 − ηmλ)J/2 t/λ

function approximation error
35 / 49

SLIDE 36

NeuralUCB – confidence radius

After update neural network function, NeuralUCB will compute γt, which is ... ◮ Under the overparameterized setting (m ≫ 1), γt = O √ λS + ν

log det Zt

δ det λI

confidence radius

+ (λ + tL)(1 − ηmλ)J/2 t/λ

function approximation error
Compared with LinUCB,

γt = O √ λS + ν

log det Zt

δ det λI

no function approximation error part!

36 / 49

SLIDE 37

Main theory – assumptions

Assumption

There exists λ0 > 0 such that H λ0I, where H is the neural tangent kernel matrix (Jacot et al. 2018; Cao and Gu 2019) on contexts {xi}TK

i=1.

37 / 49

SLIDE 38

Main theory – assumptions

Assumption

There exists λ0 > 0 such that H λ0I, where H is the neural tangent kernel matrix (Jacot et al. 2018; Cao and Gu 2019) on contexts {xi}TK

i=1.

◮ Satisfied if no two contexts in {xi}TK

i=1 are parallel.

38 / 49

SLIDE 39

Main theory – assumptions

Assumption

There exists λ0 > 0 such that H λ0I, where H is the neural tangent kernel matrix (Jacot et al. 2018; Cao and Gu 2019) on contexts {xi}TK

i=1.

◮ Satisfied if no two contexts in {xi}TK

i=1 are parallel.

Definition

The effective dimension d of the neural tangent kernel matrix on contexts {xi}TK

i=1 is defined as

d = log det(I + H/λ)/log(1 + TK/λ).

39 / 49

SLIDE 40

Main theory – assumptions

Assumption

There exists λ0 > 0 such that H λ0I, where H is the neural tangent kernel matrix (Jacot et al. 2018; Cao and Gu 2019) on contexts {xi}TK

i=1.

◮ Satisfied if no two contexts in {xi}TK

i=1 are parallel.

Definition

The effective dimension d of the neural tangent kernel matrix on contexts {xi}TK

i=1 is defined as

d = log det(I + H/λ)/log(1 + TK/λ). ◮ Notion adapted from Valko et al. (2013) and Yang and Wang (2019)

40 / 49

SLIDE 41

Main theory – assumptions

Assumption

There exists λ0 > 0 such that H λ0I, where H is the neural tangent kernel matrix (Jacot et al. 2018; Cao and Gu 2019) on contexts {xi}TK

i=1.

◮ Satisfied if no two contexts in {xi}TK

i=1 are parallel.

Definition

The effective dimension d of the neural tangent kernel matrix on contexts {xi}TK

i=1 is defined as

d = log det(I + H/λ)/log(1 + TK/λ). ◮ Notion adapted from Valko et al. (2013) and Yang and Wang (2019) ◮ d ∼ log T in several special cases (Valko et al. 2013)

41 / 49

SLIDE 42

Main theory – regret bound

Theorem

Let h = [h(xi)]TK

i=1 ∈ RTK. Set J =

Θ(TL/λ), η = Θ((mTL + mλ)−1) and S = 2 √ h⊤H−1h. Under the

verparameterized setting (m ≫ 1), with probability at least 1 − δ,

RT = O

dT
max{

d, S2}

.

42 / 49

SLIDE 43

Main theory – regret bound

Theorem

Let h = [h(xi)]TK

i=1 ∈ RTK. Set J =

Θ(TL/λ), η = Θ((mTL + mλ)−1) and S = 2 √ h⊤H−1h. Under the

verparameterized setting (m ≫ 1), with probability at least 1 − δ,

RT = O

dT
max{

d, S2}

.

◮ h belongs to the RKHS space H spanned by H ⇒ S ≤ hH

43 / 49

SLIDE 44

Main theory – regret bound

Theorem

Let h = [h(xi)]TK

i=1 ∈ RTK. Set J =

Θ(TL/λ), η = Θ((mTL + mλ)−1) and S = 2 √ h⊤H−1h. Under the

verparameterized setting (m ≫ 1), with probability at least 1 − δ,

RT = O

dT
max{

d, S2}

.

◮ h belongs to the RKHS space H spanned by H ⇒ S ≤ hH ◮ RT does not depend on p, the dimension of the dynamic feature mapping g(x; θ)

44 / 49

SLIDE 45

Main theory – regret bound

Theorem

Let h = [h(xi)]TK

i=1 ∈ RTK. Set J =

Θ(TL/λ), η = Θ((mTL + mλ)−1) and S = 2 √ h⊤H−1h. Under the

verparameterized setting (m ≫ 1), with probability at least 1 − δ,

RT = O

dT
max{

d, S2}

.

◮ h belongs to the RKHS space H spanned by H ⇒ S ≤ hH ◮ RT does not depend on p, the dimension of the dynamic feature mapping g(x; θ) ◮ Recover the regret for linear contextual bandit O(d √ T) (Abbasi-Yadkori et al. 2011)

45 / 49

SLIDE 46

Takeaway message

◮ NeuralUCB uses neural network f(x; θt) to predict, gradient g(x; θt) to explore

46 / 49

SLIDE 47

Takeaway message

◮ NeuralUCB uses neural network f(x; θt) to predict, gradient g(x; θt) to explore ◮ NeuralUCB achieves O( √ T) regret, matches result for linear setting

47 / 49

SLIDE 48

Takeaway message

◮ NeuralUCB uses neural network f(x; θt) to predict, gradient g(x; θt) to explore ◮ NeuralUCB achieves O( √ T) regret, matches result for linear setting ◮ NeuralUCB also works well empirically

2000 4000 6000 8000 10000 Round 200 400 600 800 1000 1200 1400 1600 Regret

LinUCB KernelUCB BootstrappedNN Neural -Greedy0 NeuralUCB0 Neural -Greedy NeuralUCB

48 / 49

SLIDE 49

Takeaway message

◮ NeuralUCB uses neural network f(x; θt) to predict, gradient g(x; θt) to explore ◮ NeuralUCB achieves O( √ T) regret, matches result for linear setting ◮ NeuralUCB also works well empirically

2000 4000 6000 8000 10000 Round 200 400 600 800 1000 1200 1400 1600 Regret

LinUCB KernelUCB BootstrappedNN Neural -Greedy0 NeuralUCB0 Neural -Greedy NeuralUCB

Thank you!

49 / 49