Stay With Me: Lifetime Maximization Through Heteroscedastic Linear - - PowerPoint PPT Presentation

stay with me lifetime maximization through
SMART_READER_LITE
LIVE PREVIEW

Stay With Me: Lifetime Maximization Through Heteroscedastic Linear - - PowerPoint PPT Presentation

Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging Ping-Chun Hsieh 1 , Xi Liu 1 , Anirban Bhattacharya 2 , and P . R. Kumar 1 1 Department of ECE Texas A&M University 2 Department of Statistics Texas


slide-1
SLIDE 1

Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging

Ping-Chun Hsieh1, Xi Liu1, Anirban Bhattacharya2, and P . R. Kumar1

1Department of ECE

Texas A&M University

2 Department of Statistics

Texas A&M University

ICML 2019

Poster @ Pacific Ballroom # 124

Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 1 / 10

slide-2
SLIDE 2

Lifetime Maximization: Continuing The Play

  • A finite game is played for the

purpose of winning.

  • An infinite game is for the

purpose of continuing the play.

Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 2 / 10

slide-3
SLIDE 3

Lifetime Maximization: Continuing The Play

  • A finite game is played for the

purpose of winning.

  • An infinite game is for the

purpose of continuing the play. Lifetime maximization

Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 2 / 10

slide-4
SLIDE 4

Why Lifetime Maximization?

Medical treatments Portfolio selection Cloud services Salient features of these applications:

1 Each participant has a satisfaction level. 2 A participant drops if the outcomes are not satisfactory. 3 The outcomes depend heavily on the contextual information of the

participant.

Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 3 / 10

slide-5
SLIDE 5

Model: Linear Bandits With Reneging

1 {xt,a}a∈A are pairwise participant-action contexts (observed by the

platform when participant t arrives).

2 Outcome rt,a is conditionally independent given the context and

has mean θT

∗ xt,a. 3 Participant t keeps interacting with the platform as long as

rt,a ≥ βt. Otherwise, the participant drops.

Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 4 / 10

slide-6
SLIDE 6

Heteroscedastic Outcomes

  • Heteroscedasticity: Outcome variations can be wildly different

across different participants and actions

Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 5 / 10

slide-7
SLIDE 7

Heteroscedastic Outcomes

  • Heteroscedasticity: Outcome variations can be wildly different

across different participants and actions

  • Example:
  • Two actions, 1 (red) and 2 (blue)
  • Participant satisfaction level = β
  • Heteroscedasticity is widely studied in econometrics, and is

usually captured through regression on variance.

Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 5 / 10

slide-8
SLIDE 8

Model: Heteroscedastic Bandits With Reneging

1 {xt,a}a∈A are pairwise participant-action contexts (observed by the

platform when participant t arrives)

2 Outcome rt,a is conditionally independent given the context and

satisfies that rt,a ∼ N(θ⊤

∗ xt,a, f(φ⊤ ∗ xt,a)). 3 Participant t keeps interacting with the platform if rt,a ≥ βt.

Otherwise, the participant drops.

Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 6 / 10

slide-9
SLIDE 9

Oracle Policy and Regret

  • Oracle policy πoracle already knows θ∗ and φ∗.
  • For each participant t, πoracle keeps choosing the action that

minimizes reneging probability P{rt,a < βt|xt,a}

  • Hence, πoracle is a fixed policy

Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 7 / 10

slide-10
SLIDE 10

Oracle Policy and Regret

  • Oracle policy πoracle already knows θ∗ and φ∗.
  • For each participant t, πoracle keeps choosing the action that

minimizes reneging probability P{rt,a < βt|xt,a}

  • Hence, πoracle is a fixed policy
  • For T participants, define

Regretπ(T) = (the total expected lifetime under πoracle) − (the total expected lifetime under π)

Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 7 / 10

slide-11
SLIDE 11

Proposed Algorithm: HR-UCB

  • When participant t arrives, obtain estimators

θ, φ with confidence intervals Cθ, Cφ based on past observations.

Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 8 / 10

slide-12
SLIDE 12

Proposed Algorithm: HR-UCB

  • When participant t arrives, obtain estimators

θ, φ with confidence intervals Cθ, Cφ based on past observations.

  • For each action a, construct a UCB index as

QHR

t

(xt,a) =

  • Φ

βt − θ⊤xt,a

  • f(

φ⊤xt,a) −1

  • estimated expected lifetime

+ ∆(Cθ, Cφ, xt,a)

  • confidence interval for lifetime

(1)

Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 8 / 10

slide-13
SLIDE 13

Proposed Algorithm: HR-UCB

  • When participant t arrives, obtain estimators

θ, φ with confidence intervals Cθ, Cφ based on past observations.

  • For each action a, construct a UCB index as

QHR

t

(xt,a) =

  • Φ

βt − θ⊤xt,a

  • f(

φ⊤xt,a) −1

  • estimated expected lifetime

+ ∆(Cθ, Cφ, xt,a)

  • confidence interval for lifetime

(1)

  • Apply the action arg maxa QHR

t

(xt,a).

Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 8 / 10

slide-14
SLIDE 14

Proposed Algorithm: HR-UCB

  • When participant t arrives, obtain estimators

θ, φ with confidence intervals Cθ, Cφ based on past observations.

  • For each action a, construct a UCB index as

QHR

t

(xt,a) =

  • Φ

βt − θ⊤xt,a

  • f(

φ⊤xt,a) −1

  • estimated expected lifetime

+ ∆(Cθ, Cφ, xt,a)

  • confidence interval for lifetime

(1)

  • Apply the action arg maxa QHR

t

(xt,a). Main technical challenges

1 Design estimators

θ, φ under heteroscedasticity

2 Derive the confidence intervals Cθ, Cφ for

θ, φ

3 Convert the Cθ, Cφ into the confidence interval of lifetime

Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 8 / 10

slide-15
SLIDE 15

Estimators of θ∗ and φ∗ (Challenge 1)

  • Generalized least square estimator (Wooldridge, 2015): With any

n outcome observations,

  • θn =
  • X ⊤

n X n + λI

−1X ⊤

n r,

  • φn =
  • X ⊤

n X n + λI

−1X ⊤

n f −1(

ε ◦ ε).

  • X n is the matrix of n applied contexts
  • r is the vector of n observed outcomes

ε(xt,a) = rt,a − θ⊤

n xt,a is the estimated residual with respect to

θn

Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 9 / 10

slide-16
SLIDE 16

Estimators of θ∗ and φ∗ (Challenge 1)

  • Generalized least square estimator (Wooldridge, 2015): With any

n outcome observations,

  • θn =
  • X ⊤

n X n + λI

−1X ⊤

n r,

  • φn =
  • X ⊤

n X n + λI

−1X ⊤

n f −1(

ε ◦ ε).

  • X n is the matrix of n applied contexts
  • r is the vector of n observed outcomes

ε(xt,a) = rt,a − θ⊤

n xt,a is the estimated residual with respect to

θn

  • Nice property (Abbasi-Yadkori et al., 2011): Let V n = X ⊤

n X n + λI.

For any δ > 0, with probability at least 1 − δ, for all n ∈ N, || θn − θ∗||V n ≤ Cθ(δ, n) = O

  • log(1

δ ) + log n

  • .

Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 9 / 10

slide-17
SLIDE 17

Main Technical Contributions (Challenges 2 & 3)

Theorem For any δ > 0, with probability at least 1 − 2δ, we have || φn − φ∗||V n ≤ Cφ(δ, n) = O

  • log(1

δ ) + log n

  • , ∀n ∈ N.

(2)

  • The proof is more involved since

φn depends on the residual ε

Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 10 / 10

slide-18
SLIDE 18

Main Technical Contributions (Challenges 2 & 3)

Theorem For any δ > 0, with probability at least 1 − 2δ, we have || φn − φ∗||V n ≤ Cφ(δ, n) = O

  • log(1

δ ) + log n

  • , ∀n ∈ N.

(2)

  • The proof is more involved since

φn depends on the residual ε Theorem ∆(Cθ(n, δ), Cφ(n, δ), x) :=

  • k1Cθ(n, δ) + k2Cφ(n, δ)
  • · ||x||V −1

n

is a confidence interval with respect to lifetime, where k1, k2 are constants independent of past history and x.

Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 10 / 10

slide-19
SLIDE 19

Main Technical Contributions (Challenges 2 & 3)

Theorem For any δ > 0, with probability at least 1 − 2δ, we have || φn − φ∗||V n ≤ Cφ(δ, n) = O

  • log(1

δ ) + log n

  • , ∀n ∈ N.

(2)

  • The proof is more involved since

φn depends on the residual ε Theorem ∆(Cθ(n, δ), Cφ(n, δ), x) :=

  • k1Cθ(n, δ) + k2Cφ(n, δ)
  • · ||x||V −1

n

is a confidence interval with respect to lifetime, where k1, k2 are constants independent of past history and x. Theorem Under the HR-UCB policy, Regret(T) = O

  • T(log T)3
  • .

Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 10 / 10