 
              Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging Ping-Chun Hsieh 1 , Xi Liu 1 , Anirban Bhattacharya 2 , and P . R. Kumar 1 1 Department of ECE Texas A&M University 2 Department of Statistics Texas A&M University ICML 2019 Poster @ Pacific Ballroom # 124 Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 1 / 10
Lifetime Maximization: Continuing The Play • A finite game is played for the purpose of winning. • An infinite game is for the purpose of continuing the play. Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 2 / 10
Lifetime Maximization: Continuing The Play • A finite game is played for the purpose of winning. • An infinite game is for the purpose of continuing the play. Lifetime maximization Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 2 / 10
Why Lifetime Maximization? Medical treatments Portfolio selection Cloud services Salient features of these applications: 1 Each participant has a satisfaction level. 2 A participant drops if the outcomes are not satisfactory. 3 The outcomes depend heavily on the contextual information of the participant. Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 3 / 10
Model: Linear Bandits With Reneging 1 { x t , a } a ∈ A are pairwise participant-action contexts (observed by the platform when participant t arrives). 2 Outcome r t , a is conditionally independent given the context and has mean θ T ∗ x t , a . 3 Participant t keeps interacting with the platform as long as r t , a ≥ β t . Otherwise, the participant drops. Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 4 / 10
Heteroscedastic Outcomes • Heteroscedasticity: Outcome variations can be wildly different across different participants and actions Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 5 / 10
Heteroscedastic Outcomes • Heteroscedasticity: Outcome variations can be wildly different across different participants and actions • Example: • Two actions, 1 (red) and 2 (blue) • Participant satisfaction level = β • Heteroscedasticity is widely studied in econometrics, and is usually captured through regression on variance. Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 5 / 10
Model: Heteroscedastic Bandits With Reneging 1 { x t , a } a ∈ A are pairwise participant-action contexts (observed by the platform when participant t arrives) 2 Outcome r t , a is conditionally independent given the context and satisfies that r t , a ∼ N ( θ ⊤ ∗ x t , a , f ( φ ⊤ ∗ x t , a )) . 3 Participant t keeps interacting with the platform if r t , a ≥ β t . Otherwise, the participant drops. Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 6 / 10
Oracle Policy and Regret • Oracle policy π oracle already knows θ ∗ and φ ∗ . • For each participant t , π oracle keeps choosing the action that minimizes reneging probability P { r t , a < β t | x t , a } • Hence, π oracle is a fixed policy Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 7 / 10
Oracle Policy and Regret • Oracle policy π oracle already knows θ ∗ and φ ∗ . • For each participant t , π oracle keeps choosing the action that minimizes reneging probability P { r t , a < β t | x t , a } • Hence, π oracle is a fixed policy • For T participants, define Regret π ( T ) = ( the total expected lifetime under π oracle ) − ( the total expected lifetime under π ) Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 7 / 10
Proposed Algorithm: HR-UCB • When participant t arrives, obtain estimators � θ, � φ with confidence intervals C θ , C φ based on past observations. Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 8 / 10
Proposed Algorithm: HR-UCB • When participant t arrives, obtain estimators � θ, � φ with confidence intervals C θ , C φ based on past observations. • For each action a , construct a UCB index as � �� − 1 � β t − � θ ⊤ x t , a Q HR ( x t , a ) = Φ � + ∆( C θ , C φ , x t , a ) (1) t � �� � f ( � φ ⊤ x t , a ) confidence interval for lifetime � �� � estimated expected lifetime Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 8 / 10
Proposed Algorithm: HR-UCB • When participant t arrives, obtain estimators � θ, � φ with confidence intervals C θ , C φ based on past observations. • For each action a , construct a UCB index as � �� − 1 � β t − � θ ⊤ x t , a Q HR ( x t , a ) = Φ � + ∆( C θ , C φ , x t , a ) (1) t � �� � f ( � φ ⊤ x t , a ) confidence interval for lifetime � �� � estimated expected lifetime • Apply the action arg max a Q HR ( x t , a ) . t Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 8 / 10
Proposed Algorithm: HR-UCB • When participant t arrives, obtain estimators � θ, � φ with confidence intervals C θ , C φ based on past observations. • For each action a , construct a UCB index as � �� − 1 � β t − � θ ⊤ x t , a Q HR ( x t , a ) = Φ � + ∆( C θ , C φ , x t , a ) (1) t � �� � f ( � φ ⊤ x t , a ) confidence interval for lifetime � �� � estimated expected lifetime • Apply the action arg max a Q HR ( x t , a ) . t Main technical challenges 1 Design estimators � θ, � φ under heteroscedasticity 2 Derive the confidence intervals C θ , C φ for � θ, � φ 3 Convert the C θ , C φ into the confidence interval of lifetime Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 8 / 10
Estimators of θ ∗ and φ ∗ (Challenge 1) • Generalized least square estimator (Wooldridge, 2015): With any n outcome observations, � � − 1 X ⊤ � X ⊤ θ n = n X n + λ I n r , � � − 1 X ⊤ � n f − 1 ( � X ⊤ ε ◦ � φ n = n X n + λ I ε ) . • X n is the matrix of n applied contexts • r is the vector of n observed outcomes ε ( x t , a ) = r t , a − � n x t , a is the estimated residual with respect to � θ ⊤ • � θ n Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 9 / 10
Estimators of θ ∗ and φ ∗ (Challenge 1) • Generalized least square estimator (Wooldridge, 2015): With any n outcome observations, � � − 1 X ⊤ � X ⊤ θ n = n X n + λ I n r , � � − 1 X ⊤ � n f − 1 ( � X ⊤ ε ◦ � φ n = n X n + λ I ε ) . • X n is the matrix of n applied contexts • r is the vector of n observed outcomes ε ( x t , a ) = r t , a − � n x t , a is the estimated residual with respect to � θ ⊤ • � θ n • Nice property (Abbasi-Yadkori et al., 2011): Let V n = X ⊤ n X n + λ I . For any δ > 0, with probability at least 1 − δ , for all n ∈ N , � � log( 1 || � θ n − θ ∗ || V n ≤ C θ ( δ, n ) = O δ ) + log n . Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 9 / 10
Main Technical Contributions (Challenges 2 & 3) Theorem For any δ > 0, with probability at least 1 − 2 δ , we have � � log( 1 || � φ n − φ ∗ || V n ≤ C φ ( δ, n ) = O , ∀ n ∈ N . δ ) + log n (2) • The proof is more involved since � φ n depends on the residual � ε Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 10 / 10
Main Technical Contributions (Challenges 2 & 3) Theorem For any δ > 0, with probability at least 1 − 2 δ , we have � � log( 1 || � φ n − φ ∗ || V n ≤ C φ ( δ, n ) = O , ∀ n ∈ N . δ ) + log n (2) • The proof is more involved since � φ n depends on the residual � ε Theorem � � ∆( C θ ( n , δ ) , C φ ( n , δ ) , x ) := k 1 C θ ( n , δ ) + k 2 C φ ( n , δ ) · || x || V − 1 is a n confidence interval with respect to lifetime, where k 1 , k 2 are constants independent of past history and x . Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 10 / 10
Main Technical Contributions (Challenges 2 & 3) Theorem For any δ > 0, with probability at least 1 − 2 δ , we have � � log( 1 || � φ n − φ ∗ || V n ≤ C φ ( δ, n ) = O , ∀ n ∈ N . δ ) + log n (2) • The proof is more involved since � φ n depends on the residual � ε Theorem � � ∆( C θ ( n , δ ) , C φ ( n , δ ) , x ) := k 1 C θ ( n , δ ) + k 2 C φ ( n , δ ) · || x || V − 1 is a n confidence interval with respect to lifetime, where k 1 , k 2 are constants independent of past history and x . Theorem �� � Under the HR-UCB policy, Regret ( T ) = O T (log T ) 3 . Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging ICML 2019 10 / 10
Recommend
More recommend