The Exploration-Exploitation Dilemma A. LAZARIC ( SequeL Team - PowerPoint PPT Presentation

Mathematical Tools The Exploration–Exploitation Lemma Problem 1 : The environment does not reveal the rewards of the arms not pulled by the learner ⇒ the learner should gain information by repeatedly pulling all the arms Problem 2 : Whenever the learner pulls a bad arm , it suffers some regret ⇒ the learner should reduce the regret by repeatedly pulling the best arm A. LAZARIC – Reinforcement Learning Fall 2017 - 18/95

Mathematical Tools The Exploration–Exploitation Lemma Problem 1 : The environment does not reveal the rewards of the arms not pulled by the learner ⇒ the learner should gain information by repeatedly pulling all the arms Problem 2 : Whenever the learner pulls a bad arm , it suffers some regret ⇒ the learner should reduce the regret by repeatedly pulling the best arm Challenge : The learner should solve two opposite problems! A. LAZARIC – Reinforcement Learning Fall 2017 - 18/95

Mathematical Tools The Exploration–Exploitation Lemma Problem 1 : The environment does not reveal the rewards of the arms not pulled by the learner ⇒ the learner should gain information by repeatedly pulling all the arms ⇒ exploration Problem 2 : Whenever the learner pulls a bad arm , it suffers some regret ⇒ the learner should reduce the regret by repeatedly pulling the best arm Challenge : The learner should solve two opposite problems! A. LAZARIC – Reinforcement Learning Fall 2017 - 18/95

Mathematical Tools The Exploration–Exploitation Lemma Problem 1 : The environment does not reveal the rewards of the arms not pulled by the learner ⇒ the learner should gain information by repeatedly pulling all the arms ⇒ exploration Problem 2 : Whenever the learner pulls a bad arm , it suffers some regret ⇒ the learner should reduce the regret by repeatedly pulling the best arm ⇒ exploitation Challenge : The learner should solve two opposite problems! A. LAZARIC – Reinforcement Learning Fall 2017 - 18/95

Mathematical Tools The Exploration–Exploitation Lemma Problem 1 : The environment does not reveal the rewards of the arms not pulled by the learner ⇒ the learner should gain information by repeatedly pulling all the arms ⇒ exploration Problem 2 : Whenever the learner pulls a bad arm , it suffers some regret ⇒ the learner should reduce the regret by repeatedly pulling the best arm ⇒ exploitation Challenge : The learner should solve the exploration-exploitation dilemma! A. LAZARIC – Reinforcement Learning Fall 2017 - 18/95

Mathematical Tools The Multi–armed Bandit Game (cont’d) Examples ◮ Packet routing ◮ Clinical trials ◮ Web advertising ◮ Computer games ◮ Resource mining ◮ ... A. LAZARIC – Reinforcement Learning Fall 2017 - 19/95

Mathematical Tools The Stochastic Multi–armed Bandit Problem Definition The environment is stochastic ◮ Each arm has a distribution ν i bounded in [ 0 , 1 ] and characterized by an expected value µ i ◮ The rewards are i.i.d. X i , t ∼ ν i (as in the MDP model) A. LAZARIC – Reinforcement Learning Fall 2017 - 20/95

Mathematical Tools The Stochastic Multi–armed Bandit Problem (cont’d) Notation ◮ Number of times arm i has been pulled after n rounds n � T i , n = I { I t = i } t = 1 A. LAZARIC – Reinforcement Learning Fall 2017 - 21/95

Mathematical Tools The Stochastic Multi–armed Bandit Problem (cont’d) Notation ◮ Number of times arm i has been pulled after n rounds � n T i , n = I { I t = i } t = 1 ◮ Regret � � � � n n � � R n ( A ) = max X i , t − E X I t , t i = 1 ,..., K E t = 1 t = 1 A. LAZARIC – Reinforcement Learning Fall 2017 - 21/95

Mathematical Tools The Stochastic Multi–armed Bandit Problem (cont’d) Notation ◮ Number of times arm i has been pulled after n rounds � n T i , n = I { I t = i } t = 1 ◮ Regret � � � n R n ( A ) = i = 1 ,..., K ( n µ i ) − E max X I t , t t = 1 A. LAZARIC – Reinforcement Learning Fall 2017 - 21/95

Mathematical Tools The Stochastic Multi–armed Bandit Problem (cont’d) Notation ◮ Number of times arm i has been pulled after n rounds � n T i , n = I { I t = i } t = 1 ◮ Regret � K R n ( A ) = i = 1 ,..., K ( n µ i ) − max E [ T i , n ] µ i i = 1 A. LAZARIC – Reinforcement Learning Fall 2017 - 21/95

Mathematical Tools The Stochastic Multi–armed Bandit Problem (cont’d) Notation ◮ Number of times arm i has been pulled after n rounds � n T i , n = I { I t = i } t = 1 ◮ Regret � K R n ( A ) = n µ i ∗ − E [ T i , n ] µ i i = 1 A. LAZARIC – Reinforcement Learning Fall 2017 - 21/95

Mathematical Tools The Stochastic Multi–armed Bandit Problem (cont’d) Notation ◮ Number of times arm i has been pulled after n rounds � n T i , n = I { I t = i } t = 1 ◮ Regret � E [ T i , n ]( µ i ∗ − µ i ) R n ( A ) = i � = i ∗ A. LAZARIC – Reinforcement Learning Fall 2017 - 21/95

Mathematical Tools The Stochastic Multi–armed Bandit Problem (cont’d) Notation ◮ Number of times arm i has been pulled after n rounds � n T i , n = I { I t = i } t = 1 ◮ Regret � R n ( A ) = E [ T i , n ]∆ i i � = i ∗ A. LAZARIC – Reinforcement Learning Fall 2017 - 21/95

Mathematical Tools The Stochastic Multi–armed Bandit Problem (cont’d) Notation ◮ Number of times arm i has been pulled after n rounds � n T i , n = I { I t = i } t = 1 ◮ Regret � R n ( A ) = E [ T i , n ]∆ i i � = i ∗ ◮ Gap ∆ i = µ i ∗ − µ i A. LAZARIC – Reinforcement Learning Fall 2017 - 21/95

Mathematical Tools The Stochastic Multi–armed Bandit Problem (cont’d) � R n ( A ) = E [ T i , n ]∆ i i � = i ∗ ⇒ we only need to study the expected number of pulls of the suboptimal arms A. LAZARIC – Reinforcement Learning Fall 2017 - 22/95

Mathematical Tools The Stochastic Multi–armed Bandit Problem (cont’d) Optimism in Face of Uncertainty Learning (OFUL) Whenever we are uncertain about the outcome of an arm, we consider the best possible world and choose the best arm . A. LAZARIC – Reinforcement Learning Fall 2017 - 23/95

Mathematical Tools The Stochastic Multi–armed Bandit Problem (cont’d) Optimism in Face of Uncertainty Learning (OFUL) Whenever we are uncertain about the outcome of an arm, we consider the best possible world and choose the best arm . Why it works : ◮ If the best possible world is correct ⇒ no regret ◮ If the best possible world is wrong ⇒ the reduction in the uncertainty is maximized A. LAZARIC – Reinforcement Learning Fall 2017 - 23/95

Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm The idea 2 1.5 1 0.5 Reward 0 −0.5 −1 −1.5 1 (10) 2 (73) 3 (3) 4 (23) Arms A. LAZARIC – Reinforcement Learning Fall 2017 - 24/95

Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm Show time! A. LAZARIC – Reinforcement Learning Fall 2017 - 25/95

Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm (cont’d) At each round t = 1 , . . . , n ◮ Compute the score of each arm i B i = ( optimistic score of arm i ) ◮ Pull arm I t = arg max i = 1 ,..., K B i , s , t ◮ Update the number of pulls T I t , t = T I t , t − 1 + 1 and the other statistics A. LAZARIC – Reinforcement Learning Fall 2017 - 26/95

Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm (cont’d) The score (with parameters ρ and δ ) B i = ( optimistic score of arm i ) A. LAZARIC – Reinforcement Learning Fall 2017 - 27/95

Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm (cont’d) The score (with parameters ρ and δ ) B i , s , t = ( optimistic score of arm i if pulled s times up to round t ) A. LAZARIC – Reinforcement Learning Fall 2017 - 27/95

Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm (cont’d) The score (with parameters ρ and δ ) B i , s , t = ( optimistic score of arm i if pulled s times up to round t ) Optimism in face of uncertainty: Current knowledge : average rewards ˆ µ i , s Current uncertainty : number of pulls s A. LAZARIC – Reinforcement Learning Fall 2017 - 27/95

Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm (cont’d) The score (with parameters ρ and δ ) B i , s , t = knowledge + uncertainty �� optimism Optimism in face of uncertainty: Current knowledge : average rewards ˆ µ i , s Current uncertainty : number of pulls s A. LAZARIC – Reinforcement Learning Fall 2017 - 27/95

Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm (cont’d) The score (with parameters ρ and δ ) � log 1 /δ B i , s , t = ˆ µ i , s + ρ 2 s Optimism in face of uncertainty: Current knowledge : average rewards ˆ µ i , s Current uncertainty : number of pulls s A. LAZARIC – Reinforcement Learning Fall 2017 - 27/95

Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm (cont’d) At each round t = 1 , . . . , n ◮ Compute the score of each arm i � log( t ) B i , t = ˆ µ i , T i , t + ρ 2 T i , t ◮ Pull arm I t = arg max i = 1 ,..., K B i , t ◮ Update the number of pulls T I t , t = T I t , t − 1 + 1 and ˆ µ i , T i , t A. LAZARIC – Reinforcement Learning Fall 2017 - 28/95

Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm (cont’d) Theorem Let X 1 , . . . , X n be i.i.d. samples from a distribution bounded in [ a , b ] , then for any δ ∈ ( 0 , 1 ) �� n � 1 log 2 /δ � � X t − E [ X 1 ] � > ( b − a ) ≤ δ P n 2 n t = 1 A. LAZARIC – Reinforcement Learning Fall 2017 - 29/95

Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm (cont’d) After s pulls, arm i � � � � s E [ X i ] ≤ 1 log 1 /δ X i , t + ≥ 1 − δ P s 2 s t = 1 A. LAZARIC – Reinforcement Learning Fall 2017 - 30/95

Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm (cont’d) After s pulls, arm i � � � log 1 /δ P µ i ≤ ˆ µ i , s + ≥ 1 − δ 2 s A. LAZARIC – Reinforcement Learning Fall 2017 - 30/95

Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm (cont’d) After s pulls, arm i � � � log 1 /δ µ i ≤ ˆ µ i , s + ≥ 1 − δ P 2 s ⇒ UCB uses an upper confidence bound on the expectation A. LAZARIC – Reinforcement Learning Fall 2017 - 30/95

Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm (cont’d) Theorem For any set of K arms with distributions bounded in [ 0 , b ] , if δ = 1 / t, then UCB( ρ ) with ρ > 1 , achieves a regret � � �� 4 b 2 3 1 R n ( A ) ≤ ρ log( n ) + ∆ i 2 + ∆ i 2 ( ρ − 1 ) i � = i ∗ A. LAZARIC – Reinforcement Learning Fall 2017 - 31/95

Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm (cont’d) Let K = 2 with i ∗ = 1 � � 1 R n ( A ) ≤ O ∆ ρ log( n ) Remark 1 : the cumulative regret slowly increases as log( n ) A. LAZARIC – Reinforcement Learning Fall 2017 - 32/95

Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm (cont’d) Let K = 2 with i ∗ = 1 � � 1 R n ( A ) ≤ O ∆ ρ log( n ) Remark 1 : the cumulative regret slowly increases as log( n ) Remark 2 : the smaller the gap the bigger the regret ... why? A. LAZARIC – Reinforcement Learning Fall 2017 - 32/95

Mathematical Tools The Upper–Confidence Bound (UCB) Algorithm (cont’d) Show time (again)! A. LAZARIC – Reinforcement Learning Fall 2017 - 33/95

Mathematical Tools The Worst–case Performance Remark : the regret bound is distribution–dependent � � 1 R n ( A ; ∆) ≤ O ∆ ρ log( n ) A. LAZARIC – Reinforcement Learning Fall 2017 - 34/95

Mathematical Tools The Worst–case Performance Remark : the regret bound is distribution–dependent � � 1 R n ( A ; ∆) ≤ O ∆ ρ log( n ) Meaning : the algorithm is able to adapt to the specific problem at hand! A. LAZARIC – Reinforcement Learning Fall 2017 - 34/95

Mathematical Tools The Worst–case Performance Remark : the regret bound is distribution–dependent � � 1 R n ( A ; ∆) ≤ O ∆ ρ log( n ) Meaning : the algorithm is able to adapt to the specific problem at hand! Worst–case performance : what is the distribution which leads to the worst possible performance of UCB? what is the distribution–free performance of UCB? R n ( A ) = sup R n ( A ; ∆) ∆ A. LAZARIC – Reinforcement Learning Fall 2017 - 34/95

Mathematical Tools The Worst–case Performance Problem : it seems like if ∆ → 0 then the regret tends to infinity... A. LAZARIC – Reinforcement Learning Fall 2017 - 35/95

Mathematical Tools The Worst–case Performance Problem : it seems like if ∆ → 0 then the regret tends to infinity... ... nosense because the regret is defined as R n ( A ; ∆) = E [ T 2 , n ]∆ A. LAZARIC – Reinforcement Learning Fall 2017 - 35/95

Mathematical Tools The Worst–case Performance Problem : it seems like if ∆ → 0 then the regret tends to infinity... ... nosense because the regret is defined as R n ( A ; ∆) = E [ T 2 , n ]∆ then if ∆ i is small, the regret is also small... A. LAZARIC – Reinforcement Learning Fall 2017 - 35/95

Mathematical Tools The Worst–case Performance Problem : it seems like if ∆ → 0 then the regret tends to infinity... ... nosense because the regret is defined as R n ( A ; ∆) = E [ T 2 , n ]∆ then if ∆ i is small, the regret is also small... In fact � � � � 1 R n ( A ; ∆) = min O ∆ ρ log( n ) , E [ T 2 , n ]∆ A. LAZARIC – Reinforcement Learning Fall 2017 - 35/95

Mathematical Tools The Worst–case Performance Then � � � � ≈ √ n 1 R n ( A ) = sup R n ( A ; ∆) = sup min O ∆ ρ log( n ) , n ∆ ∆ ∆ � for ∆ = 1 / n A. LAZARIC – Reinforcement Learning Fall 2017 - 36/95

Mathematical Tools Tuning the confidence δ of UCB Remark : UCB is an anytime algorithm ( δ = 1 / t ) � log t B i , s , t = ˆ µ i , s + ρ 2 s A. LAZARIC – Reinforcement Learning Fall 2017 - 37/95

Mathematical Tools Tuning the confidence δ of UCB Remark : UCB is an anytime algorithm ( δ = 1 / t ) � log t B i , s , t = ˆ µ i , s + ρ 2 s Remark : If the time horizon n is known then the optimal choice is δ = 1 / n � log n B i , s , t = ˆ µ i , s + ρ 2 s A. LAZARIC – Reinforcement Learning Fall 2017 - 37/95

Mathematical Tools Tuning the confidence δ of UCB (cont’d) Intuition : UCB should pull the suboptimal arms ◮ Enough : so as to understand which arm is the best ◮ Not too much : so as to keep the regret as small as possible A. LAZARIC – Reinforcement Learning Fall 2017 - 38/95

Mathematical Tools Tuning the confidence δ of UCB (cont’d) Intuition : UCB should pull the suboptimal arms ◮ Enough : so as to understand which arm is the best ◮ Not too much : so as to keep the regret as small as possible The confidence 1 − δ has the following impact (similar for ρ ) ◮ Big 1 − δ : high level of exploration ◮ Small 1 − δ : high level of exploitation A. LAZARIC – Reinforcement Learning Fall 2017 - 38/95

Mathematical Tools Tuning the confidence δ of UCB (cont’d) Intuition : UCB should pull the suboptimal arms ◮ Enough : so as to understand which arm is the best ◮ Not too much : so as to keep the regret as small as possible The confidence 1 − δ has the following impact (similar for ρ ) ◮ Big 1 − δ : high level of exploration ◮ Small 1 − δ : high level of exploitation Solution : depending on the time horizon, we can tune how to trade-off between exploration and exploitation A. LAZARIC – Reinforcement Learning Fall 2017 - 38/95

Mathematical Tools UCB Proof Let’s dig into the (1 page and half!!) proof. Define the (high-probability) event [statistics] � � � � � log 1 /δ � � E = ∀ i , s � ˆ µ i , s − µ i � ≤ 2 s By Chernoff-Hoeffding P [ E ] ≥ 1 − nK δ . A. LAZARIC – Reinforcement Learning Fall 2017 - 39/95

Mathematical Tools UCB Proof Let’s dig into the (1 page and half!!) proof. Define the (high-probability) event [statistics] � � � � � log 1 /δ � � E = ∀ i , s � ˆ µ i , s − µ i � ≤ 2 s By Chernoff-Hoeffding P [ E ] ≥ 1 − nK δ . At time t we pull arm i [algorithm] B i , T i , t − 1 ≥ B i ∗ , T i ∗ , t − 1 A. LAZARIC – Reinforcement Learning Fall 2017 - 39/95

Mathematical Tools UCB Proof Let’s dig into the (1 page and half!!) proof. Define the (high-probability) event [statistics] � � � � � log 1 /δ � � E = ∀ i , s � ˆ µ i , s − µ i � ≤ 2 s By Chernoff-Hoeffding P [ E ] ≥ 1 − nK δ . At time t we pull arm i [algorithm] � � log 1 /δ log 1 /δ µ i , T i , t − 1 + ˆ ≥ ˆ µ i ∗ , T i ∗ , t − 1 + 2 T i , t − 1 2 T i ∗ , t − 1 A. LAZARIC – Reinforcement Learning Fall 2017 - 39/95

Mathematical Tools UCB Proof Let’s dig into the (1 page and half!!) proof. Define the (high-probability) event [statistics] � � � � � log 1 /δ � � E = ∀ i , s � ˆ µ i , s − µ i � ≤ 2 s By Chernoff-Hoeffding P [ E ] ≥ 1 − nK δ . At time t we pull arm i [algorithm] � � log 1 /δ log 1 /δ µ i , T i , t − 1 + ˆ ≥ ˆ µ i ∗ , T i ∗ , t − 1 + 2 T i , t − 1 2 T i ∗ , t − 1 On the event E we have [math] � log 1 /δ µ i + 2 ≥ µ i ∗ 2 T i , t − 1 A. LAZARIC – Reinforcement Learning Fall 2017 - 39/95

Mathematical Tools UCB Proof (cont’d) Assume t is the last time i is pulled, then T i , n = T i , t − 1 + 1, thus � log 1 /δ µ i + 2 2 ( T i , n − 1 ) ≥ µ i ∗ A. LAZARIC – Reinforcement Learning Fall 2017 - 40/95

Mathematical Tools UCB Proof (cont’d) Assume t is the last time i is pulled, then T i , n = T i , t − 1 + 1, thus � log 1 /δ µ i + 2 2 ( T i , n − 1 ) ≥ µ i ∗ Reordering [math] T i , n ≤ log 1 /δ + 1 2 ∆ 2 i under event E and thus with probability 1 − nK δ . A. LAZARIC – Reinforcement Learning Fall 2017 - 40/95

Mathematical Tools UCB Proof (cont’d) Assume t is the last time i is pulled, then T i , n = T i , t − 1 + 1, thus � log 1 /δ µ i + 2 2 ( T i , n − 1 ) ≥ µ i ∗ Reordering [math] T i , n ≤ log 1 /δ + 1 2 ∆ 2 i under event E and thus with probability 1 − nK δ . Moving to the expectation [statistics] E [ T i , n ] = E [ T i , n I E ] + E [ T i , n I E C ] A. LAZARIC – Reinforcement Learning Fall 2017 - 40/95

Mathematical Tools UCB Proof (cont’d) Assume t is the last time i is pulled, then T i , n = T i , t − 1 + 1, thus � log 1 /δ µ i + 2 2 ( T i , n − 1 ) ≥ µ i ∗ Reordering [math] T i , n ≤ log 1 /δ + 1 2 ∆ 2 i under event E and thus with probability 1 − nK δ . Moving to the expectation [statistics] E [ T i , n ] ≤ log 1 /δ + 1 + n ( nK δ ) 2 ∆ 2 i A. LAZARIC – Reinforcement Learning Fall 2017 - 40/95

Mathematical Tools UCB Proof (cont’d) Assume t is the last time i is pulled, then T i , n = T i , t − 1 + 1, thus � log 1 /δ µ i + 2 2 ( T i , n − 1 ) ≥ µ i ∗ Reordering [math] T i , n ≤ log 1 /δ + 1 2 ∆ 2 i under event E and thus with probability 1 − nK δ . Moving to the expectation [statistics] E [ T i , n ] ≤ log 1 /δ + 1 + n ( nK δ ) 2 ∆ 2 i Trading-off the two terms δ = 1 / n 2 , we obtain � 2 log n µ i , T i , t − 1 + ˆ 2 T i , t − 1 and A. LAZARIC – Reinforcement Learning Fall 2017 - 40/95 n

Mathematical Tools UCB Proof (cont’d) Trading-off the two terms δ = 1 / n 2 , we obtain � 2 log n µ i , T i , t − 1 + ˆ 2 T i , t − 1 and E [ T i , n ] ≤ log n + 1 + K ∆ 2 i A. LAZARIC – Reinforcement Learning Fall 2017 - 41/95

Mathematical Tools Tuning the confidence δ of UCB (cont’d) Multi–armed Bandit : the same for δ = 1 / t and δ = 1 / n ... A. LAZARIC – Reinforcement Learning Fall 2017 - 42/95

Mathematical Tools Tuning the confidence δ of UCB (cont’d) Multi–armed Bandit : the same for δ = 1 / t and δ = 1 / n ... ... almost (i.e., in expectation) A. LAZARIC – Reinforcement Learning Fall 2017 - 42/95

Mathematical Tools Tuning the confidence δ of UCB (cont’d) The value–at–risk of the regret for UCB-anytime A. LAZARIC – Reinforcement Learning Fall 2017 - 43/95

Mathematical Tools Tuning the ρ of UCB (cont’d) UCB values (for the δ = 1 / n algorithm) � log n B i , s = ˆ µ i , s + ρ 2 s A. LAZARIC – Reinforcement Learning Fall 2017 - 44/95

Mathematical Tools Tuning the ρ of UCB (cont’d) UCB values (for the δ = 1 / n algorithm) � log n B i , s = ˆ µ i , s + ρ 2 s Theory ◮ ρ < 0 . 5, polynomial regret w.r.t. n ◮ ρ > 0 . 5, logarithmic regret w.r.t. n A. LAZARIC – Reinforcement Learning Fall 2017 - 44/95

Mathematical Tools Tuning the ρ of UCB (cont’d) UCB values (for the δ = 1 / n algorithm) � log n B i , s = ˆ µ i , s + ρ 2 s Theory ◮ ρ < 0 . 5, polynomial regret w.r.t. n ◮ ρ > 0 . 5, logarithmic regret w.r.t. n Practice: ρ = 0 . 2 is often the best choice A. LAZARIC – Reinforcement Learning Fall 2017 - 44/95

Mathematical Tools Improvements: UCB-V Idea : use empirical Bernstein bounds for more accurate c.i. A. LAZARIC – Reinforcement Learning Fall 2017 - 45/95

Mathematical Tools Improvements: UCB-V Idea : use empirical Bernstein bounds for more accurate c.i. Algorithm ◮ Compute the score of each arm i � log( t ) B i , t = ˆ µ i , T i , t + ρ 2 T i , t ◮ Pull arm I t = arg max i = 1 ,..., K B i , t ◮ Update the number of pulls T I t , t , ˆ µ i , T i , t A. LAZARIC – Reinforcement Learning Fall 2017 - 45/95

Mathematical Tools Improvements: UCB-V Idea : use empirical Bernstein bounds for more accurate c.i. Algorithm ◮ Compute the score of each arm i � σ 2 2 ˆ i , T i , t log t + 8 log t B i , t = ˆ µ i , T i , t + T i , t 3 T i , t ◮ Pull arm I t = arg max i = 1 ,..., K B i , t σ 2 ◮ Update the number of pulls T I t , t , ˆ µ i , T i , t and ˆ i , T i , t A. LAZARIC – Reinforcement Learning Fall 2017 - 45/95

Mathematical Tools Improvements: UCB-V Idea : use empirical Bernstein bounds for more accurate c.i. Algorithm ◮ Compute the score of each arm i � σ 2 2 ˆ i , T i , t log t + 8 log t B i , t = ˆ µ i , T i , t + T i , t 3 T i , t ◮ Pull arm I t = arg max i = 1 ,..., K B i , t ◮ Update the number of pulls T I t , t , ˆ σ 2 µ i , T i , t and ˆ i , T i , t Regret � 1 � R n ≤ O ∆ log n A. LAZARIC – Reinforcement Learning Fall 2017 - 45/95

Mathematical Tools Improvements: UCB-V Idea : use empirical Bernstein bounds for more accurate c.i. Algorithm ◮ Compute the score of each arm i � σ 2 2 ˆ i , T i , t log t + 8 log t B i , t = ˆ µ i , T i , t + T i , t 3 T i , t ◮ Pull arm I t = arg max i = 1 ,..., K B i , t ◮ Update the number of pulls T I t , t , ˆ σ 2 µ i , T i , t and ˆ i , T i , t Regret � σ 2 � R n ≤ O ∆ log n A. LAZARIC – Reinforcement Learning Fall 2017 - 45/95

Mathematical Tools Improvements: KL-UCB Idea : use even tighter c.i. based on Kullback–Leibler divergence d ( p , q ) = p log p q + ( 1 − p ) log 1 − p 1 − q A. LAZARIC – Reinforcement Learning Fall 2017 - 46/95

The Exploration-Exploitation Dilemma A. LAZARIC ( SequeL Team - PowerPoint PPT Presentation

The Exploration-Exploitation Dilemma A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course The Exploration-Exploitation Dilemma A. LAZARIC Reinforcement Learning Fall 2017 - 2/95 The

Java Card Applet Firewall Java Card Applet Firewall Exploration and Exploitation Exploration and

! Current State of Exploitation ! Return-Oriented Exploitation ! Mac OS X x86 Return-Oriented

Organizational Structure, Exploration, and Exploitation on the ELICIT Experimental Platform Allan

Mutual Fund Investors dilemma Expectations Reality THE DISCONNECT Investors dilemma

THE BANCASSURANCE DILEMMA THE BANCASSURANCE DILEMMA Should banks be brokers with higher

FOOD ALLERGIES - THE DILEMMA 2002 The Dilemma Accurate identification of the allergenic food

AI and Big data in health the dilemma of Truth the dilemma of Truth christian.lovis@hcuge.ch

The Bioshield Bioshield Dilemma: Dilemma: The Developing New Technologies At an Affordable

Meta-Reinforcement Learning of Structured Exploration Strategies Abhishek Gupta , Russell

Exploration and Function Approximation CMU 10703 Katerina Fragkiadaki This lecture Exploration

Tactical Exploitation Tactical Exploitation the other way to pen-test the other way to

exploitation initiatives Monday, 10 December 2018 Biowaste treatment and exploitation SYMBIOSIS

An Introduction to Elder Abuse for Professionals: Financial Exploitation NCEA Financial

WiFi Exploitation: How passive interception leads to active exploitation SecTor Canada Solomon

Exploitation techniques for NT kernel Introduction General concepts Internals Adrien

Weird Machines on Little Robots Intro to binary exploitation on Android smartphones @f0rki

MAB Learning in IoT Networks Learning helps even in non-stationary settings! Rmi Bonnefoi

An Optimal Private Stochastic-MAB Algorithm Based on an Optimal Private Stopping Rule Touqir Sajed

Beaufort Pediatrics QTIP team Jill Aiken, MD Nan Krueger, BSN Sydney Lubkin, RN Nikki Self A

Design of Lightweight Linear Diffusion Layers from Near-MDS Matrices Chaoyun Li 1 Qingju Wang 1 ,

Scheduling Black-box Muta5onal Fuzzing ACM CCS 2013 Maverick Woo Carnegie Mellon University

A Multi-Armed Bandit Framework for Recommendations at Netflix Jaya Kawale Elliot Chow

The Alternative Block Nondeterministially choose and execute any fragment whose guard is true

Noncommutative OSp ( 4 | 2 ) SUGRA canin 1 Dragoljub Go 1Faculty of Physics, University of