 
              I NTRODUCTION B ANDITS S YLLABUS M EASURES E XPERIMENTS C ONCLUSION W EAKNESSES OF E XP 3: S HIFTING R EWARDS ◮ Exp3 closely matches the best single arm strategy over the whole trajectory. ◮ For curriculum learning, a good strategy often changes: ◮ Easier cases in training data will provide high rewards during early training, but have diminishing returns. ◮ Over time, more difficult cases will provide higher rewards. CSC2541 - Scalable and Flexible Models of Uncertainty 6/18
I NTRODUCTION B ANDITS S YLLABUS M EASURES E XPERIMENTS C ONCLUSION T HE E XP 3.S A LGORITHM FOR S HIFTING R EWARDS ◮ Addresses issues of Exp3 by encouraging exploration with probability ǫ and by mixing weights additively: e w t , i j = 1 e w t , j + ǫ π Exp 3 . S ( i ) = ( 1 − ǫ ) t � N N � � � w t − 1 , i + η ˜ w t , i = log ( 1 − α t ) exp r t − 1 , i �� α t � � + exp w t − 1 , j + η ˜ r t − 1 , j N − 1 j � = i ◮ This effectively decays the importance of old rewards and allows the model to react faster to changing scenarios. CSC2541 - Scalable and Flexible Models of Uncertainty 7/18
I NTRODUCTION B ANDITS S YLLABUS M EASURES E XPERIMENTS C ONCLUSION L EARNING A S YLLABUS OVER T ASKS ◮ Given: separate tasks with unknown difficulties ◮ We want to maximize the rate of learning : 1. At each timestep t , we sample a task index k from π t . 2. We then sample a data batch from this task: { x k y k [ 0 .. B ] , ˆ [ 0 .. B ] } 3. A measure of learning progress ν and the effort τ (computation time, input size, etc.) are calculated. 4. The rate of learning is r t = ν τ and is re-scaled to [ − 1 , 1 ] . 5. Parameters w of the policy π are updated using Exp3.S CSC2541 - Scalable and Flexible Models of Uncertainty 8/18
I NTRODUCTION B ANDITS S YLLABUS M EASURES E XPERIMENTS C ONCLUSION L EARNING P ROGRESS M EASURES ◮ It is computationally expensive (or intractable) to measure the global impact of training on a particular sample. ◮ We desire proxies for progress that depend only on the current sample or a single extra sample. ◮ The paper proposes two types of progress measures: ◮ Loss-driven : compares predictions before/after training. ◮ Complexity-driven : information theoretic view of learning. CSC2541 - Scalable and Flexible Models of Uncertainty 9/18
I NTRODUCTION B ANDITS S YLLABUS M EASURES E XPERIMENTS C ONCLUSION P REDICTION G AIN ◮ Prediction Gain is the change in sample loss before and after training on a sample batch x : ν PG = L ( x , θ ) − L ( x , θ x ) ◮ Moreover, when training using gradient descent: ∆ θ ∝ −∇ L ( x , θ ) ◮ Hence, we have a Gradient Prediction Gain approximation: ν GPG = L ( x , θ ) − L ( x , θ x ) ≈ −∇ L ( x , θ ) · ∆ θ ∝ ||∇ L ( x , θ ) || 2 CSC2541 - Scalable and Flexible Models of Uncertainty 10/18
I NTRODUCTION B ANDITS S YLLABUS M EASURES E XPERIMENTS C ONCLUSION B IAS -V ARIANCE T RADE -O FF ◮ Prediction Gain is a biased estimate of the expected change in loss due to training on a sample x: E x ′ ∼ Task k [ L ( x ′ , θ ) − L ( x ′ , θ x )] ◮ In particular, it favors tasks that have high variance. ◮ This is since sample loss decreases after training, even though loss for other samples from the task could increase. ◮ An unbiased estimate is the Self Prediction Gain: x ′ ∼ D k ν SPG = L ( x ′ , θ ) − L ( x ′ , θ x ) , ◮ ν SPG has naturally higher variance due to sampling of x’ CSC2541 - Scalable and Flexible Models of Uncertainty 11/18
I NTRODUCTION B ANDITS S YLLABUS M EASURES E XPERIMENTS C ONCLUSION S HIFTING G EARS : C OMPLEXITY IN S TOCHASTIC VI ◮ Consider the objective in stochastic variational inference, where P φ is a variational posterior over parameters θ and Q ψ is a prior over θ : Data Compression under P φ � �� � � E θ ∼ P φ [ L ( x ′ , θ )] L VI = KLD ( P φ || Q ψ ) + � �� � x ′ ∈ D Model Complexity ◮ Training trades-off better ability to compress data with higher model complexity. We expect that complexity increases the most from highly generalizable data points. CSC2541 - Scalable and Flexible Models of Uncertainty 12/18
I NTRODUCTION B ANDITS S YLLABUS M EASURES E XPERIMENTS C ONCLUSION V ARIATIONAL C OMPLEXITY G AIN ◮ The Variational Complexity Gain after training on a sample batch x is the change in KL Divergence: ν VCG = KLD ( P φ x || Q ψ x ) − KLD ( P φ || Q ψ ) ◮ We can design P and Q to have a closed-form KLD. Example: both Diagonal Gaussian. ◮ In non-variational settings, when using L2 regularization (Gaussian Prior on weights), we can define the L2 Gain: ν L 2 G = || θ x || 2 − || θ || 2 CSC2541 - Scalable and Flexible Models of Uncertainty 13/18
I NTRODUCTION B ANDITS S YLLABUS M EASURES E XPERIMENTS C ONCLUSION G RADIENT V ARIATIONAL C OMPLEXITY G AIN ◮ The Gradient Variational Complexity Gain is the directional derivative of the KLD along the gradient descent direction of the data loss: ν GVCG ∝ ∇ φ KLD ( P φ || Q ψ ) · ∇ φ E θ ∼ P φ [ L ( x , θ )] ◮ Other loss terms are not dependent on x. ◮ This gain worked well experimentally, perhaps since the curvature of model complexity is typically flatter than loss. CSC2541 - Scalable and Flexible Models of Uncertainty 14/18
I NTRODUCTION B ANDITS S YLLABUS M EASURES E XPERIMENTS C ONCLUSION E XAMPLE E XPERIMENT : G ENERATED T EXT ◮ 11 datasets were generated using increasingly complex language models. Policies gravitated towards complexity: Credit: Automated Curriculum Learning for Neural Networks CSC2541 - Scalable and Flexible Models of Uncertainty 15/18
I NTRODUCTION B ANDITS S YLLABUS M EASURES E XPERIMENTS C ONCLUSION E XPERIMENTAL H IGHLIGHTS ◮ Uniformly sampling across tasks, while inefficient, was a very strong benchmark. Perhaps learning is dominated by gradients from tasks that drive progress. ◮ For variational loss, GVCG yielded higher complexity and faster training than uniform sampling in one experiment. ◮ Strategies observed: a policy would focus on a task until completion. Loss would reduce on unseen (related) tasks! CSC2541 - Scalable and Flexible Models of Uncertainty 16/18
I NTRODUCTION B ANDITS S YLLABUS M EASURES E XPERIMENTS C ONCLUSION S UMMARY OF I DEAS ◮ Discussed several progress measures that can be evaluated using training samples or one extra sample. ◮ By evaluating progress from each training example, a multi-armed bandit determines a stochastic policy, over which task to train from next, to maximize progress. ◮ The bandit needs to be non-stationary. Simpler tasks dominate early on (especially for Prediction Gain), while difficult tasks contain most of the complexity. CSC2541 - Scalable and Flexible Models of Uncertainty 17/18
I NTRODUCTION B ANDITS S YLLABUS M EASURES E XPERIMENTS C ONCLUSION T AKEAWAYS ◮ Better learning efficiency can be achieved with the right measure of progress, but this involves experimentation. ◮ Final overall loss was better in one out of six experiments. A research direction is to find better local minimas. ◮ Most promising: Prediction Gain for MLE problems, and Gradient Variational Complexity Gain for VI. ◮ Variational Complexity Loss was noisy and performed worse than its gradient analogue. Determining why is an open question. It could be due to terms independent of x. CSC2541 - Scalable and Flexible Models of Uncertainty 18/18
Finite-time Analysis of the Multiarmed Bandit Problem Peter Auer, Nicol` o Cesa-Bianchi, Faul Fischer Presented by Eric Langlois October 20, 2017 CSC2541 - Scalable and Flexible Models of Uncertainty 1/29
E XPLORATION VS . E XPLOITATION ◮ In reinforcement learning, must maximize long-term reward. ◮ Need to balance exploiting what we know already vs. exploring to discover better strategies. CSC2541 - Scalable and Flexible Models of Uncertainty 2/29
M ULTI -A RMED B ANDIT ◮ K slot machines, each with static reward distribution p i . ◮ Policy selects machines to play given history. ◮ The n th play of machine i ( ∈ 1 . . . K ) is a random variable X i , n with mean µ i . ◮ Goal: Maximize total reward. CSC2541 - Scalable and Flexible Models of Uncertainty 3/29
R EGRET How do we measure the quality of a policy? ◮ T i ( n ) - number of times machine i is played in first n plays. ◮ Regret : Expected under-performance compared to optimal play. The regret after n steps is � K � � Regret = E T i ( n )∆ i i = 1 ∆ i = µ ∗ − µ i µ ∗ = max 1 ≤ i ≤ K µ i ◮ Uniform random policy: linear regret ◮ ǫ -greedy policy: linear regret CSC2541 - Scalable and Flexible Models of Uncertainty 4/29
A SYMPTOTICALLY O PTIMAL R EGRET ◮ Lai and Robbins (1985) proved there exist policies with � � 1 E [ T i ( n )] ≤ D ( p i � p ∗ ) + o ( 1 ) ln n p i = probability distribution of machine i ◮ Asymptotically achieves logarithmic regret. ◮ Proved that logarithmic regret is optimal. ◮ Agrawal (1995): Asymptotically optimal policies in terms of sample mean instead of KL divergence. CSC2541 - Scalable and Flexible Models of Uncertainty 5/29
U PPER C ONFIDENCE B OUND A LGORITHMS Distribution Mean Upper Confidence Bound 0.0 0.2 0.4 0.6 0.8 1.0 ◮ Core idea: optimism in the face of uncertainty. ◮ Select arm with highest upper confidence bound. ◮ Assumption: distribution has support in [ 0 , 1 ] . CSC2541 - Scalable and Flexible Models of Uncertainty 6/29
UCB1 Initialization : Play each machine once. Loop : Play the machine i maximizing � 2 ln n x i + ¯ n i ¯ x i - Mean observed reward from machine i . n i - Number of times machine i has been played so far n - Total number of plays done so far. CSC2541 - Scalable and Flexible Models of Uncertainty 7/29
UCB1 D EMO Selection Count: 1/3 Ratio: 0.333333 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 1/3 Ratio: 0.333333 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 1/3 Ratio: 0.333333 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 2.5 CSC2541 - Scalable and Flexible Models of Uncertainty 8/29
UCB1 D EMO Selection Count: 1/4 Ratio: 0.25 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 2/4 Ratio: 0.5 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 1/4 Ratio: 0.25 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 2.5 CSC2541 - Scalable and Flexible Models of Uncertainty 9/29
UCB1 D EMO Selection Count: 1/5 Ratio: 0.2 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 2/5 Ratio: 0.4 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 2/5 Ratio: 0.4 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 2.5 CSC2541 - Scalable and Flexible Models of Uncertainty 10/29
UCB1 D EMO Selection Count: 2/6 Ratio: 0.333333 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 2/6 Ratio: 0.333333 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 2/6 Ratio: 0.333333 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 2.5 CSC2541 - Scalable and Flexible Models of Uncertainty 11/29
UCB1 D EMO Selection Count: 2/7 Ratio: 0.285714 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 3/7 Ratio: 0.428571 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 2/7 Ratio: 0.285714 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 2.5 CSC2541 - Scalable and Flexible Models of Uncertainty 12/29
UCB1 D EMO Selection Count: 7/50 Ratio: 0.14 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 18/50 Ratio: 0.36 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 25/50 Ratio: 0.5 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 2.5 CSC2541 - Scalable and Flexible Models of Uncertainty 13/29
UCB1 D EMO Selection Count: 11/100 Ratio: 0.11 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 34/100 Ratio: 0.34 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 55/100 Ratio: 0.55 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 2.5 CSC2541 - Scalable and Flexible Models of Uncertainty 14/29
UCB1 D EMO Selection Count: 32/1000 Ratio: 0.032 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 261/1000 Ratio: 0.261 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 707/1000 Ratio: 0.707 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 2.5 CSC2541 - Scalable and Flexible Models of Uncertainty 15/29
UCB1 D EMO Selection Count: 57/10000 Ratio: 0.0057 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 931/10000 Ratio: 0.0931 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 9012/10000 Ratio: 0.9012 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 2.5 CSC2541 - Scalable and Flexible Models of Uncertainty 16/29
UCB1: R EGRET B OUND (T HEOREM 1) For all K > 1, if policy UCB1 is run on K machines having arbitrary reward distributions P 1 , . . . , P K with support in [ 0 , 1 ] , then its expected regret after any number n of plays is at most � � K  � � 1 + π 2 � ln n � �  + �  8 ∆ i ∆ i 3 i : µ i <µ ∗ i = 1 CSC2541 - Scalable and Flexible Models of Uncertainty 17/29
UCB1: D EFINITIONS FOR P ROOF OF B OUND ◮ I t - Indicator RV equal to the machine played at time t . ◮ ¯ X i , n - RV of observed mean reward from n plays of machine i . n ¯ � X i , n = X i , t t = 1 ◮ An asterisk superscript refers to the (first) optimal machine. e.g. T ∗ ( n ) and ¯ X ∗ n . ◮ Braces denote the indicator function of their contents. ◮ The number of plays of machine i after time n under UCB1 is therefore n � T i ( n ) = 1 + { I t = i } t = K + 1 CSC2541 - Scalable and Flexible Models of Uncertainty 18/29
UCB1: P ROOF OF R EGRET B OUND n � T i ( n ) = 1 + { I t = i } t = K + 1 n � ≤ ℓ + { I t = i } t = K + 1 T i ( t − 1 ) ≥ 1 ◮ Strategy: For every sub-optimal arm i , need to establish bound on total number of plays as a function of n . ◮ Assume we have seen ℓ plays of machine i so far and consider number of remaining plays. CSC2541 - Scalable and Flexible Models of Uncertainty 19/29
UCB1: P ROOF OF R EGRET B OUND n � T i ( n ) ≤ ℓ + { I t = i } t = K + 1 T i ( t − 1 ) ≥ ℓ n { ¯ T ∗ ( t − 1 ) + c t − 1 , T ∗ ( t − 1 ) ≤ ¯ � X ∗ ≤ ℓ + X i , T i ( t − 1 ) + c t − 1 , T i ( t − 1 ) } t = K + 1 T i ( t − 1 ) ≥ ℓ � 2 ln t ◮ Let c t , s = be the UCB offset term. s ◮ Machine i is selected if its UCB = ¯ X i , T i ( t − 1 ) + c t − 1 , T i ( t − 1 ) is largest of all machines. ◮ In particular, must be larger than the UCB of the optimal machine. CSC2541 - Scalable and Flexible Models of Uncertainty 20/29
UCB1: P ROOF OF R EGRET B OUND n � { ¯ T ∗ ( t − 1 ) + c t − 1 , T ∗ ( t − 1 ) ≤ ¯ X ∗ T i ( n ) ≤ ℓ + X i , T i ( t − 1 ) + c t − 1 , T i ( t − 1 ) } t = K + 1 T i ( t − 1 ) ≥ ℓ t − 1 t − 1 ∞ � � � { ¯ s + c t , s ≤ ¯ X ∗ ≤ ℓ + X i , s i + c t , s i } t = 1 s = 1 s i = ℓ ◮ Do not care about the particular number of times machine i and machine ∗ have been seen. ◮ Probability is upper bounded by summing over all possible assignments of T ∗ ( t − 1 ) = s and T i ( t − 1 ) = s i . ◮ Relax the bounds on t as well. CSC2541 - Scalable and Flexible Models of Uncertainty 21/29
UCB1: P ROOF OF R EGRET B OUND t − 1 t − 1 ∞ � � � { ¯ s + c t , s ≤ ¯ X ∗ T i ( n ) ≤ ℓ + X i , s i + c t , s i } t = 1 s = 1 s i = ℓ The event ¯ s + c t , s ≤ ¯ X ∗ X i , s i + c t , s i implies at least one of the following: s ≤ µ ∗ − c t , s ¯ X ∗ (1) ¯ X i , s i ≥ µ i + c t , s i (2) µ < µ i + 2 c t , s i (3) CSC2541 - Scalable and Flexible Models of Uncertainty 22/29
C HERNOFF -H OEFFDING B OUND Let Z 1 , . . . Z n be i.i.d random variables with mean µ and domain [ 0 , 1 ] . Let ¯ Z n = 1 n ( Z 1 + · · · + Z n ) . Then for all a ≥ 0, � ¯ ≤ e − 2 na 2 � ¯ ≤ e − 2 na 2 � � P Z n ≥ µ + α P Z n ≤ µ − α Applied to inequalities (1) and (2), these give the bounds � � 2 ln t �� s ≤ µ ∗ − c t , s � ¯ = t − 4 X ∗ � P ≤ exp − 2 s s � ¯ ≤ t − 4 � P X i , s i ≥ µ i + c t , s i CSC2541 - Scalable and Flexible Models of Uncertainty 23/29
UCB1: P ROOF OF R EGRET B OUND The final inequality, µ ∗ < µ i + 2 c t , s i is based on the width of the confidence interval. For t < n , it is false when s i is large enough: � 2 ln t ∆ i = µ ∗ − µ i ≤ 2 s i ⇒ ∆ 2 4 ≤ 2 ln t i s i ⇒ s i < 8 ln t ∆ 2 i ◮ In the regret bound summation, s i ≥ ℓ so we set ℓ = 8 ln t i + 1 ∆ 2 ◮ Inequality (3) then contributes nothing to the bound. CSC2541 - Scalable and Flexible Models of Uncertainty 24/29
UCB1: P ROOF OF R EGRET B OUND With ℓ = 8 ln t i + 1 we have the bound on E [ T i ( n )] : ∆ 2 t − 1 t − 1 ∞ s ≤ µ ∗ − c t , s � � � � ¯ � ¯ X ∗ � � �� E [ T i ( n )] ≤ ℓ + P + P X i , s i ≥ µ i + c t , s i t = 1 s = 1 s i = ℓ t t ∞ � � � 2 t − 4 ≤ ℓ + t = 1 s = 1 s i = 1 + 1 + π 2 ≤ 8 ln n ∆ 2 3 i Substituted into the regret formula, this gives our bound. CSC2541 - Scalable and Flexible Models of Uncertainty 25/29
UCB1-T UNED + 1 + π 2 ◮ UCB1: E [ T i ( n )] ≤ 8 ln n ∆ 2 3 i ◮ Constant factor 8 1 i is sub-optimal. Optimal: i . ∆ 2 2 ∆ 2 ◮ In practice the performance of UCB1 can be improved further by using the confidence bound � � 1 � ln n ¯ X i , s + min 4 , V i ( n i ) n i where � s � � 1 2 ln t � − ¯ X 2 X 2 V i ( s ) = i , s + i ,τ s s τ = 1 ◮ No proof of regret bound. CSC2541 - Scalable and Flexible Models of Uncertainty 26/29
O THER P OLICIES ◮ UCB2: More complicated; gets arbitrarily close to optimal constant factor on regret. ◮ UCB1-NORMAL: UCB1 adapted for normally distributed rewards. ◮ ǫ n -GREEDY: ǫ -greedy policy with decaying ǫ . � 1 , cK � ǫ n = min d 2 n where c > 0 0 < d ≤ min i : µ i <µ ∗ ∆ i CSC2541 - Scalable and Flexible Models of Uncertainty 27/29
E XPERIMENTS Two machines: 10 machines: Bernoulli 0.9 and 0.8 Bernoulli 0.9, 0.8, . . . , 0.8 Auer, Peter, Nicolo Cesa-Bianchi, and Paul Fischer. ”Finite-time analysis of the multiarmed bandit problem.” Machine learning 47.2-3 (2002): 235-256. CSC2541 - Scalable and Flexible Models of Uncertainty 28/29
C OMPARISONS ◮ UCB1-Tuned nearly always far outperforms UCB1 ◮ ǫ n -GREEDY performs very well if tuned correctly, poorly otherwise. Also poorly if there are many suboptimal machines. ◮ UCB1-Tuned is nearly as good as the best ǫ n -GREEDY without any tuning required. ◮ UCB2 is similar to UCB1-Tuned but slightly worse. CSC2541 - Scalable and Flexible Models of Uncertainty 29/29
A Tutorial on Thompson Sampling Daniel J.Russo, Benjamin Van Roy, Abbas Kazerouni, Ian, Osband, and Zheng Wen Presenters Mingjie Mai Feng Chi October 20, 2017 CSC2541 - Scalable and Flexible Models of Uncertainty 1/37
O UTLINE ◮ Example problems ◮ Algorithms and applications to example problems ◮ Approximations for complex model ◮ Practical modeling considerations ◮ Limitations ◮ Further example: Reinforcement learning in Markov Decision Problems CSC2541 - Scalable and Flexible Models of Uncertainty 2/37
E XPLOITATION VS E XPLORATION ◮ Restaurant Selection ◮ Online Banner Advertisements ◮ Oil Drilling ◮ Game Playing ◮ Multi-armed bandit problem CSC2541 - Scalable and Flexible Models of Uncertainty 3/37
F ORMAL B ANDIT PROBLEMS Bandit problems can be seen as a generalization of supervised learning, where we: ◮ Actions x t ∈ X ◮ Unknown probability distribution over rewards: ( p 1 , . . . , p K ) ◮ Each step, pick one x t ◮ observe response y t ◮ receive the instantaneous reward r t = r ( y t ) ◮ the goal is to maximize mean cumulative reward E � t r t CSC2541 - Scalable and Flexible Models of Uncertainty 4/37
R EGRET ◮ The optimal action is x ∗ t = max x t ∈X E [ r | x t ] ◮ The regret is the opportunity loss for one step: E [ E [ r | x ∗ t ] − E [ r | x t ]] ◮ The total regret is the total opportunity loss : E [ � t τ = 1 ( E [ r | x ∗ τ ] − E [ r | x τ ])] ◮ Maximize cumulative reward ≡ minimize total regret CSC2541 - Scalable and Flexible Models of Uncertainty 5/37
B ERNOULLI B ANDIT ◮ Action: x t ∈ { 1 , 2 , ..., K } ◮ Success probabilities: ( θ 1 , ..., θ K ) , where θ k ∈ [ 0 , 1 ] ◮ Observation:  1 w.p. θ k  y t = 0 otherwise  ◮ Reward: r t ( y t ) = y t ◮ Prior belief: θ k ∼ β ( α k , β k ) CSC2541 - Scalable and Flexible Models of Uncertainty 6/37
A LGORITHMS The data observed up to time t : H t = { ( x 1 , y 1 ) , ..., ( x t − 1 , y t − 1 ) } ◮ Greedy ◮ ˆ θ = E [ θ | H t − 1 ] ◮ x t = argmax k ˆ θ k ◮ ǫ -Greedy ◮ ˆ θ = E [ θ | H t − 1 ]  argmax k ˆ θ k w.p. 1 − ǫ  ◮ x t = unif ( { 1 , . . . , K } ) otherwise  ◮ Thompson Sampling ◮ ˆ θ is sampled from P ( θ k | H t − 1 ) ◮ x t = argmax k ˆ θ k CSC2541 - Scalable and Flexible Models of Uncertainty 7/37
COMPUTING POSTERIORS WITH B ERNOULLI BANDIT ◮ Prior belief: θ k ∼ β ( α k , β k ) ◮ At each time period, apply action x t , reward r t ∈ { 0 , 1 } is generated with success probability P ( r t = 1 | x t , θ ) = θ x t ◮ Update distribution according to Baye’s rule. ◮ due to conjugacy property of beta distribution we have:  ( α k , β k ) if x t � = k  ( α k , β k ) ← ( α k , β k ) + ( r t , 1 − r t ) if x t = k .  CSC2541 - Scalable and Flexible Models of Uncertainty 8/37
S IDE B Y S IDE C OMPARISON CSC2541 - Scalable and Flexible Models of Uncertainty 9/37
PERFORMANCE COMPARISON (a) greedy algorithm (b) Thompson sampling Figure: Probability that the greedy algorithm and Thompson sampling selects an action. θ 1 = 0 . 9 , θ 2 = 0 . 8 , θ 3 = 0 . 7 CSC2541 - Scalable and Flexible Models of Uncertainty 10/37
PERFORMANCE COMPARISON (b) average over random θ (a) θ = ( 0 . 9 , 0 . 8 , 0 . 7 ) Figure: Regret from applying greedy and Thompson sampling algorithms to the three-armed Bernoulli bandit. CSC2541 - Scalable and Flexible Models of Uncertainty 11/37
ONLINE SHORTEST PATH Figure: Shortest Path Problem. CSC2541 - Scalable and Flexible Models of Uncertainty 12/37
O NLINE S HORTEST P ATH - I NDEPENDENT TRAVEL TIME Given a graph G = ( V , E , v s , v d ) , where v s , v d ∈ V , we have that ◮ Mean travel time: θ e for e ∈ E , ◮ Action: x t = ( e 1 , ..., e M ) , a path from v s to v d ◮ Observation: ( y t , e 1 | θ e 1 , ..., y t , e M | θ e M ) are independent, where σ 2 ln ( y t , e | θ e ) ∼ N ( ln θ e − ˜ σ 2 ) , so that E [ y t , e | θ e ] = θ e 2 , ˜ ◮ Reward: r t = − � e ∈ x t y t , e ◮ Prior belief: ln ( θ e ) ∼ N ( µ e , σ 2 e ) also independent. CSC2541 - Scalable and Flexible Models of Uncertainty 13/37
O NLINE S HORTEST P ATH - I NDEPENDENT TRAVEL TIME ◮ At each t th iteration with posterior parameters ( µ e , σ e ) for each e ∈ E . ◮ greedy algorithm: ˆ θ e = E p [ θ e ] = e µ e + σ 2 e / 2 ◮ Thompson sampling: draw ˆ θ e ∼ logNormal ( µ e , σ 2 e ) e ∈ x t ˆ ◮ pick an action x to maximize E q ˆ θ [ r ( y t ) | x t = x ] = − � θ e ◮ can be solved via Dijkstra’s algorithm ◮ observe y t , e , and update parameters � � σ 2   1 e µ e + 1 ln y t , e + ˜ 1 σ 2 ˜ σ 2 2 ( µ e , σ 2 e ) ← ,   1 e + 1 1 e + 1 σ 2 σ 2 ˜ σ 2 σ 2 ˜ CSC2541 - Scalable and Flexible Models of Uncertainty 14/37
B INOMIAL BRIDGE ◮ apply above algorithm to a Binomial bridge with six stages with 184,756 paths. σ 2 = 1 ◮ µ e = − 1 2 , σ 2 e = 1 so that E [ θ e ] = 1, for each e ∈ E , and ˜ Figure: A binomial bridge with six stages. CSC2541 - Scalable and Flexible Models of Uncertainty 15/37
(a) regret (b) total travel time/optimal Figure: Performance of Thompson sampling and ǫ -greedy algorithms in the shortest path problem. CSC2541 - Scalable and Flexible Models of Uncertainty 16/37
ONLINE SHORTEST PATH - CORRELATED TRAVEL TIME ◮ independent θ e ∼ logNormal ( µ e , σ 2 e ) ◮ y t , e = ζ t , e η t ν t ,ℓ ( e ) θ e ◮ ζ t , e is an idiosyncratic factor associated with edge e (road construction/closure, accident, etc) ◮ η t a common factor to all edges (weather, etc). ◮ ℓ ( e ) indicates whether edge e resides in the lower half of the bridge ◮ ν t , 0 , ν t , 1 are factors bear a common influence on edges in the upper or lower halves (signal problems) CSC2541 - Scalable and Flexible Models of Uncertainty 17/37
ONLINE SHORTEST PATH - CORRELATED TRAVEL TIME ◮ Prior setup: ◮ take ζ t , e , η t , ν t , 1 , ν t , 0 to be independent σ 2 / 6 , ˜ σ 2 / 3 ) . logNormal (˜ ◮ only need to estimate θ e , and marginal y t , e | θ is the same as independent case, but the joint distribution over y t | θ differs. ◮ Correlated observations induce dependencies in posterior, although mean travel times are independent. CSC2541 - Scalable and Flexible Models of Uncertainty 18/37
ONLINE SHORTEST PATH - CORRELATED TRAVEL TIME ◮ Let φ, z t ∈ R N be defined by � if e ∈ x t ln y t , e φ e = ln θ e z t , e = and 0 otherwise. ◮ Define a | x t | × | x t | covariance matrix ˜ Σ with elements  σ 2 for e = e ′ ˜    ˜ σ 2 / 3 Σ e , e ′ = for e � = e ′ , ℓ ( e ) = ℓ ( e ′ ) 2 ˜  σ 2 / 3  ˜ otherwise,  ◮ for e , e ′ ∈ x t , and a N × N concentration matrix � ˜ if e , e ′ ∈ x t Σ − 1 ˜ e , e ′ C e , e ′ = 0 otherwise, CSC2541 - Scalable and Flexible Models of Uncertainty 19/37
ONLINE SHORTEST PATH - CORRELATED TRAVEL TIME ◮ Apply Thompson sampling ◮ Each t th iteration, sample a vector ˆ φ from N ( µ, Σ) , then setting ˆ θ e = ˆ φ e for each e ∈ E . ◮ An action x is selected to maximize e ∈ x t ˆ θ [ r ( y t ) | x t = x ] = − � θ e , using Djikstra’s algorithm or E q ˆ an alternative. ◮ for e , e ′ ∈ E . Then, the posterior distribution of φ is normal with a mean vector µ and covariance matrix Σ that can be updated according to �� � − 1 � � − 1 � Σ − 1 + ˜ � � Σ − 1 + ˜ Σ − 1 µ + ˜ ( µ, Σ) ← C Cz t , C . CSC2541 - Scalable and Flexible Models of Uncertainty 20/37
(a) regret (b) total travel time/optimal Figure: Performance of two versions of Thompson sampling in the shortest path problem with correlated travel time. CSC2541 - Scalable and Flexible Models of Uncertainty 21/37
A PPROXIMATIONS OF POSTERIOR SAMPLING FOR COMPLEX MODEL ◮ Gibbs Sampling ◮ Langevin Monte Carlo ◮ Sampling from a Laplace Approximation ◮ Bootstrapping CSC2541 - Scalable and Flexible Models of Uncertainty 22/37
G IBBS S AMPLING ◮ History: H t − 1 = (( x 1 , y 1 ) , . . . , ( x t − 1 , y t − 1 )) ◮ Starts with an initial guess θ 0 ◮ For each n th iteration, sample each k th component according to ˆ k ∼ f n , k θ n t − 1 ( θ k ) f n , k t − 1 ( θ k ) ∝ f t − 1 ((ˆ 1 , . . . , ˆ k − 1 , θ k , ˆ k + 1 , . . . , ˆ θ n θ n θ n − 1 θ n − 1 )) K θ N is taken to be the approximate ◮ After N iterations, ˆ posterior sample CSC2541 - Scalable and Flexible Models of Uncertainty 23/37
L ANGEVIN M ONTE C ARLO ◮ Let g be the posterior distribution ◮ Euler method for stimulating Langevin daynmics: θ n + 1 = θ n + ǫ ∇ ln g ( θ n ) + √ ǫ W n n ∈ N ◮ W 1 , W 2 , · · · are i.i.d. standard normal random variables and ǫ > 0 is a small step size ◮ Stochastic gradient Langevin Monte Carlo: use sampled minibatches of data to compute approximate CSC2541 - Scalable and Flexible Models of Uncertainty 24/37
S AMPLING FROM A L APLACE A PPROXIMATION ◮ Assume posterior g is unimodal and its log density ln g ( θ ) is strictly concave around its mode θ ◮ A second-order Taylor approximation to the log-density gives ln g ( θ ) ≈ ln g ( θ ) − 1 2 ( θ − θ ) ⊤ C ( θ − θ ) , where C = −∇ 2 ln g ( θ ) . ◮ Approximation to the density g using a Gaussian distribution with mean θ and covariance C − 1 | C / 2 π | e − 1 2 ( θ − θ ) ⊤ C ( θ − θ ) � ˜ g ( θ ) = CSC2541 - Scalable and Flexible Models of Uncertainty 25/37
B OOTSTRAPPING ◮ History: H t − 1 = (( x 1 , y 1 ) , . . . , ( x t − 1 , y t − 1 )) ◮ Uniformly sample with replacement from H t − 1 ◮ Hypothetical history ˆ H t − 1 = ((ˆ x 1 , ˆ y 1 ) , . . . , (ˆ x t − 1 , ˆ y t − 1 )) ◮ Maximize the likelihood of θ under ˆ H t − 1 CSC2541 - Scalable and Flexible Models of Uncertainty 26/37
B ERNOULLI BANDIT Figure: Regret of approximation methods versus exact Thompson sampling (Bernolli bandit) CSC2541 - Scalable and Flexible Models of Uncertainty 27/37
O NLINE SHORTEST PATH Figure: Regret of approximation methods versus exact Thompson sampling (online shortest path) CSC2541 - Scalable and Flexible Models of Uncertainty 28/37
P RACTICAL MODELING CONSIDERATIONS ◮ Prior distribution specification ◮ Constraints and context ◮ Nonstationary systems CSC2541 - Scalable and Flexible Models of Uncertainty 29/37
P RIOR DISTRIBUTION SPECIFICATION ◮ Prior: a distribution over plausible values ◮ Misspecified prior vs informative prior ◮ Thoughtful choice of prior based on past experience can improve learning performance CSC2541 - Scalable and Flexible Models of Uncertainty 30/37
Recommend
More recommend