information based objective functions for active data
play

Information-Based Objective Functions for Active Data Selection By - PowerPoint PPT Presentation

Information-Based Objective Functions for Active Data Selection By David J. C. Mackay Presented by Aditya Sanghi and Grant Watson 1 Motivation Active learning A learning algorithm which is able to interactively query for more data


  1. I NTRODUCTION B ANDITS S YLLABUS M EASURES E XPERIMENTS C ONCLUSION W EAKNESSES OF E XP 3: S HIFTING R EWARDS ◮ Exp3 closely matches the best single arm strategy over the whole trajectory. ◮ For curriculum learning, a good strategy often changes: ◮ Easier cases in training data will provide high rewards during early training, but have diminishing returns. ◮ Over time, more difficult cases will provide higher rewards. CSC2541 - Scalable and Flexible Models of Uncertainty 6/18

  2. I NTRODUCTION B ANDITS S YLLABUS M EASURES E XPERIMENTS C ONCLUSION T HE E XP 3.S A LGORITHM FOR S HIFTING R EWARDS ◮ Addresses issues of Exp3 by encouraging exploration with probability ǫ and by mixing weights additively: e w t , i j = 1 e w t , j + ǫ π Exp 3 . S ( i ) = ( 1 − ǫ ) t � N N � � � w t − 1 , i + η ˜ w t , i = log ( 1 − α t ) exp r t − 1 , i �� α t � � + exp w t − 1 , j + η ˜ r t − 1 , j N − 1 j � = i ◮ This effectively decays the importance of old rewards and allows the model to react faster to changing scenarios. CSC2541 - Scalable and Flexible Models of Uncertainty 7/18

  3. I NTRODUCTION B ANDITS S YLLABUS M EASURES E XPERIMENTS C ONCLUSION L EARNING A S YLLABUS OVER T ASKS ◮ Given: separate tasks with unknown difficulties ◮ We want to maximize the rate of learning : 1. At each timestep t , we sample a task index k from π t . 2. We then sample a data batch from this task: { x k y k [ 0 .. B ] , ˆ [ 0 .. B ] } 3. A measure of learning progress ν and the effort τ (computation time, input size, etc.) are calculated. 4. The rate of learning is r t = ν τ and is re-scaled to [ − 1 , 1 ] . 5. Parameters w of the policy π are updated using Exp3.S CSC2541 - Scalable and Flexible Models of Uncertainty 8/18

  4. I NTRODUCTION B ANDITS S YLLABUS M EASURES E XPERIMENTS C ONCLUSION L EARNING P ROGRESS M EASURES ◮ It is computationally expensive (or intractable) to measure the global impact of training on a particular sample. ◮ We desire proxies for progress that depend only on the current sample or a single extra sample. ◮ The paper proposes two types of progress measures: ◮ Loss-driven : compares predictions before/after training. ◮ Complexity-driven : information theoretic view of learning. CSC2541 - Scalable and Flexible Models of Uncertainty 9/18

  5. I NTRODUCTION B ANDITS S YLLABUS M EASURES E XPERIMENTS C ONCLUSION P REDICTION G AIN ◮ Prediction Gain is the change in sample loss before and after training on a sample batch x : ν PG = L ( x , θ ) − L ( x , θ x ) ◮ Moreover, when training using gradient descent: ∆ θ ∝ −∇ L ( x , θ ) ◮ Hence, we have a Gradient Prediction Gain approximation: ν GPG = L ( x , θ ) − L ( x , θ x ) ≈ −∇ L ( x , θ ) · ∆ θ ∝ ||∇ L ( x , θ ) || 2 CSC2541 - Scalable and Flexible Models of Uncertainty 10/18

  6. I NTRODUCTION B ANDITS S YLLABUS M EASURES E XPERIMENTS C ONCLUSION B IAS -V ARIANCE T RADE -O FF ◮ Prediction Gain is a biased estimate of the expected change in loss due to training on a sample x: E x ′ ∼ Task k [ L ( x ′ , θ ) − L ( x ′ , θ x )] ◮ In particular, it favors tasks that have high variance. ◮ This is since sample loss decreases after training, even though loss for other samples from the task could increase. ◮ An unbiased estimate is the Self Prediction Gain: x ′ ∼ D k ν SPG = L ( x ′ , θ ) − L ( x ′ , θ x ) , ◮ ν SPG has naturally higher variance due to sampling of x’ CSC2541 - Scalable and Flexible Models of Uncertainty 11/18

  7. I NTRODUCTION B ANDITS S YLLABUS M EASURES E XPERIMENTS C ONCLUSION S HIFTING G EARS : C OMPLEXITY IN S TOCHASTIC VI ◮ Consider the objective in stochastic variational inference, where P φ is a variational posterior over parameters θ and Q ψ is a prior over θ : Data Compression under P φ � �� � � E θ ∼ P φ [ L ( x ′ , θ )] L VI = KLD ( P φ || Q ψ ) + � �� � x ′ ∈ D Model Complexity ◮ Training trades-off better ability to compress data with higher model complexity. We expect that complexity increases the most from highly generalizable data points. CSC2541 - Scalable and Flexible Models of Uncertainty 12/18

  8. I NTRODUCTION B ANDITS S YLLABUS M EASURES E XPERIMENTS C ONCLUSION V ARIATIONAL C OMPLEXITY G AIN ◮ The Variational Complexity Gain after training on a sample batch x is the change in KL Divergence: ν VCG = KLD ( P φ x || Q ψ x ) − KLD ( P φ || Q ψ ) ◮ We can design P and Q to have a closed-form KLD. Example: both Diagonal Gaussian. ◮ In non-variational settings, when using L2 regularization (Gaussian Prior on weights), we can define the L2 Gain: ν L 2 G = || θ x || 2 − || θ || 2 CSC2541 - Scalable and Flexible Models of Uncertainty 13/18

  9. I NTRODUCTION B ANDITS S YLLABUS M EASURES E XPERIMENTS C ONCLUSION G RADIENT V ARIATIONAL C OMPLEXITY G AIN ◮ The Gradient Variational Complexity Gain is the directional derivative of the KLD along the gradient descent direction of the data loss: ν GVCG ∝ ∇ φ KLD ( P φ || Q ψ ) · ∇ φ E θ ∼ P φ [ L ( x , θ )] ◮ Other loss terms are not dependent on x. ◮ This gain worked well experimentally, perhaps since the curvature of model complexity is typically flatter than loss. CSC2541 - Scalable and Flexible Models of Uncertainty 14/18

  10. I NTRODUCTION B ANDITS S YLLABUS M EASURES E XPERIMENTS C ONCLUSION E XAMPLE E XPERIMENT : G ENERATED T EXT ◮ 11 datasets were generated using increasingly complex language models. Policies gravitated towards complexity: Credit: Automated Curriculum Learning for Neural Networks CSC2541 - Scalable and Flexible Models of Uncertainty 15/18

  11. I NTRODUCTION B ANDITS S YLLABUS M EASURES E XPERIMENTS C ONCLUSION E XPERIMENTAL H IGHLIGHTS ◮ Uniformly sampling across tasks, while inefficient, was a very strong benchmark. Perhaps learning is dominated by gradients from tasks that drive progress. ◮ For variational loss, GVCG yielded higher complexity and faster training than uniform sampling in one experiment. ◮ Strategies observed: a policy would focus on a task until completion. Loss would reduce on unseen (related) tasks! CSC2541 - Scalable and Flexible Models of Uncertainty 16/18

  12. I NTRODUCTION B ANDITS S YLLABUS M EASURES E XPERIMENTS C ONCLUSION S UMMARY OF I DEAS ◮ Discussed several progress measures that can be evaluated using training samples or one extra sample. ◮ By evaluating progress from each training example, a multi-armed bandit determines a stochastic policy, over which task to train from next, to maximize progress. ◮ The bandit needs to be non-stationary. Simpler tasks dominate early on (especially for Prediction Gain), while difficult tasks contain most of the complexity. CSC2541 - Scalable and Flexible Models of Uncertainty 17/18

  13. I NTRODUCTION B ANDITS S YLLABUS M EASURES E XPERIMENTS C ONCLUSION T AKEAWAYS ◮ Better learning efficiency can be achieved with the right measure of progress, but this involves experimentation. ◮ Final overall loss was better in one out of six experiments. A research direction is to find better local minimas. ◮ Most promising: Prediction Gain for MLE problems, and Gradient Variational Complexity Gain for VI. ◮ Variational Complexity Loss was noisy and performed worse than its gradient analogue. Determining why is an open question. It could be due to terms independent of x. CSC2541 - Scalable and Flexible Models of Uncertainty 18/18

  14. Finite-time Analysis of the Multiarmed Bandit Problem Peter Auer, Nicol` o Cesa-Bianchi, Faul Fischer Presented by Eric Langlois October 20, 2017 CSC2541 - Scalable and Flexible Models of Uncertainty 1/29

  15. E XPLORATION VS . E XPLOITATION ◮ In reinforcement learning, must maximize long-term reward. ◮ Need to balance exploiting what we know already vs. exploring to discover better strategies. CSC2541 - Scalable and Flexible Models of Uncertainty 2/29

  16. M ULTI -A RMED B ANDIT ◮ K slot machines, each with static reward distribution p i . ◮ Policy selects machines to play given history. ◮ The n th play of machine i ( ∈ 1 . . . K ) is a random variable X i , n with mean µ i . ◮ Goal: Maximize total reward. CSC2541 - Scalable and Flexible Models of Uncertainty 3/29

  17. R EGRET How do we measure the quality of a policy? ◮ T i ( n ) - number of times machine i is played in first n plays. ◮ Regret : Expected under-performance compared to optimal play. The regret after n steps is � K � � Regret = E T i ( n )∆ i i = 1 ∆ i = µ ∗ − µ i µ ∗ = max 1 ≤ i ≤ K µ i ◮ Uniform random policy: linear regret ◮ ǫ -greedy policy: linear regret CSC2541 - Scalable and Flexible Models of Uncertainty 4/29

  18. A SYMPTOTICALLY O PTIMAL R EGRET ◮ Lai and Robbins (1985) proved there exist policies with � � 1 E [ T i ( n )] ≤ D ( p i � p ∗ ) + o ( 1 ) ln n p i = probability distribution of machine i ◮ Asymptotically achieves logarithmic regret. ◮ Proved that logarithmic regret is optimal. ◮ Agrawal (1995): Asymptotically optimal policies in terms of sample mean instead of KL divergence. CSC2541 - Scalable and Flexible Models of Uncertainty 5/29

  19. U PPER C ONFIDENCE B OUND A LGORITHMS Distribution Mean Upper Confidence Bound 0.0 0.2 0.4 0.6 0.8 1.0 ◮ Core idea: optimism in the face of uncertainty. ◮ Select arm with highest upper confidence bound. ◮ Assumption: distribution has support in [ 0 , 1 ] . CSC2541 - Scalable and Flexible Models of Uncertainty 6/29

  20. UCB1 Initialization : Play each machine once. Loop : Play the machine i maximizing � 2 ln n x i + ¯ n i ¯ x i - Mean observed reward from machine i . n i - Number of times machine i has been played so far n - Total number of plays done so far. CSC2541 - Scalable and Flexible Models of Uncertainty 7/29

  21. UCB1 D EMO Selection Count: 1/3 Ratio: 0.333333 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 1/3 Ratio: 0.333333 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 1/3 Ratio: 0.333333 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 2.5 CSC2541 - Scalable and Flexible Models of Uncertainty 8/29

  22. UCB1 D EMO Selection Count: 1/4 Ratio: 0.25 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 2/4 Ratio: 0.5 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 1/4 Ratio: 0.25 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 2.5 CSC2541 - Scalable and Flexible Models of Uncertainty 9/29

  23. UCB1 D EMO Selection Count: 1/5 Ratio: 0.2 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 2/5 Ratio: 0.4 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 2/5 Ratio: 0.4 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 2.5 CSC2541 - Scalable and Flexible Models of Uncertainty 10/29

  24. UCB1 D EMO Selection Count: 2/6 Ratio: 0.333333 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 2/6 Ratio: 0.333333 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 2/6 Ratio: 0.333333 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 2.5 CSC2541 - Scalable and Flexible Models of Uncertainty 11/29

  25. UCB1 D EMO Selection Count: 2/7 Ratio: 0.285714 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 3/7 Ratio: 0.428571 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 2/7 Ratio: 0.285714 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 2.5 CSC2541 - Scalable and Flexible Models of Uncertainty 12/29

  26. UCB1 D EMO Selection Count: 7/50 Ratio: 0.14 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 18/50 Ratio: 0.36 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 25/50 Ratio: 0.5 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 2.5 CSC2541 - Scalable and Flexible Models of Uncertainty 13/29

  27. UCB1 D EMO Selection Count: 11/100 Ratio: 0.11 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 34/100 Ratio: 0.34 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 55/100 Ratio: 0.55 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 2.5 CSC2541 - Scalable and Flexible Models of Uncertainty 14/29

  28. UCB1 D EMO Selection Count: 32/1000 Ratio: 0.032 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 261/1000 Ratio: 0.261 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 707/1000 Ratio: 0.707 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 2.5 CSC2541 - Scalable and Flexible Models of Uncertainty 15/29

  29. UCB1 D EMO Selection Count: 57/10000 Ratio: 0.0057 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 931/10000 Ratio: 0.0931 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Selection Count: 9012/10000 Ratio: 0.9012 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 2.5 CSC2541 - Scalable and Flexible Models of Uncertainty 16/29

  30. UCB1: R EGRET B OUND (T HEOREM 1) For all K > 1, if policy UCB1 is run on K machines having arbitrary reward distributions P 1 , . . . , P K with support in [ 0 , 1 ] , then its expected regret after any number n of plays is at most � � K  � � 1 + π 2 � ln n � �  + �  8 ∆ i ∆ i 3 i : µ i <µ ∗ i = 1 CSC2541 - Scalable and Flexible Models of Uncertainty 17/29

  31. UCB1: D EFINITIONS FOR P ROOF OF B OUND ◮ I t - Indicator RV equal to the machine played at time t . ◮ ¯ X i , n - RV of observed mean reward from n plays of machine i . n ¯ � X i , n = X i , t t = 1 ◮ An asterisk superscript refers to the (first) optimal machine. e.g. T ∗ ( n ) and ¯ X ∗ n . ◮ Braces denote the indicator function of their contents. ◮ The number of plays of machine i after time n under UCB1 is therefore n � T i ( n ) = 1 + { I t = i } t = K + 1 CSC2541 - Scalable and Flexible Models of Uncertainty 18/29

  32. UCB1: P ROOF OF R EGRET B OUND n � T i ( n ) = 1 + { I t = i } t = K + 1 n � ≤ ℓ + { I t = i } t = K + 1 T i ( t − 1 ) ≥ 1 ◮ Strategy: For every sub-optimal arm i , need to establish bound on total number of plays as a function of n . ◮ Assume we have seen ℓ plays of machine i so far and consider number of remaining plays. CSC2541 - Scalable and Flexible Models of Uncertainty 19/29

  33. UCB1: P ROOF OF R EGRET B OUND n � T i ( n ) ≤ ℓ + { I t = i } t = K + 1 T i ( t − 1 ) ≥ ℓ n { ¯ T ∗ ( t − 1 ) + c t − 1 , T ∗ ( t − 1 ) ≤ ¯ � X ∗ ≤ ℓ + X i , T i ( t − 1 ) + c t − 1 , T i ( t − 1 ) } t = K + 1 T i ( t − 1 ) ≥ ℓ � 2 ln t ◮ Let c t , s = be the UCB offset term. s ◮ Machine i is selected if its UCB = ¯ X i , T i ( t − 1 ) + c t − 1 , T i ( t − 1 ) is largest of all machines. ◮ In particular, must be larger than the UCB of the optimal machine. CSC2541 - Scalable and Flexible Models of Uncertainty 20/29

  34. UCB1: P ROOF OF R EGRET B OUND n � { ¯ T ∗ ( t − 1 ) + c t − 1 , T ∗ ( t − 1 ) ≤ ¯ X ∗ T i ( n ) ≤ ℓ + X i , T i ( t − 1 ) + c t − 1 , T i ( t − 1 ) } t = K + 1 T i ( t − 1 ) ≥ ℓ t − 1 t − 1 ∞ � � � { ¯ s + c t , s ≤ ¯ X ∗ ≤ ℓ + X i , s i + c t , s i } t = 1 s = 1 s i = ℓ ◮ Do not care about the particular number of times machine i and machine ∗ have been seen. ◮ Probability is upper bounded by summing over all possible assignments of T ∗ ( t − 1 ) = s and T i ( t − 1 ) = s i . ◮ Relax the bounds on t as well. CSC2541 - Scalable and Flexible Models of Uncertainty 21/29

  35. UCB1: P ROOF OF R EGRET B OUND t − 1 t − 1 ∞ � � � { ¯ s + c t , s ≤ ¯ X ∗ T i ( n ) ≤ ℓ + X i , s i + c t , s i } t = 1 s = 1 s i = ℓ The event ¯ s + c t , s ≤ ¯ X ∗ X i , s i + c t , s i implies at least one of the following: s ≤ µ ∗ − c t , s ¯ X ∗ (1) ¯ X i , s i ≥ µ i + c t , s i (2) µ < µ i + 2 c t , s i (3) CSC2541 - Scalable and Flexible Models of Uncertainty 22/29

  36. C HERNOFF -H OEFFDING B OUND Let Z 1 , . . . Z n be i.i.d random variables with mean µ and domain [ 0 , 1 ] . Let ¯ Z n = 1 n ( Z 1 + · · · + Z n ) . Then for all a ≥ 0, � ¯ ≤ e − 2 na 2 � ¯ ≤ e − 2 na 2 � � P Z n ≥ µ + α P Z n ≤ µ − α Applied to inequalities (1) and (2), these give the bounds � � 2 ln t �� s ≤ µ ∗ − c t , s � ¯ = t − 4 X ∗ � P ≤ exp − 2 s s � ¯ ≤ t − 4 � P X i , s i ≥ µ i + c t , s i CSC2541 - Scalable and Flexible Models of Uncertainty 23/29

  37. UCB1: P ROOF OF R EGRET B OUND The final inequality, µ ∗ < µ i + 2 c t , s i is based on the width of the confidence interval. For t < n , it is false when s i is large enough: � 2 ln t ∆ i = µ ∗ − µ i ≤ 2 s i ⇒ ∆ 2 4 ≤ 2 ln t i s i ⇒ s i < 8 ln t ∆ 2 i ◮ In the regret bound summation, s i ≥ ℓ so we set ℓ = 8 ln t i + 1 ∆ 2 ◮ Inequality (3) then contributes nothing to the bound. CSC2541 - Scalable and Flexible Models of Uncertainty 24/29

  38. UCB1: P ROOF OF R EGRET B OUND With ℓ = 8 ln t i + 1 we have the bound on E [ T i ( n )] : ∆ 2 t − 1 t − 1 ∞ s ≤ µ ∗ − c t , s � � � � ¯ � ¯ X ∗ � � �� E [ T i ( n )] ≤ ℓ + P + P X i , s i ≥ µ i + c t , s i t = 1 s = 1 s i = ℓ t t ∞ � � � 2 t − 4 ≤ ℓ + t = 1 s = 1 s i = 1 + 1 + π 2 ≤ 8 ln n ∆ 2 3 i Substituted into the regret formula, this gives our bound. CSC2541 - Scalable and Flexible Models of Uncertainty 25/29

  39. UCB1-T UNED + 1 + π 2 ◮ UCB1: E [ T i ( n )] ≤ 8 ln n ∆ 2 3 i ◮ Constant factor 8 1 i is sub-optimal. Optimal: i . ∆ 2 2 ∆ 2 ◮ In practice the performance of UCB1 can be improved further by using the confidence bound � � 1 � ln n ¯ X i , s + min 4 , V i ( n i ) n i where � s � � 1 2 ln t � − ¯ X 2 X 2 V i ( s ) = i , s + i ,τ s s τ = 1 ◮ No proof of regret bound. CSC2541 - Scalable and Flexible Models of Uncertainty 26/29

  40. O THER P OLICIES ◮ UCB2: More complicated; gets arbitrarily close to optimal constant factor on regret. ◮ UCB1-NORMAL: UCB1 adapted for normally distributed rewards. ◮ ǫ n -GREEDY: ǫ -greedy policy with decaying ǫ . � 1 , cK � ǫ n = min d 2 n where c > 0 0 < d ≤ min i : µ i <µ ∗ ∆ i CSC2541 - Scalable and Flexible Models of Uncertainty 27/29

  41. E XPERIMENTS Two machines: 10 machines: Bernoulli 0.9 and 0.8 Bernoulli 0.9, 0.8, . . . , 0.8 Auer, Peter, Nicolo Cesa-Bianchi, and Paul Fischer. ”Finite-time analysis of the multiarmed bandit problem.” Machine learning 47.2-3 (2002): 235-256. CSC2541 - Scalable and Flexible Models of Uncertainty 28/29

  42. C OMPARISONS ◮ UCB1-Tuned nearly always far outperforms UCB1 ◮ ǫ n -GREEDY performs very well if tuned correctly, poorly otherwise. Also poorly if there are many suboptimal machines. ◮ UCB1-Tuned is nearly as good as the best ǫ n -GREEDY without any tuning required. ◮ UCB2 is similar to UCB1-Tuned but slightly worse. CSC2541 - Scalable and Flexible Models of Uncertainty 29/29

  43. A Tutorial on Thompson Sampling Daniel J.Russo, Benjamin Van Roy, Abbas Kazerouni, Ian, Osband, and Zheng Wen Presenters Mingjie Mai Feng Chi October 20, 2017 CSC2541 - Scalable and Flexible Models of Uncertainty 1/37

  44. O UTLINE ◮ Example problems ◮ Algorithms and applications to example problems ◮ Approximations for complex model ◮ Practical modeling considerations ◮ Limitations ◮ Further example: Reinforcement learning in Markov Decision Problems CSC2541 - Scalable and Flexible Models of Uncertainty 2/37

  45. E XPLOITATION VS E XPLORATION ◮ Restaurant Selection ◮ Online Banner Advertisements ◮ Oil Drilling ◮ Game Playing ◮ Multi-armed bandit problem CSC2541 - Scalable and Flexible Models of Uncertainty 3/37

  46. F ORMAL B ANDIT PROBLEMS Bandit problems can be seen as a generalization of supervised learning, where we: ◮ Actions x t ∈ X ◮ Unknown probability distribution over rewards: ( p 1 , . . . , p K ) ◮ Each step, pick one x t ◮ observe response y t ◮ receive the instantaneous reward r t = r ( y t ) ◮ the goal is to maximize mean cumulative reward E � t r t CSC2541 - Scalable and Flexible Models of Uncertainty 4/37

  47. R EGRET ◮ The optimal action is x ∗ t = max x t ∈X E [ r | x t ] ◮ The regret is the opportunity loss for one step: E [ E [ r | x ∗ t ] − E [ r | x t ]] ◮ The total regret is the total opportunity loss : E [ � t τ = 1 ( E [ r | x ∗ τ ] − E [ r | x τ ])] ◮ Maximize cumulative reward ≡ minimize total regret CSC2541 - Scalable and Flexible Models of Uncertainty 5/37

  48. B ERNOULLI B ANDIT ◮ Action: x t ∈ { 1 , 2 , ..., K } ◮ Success probabilities: ( θ 1 , ..., θ K ) , where θ k ∈ [ 0 , 1 ] ◮ Observation:  1 w.p. θ k  y t = 0 otherwise  ◮ Reward: r t ( y t ) = y t ◮ Prior belief: θ k ∼ β ( α k , β k ) CSC2541 - Scalable and Flexible Models of Uncertainty 6/37

  49. A LGORITHMS The data observed up to time t : H t = { ( x 1 , y 1 ) , ..., ( x t − 1 , y t − 1 ) } ◮ Greedy ◮ ˆ θ = E [ θ | H t − 1 ] ◮ x t = argmax k ˆ θ k ◮ ǫ -Greedy ◮ ˆ θ = E [ θ | H t − 1 ]  argmax k ˆ θ k w.p. 1 − ǫ  ◮ x t = unif ( { 1 , . . . , K } ) otherwise  ◮ Thompson Sampling ◮ ˆ θ is sampled from P ( θ k | H t − 1 ) ◮ x t = argmax k ˆ θ k CSC2541 - Scalable and Flexible Models of Uncertainty 7/37

  50. COMPUTING POSTERIORS WITH B ERNOULLI BANDIT ◮ Prior belief: θ k ∼ β ( α k , β k ) ◮ At each time period, apply action x t , reward r t ∈ { 0 , 1 } is generated with success probability P ( r t = 1 | x t , θ ) = θ x t ◮ Update distribution according to Baye’s rule. ◮ due to conjugacy property of beta distribution we have:  ( α k , β k ) if x t � = k  ( α k , β k ) ← ( α k , β k ) + ( r t , 1 − r t ) if x t = k .  CSC2541 - Scalable and Flexible Models of Uncertainty 8/37

  51. S IDE B Y S IDE C OMPARISON CSC2541 - Scalable and Flexible Models of Uncertainty 9/37

  52. PERFORMANCE COMPARISON (a) greedy algorithm (b) Thompson sampling Figure: Probability that the greedy algorithm and Thompson sampling selects an action. θ 1 = 0 . 9 , θ 2 = 0 . 8 , θ 3 = 0 . 7 CSC2541 - Scalable and Flexible Models of Uncertainty 10/37

  53. PERFORMANCE COMPARISON (b) average over random θ (a) θ = ( 0 . 9 , 0 . 8 , 0 . 7 ) Figure: Regret from applying greedy and Thompson sampling algorithms to the three-armed Bernoulli bandit. CSC2541 - Scalable and Flexible Models of Uncertainty 11/37

  54. ONLINE SHORTEST PATH Figure: Shortest Path Problem. CSC2541 - Scalable and Flexible Models of Uncertainty 12/37

  55. O NLINE S HORTEST P ATH - I NDEPENDENT TRAVEL TIME Given a graph G = ( V , E , v s , v d ) , where v s , v d ∈ V , we have that ◮ Mean travel time: θ e for e ∈ E , ◮ Action: x t = ( e 1 , ..., e M ) , a path from v s to v d ◮ Observation: ( y t , e 1 | θ e 1 , ..., y t , e M | θ e M ) are independent, where σ 2 ln ( y t , e | θ e ) ∼ N ( ln θ e − ˜ σ 2 ) , so that E [ y t , e | θ e ] = θ e 2 , ˜ ◮ Reward: r t = − � e ∈ x t y t , e ◮ Prior belief: ln ( θ e ) ∼ N ( µ e , σ 2 e ) also independent. CSC2541 - Scalable and Flexible Models of Uncertainty 13/37

  56. O NLINE S HORTEST P ATH - I NDEPENDENT TRAVEL TIME ◮ At each t th iteration with posterior parameters ( µ e , σ e ) for each e ∈ E . ◮ greedy algorithm: ˆ θ e = E p [ θ e ] = e µ e + σ 2 e / 2 ◮ Thompson sampling: draw ˆ θ e ∼ logNormal ( µ e , σ 2 e ) e ∈ x t ˆ ◮ pick an action x to maximize E q ˆ θ [ r ( y t ) | x t = x ] = − � θ e ◮ can be solved via Dijkstra’s algorithm ◮ observe y t , e , and update parameters � � σ 2   1 e µ e + 1 ln y t , e + ˜ 1 σ 2 ˜ σ 2 2 ( µ e , σ 2 e ) ← ,   1 e + 1 1 e + 1 σ 2 σ 2 ˜ σ 2 σ 2 ˜ CSC2541 - Scalable and Flexible Models of Uncertainty 14/37

  57. B INOMIAL BRIDGE ◮ apply above algorithm to a Binomial bridge with six stages with 184,756 paths. σ 2 = 1 ◮ µ e = − 1 2 , σ 2 e = 1 so that E [ θ e ] = 1, for each e ∈ E , and ˜ Figure: A binomial bridge with six stages. CSC2541 - Scalable and Flexible Models of Uncertainty 15/37

  58. (a) regret (b) total travel time/optimal Figure: Performance of Thompson sampling and ǫ -greedy algorithms in the shortest path problem. CSC2541 - Scalable and Flexible Models of Uncertainty 16/37

  59. ONLINE SHORTEST PATH - CORRELATED TRAVEL TIME ◮ independent θ e ∼ logNormal ( µ e , σ 2 e ) ◮ y t , e = ζ t , e η t ν t ,ℓ ( e ) θ e ◮ ζ t , e is an idiosyncratic factor associated with edge e (road construction/closure, accident, etc) ◮ η t a common factor to all edges (weather, etc). ◮ ℓ ( e ) indicates whether edge e resides in the lower half of the bridge ◮ ν t , 0 , ν t , 1 are factors bear a common influence on edges in the upper or lower halves (signal problems) CSC2541 - Scalable and Flexible Models of Uncertainty 17/37

  60. ONLINE SHORTEST PATH - CORRELATED TRAVEL TIME ◮ Prior setup: ◮ take ζ t , e , η t , ν t , 1 , ν t , 0 to be independent σ 2 / 6 , ˜ σ 2 / 3 ) . logNormal (˜ ◮ only need to estimate θ e , and marginal y t , e | θ is the same as independent case, but the joint distribution over y t | θ differs. ◮ Correlated observations induce dependencies in posterior, although mean travel times are independent. CSC2541 - Scalable and Flexible Models of Uncertainty 18/37

  61. ONLINE SHORTEST PATH - CORRELATED TRAVEL TIME ◮ Let φ, z t ∈ R N be defined by � if e ∈ x t ln y t , e φ e = ln θ e z t , e = and 0 otherwise. ◮ Define a | x t | × | x t | covariance matrix ˜ Σ with elements  σ 2 for e = e ′ ˜    ˜ σ 2 / 3 Σ e , e ′ = for e � = e ′ , ℓ ( e ) = ℓ ( e ′ ) 2 ˜  σ 2 / 3  ˜ otherwise,  ◮ for e , e ′ ∈ x t , and a N × N concentration matrix � ˜ if e , e ′ ∈ x t Σ − 1 ˜ e , e ′ C e , e ′ = 0 otherwise, CSC2541 - Scalable and Flexible Models of Uncertainty 19/37

  62. ONLINE SHORTEST PATH - CORRELATED TRAVEL TIME ◮ Apply Thompson sampling ◮ Each t th iteration, sample a vector ˆ φ from N ( µ, Σ) , then setting ˆ θ e = ˆ φ e for each e ∈ E . ◮ An action x is selected to maximize e ∈ x t ˆ θ [ r ( y t ) | x t = x ] = − � θ e , using Djikstra’s algorithm or E q ˆ an alternative. ◮ for e , e ′ ∈ E . Then, the posterior distribution of φ is normal with a mean vector µ and covariance matrix Σ that can be updated according to �� � − 1 � � − 1 � Σ − 1 + ˜ � � Σ − 1 + ˜ Σ − 1 µ + ˜ ( µ, Σ) ← C Cz t , C . CSC2541 - Scalable and Flexible Models of Uncertainty 20/37

  63. (a) regret (b) total travel time/optimal Figure: Performance of two versions of Thompson sampling in the shortest path problem with correlated travel time. CSC2541 - Scalable and Flexible Models of Uncertainty 21/37

  64. A PPROXIMATIONS OF POSTERIOR SAMPLING FOR COMPLEX MODEL ◮ Gibbs Sampling ◮ Langevin Monte Carlo ◮ Sampling from a Laplace Approximation ◮ Bootstrapping CSC2541 - Scalable and Flexible Models of Uncertainty 22/37

  65. G IBBS S AMPLING ◮ History: H t − 1 = (( x 1 , y 1 ) , . . . , ( x t − 1 , y t − 1 )) ◮ Starts with an initial guess θ 0 ◮ For each n th iteration, sample each k th component according to ˆ k ∼ f n , k θ n t − 1 ( θ k ) f n , k t − 1 ( θ k ) ∝ f t − 1 ((ˆ 1 , . . . , ˆ k − 1 , θ k , ˆ k + 1 , . . . , ˆ θ n θ n θ n − 1 θ n − 1 )) K θ N is taken to be the approximate ◮ After N iterations, ˆ posterior sample CSC2541 - Scalable and Flexible Models of Uncertainty 23/37

  66. L ANGEVIN M ONTE C ARLO ◮ Let g be the posterior distribution ◮ Euler method for stimulating Langevin daynmics: θ n + 1 = θ n + ǫ ∇ ln g ( θ n ) + √ ǫ W n n ∈ N ◮ W 1 , W 2 , · · · are i.i.d. standard normal random variables and ǫ > 0 is a small step size ◮ Stochastic gradient Langevin Monte Carlo: use sampled minibatches of data to compute approximate CSC2541 - Scalable and Flexible Models of Uncertainty 24/37

  67. S AMPLING FROM A L APLACE A PPROXIMATION ◮ Assume posterior g is unimodal and its log density ln g ( θ ) is strictly concave around its mode θ ◮ A second-order Taylor approximation to the log-density gives ln g ( θ ) ≈ ln g ( θ ) − 1 2 ( θ − θ ) ⊤ C ( θ − θ ) , where C = −∇ 2 ln g ( θ ) . ◮ Approximation to the density g using a Gaussian distribution with mean θ and covariance C − 1 | C / 2 π | e − 1 2 ( θ − θ ) ⊤ C ( θ − θ ) � ˜ g ( θ ) = CSC2541 - Scalable and Flexible Models of Uncertainty 25/37

  68. B OOTSTRAPPING ◮ History: H t − 1 = (( x 1 , y 1 ) , . . . , ( x t − 1 , y t − 1 )) ◮ Uniformly sample with replacement from H t − 1 ◮ Hypothetical history ˆ H t − 1 = ((ˆ x 1 , ˆ y 1 ) , . . . , (ˆ x t − 1 , ˆ y t − 1 )) ◮ Maximize the likelihood of θ under ˆ H t − 1 CSC2541 - Scalable and Flexible Models of Uncertainty 26/37

  69. B ERNOULLI BANDIT Figure: Regret of approximation methods versus exact Thompson sampling (Bernolli bandit) CSC2541 - Scalable and Flexible Models of Uncertainty 27/37

  70. O NLINE SHORTEST PATH Figure: Regret of approximation methods versus exact Thompson sampling (online shortest path) CSC2541 - Scalable and Flexible Models of Uncertainty 28/37

  71. P RACTICAL MODELING CONSIDERATIONS ◮ Prior distribution specification ◮ Constraints and context ◮ Nonstationary systems CSC2541 - Scalable and Flexible Models of Uncertainty 29/37

  72. P RIOR DISTRIBUTION SPECIFICATION ◮ Prior: a distribution over plausible values ◮ Misspecified prior vs informative prior ◮ Thoughtful choice of prior based on past experience can improve learning performance CSC2541 - Scalable and Flexible Models of Uncertainty 30/37

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend