Thompson Sampling on Symmetric -Stable Bandits Abhimanyu Dubey and - PowerPoint PPT Presentation

Thompson Sampling on Symmetric α -Stable Bandits Abhimanyu Dubey and Alex Pentland Massachusetts Institute of Technology dubeya@mit.edu IJCAI 2019 August 14, 2019 Abhimanyu Dubey (MIT) Thompson Sampling on α -Stable Bandits IJCAI 2019 August 14, 2019 1 / 14

Introduction Multi-Armed Bandits Figure: Bernoulli bandit problem. Abhimanyu Dubey (MIT) Thompson Sampling on α -Stable Bandits IJCAI 2019 August 14, 2019 2 / 14

Introduction Stochastic Multi-Armed Bandits K actions (“arms”) that return rewards r k sampled i.i.d. from K different distributions, each with mean µ k . The problem proceeds in rounds; at each round t , the agent chooses action a ( t ), and obtains a randomly drawn reward r a ( t ) ( t ) from the corresponding distribution. The goal is to minimize regret : � � ( µ ∗ − µ k ) E [ n k ( T )] T · µ ∗ R ( T ) = − µ k E [ n k ( T )] = � �� k ∈ [ K ] k ∈ [ K ] best possible � �� avg. reward obtained average “loss” from avg. reward suboptimal decisions Abhimanyu Dubey (MIT) Thompson Sampling on α -Stable Bandits IJCAI 2019 August 14, 2019 3 / 14

Introduction Some Intuition Exploration: An agent can choose different arms to obtain a better estimate of their average rewards. Exploitation: An agent can repeatedly choose the arm it believes to be optimal. Priors: The agent may have prior information about the reward distributions. The central dilemma is to balance exploration and exploitation, and efficiently use prior information (if available). Abhimanyu Dubey (MIT) Thompson Sampling on α -Stable Bandits IJCAI 2019 August 14, 2019 4 / 14

Thompson Sampling Thompson Sampling Earliest heuristic for the problem [Tho33]; uses a Bayesian approach Assume a prior distribution over the reward params, θ k ∼ p ( ·| η k ) For t ∈ [ T ], sample params for each arm from the posterior: ˆ θ k ∼ p ( θ k | η k , r k (1) , r k (2) , ... ) Choose action that maximizes mean given the posterior samples.If µ k = f k (ˆ θ k ) for some function f k , then: f k (ˆ a ( t ) = arg max µ k = arg max θ k ) k ∈ [ K ] k ∈ [ K ] Update posterior for chosen arm with the reward recieved. Performs very well in practice [CL11], with theoretical interest in its optimality [AG13, AG12]. Abhimanyu Dubey (MIT) Thompson Sampling on α -Stable Bandits IJCAI 2019 August 14, 2019 5 / 14

Heavy-Tailed Distributions Most of the existing analysis of the problem is on well-behaved rewards: bounded support (e.g. rewards are in [0, 1]) sub-Gaussian (tails decay faster than Gaussian) There is evidence, however, that suggests that real-world data exhibit very heavy tails: Stock prices [Nol03] Presence times in online networks [VOD + 06] Labels in social media [MGR + 18] We want to design machine learning algorithms that are robust to heavy tails and provide more accurate decision-making in real-world scenarios. Abhimanyu Dubey (MIT) Thompson Sampling on α -Stable Bandits IJCAI 2019 August 14, 2019 6 / 14

α -Stable Distributions α -Stable distributions comprise all continuous distributions that are closed under linear transformations. That is, if X and Y are stable, then Z = X + Y is also stable. – e.g. Gaussian, L´ evy, Cauchy Do not (generally) admit an analytical density function. Do not admit moments higher than order α , where α ≤ 2. – i.e., have infinite variance, and are heavy-tailed (except for α = 2). The empirical mean exhibits polynomial deviations. – i.e. we cannot use typical Chernoff bounds (MGF does not exist). Abhimanyu Dubey (MIT) Thompson Sampling on α -Stable Bandits IJCAI 2019 August 14, 2019 7 / 14

This Work Problem Setting: We are given a K -armed stochastic bandit problem, where rewards are drawn i.i.d. from symmetric α -stable distributions where α ∈ (1 , 2). Contributions: Efficient algorithm for approximate Bayesian inference under α -stable rewards. Finite-time regret bounds for naive posterior sampling using the empirical mean. Near-optimal regret bounds for posterior sampling using a robust mean. Abhimanyu Dubey (MIT) Thompson Sampling on α -Stable Bandits IJCAI 2019 August 14, 2019 8 / 14

Our Approach: Algorithm Symmetric α -stable distributions can be considered as a case of scale mixtures of normals : i.e. they can be considered as a weighted mixture of a Gaussian density with another α -stable distribution. � ∞ N ( x | µ, λσ 2 ) p X ( x | µ ) = · p Λ ( λ ) · d λ � �� 0 � �� α -stable normal known α -stable density density density This implies that given samples of an auxiliary variable λ , the conditional density of rewards is Gaussian. p ( x | λ, µ ) ∼ N ( µ, λσ 2 ) Thus, we can use a (conjugate) Gaussian prior on the mean rewards: p ( x | λ ) ∼ N ( µ, λσ 2 ) · N ( µ 0 , η 2 ) Abhimanyu Dubey (MIT) Thompson Sampling on α -Stable Bandits IJCAI 2019 August 14, 2019 9 / 14

α -Thompson Sampling Input : Arms k ∈ [ K ], priors N ( µ 0 k , σ 2 ) for each arm. Set D k = 1 , N k = 0 for each arm k . For each iteration t ∈ [1 , T ]: � � µ 0 k + N k , σ 2 Draw ¯ µ k ( t ) ∼ N for each arm k . D k D k Choose arm A t = arg max µ k ( t ), and get reward r t . ¯ k ∈ [ K ] Draw λ ( t ) using a rejection sampler. k Set D k = D k + 1 /λ ( t ) k , N k = N k + r t /λ ( t ) k . Abhimanyu Dubey (MIT) Thompson Sampling on α -Stable Bandits IJCAI 2019 August 14, 2019 10 / 14

Our Approach: Analysis α -stable distributions do not allow an analytic probability density. We therefore work with the characteristic function (Fourier Transform) of the probability density: � ∞ p ( x ) e − 2 π izx dx φ ( z ) = −∞ We derive concentration results that bound how fast the empirical mean converges to the true mean under α -stable distributions.These results are of independent interest in robust machine learning theory, and are critical for our regret analysis. Using the concentration results, we bound the Bayes regret of our algorithm via an upper confidence-bound decomposition. Abhimanyu Dubey (MIT) Thompson Sampling on α -Stable Bandits IJCAI 2019 August 14, 2019 11 / 14

Our Contributions α -Thompson Sampling We use an auxiliary variable to obtain an efficient algorithm for posterior updation, and subsequently perform Thompson Sampling when rewards are from α -stable distributions.We 1 1+ ǫ 1+2 ǫ ) on the finite-time Bayes 1+ ǫ T obtain the first problem-independent bound of order O ( K Regret of this algorithm. Robust α -Thompson Sampling Using a robust mean estimator we propose a version of Thompson Sampling for α -stable � � 1 bandits that incurs a tight Bayesian Regret of ˜ O ( KT ) , matching the lower bound (up to 1+ ǫ logarithmic factors). Abhimanyu Dubey (MIT) Thompson Sampling on α -Stable Bandits IJCAI 2019 August 14, 2019 12 / 14

References Shipra Agrawal and Navin Goyal, Analysis of thompson sampling for the multi-armed bandit problem , Conference on Learning Theory, 2012, pp. 39–1. , Further optimal regret bounds for thompson sampling , Artificial Intelligence and Statistics, 2013, pp. 99–107. Olivier Chapelle and Lihong Li, An empirical evaluation of thompson sampling , Advances in neural information processing systems, 2011, pp. 2249–2257. Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten, Exploring the limits of weakly supervised pretraining , The European Conference on Computer Vision (ECCV), September 2018. John P Nolan, Modeling financial data with stable distributions , Handbook of heavy tailed distributions in finance, Elsevier, 2003, pp. 105–130. William R Thompson, On the likelihood that one unknown probability exceeds another in view of the evidence of two samples , Biometrika 25 (1933), no. 3/4, 285–294. Alexei V´ azquez, Joao Gama Oliveira, Zolt´ an Dezs¨ o, Kwang-Il Goh, Imre Kondor, and Albert-L´ aszl´ o Barab´ asi, Modeling bursts and heavy tails in human dynamics , Physical Review E 73 (2006), no. 3, 036127. Abhimanyu Dubey (MIT) Thompson Sampling on α -Stable Bandits IJCAI 2019 August 14, 2019 13 / 14

The End Abhimanyu Dubey (MIT) Thompson Sampling on α -Stable Bandits IJCAI 2019 August 14, 2019 14 / 14

Thompson Sampling on Symmetric -Stable Bandits Abhimanyu Dubey and - PowerPoint PPT Presentation

Thompson Sampling on Symmetric -Stable Bandits Abhimanyu Dubey and Alex Pentland Massachusetts Institute of Technology dubeya@mit.edu IJCAI 2019 August 14, 2019 Abhimanyu Dubey (MIT) Thompson Sampling on -Stable Bandits IJCAI 2019

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Ciara Pike-Burke

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar]

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Class 42: Free symmetric top Class 42: Free symmetric top Free symmetric top in body frame Assume

Thompson Sampling and Linear Bandits Instructor: Sham Kakade 1 Review The basic paradigm is as

Thompson Sampling Algorithms for Mean-Variance Bandits Qiuyu Zhu Vincent Y. F. Tan Institute of

Stable Marriage Problem Stable Marriage Problem Small town with n boys and n girls. Stable

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Compiler Development (CMPSC 401) ARM Architecture Janyl Jumadinova April 4, 2019 Janyl

Next generation mobile rootkits Hack In Paris 2013 leveldown security Thomas Roth Hi!

ARM Exception Handling CS2253 Owen Kaser, UNBSJ Overview Warning: hardest parts of CS2253.

HIGHLEVELMANIPULATION PRIMITIVESFORAROBOTARM Supported by National

paclitaxel/carboplatin versus paclitaxel/carboplatin/maintenance letrozole versus letrozole

Integrating DMA capabilities into BLIS for on-chip data movement Devangi Parikh Ilya

ARM Assembler Strings Strings p. 1/16 Characters or Strings A string is a sequence of

LEADING COLLABORAT ION IN THE ARM ECOSYSTEM Linaro workshop Open Source HPC Collaboration on

Sambuz

Useful Links

Newsletter

Mail Us