multi armed bandits
play

Multi-armed bandits S Bubeck, N Cesa-Bianchi Foundations and Trends - PowerPoint PPT Presentation

Multi-armed bandits S Bubeck, N Cesa-Bianchi Foundations and Trends in Machine Learning 2012 * Real title: regret analysis of stochastic and nonstochastic multi-armed bandit problems Overview Stochastic, adversarial, extensions &


  1. Multi-armed bandits S Bubeck, N Cesa-Bianchi Foundations and Trends in Machine Learning 2012 * Real title: regret analysis of stochastic and nonstochastic multi-armed bandit problems

  2. Overview • Stochastic, adversarial, extensions & connections. • Applications: • Clinical trails, Ad placement, Package routing, Video games • Flexible theoretical framework with rigorous guarantees • Feedback is partially observable • This is not supervised learning!

  3. reward Multi-armed bandit setting • Casino, 𝐿 slot machines • 𝑈 rounds, timesteps 𝑢 = 1, … , 𝑈 • For 𝑢 = 1, … 𝑈 we choose a slot machine / arm to pull / ad to show 𝑙 • If we pull arm 𝑙 at timestep t, we receive (only) reward 𝑠 𝑢 𝑙 i.i.d. sampled from distribution of arm 𝑙 • Stochastic: reward 𝑠 𝑢 • For example normal: 𝑂 𝜈 𝑙 , 1 • Generally, we don’t know the distribution, or we may only know the class.

  4. 𝜈 ∗ reward Multi-armed bandit setting • We want to maximize our gains. • Say our algorithm chooses machine 𝐵 𝑢 in round 𝑢 𝐵 𝑢 𝑈 • Gain: 𝐻 𝐵 = σ 𝑢=1 𝑠 𝑢 Best action • Performance measure: regret. • Regret compares our performance to best fixed action. • Always play best arm  in expectation gain is 𝑈𝜈 ∗ • Regret: 𝑆 = 𝑈𝜈 ∗ − 𝐻 𝐵 . Low regret = High gain = Good.

  5. reward Stochastic setting • Naïve greedy strategy: • Try all arms once (explore) • Afterward, continue pulling best arm (exploit) • Why can this fail? • We may get unlucky and observe the blue samples in the first round • Balance exploration & exploitation • Exploration: try enough arms to be certain you have a ‘good’ one • Exploitation: pull the good arm enough to maximize your reward • Algorithms: UCB, Thompson Sampling. Optimism in face of uncertainty.

  6. reward ???? Adverserial setting • Problem with stochastic setting: • We assume the distribution of the arms are fixed. • For example for the advertisements, the reward distribution can change over time. 𝑢 ∈ [0,1] • Adversary chooses rewards: no assumptions made about rewards 𝑠 𝑙 𝑢 very low. • If we choose arm k at time t, adversary can set 𝑠 𝑙 • Solve by randomization! For each t, choose distribution 𝑞 𝑢 over all k arms. • By surprising adversary, we can still get low regret. • Algorithms: EXP3, EXP3.P, FPL. • Pessimistic / ‘play it safe’ / always spread your chances.

  7. Extensions & Connections • Contextual bandit: use ‘side information’. • In each round we receive information from the user (in advertising example) • Naïve solution: run a bandit for each user category (cluster users first) • More related to supervised learning since we can use ‘features’ • Non-stationary bandit: • Distributions change slowly instead of rewards being determined by adversary. • Adversary assumption may be too pessimistic, but i.i.d. too optimistic. • Connection to reinforcement learning: MDP with only 1 state. • Full-information setting • We observe all rewards of all arms in all rounds • Online learning / Hedge algorithm / Exponential weights algorithm, EXP2.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend