Multi-armed bandits S Bubeck, N Cesa-Bianchi Foundations and Trends - - PowerPoint PPT Presentation

β–Ά
multi armed bandits
SMART_READER_LITE
LIVE PREVIEW

Multi-armed bandits S Bubeck, N Cesa-Bianchi Foundations and Trends - - PowerPoint PPT Presentation

Multi-armed bandits S Bubeck, N Cesa-Bianchi Foundations and Trends in Machine Learning 2012 * Real title: regret analysis of stochastic and nonstochastic multi-armed bandit problems Overview Stochastic, adversarial, extensions &


slide-1
SLIDE 1

Multi-armed bandits

S Bubeck, N Cesa-Bianchi Foundations and Trends in Machine Learning 2012

* Real title: regret analysis of stochastic and nonstochastic multi-armed bandit problems

slide-2
SLIDE 2

Overview

  • Stochastic, adversarial, extensions & connections.
  • Applications:
  • Clinical trails, Ad placement, Package routing, Video games
  • Flexible theoretical framework with rigorous guarantees
  • Feedback is partially observable
  • This is not supervised learning!
slide-3
SLIDE 3

Multi-armed bandit setting

  • Casino, 𝐿 slot machines
  • π‘ˆ rounds, timesteps 𝑒 = 1, … , π‘ˆ
  • For 𝑒 = 1, … π‘ˆ we choose a slot machine / arm to pull / ad to show
  • If we pull arm 𝑙 at timestep t, we receive (only) reward 𝑠

𝑒 𝑙

  • Stochastic: reward 𝑠

𝑒 𝑙 i.i.d. sampled from distribution of arm 𝑙

  • For example normal: 𝑂 πœˆπ‘™, 1
  • Generally, we don’t know the distribution, or we may only know the class.

reward

slide-4
SLIDE 4

Multi-armed bandit setting

  • We want to maximize our gains.
  • Say our algorithm chooses machine 𝐡𝑒 in round 𝑒
  • Gain: 𝐻𝐡 = σ𝑒=1

π‘ˆ

𝑠

𝑒 𝐡𝑒

  • Performance measure: regret.
  • Regret compares our performance to best fixed action.
  • Always play best arm οƒ  in expectation gain is π‘ˆπœˆβˆ—
  • Regret: 𝑆 = π‘ˆπœˆβˆ— βˆ’ 𝐻𝐡. Low regret = High gain = Good.

reward πœˆβˆ— Best action

slide-5
SLIDE 5

Stochastic setting

  • NaΓ―ve greedy strategy:
  • Try all arms once (explore)
  • Afterward, continue pulling best arm (exploit)
  • Why can this fail?
  • We may get unlucky and observe the blue samples in the first round
  • Balance exploration & exploitation
  • Exploration: try enough arms to be certain you have a β€˜good’ one
  • Exploitation: pull the good arm enough to maximize your reward
  • Algorithms: UCB, Thompson Sampling. Optimism in face of uncertainty.

reward

slide-6
SLIDE 6

Adverserial setting

  • Problem with stochastic setting:
  • We assume the distribution of the arms are fixed.
  • For example for the advertisements, the reward distribution can change over time.
  • Adversary chooses rewards: no assumptions made about rewards 𝑠

𝑙 𝑒 ∈ [0,1]

  • If we choose arm k at time t, adversary can set 𝑠

𝑙 𝑒 very low.

  • Solve by randomization! For each t, choose distribution π‘žπ‘’ over all k arms.
  • By surprising adversary, we can still get low regret.
  • Algorithms: EXP3, EXP3.P, FPL.
  • Pessimistic / β€˜play it safe’ / always spread your chances.

reward ????

slide-7
SLIDE 7

Extensions & Connections

  • Contextual bandit: use β€˜side information’.
  • In each round we receive information from the user (in advertising example)
  • NaΓ―ve solution: run a bandit for each user category (cluster users first)
  • More related to supervised learning since we can use β€˜features’
  • Non-stationary bandit:
  • Distributions change slowly instead of rewards being determined by adversary.
  • Adversary assumption may be too pessimistic, but i.i.d. too optimistic.
  • Connection to reinforcement learning: MDP with only 1 state.
  • Full-information setting
  • We observe all rewards of all arms in all rounds
  • Online learning / Hedge algorithm / Exponential weights algorithm, EXP2.