Multi-agent learning
Multi-a rmed bandit algo rithmsGerard Vreeswijk, Intelligent Software Systems, Computer Science Department, Faculty of Sciences, Utrecht University, The Netherlands.
Thursday 30th April, 2020
Multi-agent learning Multi-a rmed bandit algo rithms Gerard - - PowerPoint PPT Presentation
Multi-agent learning Multi-a rmed bandit algo rithms Gerard Vreeswijk , Intelligent Software Systems, Computer Science Department, Faculty of Sciences, Utrecht University, The Netherlands. Thursday 30 th April, 2020 Contents Author: Gerard
Gerard Vreeswijk, Intelligent Software Systems, Computer Science Department, Faculty of Sciences, Utrecht University, The Netherlands.
Thursday 30th April, 2020
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 2
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 2
■ Introduction, motivation,
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 2
■ Introduction, motivation,
■ Online vs. offline (batch)
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 2
■ Introduction, motivation,
■ Online vs. offline (batch)
■ Simple (but common) MAB
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 2
■ Introduction, motivation,
■ Online vs. offline (batch)
■ Simple (but common) MAB
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 2
■ Introduction, motivation,
■ Online vs. offline (batch)
■ Simple (but common) MAB
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 2
■ Introduction, motivation,
■ Online vs. offline (batch)
■ Simple (but common) MAB
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 2
■ Introduction, motivation,
■ Online vs. offline (batch)
■ Simple (but common) MAB
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 2
■ Introduction, motivation,
■ Online vs. offline (batch)
■ Simple (but common) MAB
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 2
■ Introduction, motivation,
■ Online vs. offline (batch)
■ Simple (but common) MAB
■ A well-known MAB algorithm
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 2
■ Introduction, motivation,
■ Online vs. offline (batch)
■ Simple (but common) MAB
■ A well-known MAB algorithm
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 2
■ Introduction, motivation,
■ Online vs. offline (batch)
■ Simple (but common) MAB
■ A well-known MAB algorithm
■ Some remarks on the analysis
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 3
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 3
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 3
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 4
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 5
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 6
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 6
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 6
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 6
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 6
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 6
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 7
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 7
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8
explo rer exploiterAuthor: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9
■ Select a restaurant from N
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9
■ Select a restaurant from N
■ Select a movie channel from N
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9
■ Select a restaurant from N
■ Select a movie channel from N
■ Distribute load among servers.
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9
■ Select a restaurant from N
■ Select a movie channel from N
■ Distribute load among servers. ■ Choose a medical treatment
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9
■ Select a restaurant from N
■ Select a movie channel from N
■ Distribute load among servers. ■ Choose a medical treatment
■ Adaptive routing to optimize
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9
■ Select a restaurant from N
■ Select a movie channel from N
■ Distribute load among servers. ■ Choose a medical treatment
■ Adaptive routing to optimize
■ Financial portfolio
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9
■ Select a restaurant from N
■ Select a movie channel from N
■ Distribute load among servers. ■ Choose a medical treatment
■ Adaptive routing to optimize
■ Financial portfolio
■ . . .
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10
■ This formula is correct. However, every time Qn is computed, all
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10
■ This formula is correct. However, every time Qn is computed, all
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10
■ This formula is correct. However, every time Qn is computed, all
■ It would be better to have an update formula that computes the new
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10
■ This formula is correct. However, every time Qn is computed, all
■ It would be better to have an update formula that computes the new
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11
value
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11
value
rate
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11
value
rate
value
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11
value
rate
value
value
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 12
lea rning rate average geometri averageAuthor: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 12
■ Amplitude of correction is determined by the
lea rning rate. average geometri averageAuthor: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 12
■ Amplitude of correction is determined by the
lea rning rate.■ To compute the
average, the learning rate is 1/n (decreases!). geometri averageAuthor: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 12
■ Amplitude of correction is determined by the
lea rning rate.■ To compute the
average, the learning rate is 1/n (decreases!).■ Learning rate can also be a constant 0 ≤ λ ≤ 1
geometri averageAuthor: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 12
■ Amplitude of correction is determined by the
lea rning rate.■ To compute the
average, the learning rate is 1/n (decreases!).■ Learning rate can also be a constant 0 ≤ λ ≤ 1 ⇒
geometri average.Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13
Exploitation Explo ration the se ond Bo rel-Cantelli lemma the la wAuthor: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13
■ Greedy: exploit the action that is optimal thus far.
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13
■ Greedy: exploit the action that is optimal thus far.
■ ǫ-Greedy: Let 0 < ǫ ≤ 1 close to 0.
Exploitation Explo ration the se ond Bo rel-Cantelli lemma the la wAuthor: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13
■ Greedy: exploit the action that is optimal thus far.
■ ǫ-Greedy: Let 0 < ǫ ≤ 1 close to 0.
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13
■ Greedy: exploit the action that is optimal thus far.
■ ǫ-Greedy: Let 0 < ǫ ≤ 1 close to 0.
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13
■ Greedy: exploit the action that is optimal thus far.
■ ǫ-Greedy: Let 0 < ǫ ≤ 1 close to 0.
i=1 ǫ is infinite, it follows from the
the se ond Bo rel-Cantelli lemma that every action is explored infinitely manyAuthor: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13
■ Greedy: exploit the action that is optimal thus far.
■ ǫ-Greedy: Let 0 < ǫ ≤ 1 close to 0.
i=1 ǫ is infinite, it follows from the
the se ond Bo rel-Cantelli lemma that every action is explored infinitely manyAuthor: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13
■ Greedy: exploit the action that is optimal thus far.
■ ǫ-Greedy: Let 0 < ǫ ≤ 1 close to 0.
i=1 ǫ is infinite, it follows from the
the se ond Bo rel-Cantelli lemma that every action is explored infinitely manyAuthor: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13
■ Greedy: exploit the action that is optimal thus far.
■ ǫ-Greedy: Let 0 < ǫ ≤ 1 close to 0.
i=1 ǫ is infinite, it follows from the
the se ond Bo rel-Cantelli lemma that every action is explored infinitely manyAuthor: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13
■ Greedy: exploit the action that is optimal thus far.
■ ǫ-Greedy: Let 0 < ǫ ≤ 1 close to 0.
i=1 ǫ is infinite, it follows from the
the se ond Bo rel-Cantelli lemma that every action is explored infinitely manyAuthor: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 14
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 14
■ At the outset, an
unrealisti ally high qualit y is attributed to every slot0 = high
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 14
■ At the outset, an
unrealisti ally high qualit y is attributed to every slot0 = high
■ As usual, for every slot machine
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 14
■ At the outset, an
unrealisti ally high qualit y is attributed to every slot0 = high
■ As usual, for every slot machine
■ Without exception, always exploit
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15
■ At the outset, an
unrealisti ally high qualit y is attributed to every slot0 = high
■ As usual, for every slot machine
■ Without exception, always exploit
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15
■ At the outset, an
unrealisti ally high qualit y is attributed to every slot0 = high
■ As usual, for every slot machine
■ Without exception, always exploit
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15
■ At the outset, an
unrealisti ally high qualit y is attributed to every slot0 = high
■ As usual, for every slot machine
■ Without exception, always exploit
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15
■ At the outset, an
unrealisti ally high qualit y is attributed to every slot0 = high
■ As usual, for every slot machine
■ Without exception, always exploit
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15
■ At the outset, an
unrealisti ally high qualit y is attributed to every slot0 = high
■ As usual, for every slot machine
■ Without exception, always exploit
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15
■ At the outset, an
unrealisti ally high qualit y is attributed to every slot0 = high
■ As usual, for every slot machine
■ Without exception, always exploit
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 16
From: “Reinforcement Learning (...)”, Sutton and Barto, Sec. 2.8, p. 41.
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17
moving average explo ration rate lea rning- adaptation rateAuthor: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17
■ Q-learning is like ǫ-Greedy learnnig, but then with a
moving average.Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17
■ Q-learning is like ǫ-Greedy learnnig, but then with a
moving average.Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17
■ Q-learning is like ǫ-Greedy learnnig, but then with a
moving average.Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17
■ Q-learning is like ǫ-Greedy learnnig, but then with a
moving average.■ Q-learning possesses two parameters: an
explo ration rate, ǫ, and a lea rning- or adaptation rate, λ.Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17
■ Q-learning is like ǫ-Greedy learnnig, but then with a
moving average.■ Q-learning possesses two parameters: an
explo ration rate, ǫ, and a lea rning- or adaptation rate, λ.Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17
■ Q-learning is like ǫ-Greedy learnnig, but then with a
moving average.■ Q-learning possesses two parameters: an
explo ration rate, ǫ, and a lea rning- or adaptation rate, λ.■ Exercise: what if ǫ is small and γ is large?
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17
■ Q-learning is like ǫ-Greedy learnnig, but then with a
moving average.■ Q-learning possesses two parameters: an
explo ration rate, ǫ, and a lea rning- or adaptation rate, λ.■ Exercise: what if ǫ is small and γ is large? The other way?
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18
Boltzmann Gibbs mixed logit quantal respAuthor: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18
■ Greedily: exploit the action that is optimal thus far.
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18
■ Greedily: exploit the action that is optimal thus far.
■ Proportional: select randomly and proportional to the expected
j=1 Qj
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18
■ Greedily: exploit the action that is optimal thus far.
■ Proportional: select randomly and proportional to the expected
j=1 Qj
■ Through softmax (or
Boltzmann, or Gibbs, or mixed logit, or quantal respAuthor: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18
■ Greedily: exploit the action that is optimal thus far.
■ Proportional: select randomly and proportional to the expected
j=1 Qj
■ Through softmax (or
Boltzmann, or Gibbs, or mixed logit, or quantal respj=1 eQj/τ ,
temp eratureAuthor: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18
■ Greedily: exploit the action that is optimal thus far.
■ Proportional: select randomly and proportional to the expected
j=1 Qj
■ Through softmax (or
Boltzmann, or Gibbs, or mixed logit, or quantal respj=1 eQj/τ ,
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 19
j=1 eQj/τ .
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 19
j=1 eQj/τ .
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 19
j=1 eQj/τ .
τ → ∞
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 19
j=1 eQj/τ .
τ → ∞ τ = 1
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 19
j=1 eQj/τ .
τ → ∞ τ = 1 τ ↓ 0
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 20
upp erAuthor: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 20
■ UCB is short for
upp erAuthor: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 20
■ UCB is short for
upp er■ Proposed by Lai et al. in “Asymptotically efficient adaptive allocation
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 20
■ UCB is short for
upp er■ Proposed by Lai et al. in “Asymptotically efficient adaptive allocation
■ Idea: keep track of confidence intervals for each action. At any time,
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 20
■ UCB is short for
upp er■ Proposed by Lai et al. in “Asymptotically efficient adaptive allocation
■ Idea: keep track of confidence intervals for each action. At any time,
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 20
■ UCB is short for
upp er■ Proposed by Lai et al. in “Asymptotically efficient adaptive allocation
■ Idea: keep track of confidence intervals for each action. At any time,
■ Algorithm: execute each action once. Then, at each round t, choose
t +
t
t is action’s i average at round t, and ni t is the number of
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 21
b
b
b
b
b
b
b
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 22
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 23
Ho eding's inequalit yAuthor: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 23
■
Ho eding's inequalit y for i.i.d. random variables X1, . . . , Xt with meanAuthor: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 23
■
Ho eding's inequalit y for i.i.d. random variables X1, . . . , Xt with mean■ For action i this would be Pr{µ ≥ ¯
t + d} ≤ exp(−2ni td2).
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 23
■
Ho eding's inequalit y for i.i.d. random variables X1, . . . , Xt with mean■ For action i this would be Pr{µ ≥ ¯
t + d} ≤ exp(−2ni td2).
■ The probability that the true mean lies outside [ ¯
t − d, ¯
t + d] goes to
td2) equal to an expression that goes
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 23
■
Ho eding's inequalit y for i.i.d. random variables X1, . . . , Xt with mean■ For action i this would be Pr{µ ≥ ¯
t + d} ≤ exp(−2ni td2).
■ The probability that the true mean lies outside [ ¯
t − d, ¯
t + d] goes to
td2) equal to an expression that goes
td2) = t−4.
ni
t
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 24
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 24
■ All arms are pulled infinitely often. (Hint: what happens with an
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 24
■ All arms are pulled infinitely often. (Hint: what happens with an
■ Let µ1, . . . , µK be the means of the distributions of the K arms.
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 24
■ All arms are pulled infinitely often. (Hint: what happens with an
■ Let µ1, . . . , µK be the means of the distributions of the K arms. ■ Let I = argmaxi=1,...,K{µi}. So I is the set of indices of all optimal
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 24
■ All arms are pulled infinitely often. (Hint: what happens with an
■ Let µ1, . . . , µK be the means of the distributions of the K arms. ■ Let I = argmaxi=1,...,K{µi}. So I is the set of indices of all optimal
■ It can be proven that the number of times a sub-optimal arm is played
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 24
■ All arms are pulled infinitely often. (Hint: what happens with an
■ Let µ1, . . . , µK be the means of the distributions of the K arms. ■ Let I = argmaxi=1,...,K{µi}. So I is the set of indices of all optimal
■ It can be proven that the number of times a sub-optimal arm is played
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 24
■ All arms are pulled infinitely often. (Hint: what happens with an
■ Let µ1, . . . , µK be the means of the distributions of the K arms. ■ Let I = argmaxi=1,...,K{µi}. So I is the set of indices of all optimal
■ It can be proven that the number of times a sub-optimal arm is played
■ Bounds can be loose. Suppose C = 8 and N is 20. Then
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 24
■ All arms are pulled infinitely often. (Hint: what happens with an
■ Let µ1, . . . , µK be the means of the distributions of the K arms. ■ Let I = argmaxi=1,...,K{µi}. So I is the set of indices of all optimal
■ It can be proven that the number of times a sub-optimal arm is played
■ Bounds can be loose. Suppose C = 8 and N is 20. Then
■ UCB comes in variants. UCB1 was discussed here.
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 25
hypAuthor: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 25
■ Idea: for each arm, use a
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 25
■ Idea: for each arm, use a
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 25
■ Idea: for each arm, use a
■ For every action, start out with
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 25
■ Idea: for each arm, use a
■ For every action, start out with
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 25
■ Idea: for each arm, use a
■ For every action, start out with
■ At every round, sample all
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 25
■ Idea: for each arm, use a
■ For every action, start out with
■ At every round, sample all
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 25
■ Idea: for each arm, use a
■ For every action, start out with
■ At every round, sample all
■ Update: compute the action’s
pAuthor: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 25
■ Idea: for each arm, use a
■ For every action, start out with
■ At every round, sample all
■ Update: compute the action’s
p1Each arm i is associated with the outcome of tossing a biased coin (heads = 1, tails =
0) with fixed (and hidden) bias 0 ≤ θi ≤ 1.
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 26
Arm 1, with posterior Betai(3, 2) Arm 2, Betai(1, 1) Arm 3, Betai(2, 4) Arm 4, Betai(2, 7)
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 27
shifting gea rs adversa rial banditAuthor: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 27
■ Adversarial behaviour seems to
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 27
■ Adversarial behaviour seems to
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 27
■ Adversarial behaviour seems to
■ An
adversa rial bandit (Auer andAuthor: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 27
■ Adversarial behaviour seems to
■ An
adversa rial bandit (Auer and■ In the worst case, an adversarial
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 27
■ Adversarial behaviour seems to
■ An
adversa rial bandit (Auer and■ In the worst case, an adversarial
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 27
■ Adversarial behaviour seems to
■ An
adversa rial bandit (Auer and■ In the worst case, an adversarial
■ “Cumulative” algorithms, like
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 28
expAuthor: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 28
■ Exp3 is short for
expAuthor: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 28
■ Exp3 is short for
exp■ Proposed by Auer et al. (2002) in “The nonstochastic multiarmed
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 28
■ Exp3 is short for
exp■ Proposed by Auer et al. (2002) in “The nonstochastic multiarmed
■ Exp3 may be seen as a volatile Softmax.
adversa rial algo rithm egalita rianism fa to rAuthor: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 28
■ Exp3 is short for
exp■ Proposed by Auer et al. (2002) in “The nonstochastic multiarmed
■ Exp3 may be seen as a volatile Softmax. ■ It is an
adversa rial algo rithm, meaning that it should perform well inAuthor: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 28
■ Exp3 is short for
exp■ Proposed by Auer et al. (2002) in “The nonstochastic multiarmed
■ Exp3 may be seen as a volatile Softmax. ■ It is an
adversa rial algo rithm, meaning that it should perform well in■ Idea: maintain a vector of action weights (w1, . . . , wK). Actions are
i=1 wi(t)
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 29
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 29
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 29
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 29
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 29
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 30
■
stri tly stationa ry w eakly stationa ryb b b b b b b b b b b b b b b b b b b b b b b b b b b b b b
b
b
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 30
■
stri tly stationa ry if, for each fixed h, the vectors (Xt, Xt+1, . . . , Xt+h)■
w eakly stationa ryb b b b b b b b b b b b b b b b b b b b b b b b b b b b b b
b
b
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 30
■
stri tly stationa ry if, for each fixed h, the vectors (Xt, Xt+1, . . . , Xt+h)■
w eakly stationa ryb b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b
b
b
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 30
■
stri tly stationa ry if, for each fixed h, the vectors (Xt, Xt+1, . . . , Xt+h)■
w eakly stationa ryb b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b
b
b
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 30
■
stri tly stationa ry if, for each fixed h, the vectors (Xt, Xt+1, . . . , Xt+h)■
w eakly stationa ryb b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b
b
b
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 30
■
stri tly stationa ry if, for each fixed h, the vectors (Xt, Xt+1, . . . , Xt+h)■
w eakly stationa ryb b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b
b
b
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 30
■
stri tly stationa ry if, for each fixed h, the vectors (Xt, Xt+1, . . . , Xt+h)■
w eakly stationa ryb b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b
b
b
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 30
■
stri tly stationa ry if, for each fixed h, the vectors (Xt, Xt+1, . . . , Xt+h)■
w eakly stationa ryb b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b
b
b
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 31
b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 32
b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b
unevenly spa ed time seriesArm 1 Arm 2 Arm 3 Arm 4 Arm 5
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 32
b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b
Arm 1 Arm 2 Arm 3 Arm 4 Arm 5
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 32
b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b
unevenly spa ed time seriesArm 1 Arm 2 Arm 3 Arm 4 Arm 5
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 33
Regula r evenly spa ed time series Unevenly spa ed time series to tak e advantageAuthor: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 33
■ Lots of theories, methods and
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 33
■ Lots of theories, methods and
Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 33
■ Lots of theories, methods and
■
Unevenly spa ed time series areAuthor: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 33
■ Lots of theories, methods and
■
Unevenly spa ed time series areAuthor: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 33
■ Lots of theories, methods and
■
Unevenly spa ed time series areAuthor: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 33
■ Lots of theories, methods and
■
Unevenly spa ed time series areAuthor: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 33
■ Lots of theories, methods and
■
Unevenly spa ed time series are■ Rather than to overengineer