Multi-agent learning Multi-a rmed bandit algo rithms Gerard - - PowerPoint PPT Presentation

multi agent learning
SMART_READER_LITE
LIVE PREVIEW

Multi-agent learning Multi-a rmed bandit algo rithms Gerard - - PowerPoint PPT Presentation

Multi-agent learning Multi-a rmed bandit algo rithms Gerard Vreeswijk , Intelligent Software Systems, Computer Science Department, Faculty of Sciences, Utrecht University, The Netherlands. Thursday 30 th April, 2020 Contents Author: Gerard


slide-1
SLIDE 1

Multi-agent learning

Multi-a rmed bandit algo rithms

Gerard Vreeswijk, Intelligent Software Systems, Computer Science Department, Faculty of Sciences, Utrecht University, The Netherlands.

Thursday 30th April, 2020

slide-2
SLIDE 2

Contents

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 2

slide-3
SLIDE 3

Contents

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 2

■ Introduction, motivation,

practical applications.

slide-4
SLIDE 4

Contents

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 2

■ Introduction, motivation,

practical applications.

■ Online vs. offline (batch)

processing of data.

slide-5
SLIDE 5

Contents

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 2

■ Introduction, motivation,

practical applications.

■ Online vs. offline (batch)

processing of data.

■ Simple (but common) MAB

algorithms:

slide-6
SLIDE 6

Contents

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 2

■ Introduction, motivation,

practical applications.

■ Online vs. offline (batch)

processing of data.

■ Simple (but common) MAB

algorithms:

  • ǫ-Greedy.
slide-7
SLIDE 7

Contents

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 2

■ Introduction, motivation,

practical applications.

■ Online vs. offline (batch)

processing of data.

■ Simple (but common) MAB

algorithms:

  • ǫ-Greedy.
  • Q-learning with exploration

ǫ and learning rate γ.

slide-8
SLIDE 8

Contents

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 2

■ Introduction, motivation,

practical applications.

■ Online vs. offline (batch)

processing of data.

■ Simple (but common) MAB

algorithms:

  • ǫ-Greedy.
  • Q-learning with exploration

ǫ and learning rate γ.

  • Boltzmann (a.k.a. Softmax,

Gibbs, mixed logit, quantal response) with temperature γ.

slide-9
SLIDE 9

Contents

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 2

■ Introduction, motivation,

practical applications.

■ Online vs. offline (batch)

processing of data.

■ Simple (but common) MAB

algorithms:

  • ǫ-Greedy.
  • Q-learning with exploration

ǫ and learning rate γ.

  • Boltzmann (a.k.a. Softmax,

Gibbs, mixed logit, quantal response) with temperature γ.

  • UCB (upper confidence

bound). (Parameterless.)

slide-10
SLIDE 10

Contents

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 2

■ Introduction, motivation,

practical applications.

■ Online vs. offline (batch)

processing of data.

■ Simple (but common) MAB

algorithms:

  • ǫ-Greedy.
  • Q-learning with exploration

ǫ and learning rate γ.

  • Boltzmann (a.k.a. Softmax,

Gibbs, mixed logit, quantal response) with temperature γ.

  • UCB (upper confidence

bound). (Parameterless.)

  • Thompson sampling.
slide-11
SLIDE 11

Contents

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 2

■ Introduction, motivation,

practical applications.

■ Online vs. offline (batch)

processing of data.

■ Simple (but common) MAB

algorithms:

  • ǫ-Greedy.
  • Q-learning with exploration

ǫ and learning rate γ.

  • Boltzmann (a.k.a. Softmax,

Gibbs, mixed logit, quantal response) with temperature γ.

  • UCB (upper confidence

bound). (Parameterless.)

  • Thompson sampling.

■ A well-known MAB algorithm

what works well in adversarial circumstances.

slide-12
SLIDE 12

Contents

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 2

■ Introduction, motivation,

practical applications.

■ Online vs. offline (batch)

processing of data.

■ Simple (but common) MAB

algorithms:

  • ǫ-Greedy.
  • Q-learning with exploration

ǫ and learning rate γ.

  • Boltzmann (a.k.a. Softmax,

Gibbs, mixed logit, quantal response) with temperature γ.

  • UCB (upper confidence

bound). (Parameterless.)

  • Thompson sampling.

■ A well-known MAB algorithm

what works well in adversarial circumstances.

  • Exp3 (exponential weight

algorithm for exploration and exploitation) with egalitarian factor γ.

slide-13
SLIDE 13

Contents

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 2

■ Introduction, motivation,

practical applications.

■ Online vs. offline (batch)

processing of data.

■ Simple (but common) MAB

algorithms:

  • ǫ-Greedy.
  • Q-learning with exploration

ǫ and learning rate γ.

  • Boltzmann (a.k.a. Softmax,

Gibbs, mixed logit, quantal response) with temperature γ.

  • UCB (upper confidence

bound). (Parameterless.)

  • Thompson sampling.

■ A well-known MAB algorithm

what works well in adversarial circumstances.

  • Exp3 (exponential weight

algorithm for exploration and exploitation) with egalitarian factor γ.

■ Some remarks on the analysis

unevenly spaced time series.

slide-14
SLIDE 14

MAB algorithms are only interest in rewards per action

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 3

Row is protagonist. From a b c d e A 1, 0 5, 6 1, 0 9, 7 7, 2 B 4, 6 4, 2 1, 8 7, 2 9, 7 C 1, 0 7, 2 9, 7 3, 4 4, 6 D 3, 7 5, 2 5, 3 9, 7 1, 8 E 1, 0 7, 2 4, 6 1, 2 2, 0

slide-15
SLIDE 15

MAB algorithms are only interest in rewards per action

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 3

Row is protagonist. From a b c d e A 1, 0 5, 6 1, 0 9, 7 7, 2 B 4, 6 4, 2 1, 8 7, 2 9, 7 C 1, 0 7, 2 9, 7 3, 4 4, 6 D 3, 7 5, 2 5, 3 9, 7 1, 8 E 1, 0 7, 2 4, 6 1, 2 2, 0 to a b c d e A 1, ? 5, ? 1, ? 9, ? 7, ? B 4, ? 4, ? 1, ? 7, ? 9, ? C 1, ? 7, ? 9, ? 3, ? 4, ? D 3, ? 5, ? 5, ? 9, ? 1, ? E 1, ? 7, ? 4, ? 1, ? 2, ?

slide-16
SLIDE 16

MAB algorithms are only interest in rewards per action

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 3

Row is protagonist. From a b c d e A 1, 0 5, 6 1, 0 9, 7 7, 2 B 4, 6 4, 2 1, 8 7, 2 9, 7 C 1, 0 7, 2 9, 7 3, 4 4, 6 D 3, 7 5, 2 5, 3 9, 7 1, 8 E 1, 0 7, 2 4, 6 1, 2 2, 0 to a b c d e A 1, ? 5, ? 1, ? 9, ? 7, ? B 4, ? 4, ? 1, ? 7, ? 9, ? C 1, ? 7, ? 9, ? 3, ? 4, ? D 3, ? 5, ? 5, ? 9, ? 1, ? E 1, ? 7, ? 4, ? 1, ? 2, ? to don’t care what the antagonist does A reward sequence r1, r2, . . . . . . B reward sequence r5, . . . . . . C reward sequence r3, r7, r8, . . . . . . D reward sequence r4, r9, r10, r11, r12, . . . . . . E reward sequence r6, . . . . . .

slide-17
SLIDE 17

Introduction

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 4

The multi-armed bandit.

http://en.wikipedia.org/wiki/Multi- armed_bandit
slide-18
SLIDE 18

The multi-armed bandit problem

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 5

Which slot machine to choose?

slide-19
SLIDE 19

MAB problem: random questions

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 6

  • Given. An array of N slot

machines.

exploit explo re
slide-20
SLIDE 20

MAB problem: random questions

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 6

  • Given. An array of N slot

machines. Random questions:

exploit explo re
slide-21
SLIDE 21

MAB problem: random questions

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 6

  • Given. An array of N slot

machines. Random questions:

  • 1. How long do to stick with a slot

machine?

exploit explo re
slide-22
SLIDE 22

MAB problem: random questions

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 6

  • Given. An array of N slot

machines. Random questions:

  • 1. How long do to stick with a slot

machine?

  • 2. Try many machines, or opt for

security?

exploit explo re
slide-23
SLIDE 23

MAB problem: random questions

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 6

  • Given. An array of N slot

machines. Random questions:

  • 1. How long do to stick with a slot

machine?

  • 2. Try many machines, or opt for

security?

  • 3. Do you
exploit success, or do

you

explo re the possibilities?
slide-24
SLIDE 24

MAB problem: random questions

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 6

  • Given. An array of N slot

machines. Random questions:

  • 1. How long do to stick with a slot

machine?

  • 2. Try many machines, or opt for

security?

  • 3. Do you
exploit success, or do

you

explo re the possibilities?
  • 4. Is it something we can assume

about the distribution of the payouts? Constant mean? Constant variance? Stationary? Does a machine “shift gears” every now and then?

slide-25
SLIDE 25

Experiment

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 7

Yield Machine 1 Yield Machine 2 Yield Machine 3 8 7 20 8 11 1 8 8 8 9 8

slide-26
SLIDE 26

Experiment

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 7

Yield Machine 1 Yield Machine 2 Yield Machine 3 8 7 20 8 11 1 8 8 8 9 8 Average 8 8.75 10.5

slide-27
SLIDE 27

Exploration vs. exploitation

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

explo rer exploiter
slide-28
SLIDE 28

Exploration vs. exploitation

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

  • Problem. You are at the beginning of a new study year. Every fellow

student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness?

explo rer exploiter
slide-29
SLIDE 29

Exploration vs. exploitation

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

  • Problem. You are at the beginning of a new study year. Every fellow

student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies:

explo rer exploiter
slide-30
SLIDE 30

Exploration vs. exploitation

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

  • Problem. You are at the beginning of a new study year. Every fellow

student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies:

  • 1. Make friends whe{n|r}ever possible. You are an
explo rer. exploiter
slide-31
SLIDE 31

Exploration vs. exploitation

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

  • Problem. You are at the beginning of a new study year. Every fellow

student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies:

  • 1. Make friends whe{n|r}ever possible. You are an
explo rer.
  • 2. Stick to the nearest fellow-student. You are an
exploiter.
slide-32
SLIDE 32

Exploration vs. exploitation

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

  • Problem. You are at the beginning of a new study year. Every fellow

student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies:

  • 1. Make friends whe{n|r}ever possible. You are an
explo rer.
  • 2. Stick to the nearest fellow-student. You are an
exploiter.
  • 3. What most people would do: first explore, then “exploit”.
slide-33
SLIDE 33

Exploration vs. exploitation

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

  • Problem. You are at the beginning of a new study year. Every fellow

student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies:

  • 1. Make friends whe{n|r}ever possible. You are an
explo rer.
  • 2. Stick to the nearest fellow-student. You are an
exploiter.
  • 3. What most people would do: first explore, then “exploit”.

We ignore / abstract away from:

slide-34
SLIDE 34

Exploration vs. exploitation

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

  • Problem. You are at the beginning of a new study year. Every fellow

student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies:

  • 1. Make friends whe{n|r}ever possible. You are an
explo rer.
  • 2. Stick to the nearest fellow-student. You are an
exploiter.
  • 3. What most people would do: first explore, then “exploit”.

We ignore / abstract away from:

  • 1. How quality of friendships is measured.
slide-35
SLIDE 35

Exploration vs. exploitation

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

  • Problem. You are at the beginning of a new study year. Every fellow

student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies:

  • 1. Make friends whe{n|r}ever possible. You are an
explo rer.
  • 2. Stick to the nearest fellow-student. You are an
exploiter.
  • 3. What most people would do: first explore, then “exploit”.

We ignore / abstract away from:

  • 1. How quality of friendships is measured.
  • 2. That personalities of friends may change (so-called “non-stationary

search”).

slide-36
SLIDE 36

Other practical problems

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9

slide-37
SLIDE 37

Other practical problems

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9

■ Select a restaurant from N

alternatives.

slide-38
SLIDE 38

Other practical problems

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9

■ Select a restaurant from N

alternatives.

■ Select a movie channel from N

recommendations.

slide-39
SLIDE 39

Other practical problems

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9

■ Select a restaurant from N

alternatives.

■ Select a movie channel from N

recommendations.

■ Distribute load among servers.

slide-40
SLIDE 40

Other practical problems

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9

■ Select a restaurant from N

alternatives.

■ Select a movie channel from N

recommendations.

■ Distribute load among servers. ■ Choose a medical treatment

from N alternatives.

slide-41
SLIDE 41

Other practical problems

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9

■ Select a restaurant from N

alternatives.

■ Select a movie channel from N

recommendations.

■ Distribute load among servers. ■ Choose a medical treatment

from N alternatives.

■ Adaptive routing to optimize

network flow.

slide-42
SLIDE 42

Other practical problems

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9

■ Select a restaurant from N

alternatives.

■ Select a movie channel from N

recommendations.

■ Distribute load among servers. ■ Choose a medical treatment

from N alternatives.

■ Adaptive routing to optimize

network flow.

■ Financial portfolio

management.

slide-43
SLIDE 43

Other practical problems

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 9

■ Select a restaurant from N

alternatives.

■ Select a movie channel from N

recommendations.

■ Distribute load among servers. ■ Choose a medical treatment

from N alternatives.

■ Adaptive routing to optimize

network flow.

■ Financial portfolio

management.

■ . . .

slide-44
SLIDE 44

Computation of the quality (off line version)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10

A reasonable measure for

the qualit y
  • f
an a tion a after n tries, Qn, would

be its average payoff: Formula for the quality of an action after n tries. Qn =Def r1 + · · · + rn n

bat h lea rning
  • nline
lea rning
slide-45
SLIDE 45

Computation of the quality (off line version)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10

A reasonable measure for

the qualit y
  • f
an a tion a after n tries, Qn, would

be its average payoff: Formula for the quality of an action after n tries. Qn =Def r1 + · · · + rn n Data comes in gradually.

bat h lea rning
  • nline
lea rning
slide-46
SLIDE 46

Computation of the quality (off line version)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10

A reasonable measure for

the qualit y
  • f
an a tion a after n tries, Qn, would

be its average payoff: Formula for the quality of an action after n tries. Qn =Def r1 + · · · + rn n Data comes in gradually.

■ This formula is correct. However, every time Qn is computed, all

r1, . . . , rn must be retrieved.

bat h lea rning
  • nline
lea rning
slide-47
SLIDE 47

Computation of the quality (off line version)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10

A reasonable measure for

the qualit y
  • f
an a tion a after n tries, Qn, would

be its average payoff: Formula for the quality of an action after n tries. Qn =Def r1 + · · · + rn n Data comes in gradually.

■ This formula is correct. However, every time Qn is computed, all

r1, . . . , rn must be retrieved. This is

bat h lea rning.
  • nline
lea rning
slide-48
SLIDE 48

Computation of the quality (off line version)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10

A reasonable measure for

the qualit y
  • f
an a tion a after n tries, Qn, would

be its average payoff: Formula for the quality of an action after n tries. Qn =Def r1 + · · · + rn n Data comes in gradually.

■ This formula is correct. However, every time Qn is computed, all

r1, . . . , rn must be retrieved. This is

bat h lea rning.

■ It would be better to have an update formula that computes the new

average based on the old average and the new incoming value.

  • nline
lea rning
slide-49
SLIDE 49

Computation of the quality (off line version)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10

A reasonable measure for

the qualit y
  • f
an a tion a after n tries, Qn, would

be its average payoff: Formula for the quality of an action after n tries. Qn =Def r1 + · · · + rn n Data comes in gradually.

■ This formula is correct. However, every time Qn is computed, all

r1, . . . , rn must be retrieved. This is

bat h lea rning.

■ It would be better to have an update formula that computes the new

average based on the old average and the new incoming value. That would be

  • nline
lea rning.
slide-50
SLIDE 50

Computation of the quality (online version)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

Qn

slide-51
SLIDE 51

Computation of the quality (online version)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

Qn

=

r1 + · · · + rn n

slide-52
SLIDE 52

Computation of the quality (online version)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

Qn

=

r1 + · · · + rn n

= r1 + · · · + rn−1

n

+ rn

n

slide-53
SLIDE 53

Computation of the quality (online version)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

Qn

=

r1 + · · · + rn n

= r1 + · · · + rn−1

n

+ rn

n

=

r1 + · · · + rn−1 n − 1

· n − 1

n

+ rn

n

slide-54
SLIDE 54

Computation of the quality (online version)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

Qn

=

r1 + · · · + rn n

= r1 + · · · + rn−1

n

+ rn

n

=

r1 + · · · + rn−1 n − 1

· n − 1

n

+ rn

n

=

Qn−1· n − 1 n

+ rn

n

slide-55
SLIDE 55

Computation of the quality (online version)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

Qn

=

r1 + · · · + rn n

= r1 + · · · + rn−1

n

+ rn

n

=

r1 + · · · + rn−1 n − 1

· n − 1

n

+ rn

n

=

Qn−1· n − 1 n

+ rn

n

=

Qn−1 − 1 n

  • Qn−1 +

1 n

  • rn
slide-56
SLIDE 56

Computation of the quality (online version)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

Qn

=

r1 + · · · + rn n

= r1 + · · · + rn−1

n

+ rn

n

=

r1 + · · · + rn−1 n − 1

· n − 1

n

+ rn

n

=

Qn−1· n − 1 n

+ rn

n

=

Qn−1 − 1 n

  • Qn−1 +

1 n

  • rn

=

Qn−1

  • ld

value

slide-57
SLIDE 57

Computation of the quality (online version)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

Qn

=

r1 + · · · + rn n

= r1 + · · · + rn−1

n

+ rn

n

=

r1 + · · · + rn−1 n − 1

· n − 1

n

+ rn

n

=

Qn−1· n − 1 n

+ rn

n

=

Qn−1 − 1 n

  • Qn−1 +

1 n

  • rn

=

Qn−1

  • ld

value

+

1 n

  • learning

rate

slide-58
SLIDE 58

Computation of the quality (online version)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

Qn

=

r1 + · · · + rn n

= r1 + · · · + rn−1

n

+ rn

n

=

r1 + · · · + rn−1 n − 1

· n − 1

n

+ rn

n

=

Qn−1· n − 1 n

+ rn

n

=

Qn−1 − 1 n

  • Qn−1 +

1 n

  • rn

=

Qn−1

  • ld

value

+

1 n

  • learning

rate

( rn

  • goal

value

slide-59
SLIDE 59

Computation of the quality (online version)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

Qn

=

r1 + · · · + rn n

= r1 + · · · + rn−1

n

+ rn

n

=

r1 + · · · + rn−1 n − 1

· n − 1

n

+ rn

n

=

Qn−1· n − 1 n

+ rn

n

=

Qn−1 − 1 n

  • Qn−1 +

1 n

  • rn

=

Qn−1

  • ld

value

+

1 n

  • learning

rate

( rn

  • goal

value

− Qn−1

  • ld

value

  • error

)

  • correction
  • new value

.

slide-60
SLIDE 60

Progress of the quality of one action

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 12

lea rning rate average geometri average
slide-61
SLIDE 61

Progress of the quality of one action

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 12

■ Amplitude of correction is determined by the

lea rning rate. average geometri average
slide-62
SLIDE 62

Progress of the quality of one action

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 12

■ Amplitude of correction is determined by the

lea rning rate.

■ To compute the

average, the learning rate is 1/n (decreases!). geometri average
slide-63
SLIDE 63

Progress of the quality of one action

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 12

■ Amplitude of correction is determined by the

lea rning rate.

■ To compute the

average, the learning rate is 1/n (decreases!).

■ Learning rate can also be a constant 0 ≤ λ ≤ 1

geometri average
slide-64
SLIDE 64

Progress of the quality of one action

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 12

■ Amplitude of correction is determined by the

lea rning rate.

■ To compute the

average, the learning rate is 1/n (decreases!).

■ Learning rate can also be a constant 0 ≤ λ ≤ 1 ⇒

geometri average.
slide-65
SLIDE 65

Action selection: greedy and epsilon-greedy

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

Exploitation Explo ration the se ond Bo rel-Cantelli lemma the la w
  • f
la rge numb ers
slide-66
SLIDE 66

Action selection: greedy and epsilon-greedy

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

■ Greedy: exploit the action that is optimal thus far.

pi =Def 1 if ai is optimal thus far,

  • therwise.
Exploitation Explo ration the se ond Bo rel-Cantelli lemma the la w
  • f
la rge numb ers
slide-67
SLIDE 67

Action selection: greedy and epsilon-greedy

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

■ Greedy: exploit the action that is optimal thus far.

pi =Def 1 if ai is optimal thus far,

  • therwise.

■ ǫ-Greedy: Let 0 < ǫ ≤ 1 close to 0.

Exploitation Explo ration the se ond Bo rel-Cantelli lemma the la w
  • f
la rge numb ers
slide-68
SLIDE 68

Action selection: greedy and epsilon-greedy

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

■ Greedy: exploit the action that is optimal thus far.

pi =Def 1 if ai is optimal thus far,

  • therwise.

■ ǫ-Greedy: Let 0 < ǫ ≤ 1 close to 0.

1.

Exploitation: choose (1 − ǫ)% of the time an optimal action. Explo ration the se ond Bo rel-Cantelli lemma the la w
  • f
la rge numb ers
slide-69
SLIDE 69

Action selection: greedy and epsilon-greedy

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

■ Greedy: exploit the action that is optimal thus far.

pi =Def 1 if ai is optimal thus far,

  • therwise.

■ ǫ-Greedy: Let 0 < ǫ ≤ 1 close to 0.

1.

Exploitation: choose (1 − ǫ)% of the time an optimal action.

2.

Explo ration: at other times, choose a random action. the se ond Bo rel-Cantelli lemma the la w
  • f
la rge numb ers
slide-70
SLIDE 70

Action selection: greedy and epsilon-greedy

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

■ Greedy: exploit the action that is optimal thus far.

pi =Def 1 if ai is optimal thus far,

  • therwise.

■ ǫ-Greedy: Let 0 < ǫ ≤ 1 close to 0.

1.

Exploitation: choose (1 − ǫ)% of the time an optimal action.

2.

Explo ration: at other times, choose a random action.
  • Because ∑∞

i=1 ǫ is infinite, it follows from the

the se ond Bo rel-Cantelli lemma that every action is explored infinitely many

times a.s.

the la w
  • f
la rge numb ers
slide-71
SLIDE 71

Action selection: greedy and epsilon-greedy

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

■ Greedy: exploit the action that is optimal thus far.

pi =Def 1 if ai is optimal thus far,

  • therwise.

■ ǫ-Greedy: Let 0 < ǫ ≤ 1 close to 0.

1.

Exploitation: choose (1 − ǫ)% of the time an optimal action.

2.

Explo ration: at other times, choose a random action.
  • Because ∑∞

i=1 ǫ is infinite, it follows from the

the se ond Bo rel-Cantelli lemma that every action is explored infinitely many

times a.s. So, by

the la w
  • f
la rge numb ers, the estimated value of an action

converges to its true value.

slide-72
SLIDE 72

Action selection: greedy and epsilon-greedy

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

■ Greedy: exploit the action that is optimal thus far.

pi =Def 1 if ai is optimal thus far,

  • therwise.

■ ǫ-Greedy: Let 0 < ǫ ≤ 1 close to 0.

1.

Exploitation: choose (1 − ǫ)% of the time an optimal action.

2.

Explo ration: at other times, choose a random action.
  • Because ∑∞

i=1 ǫ is infinite, it follows from the

the se ond Bo rel-Cantelli lemma that every action is explored infinitely many

times a.s. So, by

the la w
  • f
la rge numb ers, the estimated value of an action

converges to its true value.

  • All this a.s.
slide-73
SLIDE 73

Action selection: greedy and epsilon-greedy

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

■ Greedy: exploit the action that is optimal thus far.

pi =Def 1 if ai is optimal thus far,

  • therwise.

■ ǫ-Greedy: Let 0 < ǫ ≤ 1 close to 0.

1.

Exploitation: choose (1 − ǫ)% of the time an optimal action.

2.

Explo ration: at other times, choose a random action.
  • Because ∑∞

i=1 ǫ is infinite, it follows from the

the se ond Bo rel-Cantelli lemma that every action is explored infinitely many

times a.s. So, by

the la w
  • f
la rge numb ers, the estimated value of an action

converges to its true value.

  • All this a.s. (= with probability 1).
slide-74
SLIDE 74

Action selection: greedy and epsilon-greedy

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

■ Greedy: exploit the action that is optimal thus far.

pi =Def 1 if ai is optimal thus far,

  • therwise.

■ ǫ-Greedy: Let 0 < ǫ ≤ 1 close to 0.

1.

Exploitation: choose (1 − ǫ)% of the time an optimal action.

2.

Explo ration: at other times, choose a random action.
  • Because ∑∞

i=1 ǫ is infinite, it follows from the

the se ond Bo rel-Cantelli lemma that every action is explored infinitely many

times a.s. So, by

the la w
  • f
la rge numb ers, the estimated value of an action

converges to its true value.

  • All this a.s. (= with probability 1). In particular it is not certain.
slide-75
SLIDE 75

Action selection: optimistic initial values

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 14

An alternative for ǫ-greedy is to work with

  • ptimisti
initial values. unrealisti ally high qualit y
slide-76
SLIDE 76

Action selection: optimistic initial values

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 14

An alternative for ǫ-greedy is to work with

  • ptimisti
initial values.

■ At the outset, an

unrealisti ally high qualit y is attributed to every slot

machine: Qk

0 = high

for 1 ≤ k ≤ N.

slide-77
SLIDE 77

Action selection: optimistic initial values

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 14

An alternative for ǫ-greedy is to work with

  • ptimisti
initial values.

■ At the outset, an

unrealisti ally high qualit y is attributed to every slot

machine: Qk

0 = high

for 1 ≤ k ≤ N.

■ As usual, for every slot machine

its average profit is maintained.

slide-78
SLIDE 78

Action selection: optimistic initial values

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 14

An alternative for ǫ-greedy is to work with

  • ptimisti
initial values.

■ At the outset, an

unrealisti ally high qualit y is attributed to every slot

machine: Qk

0 = high

for 1 ≤ k ≤ N.

■ As usual, for every slot machine

its average profit is maintained.

■ Without exception, always exploit

machines with highest Q-values.

slide-79
SLIDE 79

Action selection: optimistic initial values

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15

An alternative for ǫ-greedy is to work with

  • ptimisti
initial values.

■ At the outset, an

unrealisti ally high qualit y is attributed to every slot

machine: Qk

0 = high

for 1 ≤ k ≤ N.

■ As usual, for every slot machine

its average profit is maintained.

■ Without exception, always exploit

machines with highest Q-values. Some questions:

slide-80
SLIDE 80

Action selection: optimistic initial values

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15

An alternative for ǫ-greedy is to work with

  • ptimisti
initial values.

■ At the outset, an

unrealisti ally high qualit y is attributed to every slot

machine: Qk

0 = high

for 1 ≤ k ≤ N.

■ As usual, for every slot machine

its average profit is maintained.

■ Without exception, always exploit

machines with highest Q-values. Some questions:

  • 1. Initially, many actions are tried

⇒ all actions are tried?

slide-81
SLIDE 81

Action selection: optimistic initial values

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15

An alternative for ǫ-greedy is to work with

  • ptimisti
initial values.

■ At the outset, an

unrealisti ally high qualit y is attributed to every slot

machine: Qk

0 = high

for 1 ≤ k ≤ N.

■ As usual, for every slot machine

its average profit is maintained.

■ Without exception, always exploit

machines with highest Q-values. Some questions:

  • 1. Initially, many actions are tried

⇒ all actions are tried?

  • 2. How high should “high” be?
slide-82
SLIDE 82

Action selection: optimistic initial values

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15

An alternative for ǫ-greedy is to work with

  • ptimisti
initial values.

■ At the outset, an

unrealisti ally high qualit y is attributed to every slot

machine: Qk

0 = high

for 1 ≤ k ≤ N.

■ As usual, for every slot machine

its average profit is maintained.

■ Without exception, always exploit

machines with highest Q-values. Some questions:

  • 1. Initially, many actions are tried

⇒ all actions are tried?

  • 2. How high should “high” be?
  • 3. Can we still speak of

exploration?

slide-83
SLIDE 83

Action selection: optimistic initial values

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15

An alternative for ǫ-greedy is to work with

  • ptimisti
initial values.

■ At the outset, an

unrealisti ally high qualit y is attributed to every slot

machine: Qk

0 = high

for 1 ≤ k ≤ N.

■ As usual, for every slot machine

its average profit is maintained.

■ Without exception, always exploit

machines with highest Q-values. Some questions:

  • 1. Initially, many actions are tried

⇒ all actions are tried?

  • 2. How high should “high” be?
  • 3. Can we still speak of

exploration?

  • 4. ǫ-greedy: Pr( every action is

explored infinitely many times ) = 1. Also with

  • ptimism?
slide-84
SLIDE 84

Action selection: optimistic initial values

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15

An alternative for ǫ-greedy is to work with

  • ptimisti
initial values.

■ At the outset, an

unrealisti ally high qualit y is attributed to every slot

machine: Qk

0 = high

for 1 ≤ k ≤ N.

■ As usual, for every slot machine

its average profit is maintained.

■ Without exception, always exploit

machines with highest Q-values. Some questions:

  • 1. Initially, many actions are tried

⇒ all actions are tried?

  • 2. How high should “high” be?
  • 3. Can we still speak of

exploration?

  • 4. ǫ-greedy: Pr( every action is

explored infinitely many times ) = 1. Also with

  • ptimism?
  • 5. Is optimism (as a method)

suitable to explore an array of (possibly) infinitely many slot machines? Why (not)?

slide-85
SLIDE 85

Optimistic initial values vs. ǫ-greedy

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 16

From: “Reinforcement Learning (...)”, Sutton and Barto, Sec. 2.8, p. 41.

slide-86
SLIDE 86

Q-learning

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17

moving average explo ration rate lea rning- adaptation rate
slide-87
SLIDE 87

Q-learning

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17

■ Q-learning is like ǫ-Greedy learnnig, but then with a

moving average.

Algorithm:

explo ration rate lea rning- adaptation rate
slide-88
SLIDE 88

Q-learning

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17

■ Q-learning is like ǫ-Greedy learnnig, but then with a

moving average.

Algorithm:

  • 1. At round t choose an optimal action uniformly with probability

1 − ǫ.

explo ration rate lea rning- adaptation rate
slide-89
SLIDE 89

Q-learning

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17

■ Q-learning is like ǫ-Greedy learnnig, but then with a

moving average.

Algorithm:

  • 1. At round t choose an optimal action uniformly with probability

1 − ǫ.

  • 2. Update Arm i’s estimate at round, Qi(t), as

Qi(t) = (1 − λ)Qi(t − 1) + λri, if Arm i is pulled with reward ri Qi(t − 1)

  • therwise.
explo ration rate lea rning- adaptation rate
slide-90
SLIDE 90

Q-learning

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17

■ Q-learning is like ǫ-Greedy learnnig, but then with a

moving average.

Algorithm:

  • 1. At round t choose an optimal action uniformly with probability

1 − ǫ.

  • 2. Update Arm i’s estimate at round, Qi(t), as

Qi(t) = (1 − λ)Qi(t − 1) + λri, if Arm i is pulled with reward ri Qi(t − 1)

  • therwise.

■ Q-learning possesses two parameters: an

explo ration rate, ǫ, and a lea rning- or adaptation rate, λ.
slide-91
SLIDE 91

Q-learning

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17

■ Q-learning is like ǫ-Greedy learnnig, but then with a

moving average.

Algorithm:

  • 1. At round t choose an optimal action uniformly with probability

1 − ǫ.

  • 2. Update Arm i’s estimate at round, Qi(t), as

Qi(t) = (1 − λ)Qi(t − 1) + λri, if Arm i is pulled with reward ri Qi(t − 1)

  • therwise.

■ Q-learning possesses two parameters: an

explo ration rate, ǫ, and a lea rning- or adaptation rate, λ.

A practical disadvantage of having two parameters is that tuning the algorithm takes more time.

slide-92
SLIDE 92

Q-learning

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17

■ Q-learning is like ǫ-Greedy learnnig, but then with a

moving average.

Algorithm:

  • 1. At round t choose an optimal action uniformly with probability

1 − ǫ.

  • 2. Update Arm i’s estimate at round, Qi(t), as

Qi(t) = (1 − λ)Qi(t − 1) + λri, if Arm i is pulled with reward ri Qi(t − 1)

  • therwise.

■ Q-learning possesses two parameters: an

explo ration rate, ǫ, and a lea rning- or adaptation rate, λ.

A practical disadvantage of having two parameters is that tuning the algorithm takes more time.

■ Exercise: what if ǫ is small and γ is large?

slide-93
SLIDE 93

Q-learning

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17

■ Q-learning is like ǫ-Greedy learnnig, but then with a

moving average.

Algorithm:

  • 1. At round t choose an optimal action uniformly with probability

1 − ǫ.

  • 2. Update Arm i’s estimate at round, Qi(t), as

Qi(t) = (1 − λ)Qi(t − 1) + λri, if Arm i is pulled with reward ri Qi(t − 1)

  • therwise.

■ Q-learning possesses two parameters: an

explo ration rate, ǫ, and a lea rning- or adaptation rate, λ.

A practical disadvantage of having two parameters is that tuning the algorithm takes more time.

■ Exercise: what if ǫ is small and γ is large? The other way?

slide-94
SLIDE 94

Action selection

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18

Boltzmann Gibbs mixed logit quantal resp
  • nse
temp erature
slide-95
SLIDE 95

Action selection

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18

■ Greedily: exploit the action that is optimal thus far.

pi =Def 1 if ai is optimal thus far,

  • therwise.
Boltzmann Gibbs mixed logit quantal resp
  • nse
temp erature
slide-96
SLIDE 96

Action selection

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18

■ Greedily: exploit the action that is optimal thus far.

pi =Def 1 if ai is optimal thus far,

  • therwise.

■ Proportional: select randomly and proportional to the expected

payoff. pi =Def Qi ∑n

j=1 Qj

.

Boltzmann Gibbs mixed logit quantal resp
  • nse
temp erature
slide-97
SLIDE 97

Action selection

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18

■ Greedily: exploit the action that is optimal thus far.

pi =Def 1 if ai is optimal thus far,

  • therwise.

■ Proportional: select randomly and proportional to the expected

payoff. pi =Def Qi ∑n

j=1 Qj

.

■ Through softmax (or

Boltzmann, or Gibbs, or mixed logit, or quantal resp
  • nse).
temp erature
slide-98
SLIDE 98

Action selection

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18

■ Greedily: exploit the action that is optimal thus far.

pi =Def 1 if ai is optimal thus far,

  • therwise.

■ Proportional: select randomly and proportional to the expected

payoff. pi =Def Qi ∑n

j=1 Qj

.

■ Through softmax (or

Boltzmann, or Gibbs, or mixed logit, or quantal resp
  • nse).

pi =Def eQi/τ ∑n

j=1 eQj/τ ,

temp erature
slide-99
SLIDE 99

Action selection

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18

■ Greedily: exploit the action that is optimal thus far.

pi =Def 1 if ai is optimal thus far,

  • therwise.

■ Proportional: select randomly and proportional to the expected

payoff. pi =Def Qi ∑n

j=1 Qj

.

■ Through softmax (or

Boltzmann, or Gibbs, or mixed logit, or quantal resp
  • nse).

pi =Def eQi/τ ∑n

j=1 eQj/τ ,

where the parameter τ is often called the

temp erature.
slide-100
SLIDE 100

Effect of the temperature parameter

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 19

The softmax function: pi =Def eQi/τ ∑n

j=1 eQj/τ .

slide-101
SLIDE 101

Effect of the temperature parameter

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 19

The softmax function: pi =Def eQi/τ ∑n

j=1 eQj/τ .

This function favours successful actions. How much depends on τ:

slide-102
SLIDE 102

Effect of the temperature parameter

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 19

The softmax function: pi =Def eQi/τ ∑n

j=1 eQj/τ .

This function favours successful actions. How much depends on τ:

τ → ∞

slide-103
SLIDE 103

Effect of the temperature parameter

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 19

The softmax function: pi =Def eQi/τ ∑n

j=1 eQj/τ .

This function favours successful actions. How much depends on τ:

τ → ∞ τ = 1

slide-104
SLIDE 104

Effect of the temperature parameter

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 19

The softmax function: pi =Def eQi/τ ∑n

j=1 eQj/τ .

This function favours successful actions. How much depends on τ:

τ → ∞ τ = 1 τ ↓ 0

slide-105
SLIDE 105

UCB

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 20

upp er
  • nden e
b
  • unds
  • ptimism
in the fa e
  • f
un ertaint y
slide-106
SLIDE 106

UCB

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 20

■ UCB is short for

upp er
  • nden e
b
  • unds.
  • ptimism
in the fa e
  • f
un ertaint y
slide-107
SLIDE 107

UCB

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 20

■ UCB is short for

upp er
  • nden e
b
  • unds.

■ Proposed by Lai et al. in “Asymptotically efficient adaptive allocation

rules” in: Advances in applied mathematics Vol. 6, Nr. 1 (1985), pp. 4-22.

  • ptimism
in the fa e
  • f
un ertaint y
slide-108
SLIDE 108

UCB

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 20

■ UCB is short for

upp er
  • nden e
b
  • unds.

■ Proposed by Lai et al. in “Asymptotically efficient adaptive allocation

rules” in: Advances in applied mathematics Vol. 6, Nr. 1 (1985), pp. 4-22.

■ Idea: keep track of confidence intervals for each action. At any time,

choose the action in that confidence interval with the highest upper bound.

  • ptimism
in the fa e
  • f
un ertaint y
slide-109
SLIDE 109

UCB

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 20

■ UCB is short for

upp er
  • nden e
b
  • unds.

■ Proposed by Lai et al. in “Asymptotically efficient adaptive allocation

rules” in: Advances in applied mathematics Vol. 6, Nr. 1 (1985), pp. 4-22.

■ Idea: keep track of confidence intervals for each action. At any time,

choose the action in that confidence interval with the highest upper

  • bound. Often advocated as
  • ptimism
in the fa e
  • f
un ertaint y.
slide-110
SLIDE 110

UCB

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 20

■ UCB is short for

upp er
  • nden e
b
  • unds.

■ Proposed by Lai et al. in “Asymptotically efficient adaptive allocation

rules” in: Advances in applied mathematics Vol. 6, Nr. 1 (1985), pp. 4-22.

■ Idea: keep track of confidence intervals for each action. At any time,

choose the action in that confidence interval with the highest upper

  • bound. Often advocated as
  • ptimism
in the fa e
  • f
un ertaint y.

■ Algorithm: execute each action once. Then, at each round t, choose

  • ne of the actions that has highest

¯ Xi

t +

  • 2 ln(t)

ni

t

, where ¯ Xi

t is action’s i average at round t, and ni t is the number of

times action i is executed at round t.

slide-111
SLIDE 111

UCB: idea

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 21

b

Action 7

b

Action 6

b

Action 5

b

Action 4

b

Action 3

b

Action 2

b

Action 1 Many actions have identical empirical

  • means. On the basis of

highest empirical mean

  • nly, Action 4 and

Action 5 would be equally optimal. However the variation in the rewards of Action 5 is higher, hence its confidence interval is wider, hence its UCB is higher, therefore, choose 5:

  • ptimism
in the fa e
  • f
un ertaint y.
slide-112
SLIDE 112

UCB: demo

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 22

slide-113
SLIDE 113

UCB: derivation

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 23

Ho eding's inequalit y
slide-114
SLIDE 114

UCB: derivation

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 23

Ho eding's inequalit y for i.i.d. random variables X1, . . . , Xt with mean

µ and values in [0, 1] says that, for any d ≥ 0 Pr{µ ≥ ¯ Xt + d} ≤ exp(−2td2), where ¯ Xt = (X1, . . . , Xt)/t is the empirical mean of the random variables at round t.

slide-115
SLIDE 115

UCB: derivation

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 23

Ho eding's inequalit y for i.i.d. random variables X1, . . . , Xt with mean

µ and values in [0, 1] says that, for any d ≥ 0 Pr{µ ≥ ¯ Xt + d} ≤ exp(−2td2), where ¯ Xt = (X1, . . . , Xt)/t is the empirical mean of the random variables at round t.

■ For action i this would be Pr{µ ≥ ¯

Xi

t + d} ≤ exp(−2ni td2).

slide-116
SLIDE 116

UCB: derivation

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 23

Ho eding's inequalit y for i.i.d. random variables X1, . . . , Xt with mean

µ and values in [0, 1] says that, for any d ≥ 0 Pr{µ ≥ ¯ Xt + d} ≤ exp(−2td2), where ¯ Xt = (X1, . . . , Xt)/t is the empirical mean of the random variables at round t.

■ For action i this would be Pr{µ ≥ ¯

Xi

t + d} ≤ exp(−2ni td2).

■ The probability that the true mean lies outside [ ¯

Xi

t − d, ¯

Xi

t + d] goes to

zero for t → ∞ if we set exp(−2ni

td2) equal to an expression that goes

to zero if t → ∞.

slide-117
SLIDE 117

UCB: derivation

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 23

Ho eding's inequalit y for i.i.d. random variables X1, . . . , Xt with mean

µ and values in [0, 1] says that, for any d ≥ 0 Pr{µ ≥ ¯ Xt + d} ≤ exp(−2td2), where ¯ Xt = (X1, . . . , Xt)/t is the empirical mean of the random variables at round t.

■ For action i this would be Pr{µ ≥ ¯

Xi

t + d} ≤ exp(−2ni td2).

■ The probability that the true mean lies outside [ ¯

Xi

t − d, ¯

Xi

t + d] goes to

zero for t → ∞ if we set exp(−2ni

td2) equal to an expression that goes

to zero if t → ∞. The term t−4 is convenient here. Set exp(−2ni

td2) = t−4.

Isolating d yields d =

  • 2 ln(t)

ni

t

.

slide-118
SLIDE 118

UCB: discussion

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 24

slide-119
SLIDE 119

UCB: discussion

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 24

■ All arms are pulled infinitely often. (Hint: what happens with an

arm’s UCB if it is not pulled?)

slide-120
SLIDE 120

UCB: discussion

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 24

■ All arms are pulled infinitely often. (Hint: what happens with an

arm’s UCB if it is not pulled?)

■ Let µ1, . . . , µK be the means of the distributions of the K arms.

slide-121
SLIDE 121

UCB: discussion

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 24

■ All arms are pulled infinitely often. (Hint: what happens with an

arm’s UCB if it is not pulled?)

■ Let µ1, . . . , µK be the means of the distributions of the K arms. ■ Let I = argmaxi=1,...,K{µi}. So I is the set of indices of all optimal

arms.

slide-122
SLIDE 122

UCB: discussion

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 24

■ All arms are pulled infinitely often. (Hint: what happens with an

arm’s UCB if it is not pulled?)

■ Let µ1, . . . , µK be the means of the distributions of the K arms. ■ Let I = argmaxi=1,...,K{µi}. So I is the set of indices of all optimal

arms.

■ It can be proven that the number of times a sub-optimal arm is played

can be bounded from above by C ln N, where C is some constant.

slide-123
SLIDE 123

UCB: discussion

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 24

■ All arms are pulled infinitely often. (Hint: what happens with an

arm’s UCB if it is not pulled?)

■ Let µ1, . . . , µK be the means of the distributions of the K arms. ■ Let I = argmaxi=1,...,K{µi}. So I is the set of indices of all optimal

arms.

■ It can be proven that the number of times a sub-optimal arm is played

can be bounded from above by C ln N, where C is some constant. It has also been proven that this is the best possible bound, up to C.

slide-124
SLIDE 124

UCB: discussion

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 24

■ All arms are pulled infinitely often. (Hint: what happens with an

arm’s UCB if it is not pulled?)

■ Let µ1, . . . , µK be the means of the distributions of the K arms. ■ Let I = argmaxi=1,...,K{µi}. So I is the set of indices of all optimal

arms.

■ It can be proven that the number of times a sub-optimal arm is played

can be bounded from above by C ln N, where C is some constant. It has also been proven that this is the best possible bound, up to C.

■ Bounds can be loose. Suppose C = 8 and N is 20. Then

C ln N = 11.82. So it is possible to play 11 out of 20 times sub-optimal.

slide-125
SLIDE 125

UCB: discussion

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 24

■ All arms are pulled infinitely often. (Hint: what happens with an

arm’s UCB if it is not pulled?)

■ Let µ1, . . . , µK be the means of the distributions of the K arms. ■ Let I = argmaxi=1,...,K{µi}. So I is the set of indices of all optimal

arms.

■ It can be proven that the number of times a sub-optimal arm is played

can be bounded from above by C ln N, where C is some constant. It has also been proven that this is the best possible bound, up to C.

■ Bounds can be loose. Suppose C = 8 and N is 20. Then

C ln N = 11.82. So it is possible to play 11 out of 20 times sub-optimal.

■ UCB comes in variants. UCB1 was discussed here.

slide-126
SLIDE 126

Thompson sampling (posterior sampling)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 25

hyp
  • thesis
p rio r PDF p
  • sterio
r PDF Ba y esian up date
slide-127
SLIDE 127

Thompson sampling (posterior sampling)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 25

■ Idea: for each arm, use a

parameterised probability density function as a

hyp
  • thesis

for its rewards.

p rio r PDF p
  • sterio
r PDF Ba y esian up date
slide-128
SLIDE 128

Thompson sampling (posterior sampling)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 25

■ Idea: for each arm, use a

parameterised probability density function as a

hyp
  • thesis

for its rewards. E.g., for the Bernoulli MAB problem1 this could be the beta-distributions Betai(αi, βi), with parameters αi, βi > 0.

p rio r PDF p
  • sterio
r PDF Ba y esian up date
slide-129
SLIDE 129

Thompson sampling (posterior sampling)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 25

■ Idea: for each arm, use a

parameterised probability density function as a

hyp
  • thesis

for its rewards. E.g., for the Bernoulli MAB problem1 this could be the beta-distributions Betai(αi, βi), with parameters αi, βi > 0.

■ For every action, start out with

neutral distribution, the

p rio r PDF. p
  • sterio
r PDF Ba y esian up date
slide-130
SLIDE 130

Thompson sampling (posterior sampling)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 25

■ Idea: for each arm, use a

parameterised probability density function as a

hyp
  • thesis

for its rewards. E.g., for the Bernoulli MAB problem1 this could be the beta-distributions Betai(αi, βi), with parameters αi, βi > 0.

■ For every action, start out with

neutral distribution, the

p rio r PDF.

For B-MAB, this should be Beta(1, 1).

p
  • sterio
r PDF Ba y esian up date
slide-131
SLIDE 131

Thompson sampling (posterior sampling)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 25

■ Idea: for each arm, use a

parameterised probability density function as a

hyp
  • thesis

for its rewards. E.g., for the Bernoulli MAB problem1 this could be the beta-distributions Betai(αi, βi), with parameters αi, βi > 0.

■ For every action, start out with

neutral distribution, the

p rio r PDF.

For B-MAB, this should be Beta(1, 1).

■ At every round, sample all

hypotheses, and choose the arm which is sampled highest.

p
  • sterio
r PDF Ba y esian up date
slide-132
SLIDE 132

Thompson sampling (posterior sampling)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 25

■ Idea: for each arm, use a

parameterised probability density function as a

hyp
  • thesis

for its rewards. E.g., for the Bernoulli MAB problem1 this could be the beta-distributions Betai(αi, βi), with parameters αi, βi > 0.

■ For every action, start out with

neutral distribution, the

p rio r PDF.

For B-MAB, this should be Beta(1, 1).

■ At every round, sample all

hypotheses, and choose the arm which is sampled highest. B-MAB: the mean of Beta(α, β) is 1/(1 + β/α).

p
  • sterio
r PDF Ba y esian up date
slide-133
SLIDE 133

Thompson sampling (posterior sampling)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 25

■ Idea: for each arm, use a

parameterised probability density function as a

hyp
  • thesis

for its rewards. E.g., for the Bernoulli MAB problem1 this could be the beta-distributions Betai(αi, βi), with parameters αi, βi > 0.

■ For every action, start out with

neutral distribution, the

p rio r PDF.

For B-MAB, this should be Beta(1, 1).

■ At every round, sample all

hypotheses, and choose the arm which is sampled highest. B-MAB: the mean of Beta(α, β) is 1/(1 + β/α).

■ Update: compute the action’s

p
  • sterio
r PDF by letting the

reward change the parameters

  • f its associated PDF, through a
Ba y esian up date.
slide-134
SLIDE 134

Thompson sampling (posterior sampling)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 25

■ Idea: for each arm, use a

parameterised probability density function as a

hyp
  • thesis

for its rewards. E.g., for the Bernoulli MAB problem1 this could be the beta-distributions Betai(αi, βi), with parameters αi, βi > 0.

■ For every action, start out with

neutral distribution, the

p rio r PDF.

For B-MAB, this should be Beta(1, 1).

■ At every round, sample all

hypotheses, and choose the arm which is sampled highest. B-MAB: the mean of Beta(α, β) is 1/(1 + β/α).

■ Update: compute the action’s

p
  • sterio
r PDF by letting the

reward change the parameters

  • f its associated PDF, through a
Ba y esian up date.

B-MAB: if arm 5 pays 0, do β5 = β5 + 1 for the associated

  • PDF. (If arm 5 pays 1, do

α5 = α5 + 1.)

1Each arm i is associated with the outcome of tossing a biased coin (heads = 1, tails =

0) with fixed (and hidden) bias 0 ≤ θi ≤ 1.

slide-135
SLIDE 135

Thompson sampling on a Bernoulli bandit

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 26

1 1 2 3

Arm 1, with posterior Betai(3, 2) Arm 2, Betai(1, 1) Arm 3, Betai(2, 4) Arm 4, Betai(2, 7)

Posterior PDFs after pulling Arm 1 three times with two successes, Arm 2 zero times, Arm 3 four times with one success, Arm 4 seven times with

  • ne success. (Notice: α = #successes + 1, β = #failures + 1.)
slide-136
SLIDE 136

Adversarial bandits

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 27

shifting gea rs adversa rial bandit
slide-137
SLIDE 137

Adversarial bandits

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 27

■ Adversarial behaviour seems to

be the norm multi-agent learning: opponents may change strategies at any time (called

shifting gea rs in poker),

for example for the end of being unpredictable.

adversa rial bandit
slide-138
SLIDE 138

Adversarial bandits

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 27

■ Adversarial behaviour seems to

be the norm multi-agent learning: opponents may change strategies at any time (called

shifting gea rs in poker),

for example for the end of being unpredictable. Think of anti-coordination games, such as chicken, or matching pennies.

adversa rial bandit
slide-139
SLIDE 139

Adversarial bandits

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 27

■ Adversarial behaviour seems to

be the norm multi-agent learning: opponents may change strategies at any time (called

shifting gea rs in poker),

for example for the end of being unpredictable. Think of anti-coordination games, such as chicken, or matching pennies.

■ An

adversa rial bandit (Auer and

Cesa-Bianchi, 1998) may use any reward algorithm.

slide-140
SLIDE 140

Adversarial bandits

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 27

■ Adversarial behaviour seems to

be the norm multi-agent learning: opponents may change strategies at any time (called

shifting gea rs in poker),

for example for the end of being unpredictable. Think of anti-coordination games, such as chicken, or matching pennies.

■ An

adversa rial bandit (Auer and

Cesa-Bianchi, 1998) may use any reward algorithm.

■ In the worst case, an adversarial

bandit is benign. If it is smart [omniscient], it may frustrate a learning algorithm [to the max].

slide-141
SLIDE 141

Adversarial bandits

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 27

■ Adversarial behaviour seems to

be the norm multi-agent learning: opponents may change strategies at any time (called

shifting gea rs in poker),

for example for the end of being unpredictable. Think of anti-coordination games, such as chicken, or matching pennies.

■ An

adversa rial bandit (Auer and

Cesa-Bianchi, 1998) may use any reward algorithm.

■ In the worst case, an adversarial

bandit is benign. If it is smart [omniscient], it may frustrate a learning algorithm [to the max]. Small exercise: suppose your are a benign adversarial playing against against a fictitious player who plays row. How would you play? H T H 1, 0

−1, 0

T

−1, 0

1, 0

slide-142
SLIDE 142

Adversarial bandits

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 27

■ Adversarial behaviour seems to

be the norm multi-agent learning: opponents may change strategies at any time (called

shifting gea rs in poker),

for example for the end of being unpredictable. Think of anti-coordination games, such as chicken, or matching pennies.

■ An

adversa rial bandit (Auer and

Cesa-Bianchi, 1998) may use any reward algorithm.

■ In the worst case, an adversarial

bandit is benign. If it is smart [omniscient], it may frustrate a learning algorithm [to the max]. Small exercise: suppose your are a benign adversarial playing against against a fictitious player who plays row. How would you play? H T H 1, 0

−1, 0

T

−1, 0

1, 0

■ “Cumulative” algorithms, like

ǫ-Greedy, UCB, or Thompson, respond slowly to sudden changes.

slide-143
SLIDE 143

Exp3

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 28

exp
  • nential
w eight algo rithm fo r explo ration and exploitation adversa rial algo rithm egalita rianism fa to r
slide-144
SLIDE 144

Exp3

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 28

■ Exp3 is short for

exp
  • nential
w eight algo rithm fo r explo ration and exploitation. adversa rial algo rithm egalita rianism fa to r
slide-145
SLIDE 145

Exp3

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 28

■ Exp3 is short for

exp
  • nential
w eight algo rithm fo r explo ration and exploitation.

■ Proposed by Auer et al. (2002) in “The nonstochastic multiarmed

bandit problem”, in Journal on Computing Vol. 32, Nr. 1.

adversa rial algo rithm egalita rianism fa to r
slide-146
SLIDE 146

Exp3

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 28

■ Exp3 is short for

exp
  • nential
w eight algo rithm fo r explo ration and exploitation.

■ Proposed by Auer et al. (2002) in “The nonstochastic multiarmed

bandit problem”, in Journal on Computing Vol. 32, Nr. 1.

■ Exp3 may be seen as a volatile Softmax.

adversa rial algo rithm egalita rianism fa to r
slide-147
SLIDE 147

Exp3

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 28

■ Exp3 is short for

exp
  • nential
w eight algo rithm fo r explo ration and exploitation.

■ Proposed by Auer et al. (2002) in “The nonstochastic multiarmed

bandit problem”, in Journal on Computing Vol. 32, Nr. 1.

■ Exp3 may be seen as a volatile Softmax. ■ It is an

adversa rial algo rithm, meaning that it should perform well in

environments where payoffs for actions may suddenly change.

egalita rianism fa to r
slide-148
SLIDE 148

Exp3

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 28

■ Exp3 is short for

exp
  • nential
w eight algo rithm fo r explo ration and exploitation.

■ Proposed by Auer et al. (2002) in “The nonstochastic multiarmed

bandit problem”, in Journal on Computing Vol. 32, Nr. 1.

■ Exp3 may be seen as a volatile Softmax. ■ It is an

adversa rial algo rithm, meaning that it should perform well in

environments where payoffs for actions may suddenly change.

■ Idea: maintain a vector of action weights (w1, . . . , wK). Actions are

chosen probabilistically, proportional to their weights: pk(t) =Def (1 − γ) wk(t) ∑k

i=1 wi(t)

+ γ 1

K where 0 ≤ γ ≤ 1 is the

egalita rianism fa to r (check what if γ = 0, 1).
slide-149
SLIDE 149

Exp3 (continued)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 29

So far so good.

w eak no-regret algo rithm
slide-150
SLIDE 150

Exp3 (continued)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 29

So far so good. But how are the weights computed?

w eak no-regret algo rithm
slide-151
SLIDE 151

Exp3 (continued)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 29

So far so good. But how are the weights computed? If ˆ ri(t) =Def    ri(t) pi(t) if i is chosen at t,

  • therwise.

denotes the estimated reward, i.e., the reward of action i weighed by its surprise (i.e., multiplied by the reciprocal of its probability to occur), then weights are computed as wi(t + 1) =Def wi(t) exp

  • γ 1

K ˆ ri(t)

  • w
eak no-regret algo rithm
slide-152
SLIDE 152

Exp3 (continued)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 29

So far so good. But how are the weights computed? If ˆ ri(t) =Def    ri(t) pi(t) if i is chosen at t,

  • therwise.

denotes the estimated reward, i.e., the reward of action i weighed by its surprise (i.e., multiplied by the reciprocal of its probability to occur), then weights are computed as wi(t + 1) =Def wi(t) exp

  • γ 1

K ˆ ri(t)

  • Important: rewards are supposed to lie in [0, 1]. (Scale payoffs if

necessary.)

w eak no-regret algo rithm
slide-153
SLIDE 153

Exp3 (continued)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 29

So far so good. But how are the weights computed? If ˆ ri(t) =Def    ri(t) pi(t) if i is chosen at t,

  • therwise.

denotes the estimated reward, i.e., the reward of action i weighed by its surprise (i.e., multiplied by the reciprocal of its probability to occur), then weights are computed as wi(t + 1) =Def wi(t) exp

  • γ 1

K ˆ ri(t)

  • Important: rewards are supposed to lie in [0, 1]. (Scale payoffs if

necessary.) Exp3 is a so-called

w eak no-regret algo rithm, which means that the average

regrets are pressed out a.s.

slide-154
SLIDE 154

Stationary time series

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 30

A time series {Xt}t is called

stri tly stationa ry w eakly stationa ry
  • f
  • rder
t w
  • b

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b

b

b

slide-155
SLIDE 155

Stationary time series

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 30

A time series {Xt}t is called

stri tly stationa ry if, for each fixed h, the vectors (Xt, Xt+1, . . . , Xt+h)

are identically distributed for all t.

w eakly stationa ry
  • f
  • rder
t w
  • b

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b

b

b

slide-156
SLIDE 156

Stationary time series

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 30

A time series {Xt}t is called

stri tly stationa ry if, for each fixed h, the vectors (Xt, Xt+1, . . . , Xt+h)

are identically distributed for all t.

w eakly stationa ry
  • f
  • rder
t w
  • if it has a fixed mean, and, for each

fixed h, the covariances Cov(Xt, Xt+h) are equal for all t.

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b

b

b

slide-157
SLIDE 157

Stationary time series

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 30

A time series {Xt}t is called

stri tly stationa ry if, for each fixed h, the vectors (Xt, Xt+1, . . . , Xt+h)

are identically distributed for all t.

w eakly stationa ry
  • f
  • rder
t w
  • if it has a fixed mean, and, for each

fixed h, the covariances Cov(Xt, Xt+h) are equal for all t. Strict: all statistics are time-independent.

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b

b

b

slide-158
SLIDE 158

Stationary time series

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 30

A time series {Xt}t is called

stri tly stationa ry if, for each fixed h, the vectors (Xt, Xt+1, . . . , Xt+h)

are identically distributed for all t.

w eakly stationa ry
  • f
  • rder
t w
  • if it has a fixed mean, and, for each

fixed h, the covariances Cov(Xt, Xt+h) are equal for all t. Strict: all statistics are time-independent. (Unrealistic.)

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b

b

b

slide-159
SLIDE 159

Stationary time series

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 30

A time series {Xt}t is called

stri tly stationa ry if, for each fixed h, the vectors (Xt, Xt+1, . . . , Xt+h)

are identically distributed for all t.

w eakly stationa ry
  • f
  • rder
t w
  • if it has a fixed mean, and, for each

fixed h, the covariances Cov(Xt, Xt+h) are equal for all t. Strict: all statistics are time-independent. (Unrealistic.) Weak: constant mean, variance and covariance. Other statistics are free to change.

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b

b

b

slide-160
SLIDE 160

Stationary time series

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 30

A time series {Xt}t is called

stri tly stationa ry if, for each fixed h, the vectors (Xt, Xt+1, . . . , Xt+h)

are identically distributed for all t.

w eakly stationa ry
  • f
  • rder
t w
  • if it has a fixed mean, and, for each

fixed h, the covariances Cov(Xt, Xt+h) are equal for all t. Strict: all statistics are time-independent. (Unrealistic.) Weak: constant mean, variance and covariance. Other statistics are free to change. (Common.)

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b

b

b

slide-161
SLIDE 161

Stationary time series

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 30

A time series {Xt}t is called

stri tly stationa ry if, for each fixed h, the vectors (Xt, Xt+1, . . . , Xt+h)

are identically distributed for all t.

w eakly stationa ry
  • f
  • rder
t w
  • if it has a fixed mean, and, for each

fixed h, the covariances Cov(Xt, Xt+h) are equal for all t. Strict: all statistics are time-independent. (Unrealistic.) Weak: constant mean, variance and covariance. Other statistics are free to change. (Common.)

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b

b

b

true values trend = optimal prediction average moving average

slide-162
SLIDE 162

Non-stationary time series

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 31

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b

true values trend

  • ptimal prediction

average moving average With

non-stationa ry time series, the average (a.k.a. empirical mean)

(n − 1)/nT + r/n and moving average (a.k.a. rolling average, geometric

mean, or exponentially smoothed mean) (1 − γ)T + γr are bad predictors.

slide-163
SLIDE 163

MAB with non-stationary payoffs

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 32

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b

unevenly spa ed time series

Arm 1 Arm 2 Arm 3 Arm 4 Arm 5

True payments per arm per round, had arm been pulled (solid lines). Received payments, per arm per round (points). Due to prediction errors, sometimes a “wrong” arm is pulled (verify!).

slide-164
SLIDE 164

MAB with non-stationary payoffs

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 32

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b

payoffs constitute an

unevenly spa ed time series

Arm 1 Arm 2 Arm 3 Arm 4 Arm 5

True payments per arm per round, had arm been pulled (solid lines). Received payments, per arm per round (points). Due to prediction errors, sometimes a “wrong” arm is pulled (verify!).

slide-165
SLIDE 165

MAB with non-stationary payoffs

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 32

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b

unevenly spa ed time series

and this, of course, it what the observer sees

Arm 1 Arm 2 Arm 3 Arm 4 Arm 5

True payments per arm per round, had arm been pulled (solid lines). Received payments, per arm per round (points). Due to prediction errors, sometimes a “wrong” arm is pulled (verify!).

slide-166
SLIDE 166

Status of MAB algorithms in MAL

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 33

Regula r evenly spa ed time series Unevenly spa ed time series to tak e advantage
  • f
the game
  • ntext
slide-167
SLIDE 167

Status of MAB algorithms in MAL

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 33

■ Lots of theories, methods and

techniques are known to analyse and predict time series.

Regula r evenly spa ed time series Unevenly spa ed time series to tak e advantage
  • f
the game
  • ntext
slide-168
SLIDE 168

Status of MAB algorithms in MAL

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 33

■ Lots of theories, methods and

techniques are known to analyse and predict time series.

Regula r, a.k.a. evenly spa ed time series are best understood. This

is probably due to its inherent properties and efforts in finance and stock prediction.

Unevenly spa ed time series to tak e advantage
  • f
the game
  • ntext
slide-169
SLIDE 169

Status of MAB algorithms in MAL

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 33

■ Lots of theories, methods and

techniques are known to analyse and predict time series.

Regula r, a.k.a. evenly spa ed time series are best understood. This

is probably due to its inherent properties and efforts in finance and stock prediction.

Unevenly spa ed time series are

less well understood.

to tak e advantage
  • f
the game
  • ntext
slide-170
SLIDE 170

Status of MAB algorithms in MAL

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 33

■ Lots of theories, methods and

techniques are known to analyse and predict time series.

Regula r, a.k.a. evenly spa ed time series are best understood. This

is probably due to its inherent properties and efforts in finance and stock prediction.

Unevenly spa ed time series are

less well understood. Some techniques:

to tak e advantage
  • f
the game
  • ntext
slide-171
SLIDE 171

Status of MAB algorithms in MAL

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 33

■ Lots of theories, methods and

techniques are known to analyse and predict time series.

Regula r, a.k.a. evenly spa ed time series are best understood. This

is probably due to its inherent properties and efforts in finance and stock prediction.

Unevenly spa ed time series are

less well understood. Some techniques:

  • Interpolate empty intervals /

transform to evenly-spaced

  • series. (“Traces” is a Python

library based on this principle.)

to tak e advantage
  • f
the game
  • ntext
slide-172
SLIDE 172

Status of MAB algorithms in MAL

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 33

■ Lots of theories, methods and

techniques are known to analyse and predict time series.

Regula r, a.k.a. evenly spa ed time series are best understood. This

is probably due to its inherent properties and efforts in finance and stock prediction.

Unevenly spa ed time series are

less well understood. Some techniques:

  • Interpolate empty intervals /

transform to evenly-spaced

  • series. (“Traces” is a Python

library based on this principle.)

  • Techniques that take

irregular time series “as they are” include state space analysis, Kalman filtering, autoregression, and stochastic differential equations, to name a few.

to tak e advantage
  • f
the game
  • ntext
slide-173
SLIDE 173

Status of MAB algorithms in MAL

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 33

■ Lots of theories, methods and

techniques are known to analyse and predict time series.

Regula r, a.k.a. evenly spa ed time series are best understood. This

is probably due to its inherent properties and efforts in finance and stock prediction.

Unevenly spa ed time series are

less well understood. Some techniques:

  • Interpolate empty intervals /

transform to evenly-spaced

  • series. (“Traces” is a Python

library based on this principle.)

  • Techniques that take

irregular time series “as they are” include state space analysis, Kalman filtering, autoregression, and stochastic differential equations, to name a few.

■ Rather than to overengineer

MAB algorithms, for MAL it is perhaps better

to tak e advantage
  • f
the game
  • ntext (own payoff

matrix, opponent moves,

  • pponent’s hypothesized

strategy).