[PPT] - Multi-agent learning Multi-a rmed bandit algo rithms Gerard PowerPoint Presentation

SLIDE 1

Multi-agent learning

Multi-a rmed bandit algo rithms

Gerard Vreeswijk, Intelligent Software Systems, Computer Science Department, Faculty of Sciences, Utrecht University, The Netherlands.

Thursday 30th April, 2020

SLIDE 2

practical applications.

SLIDE 4

practical applications.

■ Online vs. offline (batch)

processing of data.

SLIDE 5

practical applications.

■ Online vs. offline (batch)

processing of data.

■ Simple (but common) MAB

algorithms:

SLIDE 6

practical applications.

■ Online vs. offline (batch)

processing of data.

■ Simple (but common) MAB

algorithms:

ǫ-Greedy.

SLIDE 7

practical applications.

■ Online vs. offline (batch)

processing of data.

■ Simple (but common) MAB

algorithms:

ǫ-Greedy.
Q-learning with exploration

ǫ and learning rate γ.

SLIDE 8

practical applications.

■ Online vs. offline (batch)

processing of data.

■ Simple (but common) MAB

algorithms:

ǫ-Greedy.
Q-learning with exploration

ǫ and learning rate γ.

Boltzmann (a.k.a. Softmax,

Gibbs, mixed logit, quantal response) with temperature γ.

SLIDE 9

practical applications.

■ Online vs. offline (batch)

processing of data.

■ Simple (but common) MAB

algorithms:

ǫ-Greedy.
Q-learning with exploration

ǫ and learning rate γ.

Boltzmann (a.k.a. Softmax,

Gibbs, mixed logit, quantal response) with temperature γ.

UCB (upper confidence

bound). (Parameterless.)

SLIDE 10

practical applications.

■ Online vs. offline (batch)

processing of data.

■ Simple (but common) MAB

algorithms:

ǫ-Greedy.
Q-learning with exploration

ǫ and learning rate γ.

Boltzmann (a.k.a. Softmax,

Gibbs, mixed logit, quantal response) with temperature γ.

UCB (upper confidence

bound). (Parameterless.)

Thompson sampling.

SLIDE 11

practical applications.

■ Online vs. offline (batch)

processing of data.

■ Simple (but common) MAB

algorithms:

ǫ-Greedy.
Q-learning with exploration

ǫ and learning rate γ.

Boltzmann (a.k.a. Softmax,

Gibbs, mixed logit, quantal response) with temperature γ.

UCB (upper confidence

bound). (Parameterless.)

Thompson sampling.

■ A well-known MAB algorithm

what works well in adversarial circumstances.

SLIDE 12

practical applications.

■ Online vs. offline (batch)

processing of data.

■ Simple (but common) MAB

algorithms:

ǫ-Greedy.
Q-learning with exploration

ǫ and learning rate γ.

Boltzmann (a.k.a. Softmax,

Gibbs, mixed logit, quantal response) with temperature γ.

UCB (upper confidence

bound). (Parameterless.)

Thompson sampling.

■ A well-known MAB algorithm

what works well in adversarial circumstances.

Exp3 (exponential weight

algorithm for exploration and exploitation) with egalitarian factor γ.

SLIDE 13

practical applications.

■ Online vs. offline (batch)

processing of data.

■ Simple (but common) MAB

algorithms:

ǫ-Greedy.
Q-learning with exploration

ǫ and learning rate γ.

Boltzmann (a.k.a. Softmax,

Gibbs, mixed logit, quantal response) with temperature γ.

UCB (upper confidence

bound). (Parameterless.)

Thompson sampling.

■ A well-known MAB algorithm

what works well in adversarial circumstances.

Exp3 (exponential weight

algorithm for exploration and exploitation) with egalitarian factor γ.

■ Some remarks on the analysis

unevenly spaced time series.

SLIDE 14

MAB algorithms are only interest in rewards per action

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 3

Row is protagonist. From a b c d e A 1, 0 5, 6 1, 0 9, 7 7, 2 B 4, 6 4, 2 1, 8 7, 2 9, 7 C 1, 0 7, 2 9, 7 3, 4 4, 6 D 3, 7 5, 2 5, 3 9, 7 1, 8 E 1, 0 7, 2 4, 6 1, 2 2, 0

SLIDE 15

MAB algorithms are only interest in rewards per action

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 3

Row is protagonist. From a b c d e A 1, 0 5, 6 1, 0 9, 7 7, 2 B 4, 6 4, 2 1, 8 7, 2 9, 7 C 1, 0 7, 2 9, 7 3, 4 4, 6 D 3, 7 5, 2 5, 3 9, 7 1, 8 E 1, 0 7, 2 4, 6 1, 2 2, 0 to a b c d e A 1, ? 5, ? 1, ? 9, ? 7, ? B 4, ? 4, ? 1, ? 7, ? 9, ? C 1, ? 7, ? 9, ? 3, ? 4, ? D 3, ? 5, ? 5, ? 9, ? 1, ? E 1, ? 7, ? 4, ? 1, ? 2, ?

SLIDE 16

MAB algorithms are only interest in rewards per action

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 3

Row is protagonist. From a b c d e A 1, 0 5, 6 1, 0 9, 7 7, 2 B 4, 6 4, 2 1, 8 7, 2 9, 7 C 1, 0 7, 2 9, 7 3, 4 4, 6 D 3, 7 5, 2 5, 3 9, 7 1, 8 E 1, 0 7, 2 4, 6 1, 2 2, 0 to a b c d e A 1, ? 5, ? 1, ? 9, ? 7, ? B 4, ? 4, ? 1, ? 7, ? 9, ? C 1, ? 7, ? 9, ? 3, ? 4, ? D 3, ? 5, ? 5, ? 9, ? 1, ? E 1, ? 7, ? 4, ? 1, ? 2, ? to don’t care what the antagonist does A reward sequence r1, r2, . . . . . . B reward sequence r5, . . . . . . C reward sequence r3, r7, r8, . . . . . . D reward sequence r4, r9, r10, r11, r12, . . . . . . E reward sequence r6, . . . . . .

SLIDE 17

Introduction

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 4

The multi-armed bandit.

http://en.wikipedia.org/wiki/Multi- armed_bandit

SLIDE 18

The multi-armed bandit problem

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 5

Which slot machine to choose?

SLIDE 19

MAB problem: random questions

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 6

Given. An array of N slot

machines.

exploit explo re

SLIDE 20

MAB problem: random questions

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 6

Given. An array of N slot

machines. Random questions:

exploit explo re

SLIDE 21

MAB problem: random questions

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 6

Given. An array of N slot

machines. Random questions:

1. How long do to stick with a slot

machine?

exploit explo re

SLIDE 22

MAB problem: random questions

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 6

Given. An array of N slot

machines. Random questions:

1. How long do to stick with a slot

machine?

2. Try many machines, or opt for

security?

exploit explo re

SLIDE 23

MAB problem: random questions

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 6

Given. An array of N slot

machines. Random questions:

1. How long do to stick with a slot

machine?

2. Try many machines, or opt for

security?

3. Do you

exploit success, or do

you

explo re the possibilities?

SLIDE 24

MAB problem: random questions

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 6

Given. An array of N slot

machines. Random questions:

1. How long do to stick with a slot

machine?

2. Try many machines, or opt for

security?

3. Do you

exploit success, or do

you

explo re the possibilities?

4. Is it something we can assume

about the distribution of the payouts? Constant mean? Constant variance? Stationary? Does a machine “shift gears” every now and then?

SLIDE 25

Experiment

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 7

Yield Machine 1 Yield Machine 2 Yield Machine 3 8 7 20 8 11 1 8 8 8 9 8

SLIDE 26

Experiment

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 7

Yield Machine 1 Yield Machine 2 Yield Machine 3 8 7 20 8 11 1 8 8 8 9 8 Average 8 8.75 10.5

SLIDE 27

Exploration vs. exploitation

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

explo rer exploiter

SLIDE 28

Exploration vs. exploitation

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

Problem. You are at the beginning of a new study year. Every fellow

student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness?

explo rer exploiter

SLIDE 29

Exploration vs. exploitation

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

Problem. You are at the beginning of a new study year. Every fellow

student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies:

explo rer exploiter

SLIDE 30

Exploration vs. exploitation

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

Problem. You are at the beginning of a new study year. Every fellow

student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies:

1. Make friends whe{n|r}ever possible. You are an

explo rer. exploiter

SLIDE 31

Exploration vs. exploitation

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

Problem. You are at the beginning of a new study year. Every fellow

student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies:

1. Make friends whe{n|r}ever possible. You are an

explo rer.

2. Stick to the nearest fellow-student. You are an

exploiter.

SLIDE 32

Exploration vs. exploitation

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

Problem. You are at the beginning of a new study year. Every fellow

student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies:

1. Make friends whe{n|r}ever possible. You are an

explo rer.

2. Stick to the nearest fellow-student. You are an

exploiter.

3. What most people would do: first explore, then “exploit”.

SLIDE 33

Exploration vs. exploitation

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

Problem. You are at the beginning of a new study year. Every fellow

student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies:

1. Make friends whe{n|r}ever possible. You are an

explo rer.

2. Stick to the nearest fellow-student. You are an

exploiter.

3. What most people would do: first explore, then “exploit”.

We ignore / abstract away from:

SLIDE 34

Exploration vs. exploitation

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

Problem. You are at the beginning of a new study year. Every fellow

student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies:

1. Make friends whe{n|r}ever possible. You are an

explo rer.

2. Stick to the nearest fellow-student. You are an

exploiter.

3. What most people would do: first explore, then “exploit”.

We ignore / abstract away from:

1. How quality of friendships is measured.

SLIDE 35

Exploration vs. exploitation

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 8

Problem. You are at the beginning of a new study year. Every fellow

student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies:

1. Make friends whe{n|r}ever possible. You are an

explo rer.

2. Stick to the nearest fellow-student. You are an

exploiter.

3. What most people would do: first explore, then “exploit”.

We ignore / abstract away from:

1. How quality of friendships is measured.
2. That personalities of friends may change (so-called “non-stationary

search”).

SLIDE 36

alternatives.

SLIDE 38

alternatives.

■ Select a movie channel from N

recommendations.

SLIDE 39

alternatives.

■ Select a movie channel from N

recommendations.

■ Distribute load among servers.

SLIDE 40

alternatives.

■ Select a movie channel from N

recommendations.

■ Distribute load among servers. ■ Choose a medical treatment

from N alternatives.

SLIDE 41

alternatives.

■ Select a movie channel from N

recommendations.

■ Distribute load among servers. ■ Choose a medical treatment

from N alternatives.

■ Adaptive routing to optimize

network flow.

SLIDE 42

alternatives.

■ Select a movie channel from N

recommendations.

■ Distribute load among servers. ■ Choose a medical treatment

from N alternatives.

■ Adaptive routing to optimize

network flow.

■ Financial portfolio

management.

SLIDE 43

alternatives.

■ Select a movie channel from N

recommendations.

■ Distribute load among servers. ■ Choose a medical treatment

from N alternatives.

■ Adaptive routing to optimize

network flow.

■ Financial portfolio

management.

■ . . .

SLIDE 44

Computation of the quality (off line version)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10

A reasonable measure for

the qualit y

f

an a tion a after n tries, Qn, would

be its average payoff: Formula for the quality of an action after n tries. Qn =Def r1 + · · · + rn n

bat h lea rning

nline

lea rning

SLIDE 45

Computation of the quality (off line version)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10

A reasonable measure for

the qualit y

f

an a tion a after n tries, Qn, would

be its average payoff: Formula for the quality of an action after n tries. Qn =Def r1 + · · · + rn n Data comes in gradually.

bat h lea rning

nline

lea rning

SLIDE 46

Computation of the quality (off line version)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10

A reasonable measure for

the qualit y

f

an a tion a after n tries, Qn, would

be its average payoff: Formula for the quality of an action after n tries. Qn =Def r1 + · · · + rn n Data comes in gradually.

■ This formula is correct. However, every time Qn is computed, all

r1, . . . , rn must be retrieved.

bat h lea rning

nline

lea rning

SLIDE 47

Computation of the quality (off line version)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10

A reasonable measure for

the qualit y

f

an a tion a after n tries, Qn, would

be its average payoff: Formula for the quality of an action after n tries. Qn =Def r1 + · · · + rn n Data comes in gradually.

■ This formula is correct. However, every time Qn is computed, all

r1, . . . , rn must be retrieved. This is

bat h lea rning.

nline

lea rning

SLIDE 48

Computation of the quality (off line version)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10

A reasonable measure for

the qualit y

f

an a tion a after n tries, Qn, would

be its average payoff: Formula for the quality of an action after n tries. Qn =Def r1 + · · · + rn n Data comes in gradually.

■ This formula is correct. However, every time Qn is computed, all

r1, . . . , rn must be retrieved. This is

bat h lea rning.

■ It would be better to have an update formula that computes the new

average based on the old average and the new incoming value.

nline

lea rning

SLIDE 49

Computation of the quality (off line version)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 10

A reasonable measure for

the qualit y

f

an a tion a after n tries, Qn, would

be its average payoff: Formula for the quality of an action after n tries. Qn =Def r1 + · · · + rn n Data comes in gradually.

■ This formula is correct. However, every time Qn is computed, all

r1, . . . , rn must be retrieved. This is

bat h lea rning.

■ It would be better to have an update formula that computes the new

average based on the old average and the new incoming value. That would be

nline

lea rning.

SLIDE 50

Computation of the quality (online version)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

Qn

SLIDE 51

Computation of the quality (online version)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

Qn

=

r1 + · · · + rn n

SLIDE 52

Computation of the quality (online version)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

Qn

=

r1 + · · · + rn n

= r1 + · · · + rn−1

n

+ rn

n

SLIDE 53

Computation of the quality (online version)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

Qn

=

r1 + · · · + rn n

= r1 + · · · + rn−1

n

+ rn

n

=

r1 + · · · + rn−1 n − 1

· n − 1

n

+ rn

n

SLIDE 54

Computation of the quality (online version)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

Qn

=

r1 + · · · + rn n

= r1 + · · · + rn−1

n

+ rn

n

=

r1 + · · · + rn−1 n − 1

· n − 1

n

+ rn

n

=

Qn−1· n − 1 n

+ rn

n

SLIDE 55

Computation of the quality (online version)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

Qn

=

r1 + · · · + rn n

= r1 + · · · + rn−1

n

+ rn

n

=

r1 + · · · + rn−1 n − 1

· n − 1

n

+ rn

n

=

Qn−1· n − 1 n

+ rn

n

=

Qn−1 − 1 n

Qn−1 +

1 n

rn

SLIDE 56

Computation of the quality (online version)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

Qn

=

r1 + · · · + rn n

= r1 + · · · + rn−1

n

+ rn

n

=

r1 + · · · + rn−1 n − 1

· n − 1

n

+ rn

n

=

Qn−1· n − 1 n

+ rn

n

=

Qn−1 − 1 n

Qn−1 +

1 n

rn

=

Qn−1

ld

value

SLIDE 57

Computation of the quality (online version)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

Qn

=

r1 + · · · + rn n

= r1 + · · · + rn−1

n

+ rn

n

=

r1 + · · · + rn−1 n − 1

· n − 1

n

+ rn

n

=

Qn−1· n − 1 n

+ rn

n

=

Qn−1 − 1 n

Qn−1 +

1 n

rn

=

Qn−1

ld

value

+

1 n

learning

rate

SLIDE 58

Computation of the quality (online version)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

Qn

=

r1 + · · · + rn n

= r1 + · · · + rn−1

n

+ rn

n

=

r1 + · · · + rn−1 n − 1

· n − 1

n

+ rn

n

=

Qn−1· n − 1 n

+ rn

n

=

Qn−1 − 1 n

Qn−1 +

1 n

rn

=

Qn−1

ld

value

+

1 n

learning

rate

( rn

goal

value

SLIDE 59

Computation of the quality (online version)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 11

Qn

=

r1 + · · · + rn n

= r1 + · · · + rn−1

n

+ rn

n

=

r1 + · · · + rn−1 n − 1

· n − 1

n

+ rn

n

=

Qn−1· n − 1 n

+ rn

n

=

Qn−1 − 1 n

Qn−1 +

1 n

rn

=

Qn−1

ld

value

+

1 n

learning

rate

( rn

goal

value

− Qn−1

ld

value

error

)

correction
new value

.

SLIDE 60

Progress of the quality of one action

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 12

lea rning rate average geometri average

SLIDE 61

Progress of the quality of one action

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 12

■ Amplitude of correction is determined by the

lea rning rate. average geometri average

SLIDE 62

Progress of the quality of one action

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 12

■ Amplitude of correction is determined by the

lea rning rate.

■ To compute the

average, the learning rate is 1/n (decreases!). geometri average

SLIDE 63

Progress of the quality of one action

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 12

■ Amplitude of correction is determined by the

lea rning rate.

■ To compute the

average, the learning rate is 1/n (decreases!).

■ Learning rate can also be a constant 0 ≤ λ ≤ 1

geometri average

SLIDE 64

Progress of the quality of one action

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 12

■ Amplitude of correction is determined by the

lea rning rate.

■ To compute the

average, the learning rate is 1/n (decreases!).

■ Learning rate can also be a constant 0 ≤ λ ≤ 1 ⇒

geometri average.

SLIDE 65

Action selection: greedy and epsilon-greedy

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

Exploitation Explo ration the se ond Bo rel-Cantelli lemma the la w

f

la rge numb ers

SLIDE 66

Action selection: greedy and epsilon-greedy

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

■ Greedy: exploit the action that is optimal thus far.

pi =Def 1 if ai is optimal thus far,

therwise.

Exploitation Explo ration the se ond Bo rel-Cantelli lemma the la w

f

la rge numb ers

SLIDE 67

Action selection: greedy and epsilon-greedy

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

■ Greedy: exploit the action that is optimal thus far.

pi =Def 1 if ai is optimal thus far,

therwise.

■ ǫ-Greedy: Let 0 < ǫ ≤ 1 close to 0.

Exploitation Explo ration the se ond Bo rel-Cantelli lemma the la w

f

la rge numb ers

SLIDE 68

Action selection: greedy and epsilon-greedy

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

■ Greedy: exploit the action that is optimal thus far.

pi =Def 1 if ai is optimal thus far,

therwise.

■ ǫ-Greedy: Let 0 < ǫ ≤ 1 close to 0.

1.

Exploitation: choose (1 − ǫ)% of the time an optimal action. Explo ration the se ond Bo rel-Cantelli lemma the la w

f

la rge numb ers

SLIDE 69

Action selection: greedy and epsilon-greedy

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

■ Greedy: exploit the action that is optimal thus far.

pi =Def 1 if ai is optimal thus far,

therwise.

■ ǫ-Greedy: Let 0 < ǫ ≤ 1 close to 0.

1.

Exploitation: choose (1 − ǫ)% of the time an optimal action.

2.

Explo ration: at other times, choose a random action. the se ond Bo rel-Cantelli lemma the la w

f

la rge numb ers

SLIDE 70

Action selection: greedy and epsilon-greedy

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

■ Greedy: exploit the action that is optimal thus far.

pi =Def 1 if ai is optimal thus far,

therwise.

■ ǫ-Greedy: Let 0 < ǫ ≤ 1 close to 0.

1.

Exploitation: choose (1 − ǫ)% of the time an optimal action.

2.

Explo ration: at other times, choose a random action.

Because ∑∞

i=1 ǫ is infinite, it follows from the

the se ond Bo rel-Cantelli lemma that every action is explored infinitely many

times a.s.

the la w

f

la rge numb ers

SLIDE 71

Action selection: greedy and epsilon-greedy

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

■ Greedy: exploit the action that is optimal thus far.

pi =Def 1 if ai is optimal thus far,

therwise.

■ ǫ-Greedy: Let 0 < ǫ ≤ 1 close to 0.

1.

Exploitation: choose (1 − ǫ)% of the time an optimal action.

2.

Explo ration: at other times, choose a random action.

Because ∑∞

i=1 ǫ is infinite, it follows from the

the se ond Bo rel-Cantelli lemma that every action is explored infinitely many

times a.s. So, by

the la w

f

la rge numb ers, the estimated value of an action

converges to its true value.

SLIDE 72

Action selection: greedy and epsilon-greedy

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

■ Greedy: exploit the action that is optimal thus far.

pi =Def 1 if ai is optimal thus far,

therwise.

■ ǫ-Greedy: Let 0 < ǫ ≤ 1 close to 0.

1.

Exploitation: choose (1 − ǫ)% of the time an optimal action.

2.

Explo ration: at other times, choose a random action.

Because ∑∞

i=1 ǫ is infinite, it follows from the

the se ond Bo rel-Cantelli lemma that every action is explored infinitely many

times a.s. So, by

the la w

f

la rge numb ers, the estimated value of an action

converges to its true value.

All this a.s.

SLIDE 73

Action selection: greedy and epsilon-greedy

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

■ Greedy: exploit the action that is optimal thus far.

pi =Def 1 if ai is optimal thus far,

therwise.

■ ǫ-Greedy: Let 0 < ǫ ≤ 1 close to 0.

1.

Exploitation: choose (1 − ǫ)% of the time an optimal action.

2.

Explo ration: at other times, choose a random action.

Because ∑∞

i=1 ǫ is infinite, it follows from the

the se ond Bo rel-Cantelli lemma that every action is explored infinitely many

times a.s. So, by

the la w

f

la rge numb ers, the estimated value of an action

converges to its true value.

All this a.s. (= with probability 1).

SLIDE 74

Action selection: greedy and epsilon-greedy

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 13

■ Greedy: exploit the action that is optimal thus far.

pi =Def 1 if ai is optimal thus far,

therwise.

■ ǫ-Greedy: Let 0 < ǫ ≤ 1 close to 0.

1.

Exploitation: choose (1 − ǫ)% of the time an optimal action.

2.

Explo ration: at other times, choose a random action.

Because ∑∞

i=1 ǫ is infinite, it follows from the

the se ond Bo rel-Cantelli lemma that every action is explored infinitely many

times a.s. So, by

the la w

f

la rge numb ers, the estimated value of an action

converges to its true value.

All this a.s. (= with probability 1). In particular it is not certain.

SLIDE 75

Action selection: optimistic initial values

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 14

An alternative for ǫ-greedy is to work with

ptimisti

initial values. unrealisti ally high qualit y

SLIDE 76

Action selection: optimistic initial values

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 14

An alternative for ǫ-greedy is to work with

ptimisti

initial values.

■ At the outset, an

unrealisti ally high qualit y is attributed to every slot

machine: Qk

0 = high

for 1 ≤ k ≤ N.

SLIDE 77

Action selection: optimistic initial values

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 14

An alternative for ǫ-greedy is to work with

ptimisti

initial values.

■ At the outset, an

unrealisti ally high qualit y is attributed to every slot

machine: Qk

0 = high

for 1 ≤ k ≤ N.

■ As usual, for every slot machine

its average profit is maintained.

SLIDE 78

Action selection: optimistic initial values

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 14

An alternative for ǫ-greedy is to work with

ptimisti

initial values.

■ At the outset, an

unrealisti ally high qualit y is attributed to every slot

machine: Qk

0 = high

for 1 ≤ k ≤ N.

■ As usual, for every slot machine

its average profit is maintained.

■ Without exception, always exploit

machines with highest Q-values.

SLIDE 79

Action selection: optimistic initial values

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15

An alternative for ǫ-greedy is to work with

ptimisti

initial values.

■ At the outset, an

unrealisti ally high qualit y is attributed to every slot

machine: Qk

0 = high

for 1 ≤ k ≤ N.

■ As usual, for every slot machine

its average profit is maintained.

■ Without exception, always exploit

machines with highest Q-values. Some questions:

SLIDE 80

Action selection: optimistic initial values

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15

An alternative for ǫ-greedy is to work with

ptimisti

initial values.

■ At the outset, an

unrealisti ally high qualit y is attributed to every slot

machine: Qk

0 = high

for 1 ≤ k ≤ N.

■ As usual, for every slot machine

its average profit is maintained.

■ Without exception, always exploit

machines with highest Q-values. Some questions:

1. Initially, many actions are tried

⇒ all actions are tried?

SLIDE 81

Action selection: optimistic initial values

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15

An alternative for ǫ-greedy is to work with

ptimisti

initial values.

■ At the outset, an

unrealisti ally high qualit y is attributed to every slot

machine: Qk

0 = high

for 1 ≤ k ≤ N.

■ As usual, for every slot machine

its average profit is maintained.

■ Without exception, always exploit

machines with highest Q-values. Some questions:

1. Initially, many actions are tried

⇒ all actions are tried?

2. How high should “high” be?

SLIDE 82

Action selection: optimistic initial values

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15

An alternative for ǫ-greedy is to work with

ptimisti

initial values.

■ At the outset, an

unrealisti ally high qualit y is attributed to every slot

machine: Qk

0 = high

for 1 ≤ k ≤ N.

■ As usual, for every slot machine

its average profit is maintained.

■ Without exception, always exploit

machines with highest Q-values. Some questions:

1. Initially, many actions are tried

⇒ all actions are tried?

2. How high should “high” be?
3. Can we still speak of

exploration?

SLIDE 83

Action selection: optimistic initial values

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15

An alternative for ǫ-greedy is to work with

ptimisti

initial values.

■ At the outset, an

unrealisti ally high qualit y is attributed to every slot

machine: Qk

0 = high

for 1 ≤ k ≤ N.

■ As usual, for every slot machine

its average profit is maintained.

■ Without exception, always exploit

machines with highest Q-values. Some questions:

1. Initially, many actions are tried

⇒ all actions are tried?

2. How high should “high” be?
3. Can we still speak of

exploration?

4. ǫ-greedy: Pr( every action is

explored infinitely many times ) = 1. Also with

ptimism?

SLIDE 84

Action selection: optimistic initial values

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 15

An alternative for ǫ-greedy is to work with

ptimisti

initial values.

■ At the outset, an

unrealisti ally high qualit y is attributed to every slot

machine: Qk

0 = high

for 1 ≤ k ≤ N.

■ As usual, for every slot machine

its average profit is maintained.

■ Without exception, always exploit

machines with highest Q-values. Some questions:

1. Initially, many actions are tried

⇒ all actions are tried?

2. How high should “high” be?
3. Can we still speak of

exploration?

4. ǫ-greedy: Pr( every action is

explored infinitely many times ) = 1. Also with

ptimism?
5. Is optimism (as a method)

suitable to explore an array of (possibly) infinitely many slot machines? Why (not)?

SLIDE 85

Optimistic initial values vs. ǫ-greedy

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 16

From: “Reinforcement Learning (...)”, Sutton and Barto, Sec. 2.8, p. 41.

SLIDE 86

Q-learning

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17

moving average explo ration rate lea rning- adaptation rate

SLIDE 87

Q-learning

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17

■ Q-learning is like ǫ-Greedy learnnig, but then with a

moving average.

Algorithm:

explo ration rate lea rning- adaptation rate

SLIDE 88

Q-learning

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17

■ Q-learning is like ǫ-Greedy learnnig, but then with a

moving average.

Algorithm:

1. At round t choose an optimal action uniformly with probability

1 − ǫ.

explo ration rate lea rning- adaptation rate

SLIDE 89

Q-learning

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17

■ Q-learning is like ǫ-Greedy learnnig, but then with a

moving average.

Algorithm:

1. At round t choose an optimal action uniformly with probability

1 − ǫ.

2. Update Arm i’s estimate at round, Qi(t), as

Qi(t) = (1 − λ)Qi(t − 1) + λri, if Arm i is pulled with reward ri Qi(t − 1)

therwise.

explo ration rate lea rning- adaptation rate

SLIDE 90

Q-learning

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17

■ Q-learning is like ǫ-Greedy learnnig, but then with a

moving average.

Algorithm:

1. At round t choose an optimal action uniformly with probability

1 − ǫ.

2. Update Arm i’s estimate at round, Qi(t), as

Qi(t) = (1 − λ)Qi(t − 1) + λri, if Arm i is pulled with reward ri Qi(t − 1)

therwise.

■ Q-learning possesses two parameters: an

explo ration rate, ǫ, and a lea rning- or adaptation rate, λ.

SLIDE 91

Q-learning

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17

■ Q-learning is like ǫ-Greedy learnnig, but then with a

moving average.

Algorithm:

1. At round t choose an optimal action uniformly with probability

1 − ǫ.

2. Update Arm i’s estimate at round, Qi(t), as

Qi(t) = (1 − λ)Qi(t − 1) + λri, if Arm i is pulled with reward ri Qi(t − 1)

therwise.

■ Q-learning possesses two parameters: an

explo ration rate, ǫ, and a lea rning- or adaptation rate, λ.

A practical disadvantage of having two parameters is that tuning the algorithm takes more time.

SLIDE 92

Q-learning

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17

■ Q-learning is like ǫ-Greedy learnnig, but then with a

moving average.

Algorithm:

1. At round t choose an optimal action uniformly with probability

1 − ǫ.

2. Update Arm i’s estimate at round, Qi(t), as

Qi(t) = (1 − λ)Qi(t − 1) + λri, if Arm i is pulled with reward ri Qi(t − 1)

therwise.

■ Q-learning possesses two parameters: an

explo ration rate, ǫ, and a lea rning- or adaptation rate, λ.

A practical disadvantage of having two parameters is that tuning the algorithm takes more time.

■ Exercise: what if ǫ is small and γ is large?

SLIDE 93

Q-learning

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 17

■ Q-learning is like ǫ-Greedy learnnig, but then with a

moving average.

Algorithm:

1. At round t choose an optimal action uniformly with probability

1 − ǫ.

2. Update Arm i’s estimate at round, Qi(t), as

Qi(t) = (1 − λ)Qi(t − 1) + λri, if Arm i is pulled with reward ri Qi(t − 1)

therwise.

■ Q-learning possesses two parameters: an

explo ration rate, ǫ, and a lea rning- or adaptation rate, λ.

A practical disadvantage of having two parameters is that tuning the algorithm takes more time.

■ Exercise: what if ǫ is small and γ is large? The other way?

SLIDE 94

Action selection

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18

Boltzmann Gibbs mixed logit quantal resp

nse

temp erature

SLIDE 95

Action selection

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18

■ Greedily: exploit the action that is optimal thus far.

pi =Def 1 if ai is optimal thus far,

therwise.

Boltzmann Gibbs mixed logit quantal resp

nse

temp erature

SLIDE 96

Action selection

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18

■ Greedily: exploit the action that is optimal thus far.

pi =Def 1 if ai is optimal thus far,

therwise.

■ Proportional: select randomly and proportional to the expected

payoff. pi =Def Qi ∑n

j=1 Qj

.

Boltzmann Gibbs mixed logit quantal resp

nse

temp erature

SLIDE 97

Action selection

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18

■ Greedily: exploit the action that is optimal thus far.

pi =Def 1 if ai is optimal thus far,

therwise.

■ Proportional: select randomly and proportional to the expected

payoff. pi =Def Qi ∑n

j=1 Qj

.

■ Through softmax (or

Boltzmann, or Gibbs, or mixed logit, or quantal resp

nse).

temp erature

SLIDE 98

Action selection

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18

■ Greedily: exploit the action that is optimal thus far.

pi =Def 1 if ai is optimal thus far,

therwise.

■ Proportional: select randomly and proportional to the expected

payoff. pi =Def Qi ∑n

j=1 Qj

.

■ Through softmax (or

Boltzmann, or Gibbs, or mixed logit, or quantal resp

nse).

pi =Def eQi/τ ∑n

j=1 eQj/τ ,

temp erature

SLIDE 99

Action selection

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 18

■ Greedily: exploit the action that is optimal thus far.

pi =Def 1 if ai is optimal thus far,

therwise.

■ Proportional: select randomly and proportional to the expected

payoff. pi =Def Qi ∑n

j=1 Qj

.

■ Through softmax (or

Boltzmann, or Gibbs, or mixed logit, or quantal resp

nse).

pi =Def eQi/τ ∑n

j=1 eQj/τ ,

where the parameter τ is often called the

temp erature.

SLIDE 100

Effect of the temperature parameter

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 19

The softmax function: pi =Def eQi/τ ∑n

j=1 eQj/τ .

SLIDE 101

Effect of the temperature parameter

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 19

The softmax function: pi =Def eQi/τ ∑n

j=1 eQj/τ .

This function favours successful actions. How much depends on τ:

SLIDE 102

Effect of the temperature parameter

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 19

The softmax function: pi =Def eQi/τ ∑n

j=1 eQj/τ .

This function favours successful actions. How much depends on τ:

τ → ∞

SLIDE 103

Effect of the temperature parameter

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 19

The softmax function: pi =Def eQi/τ ∑n

j=1 eQj/τ .

This function favours successful actions. How much depends on τ:

τ → ∞ τ = 1

SLIDE 104

Effect of the temperature parameter

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 19

The softmax function: pi =Def eQi/τ ∑n

j=1 eQj/τ .

This function favours successful actions. How much depends on τ:

τ → ∞ τ = 1 τ ↓ 0

SLIDE 105

UCB

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 20

upp er

nden e

b

unds
ptimism

in the fa e

f

un ertaint y

SLIDE 106

UCB

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 20

■ UCB is short for

upp er

nden e

b

unds.
ptimism

in the fa e

f

un ertaint y

SLIDE 107

UCB

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 20

■ UCB is short for

upp er

nden e

b

unds.

■ Proposed by Lai et al. in “Asymptotically efficient adaptive allocation

rules” in: Advances in applied mathematics Vol. 6, Nr. 1 (1985), pp. 4-22.

ptimism

in the fa e

f

un ertaint y

SLIDE 108

UCB

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 20

■ UCB is short for

upp er

nden e

b

unds.

■ Proposed by Lai et al. in “Asymptotically efficient adaptive allocation

rules” in: Advances in applied mathematics Vol. 6, Nr. 1 (1985), pp. 4-22.

■ Idea: keep track of confidence intervals for each action. At any time,

choose the action in that confidence interval with the highest upper bound.

ptimism

in the fa e

f

un ertaint y

SLIDE 109

UCB

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 20

■ UCB is short for

upp er

nden e

b

unds.

■ Proposed by Lai et al. in “Asymptotically efficient adaptive allocation

rules” in: Advances in applied mathematics Vol. 6, Nr. 1 (1985), pp. 4-22.

■ Idea: keep track of confidence intervals for each action. At any time,

choose the action in that confidence interval with the highest upper

bound. Often advocated as
ptimism

in the fa e

f

un ertaint y.

SLIDE 110

UCB

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 20

■ UCB is short for

upp er

nden e

b

unds.

■ Proposed by Lai et al. in “Asymptotically efficient adaptive allocation

rules” in: Advances in applied mathematics Vol. 6, Nr. 1 (1985), pp. 4-22.

■ Idea: keep track of confidence intervals for each action. At any time,

choose the action in that confidence interval with the highest upper

bound. Often advocated as
ptimism

in the fa e

f

un ertaint y.

■ Algorithm: execute each action once. Then, at each round t, choose

ne of the actions that has highest

¯ Xi

t +

2 ln(t)

ni

t

, where ¯ Xi

t is action’s i average at round t, and ni t is the number of

times action i is executed at round t.

SLIDE 111

UCB: idea

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 21

b

Action 7

b

Action 6

b

Action 5

b

Action 4

b

Action 3

b

Action 2

b

Action 1 Many actions have identical empirical

means. On the basis of

highest empirical mean

nly, Action 4 and

Action 5 would be equally optimal. However the variation in the rewards of Action 5 is higher, hence its confidence interval is wider, hence its UCB is higher, therefore, choose 5:

ptimism

in the fa e

f

un ertaint y.

SLIDE 112

UCB: demo

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 22

SLIDE 113

UCB: derivation

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 23

Ho eding's inequalit y

SLIDE 114

UCB: derivation

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 23

■

Ho eding's inequalit y for i.i.d. random variables X1, . . . , Xt with mean

µ and values in [0, 1] says that, for any d ≥ 0 Pr{µ ≥ ¯ Xt + d} ≤ exp(−2td2), where ¯ Xt = (X1, . . . , Xt)/t is the empirical mean of the random variables at round t.

SLIDE 115

UCB: derivation

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 23

■

Ho eding's inequalit y for i.i.d. random variables X1, . . . , Xt with mean

µ and values in [0, 1] says that, for any d ≥ 0 Pr{µ ≥ ¯ Xt + d} ≤ exp(−2td2), where ¯ Xt = (X1, . . . , Xt)/t is the empirical mean of the random variables at round t.

■ For action i this would be Pr{µ ≥ ¯

Xi

t + d} ≤ exp(−2ni td2).

SLIDE 116

UCB: derivation

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 23

■

Ho eding's inequalit y for i.i.d. random variables X1, . . . , Xt with mean

µ and values in [0, 1] says that, for any d ≥ 0 Pr{µ ≥ ¯ Xt + d} ≤ exp(−2td2), where ¯ Xt = (X1, . . . , Xt)/t is the empirical mean of the random variables at round t.

■ For action i this would be Pr{µ ≥ ¯

Xi

t + d} ≤ exp(−2ni td2).

■ The probability that the true mean lies outside [ ¯

Xi

t − d, ¯

Xi

t + d] goes to

zero for t → ∞ if we set exp(−2ni

td2) equal to an expression that goes

to zero if t → ∞.

SLIDE 117

UCB: derivation

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 23

■

Ho eding's inequalit y for i.i.d. random variables X1, . . . , Xt with mean

µ and values in [0, 1] says that, for any d ≥ 0 Pr{µ ≥ ¯ Xt + d} ≤ exp(−2td2), where ¯ Xt = (X1, . . . , Xt)/t is the empirical mean of the random variables at round t.

■ For action i this would be Pr{µ ≥ ¯

Xi

t + d} ≤ exp(−2ni td2).

■ The probability that the true mean lies outside [ ¯

Xi

t − d, ¯

Xi

t + d] goes to

zero for t → ∞ if we set exp(−2ni

td2) equal to an expression that goes

to zero if t → ∞. The term t−4 is convenient here. Set exp(−2ni

td2) = t−4.

Isolating d yields d =

2 ln(t)

ni

t

.

SLIDE 118

UCB: discussion

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 24

SLIDE 119

UCB: discussion

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 24

■ All arms are pulled infinitely often. (Hint: what happens with an

arm’s UCB if it is not pulled?)

SLIDE 120

UCB: discussion

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 24

■ All arms are pulled infinitely often. (Hint: what happens with an

arm’s UCB if it is not pulled?)

■ Let µ1, . . . , µK be the means of the distributions of the K arms.

SLIDE 121

UCB: discussion

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 24

■ All arms are pulled infinitely often. (Hint: what happens with an

arm’s UCB if it is not pulled?)

■ Let µ1, . . . , µK be the means of the distributions of the K arms. ■ Let I = argmaxi=1,...,K{µi}. So I is the set of indices of all optimal

arms.

SLIDE 122

UCB: discussion

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 24

■ All arms are pulled infinitely often. (Hint: what happens with an

arm’s UCB if it is not pulled?)

■ Let µ1, . . . , µK be the means of the distributions of the K arms. ■ Let I = argmaxi=1,...,K{µi}. So I is the set of indices of all optimal

arms.

■ It can be proven that the number of times a sub-optimal arm is played

can be bounded from above by C ln N, where C is some constant.

SLIDE 123

UCB: discussion

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 24

■ All arms are pulled infinitely often. (Hint: what happens with an

arm’s UCB if it is not pulled?)

■ Let µ1, . . . , µK be the means of the distributions of the K arms. ■ Let I = argmaxi=1,...,K{µi}. So I is the set of indices of all optimal

arms.

■ It can be proven that the number of times a sub-optimal arm is played

can be bounded from above by C ln N, where C is some constant. It has also been proven that this is the best possible bound, up to C.

SLIDE 124

UCB: discussion

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 24

■ All arms are pulled infinitely often. (Hint: what happens with an

arm’s UCB if it is not pulled?)

■ Let µ1, . . . , µK be the means of the distributions of the K arms. ■ Let I = argmaxi=1,...,K{µi}. So I is the set of indices of all optimal

arms.

■ It can be proven that the number of times a sub-optimal arm is played

can be bounded from above by C ln N, where C is some constant. It has also been proven that this is the best possible bound, up to C.

■ Bounds can be loose. Suppose C = 8 and N is 20. Then

C ln N = 11.82. So it is possible to play 11 out of 20 times sub-optimal.

SLIDE 125

UCB: discussion

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 24

■ All arms are pulled infinitely often. (Hint: what happens with an

arm’s UCB if it is not pulled?)

■ Let µ1, . . . , µK be the means of the distributions of the K arms. ■ Let I = argmaxi=1,...,K{µi}. So I is the set of indices of all optimal

arms.

■ It can be proven that the number of times a sub-optimal arm is played

can be bounded from above by C ln N, where C is some constant. It has also been proven that this is the best possible bound, up to C.

■ Bounds can be loose. Suppose C = 8 and N is 20. Then

C ln N = 11.82. So it is possible to play 11 out of 20 times sub-optimal.

■ UCB comes in variants. UCB1 was discussed here.

SLIDE 126

Thompson sampling (posterior sampling)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 25

hyp

thesis

p rio r PDF p

sterio

r PDF Ba y esian up date

SLIDE 127

Thompson sampling (posterior sampling)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 25

■ Idea: for each arm, use a

parameterised probability density function as a

hyp

thesis

for its rewards.

p rio r PDF p

sterio

r PDF Ba y esian up date

SLIDE 128

Thompson sampling (posterior sampling)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 25

■ Idea: for each arm, use a

parameterised probability density function as a

hyp

thesis

for its rewards. E.g., for the Bernoulli MAB problem1 this could be the beta-distributions Betai(αi, βi), with parameters αi, βi > 0.

p rio r PDF p

sterio

r PDF Ba y esian up date

SLIDE 129

Thompson sampling (posterior sampling)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 25

■ Idea: for each arm, use a

parameterised probability density function as a

hyp

thesis

for its rewards. E.g., for the Bernoulli MAB problem1 this could be the beta-distributions Betai(αi, βi), with parameters αi, βi > 0.

■ For every action, start out with

neutral distribution, the

p rio r PDF. p

sterio

r PDF Ba y esian up date

SLIDE 130

Thompson sampling (posterior sampling)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 25

■ Idea: for each arm, use a

parameterised probability density function as a

hyp

thesis

for its rewards. E.g., for the Bernoulli MAB problem1 this could be the beta-distributions Betai(αi, βi), with parameters αi, βi > 0.

■ For every action, start out with

neutral distribution, the

p rio r PDF.

For B-MAB, this should be Beta(1, 1).

p

sterio

r PDF Ba y esian up date

SLIDE 131

Thompson sampling (posterior sampling)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 25

■ Idea: for each arm, use a

parameterised probability density function as a

hyp

thesis

for its rewards. E.g., for the Bernoulli MAB problem1 this could be the beta-distributions Betai(αi, βi), with parameters αi, βi > 0.

■ For every action, start out with

neutral distribution, the

p rio r PDF.

For B-MAB, this should be Beta(1, 1).

■ At every round, sample all

hypotheses, and choose the arm which is sampled highest.

p

sterio

r PDF Ba y esian up date

SLIDE 132

Thompson sampling (posterior sampling)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 25

■ Idea: for each arm, use a

parameterised probability density function as a

hyp

thesis

for its rewards. E.g., for the Bernoulli MAB problem1 this could be the beta-distributions Betai(αi, βi), with parameters αi, βi > 0.

■ For every action, start out with

neutral distribution, the

p rio r PDF.

For B-MAB, this should be Beta(1, 1).

■ At every round, sample all

hypotheses, and choose the arm which is sampled highest. B-MAB: the mean of Beta(α, β) is 1/(1 + β/α).

p

sterio

r PDF Ba y esian up date

SLIDE 133

Thompson sampling (posterior sampling)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 25

■ Idea: for each arm, use a

parameterised probability density function as a

hyp

thesis

for its rewards. E.g., for the Bernoulli MAB problem1 this could be the beta-distributions Betai(αi, βi), with parameters αi, βi > 0.

■ For every action, start out with

neutral distribution, the

p rio r PDF.

For B-MAB, this should be Beta(1, 1).

■ At every round, sample all

hypotheses, and choose the arm which is sampled highest. B-MAB: the mean of Beta(α, β) is 1/(1 + β/α).

■ Update: compute the action’s

p

sterio

r PDF by letting the

reward change the parameters

f its associated PDF, through a

Ba y esian up date.

SLIDE 134

Thompson sampling (posterior sampling)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 25

■ Idea: for each arm, use a

parameterised probability density function as a

hyp

thesis

for its rewards. E.g., for the Bernoulli MAB problem1 this could be the beta-distributions Betai(αi, βi), with parameters αi, βi > 0.

■ For every action, start out with

neutral distribution, the

p rio r PDF.

For B-MAB, this should be Beta(1, 1).

■ At every round, sample all

hypotheses, and choose the arm which is sampled highest. B-MAB: the mean of Beta(α, β) is 1/(1 + β/α).

■ Update: compute the action’s

p

sterio

r PDF by letting the

reward change the parameters

f its associated PDF, through a

Ba y esian up date.

B-MAB: if arm 5 pays 0, do β5 = β5 + 1 for the associated

PDF. (If arm 5 pays 1, do

α5 = α5 + 1.)

1Each arm i is associated with the outcome of tossing a biased coin (heads = 1, tails =

0) with fixed (and hidden) bias 0 ≤ θi ≤ 1.

SLIDE 135

Thompson sampling on a Bernoulli bandit

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 26

1 1 2 3

Arm 1, with posterior Betai(3, 2) Arm 2, Betai(1, 1) Arm 3, Betai(2, 4) Arm 4, Betai(2, 7)

Posterior PDFs after pulling Arm 1 three times with two successes, Arm 2 zero times, Arm 3 four times with one success, Arm 4 seven times with

ne success. (Notice: α = #successes + 1, β = #failures + 1.)

SLIDE 136

Adversarial bandits

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 27

shifting gea rs adversa rial bandit

SLIDE 137

Adversarial bandits

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 27

■ Adversarial behaviour seems to

be the norm multi-agent learning: opponents may change strategies at any time (called

shifting gea rs in poker),

for example for the end of being unpredictable.

adversa rial bandit

SLIDE 138

Adversarial bandits

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 27

■ Adversarial behaviour seems to

be the norm multi-agent learning: opponents may change strategies at any time (called

shifting gea rs in poker),

for example for the end of being unpredictable. Think of anti-coordination games, such as chicken, or matching pennies.

adversa rial bandit

SLIDE 139

Adversarial bandits

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 27

■ Adversarial behaviour seems to

be the norm multi-agent learning: opponents may change strategies at any time (called

shifting gea rs in poker),

for example for the end of being unpredictable. Think of anti-coordination games, such as chicken, or matching pennies.

■ An

adversa rial bandit (Auer and

Cesa-Bianchi, 1998) may use any reward algorithm.

SLIDE 140

Adversarial bandits

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 27

■ Adversarial behaviour seems to

be the norm multi-agent learning: opponents may change strategies at any time (called

shifting gea rs in poker),

for example for the end of being unpredictable. Think of anti-coordination games, such as chicken, or matching pennies.

■ An

adversa rial bandit (Auer and

Cesa-Bianchi, 1998) may use any reward algorithm.

■ In the worst case, an adversarial

bandit is benign. If it is smart [omniscient], it may frustrate a learning algorithm [to the max].

SLIDE 141

Adversarial bandits

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 27

■ Adversarial behaviour seems to

be the norm multi-agent learning: opponents may change strategies at any time (called

shifting gea rs in poker),

for example for the end of being unpredictable. Think of anti-coordination games, such as chicken, or matching pennies.

■ An

adversa rial bandit (Auer and

Cesa-Bianchi, 1998) may use any reward algorithm.

■ In the worst case, an adversarial

bandit is benign. If it is smart [omniscient], it may frustrate a learning algorithm [to the max]. Small exercise: suppose your are a benign adversarial playing against against a fictitious player who plays row. How would you play? H T H 1, 0

−1, 0

T

−1, 0

1, 0

SLIDE 142

Adversarial bandits

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 27

■ Adversarial behaviour seems to

be the norm multi-agent learning: opponents may change strategies at any time (called

shifting gea rs in poker),

for example for the end of being unpredictable. Think of anti-coordination games, such as chicken, or matching pennies.

■ An

adversa rial bandit (Auer and

Cesa-Bianchi, 1998) may use any reward algorithm.

■ In the worst case, an adversarial

bandit is benign. If it is smart [omniscient], it may frustrate a learning algorithm [to the max]. Small exercise: suppose your are a benign adversarial playing against against a fictitious player who plays row. How would you play? H T H 1, 0

−1, 0

T

−1, 0

1, 0

■ “Cumulative” algorithms, like

ǫ-Greedy, UCB, or Thompson, respond slowly to sudden changes.

SLIDE 143

Exp3

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 28

exp

nential

w eight algo rithm fo r explo ration and exploitation adversa rial algo rithm egalita rianism fa to r

SLIDE 144

Exp3

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 28

■ Exp3 is short for

exp

nential

w eight algo rithm fo r explo ration and exploitation. adversa rial algo rithm egalita rianism fa to r

SLIDE 145

Exp3

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 28

■ Exp3 is short for

exp

nential

w eight algo rithm fo r explo ration and exploitation.

■ Proposed by Auer et al. (2002) in “The nonstochastic multiarmed

bandit problem”, in Journal on Computing Vol. 32, Nr. 1.

adversa rial algo rithm egalita rianism fa to r

SLIDE 146

Exp3

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 28

■ Exp3 is short for

exp

nential

w eight algo rithm fo r explo ration and exploitation.

■ Proposed by Auer et al. (2002) in “The nonstochastic multiarmed

bandit problem”, in Journal on Computing Vol. 32, Nr. 1.

■ Exp3 may be seen as a volatile Softmax.

adversa rial algo rithm egalita rianism fa to r

SLIDE 147

Exp3

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 28

■ Exp3 is short for

exp

nential

w eight algo rithm fo r explo ration and exploitation.

■ Proposed by Auer et al. (2002) in “The nonstochastic multiarmed

bandit problem”, in Journal on Computing Vol. 32, Nr. 1.

■ Exp3 may be seen as a volatile Softmax. ■ It is an

adversa rial algo rithm, meaning that it should perform well in

environments where payoffs for actions may suddenly change.

egalita rianism fa to r

SLIDE 148

Exp3

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 28

■ Exp3 is short for

exp

nential

w eight algo rithm fo r explo ration and exploitation.

■ Proposed by Auer et al. (2002) in “The nonstochastic multiarmed

bandit problem”, in Journal on Computing Vol. 32, Nr. 1.

■ Exp3 may be seen as a volatile Softmax. ■ It is an

adversa rial algo rithm, meaning that it should perform well in

environments where payoffs for actions may suddenly change.

■ Idea: maintain a vector of action weights (w1, . . . , wK). Actions are

chosen probabilistically, proportional to their weights: pk(t) =Def (1 − γ) wk(t) ∑k

i=1 wi(t)

+ γ 1

K where 0 ≤ γ ≤ 1 is the

egalita rianism fa to r (check what if γ = 0, 1).

SLIDE 149

Exp3 (continued)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 29

So far so good.

w eak no-regret algo rithm

SLIDE 150

Exp3 (continued)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 29

So far so good. But how are the weights computed?

w eak no-regret algo rithm

SLIDE 151

Exp3 (continued)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 29

So far so good. But how are the weights computed? If ˆ ri(t) =Def    ri(t) pi(t) if i is chosen at t,

therwise.

denotes the estimated reward, i.e., the reward of action i weighed by its surprise (i.e., multiplied by the reciprocal of its probability to occur), then weights are computed as wi(t + 1) =Def wi(t) exp

γ 1

K ˆ ri(t)

w

eak no-regret algo rithm

SLIDE 152

Exp3 (continued)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 29

So far so good. But how are the weights computed? If ˆ ri(t) =Def    ri(t) pi(t) if i is chosen at t,

therwise.

denotes the estimated reward, i.e., the reward of action i weighed by its surprise (i.e., multiplied by the reciprocal of its probability to occur), then weights are computed as wi(t + 1) =Def wi(t) exp

γ 1

K ˆ ri(t)

Important: rewards are supposed to lie in [0, 1]. (Scale payoffs if

necessary.)

w eak no-regret algo rithm

SLIDE 153

Exp3 (continued)

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 29

So far so good. But how are the weights computed? If ˆ ri(t) =Def    ri(t) pi(t) if i is chosen at t,

therwise.

denotes the estimated reward, i.e., the reward of action i weighed by its surprise (i.e., multiplied by the reciprocal of its probability to occur), then weights are computed as wi(t + 1) =Def wi(t) exp

γ 1

K ˆ ri(t)

Important: rewards are supposed to lie in [0, 1]. (Scale payoffs if

necessary.) Exp3 is a so-called

w eak no-regret algo rithm, which means that the average

regrets are pressed out a.s.

SLIDE 154

Stationary time series

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 30

A time series {Xt}t is called

■

stri tly stationa ry w eakly stationa ry

f
rder

t w

b

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b

b

SLIDE 155

Stationary time series

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 30

A time series {Xt}t is called

■

stri tly stationa ry if, for each fixed h, the vectors (Xt, Xt+1, . . . , Xt+h)

are identically distributed for all t.

■

w eakly stationa ry

f
rder

t w

b

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b

b

SLIDE 156

Stationary time series

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 30

A time series {Xt}t is called

■

stri tly stationa ry if, for each fixed h, the vectors (Xt, Xt+1, . . . , Xt+h)

are identically distributed for all t.

■

w eakly stationa ry

f
rder

t w

if it has a fixed mean, and, for each

fixed h, the covariances Cov(Xt, Xt+h) are equal for all t.

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b

b

SLIDE 157

Stationary time series

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 30

A time series {Xt}t is called

■

stri tly stationa ry if, for each fixed h, the vectors (Xt, Xt+1, . . . , Xt+h)

are identically distributed for all t.

■

w eakly stationa ry

f
rder

t w

if it has a fixed mean, and, for each

fixed h, the covariances Cov(Xt, Xt+h) are equal for all t. Strict: all statistics are time-independent.

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b

b

SLIDE 158

Stationary time series

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 30

A time series {Xt}t is called

■

stri tly stationa ry if, for each fixed h, the vectors (Xt, Xt+1, . . . , Xt+h)

are identically distributed for all t.

■

w eakly stationa ry

f
rder

t w

if it has a fixed mean, and, for each

fixed h, the covariances Cov(Xt, Xt+h) are equal for all t. Strict: all statistics are time-independent. (Unrealistic.)

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b

b

SLIDE 159

Stationary time series

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 30

A time series {Xt}t is called

■

stri tly stationa ry if, for each fixed h, the vectors (Xt, Xt+1, . . . , Xt+h)

are identically distributed for all t.

■

w eakly stationa ry

f
rder

t w

if it has a fixed mean, and, for each

fixed h, the covariances Cov(Xt, Xt+h) are equal for all t. Strict: all statistics are time-independent. (Unrealistic.) Weak: constant mean, variance and covariance. Other statistics are free to change.

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b

b

SLIDE 160

Stationary time series

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 30

A time series {Xt}t is called

■

stri tly stationa ry if, for each fixed h, the vectors (Xt, Xt+1, . . . , Xt+h)

are identically distributed for all t.

■

w eakly stationa ry

f
rder

t w

if it has a fixed mean, and, for each

fixed h, the covariances Cov(Xt, Xt+h) are equal for all t. Strict: all statistics are time-independent. (Unrealistic.) Weak: constant mean, variance and covariance. Other statistics are free to change. (Common.)

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b

b

SLIDE 161

Stationary time series

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 30

A time series {Xt}t is called

■

stri tly stationa ry if, for each fixed h, the vectors (Xt, Xt+1, . . . , Xt+h)

are identically distributed for all t.

■

w eakly stationa ry

f
rder

t w

if it has a fixed mean, and, for each

fixed h, the covariances Cov(Xt, Xt+h) are equal for all t. Strict: all statistics are time-independent. (Unrealistic.) Weak: constant mean, variance and covariance. Other statistics are free to change. (Common.)

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b

b

true values trend = optimal prediction average moving average

SLIDE 162

Non-stationary time series

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 31

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b

true values trend

ptimal prediction

average moving average With

non-stationa ry time series, the average (a.k.a. empirical mean)

(n − 1)/nT + r/n and moving average (a.k.a. rolling average, geometric

mean, or exponentially smoothed mean) (1 − γ)T + γr are bad predictors.

SLIDE 163

MAB with non-stationary payoffs

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 32

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b

unevenly spa ed time series

Arm 1 Arm 2 Arm 3 Arm 4 Arm 5

True payments per arm per round, had arm been pulled (solid lines). Received payments, per arm per round (points). Due to prediction errors, sometimes a “wrong” arm is pulled (verify!).

SLIDE 164

MAB with non-stationary payoffs

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 32

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b

payoffs constitute an

unevenly spa ed time series

Arm 1 Arm 2 Arm 3 Arm 4 Arm 5

True payments per arm per round, had arm been pulled (solid lines). Received payments, per arm per round (points). Due to prediction errors, sometimes a “wrong” arm is pulled (verify!).

SLIDE 165

MAB with non-stationary payoffs

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 32

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b

unevenly spa ed time series

and this, of course, it what the observer sees

Arm 1 Arm 2 Arm 3 Arm 4 Arm 5

True payments per arm per round, had arm been pulled (solid lines). Received payments, per arm per round (points). Due to prediction errors, sometimes a “wrong” arm is pulled (verify!).

SLIDE 166

Status of MAB algorithms in MAL

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 33

Regula r evenly spa ed time series Unevenly spa ed time series to tak e advantage

f

the game

ntext

SLIDE 167

Status of MAB algorithms in MAL

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 33

■ Lots of theories, methods and

techniques are known to analyse and predict time series.

Regula r evenly spa ed time series Unevenly spa ed time series to tak e advantage

f

the game

ntext

SLIDE 168

Status of MAB algorithms in MAL

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 33

■ Lots of theories, methods and

techniques are known to analyse and predict time series.

Regula r, a.k.a. evenly spa ed time series are best understood. This

is probably due to its inherent properties and efforts in finance and stock prediction.

Unevenly spa ed time series to tak e advantage

f

the game

ntext

SLIDE 169

Status of MAB algorithms in MAL

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 33

■ Lots of theories, methods and

techniques are known to analyse and predict time series.

Regula r, a.k.a. evenly spa ed time series are best understood. This

is probably due to its inherent properties and efforts in finance and stock prediction.

■

Unevenly spa ed time series are

less well understood.

to tak e advantage

f

the game

ntext

SLIDE 170

Status of MAB algorithms in MAL

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 33

■ Lots of theories, methods and

techniques are known to analyse and predict time series.

Regula r, a.k.a. evenly spa ed time series are best understood. This

is probably due to its inherent properties and efforts in finance and stock prediction.

■

Unevenly spa ed time series are

less well understood. Some techniques:

to tak e advantage

f

the game

ntext

SLIDE 171

Status of MAB algorithms in MAL

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 33

■ Lots of theories, methods and

techniques are known to analyse and predict time series.

Regula r, a.k.a. evenly spa ed time series are best understood. This

is probably due to its inherent properties and efforts in finance and stock prediction.

■

Unevenly spa ed time series are

less well understood. Some techniques:

Interpolate empty intervals /

transform to evenly-spaced

series. (“Traces” is a Python

library based on this principle.)

to tak e advantage

f

the game

ntext

SLIDE 172

Status of MAB algorithms in MAL

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 33

■ Lots of theories, methods and

techniques are known to analyse and predict time series.

Regula r, a.k.a. evenly spa ed time series are best understood. This

is probably due to its inherent properties and efforts in finance and stock prediction.

■

Unevenly spa ed time series are

less well understood. Some techniques:

Interpolate empty intervals /

transform to evenly-spaced

series. (“Traces” is a Python

library based on this principle.)

Techniques that take

irregular time series “as they are” include state space analysis, Kalman filtering, autoregression, and stochastic differential equations, to name a few.

to tak e advantage

f

the game

ntext

SLIDE 173

Status of MAB algorithms in MAL

Author: Gerard Vreeswijk. Slides last modified on April 30th, 2020 at 13:45 Multi-agent learning: Multi-armed bandit algorithms, slide 33

■ Lots of theories, methods and

techniques are known to analyse and predict time series.

Regula r, a.k.a. evenly spa ed time series are best understood. This

is probably due to its inherent properties and efforts in finance and stock prediction.

■

Unevenly spa ed time series are

less well understood. Some techniques:

Interpolate empty intervals /

transform to evenly-spaced

series. (“Traces” is a Python

library based on this principle.)

Techniques that take

irregular time series “as they are” include state space analysis, Kalman filtering, autoregression, and stochastic differential equations, to name a few.

■ Rather than to overengineer

MAB algorithms, for MAL it is perhaps better

to tak e advantage

f

the game

ntext (own payoff

matrix, opponent moves,

pponent’s hypothesized

Multi-agent learning

Contents

Contents

practical applications.

Contents

practical applications.

processing of data.

Contents

practical applications.

processing of data.

algorithms:

Contents

practical applications.

processing of data.

algorithms:

Contents

practical applications.

processing of data.

algorithms:

ǫ and learning rate γ.

Contents

practical applications.

processing of data.

algorithms:

ǫ and learning rate γ.

Gibbs, mixed logit, quantal response) with temperature γ.

Contents

practical applications.

processing of data.

algorithms:

ǫ and learning rate γ.

Gibbs, mixed logit, quantal response) with temperature γ.

bound). (Parameterless.)

Contents

practical applications.

processing of data.

algorithms:

ǫ and learning rate γ.

Gibbs, mixed logit, quantal response) with temperature γ.

bound). (Parameterless.)

Contents

practical applications.

processing of data.

algorithms:

ǫ and learning rate γ.

Gibbs, mixed logit, quantal response) with temperature γ.

bound). (Parameterless.)

what works well in adversarial circumstances.

Contents

practical applications.

processing of data.

algorithms:

ǫ and learning rate γ.

Gibbs, mixed logit, quantal response) with temperature γ.

bound). (Parameterless.)

what works well in adversarial circumstances.

algorithm for exploration and exploitation) with egalitarian factor γ.

Contents

practical applications.

processing of data.

algorithms:

ǫ and learning rate γ.

Gibbs, mixed logit, quantal response) with temperature γ.

bound). (Parameterless.)

what works well in adversarial circumstances.

algorithm for exploration and exploitation) with egalitarian factor γ.

unevenly spaced time series.

MAB algorithms are only interest in rewards per action

Row is protagonist. From a b c d e A 1, 0 5, 6 1, 0 9, 7 7, 2 B 4, 6 4, 2 1, 8 7, 2 9, 7 C 1, 0 7, 2 9, 7 3, 4 4, 6 D 3, 7 5, 2 5, 3 9, 7 1, 8 E 1, 0 7, 2 4, 6 1, 2 2, 0

MAB algorithms are only interest in rewards per action

MAB algorithms are only interest in rewards per action

Introduction

The multi-armed bandit.

The multi-armed bandit problem

Which slot machine to choose?

MAB problem: random questions

machines.

MAB problem: random questions

machines. Random questions:

MAB problem: random questions