Multi-agent learning T eahing strategies Gerard Vreeswijk , - - PowerPoint PPT Presentation

multi agent learning
SMART_READER_LITE
LIVE PREVIEW

Multi-agent learning T eahing strategies Gerard Vreeswijk , - - PowerPoint PPT Presentation

Multi-agent learning T eahing strategies Gerard Vreeswijk , Intelligent Software Systems, Computer Science Department, Faculty of Sciences, Utrecht University, The Netherlands. Thursday 18 th June, 2020 Bully Go dfather {lenient, strit}


slide-1
SLIDE 1

Multi-agent learning

T ea hing strategies

Gerard Vreeswijk, Intelligent Software Systems, Computer Science Department, Faculty of Sciences, Utrecht University, The Netherlands.

Thursday 18th June, 2020

slide-2
SLIDE 2

Plan for Today

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Part I: Preliminaries

Bully Go dfather {lenient, stri t} Go dfather Go dfather++ SP aM guilt
slide-3
SLIDE 3

Plan for Today

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Part I: Preliminaries 1. Teacher possesses memory of k = 0 rounds:

Bully Go dfather {lenient, stri t} Go dfather Go dfather++ SP aM guilt
slide-4
SLIDE 4

Plan for Today

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Part I: Preliminaries 1. Teacher possesses memory of k = 0 rounds:

Bully

2. Teacher possesses memory of k = 1 round:

Go dfather {lenient, stri t} Go dfather Go dfather++ SP aM guilt
slide-5
SLIDE 5

Plan for Today

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Part I: Preliminaries 1. Teacher possesses memory of k = 0 rounds:

Bully

2. Teacher possesses memory of k = 1 round:

Go dfather

3. Teacher possesses memory of k > 1 rounds:

{lenient, stri t} Go dfather Go dfather++ SP aM guilt
slide-6
SLIDE 6

Plan for Today

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Part I: Preliminaries 1. Teacher possesses memory of k = 0 rounds:

Bully

2. Teacher possesses memory of k = 1 round:

Go dfather

3. Teacher possesses memory of k > 1 rounds:

{lenient, stri t} Go dfather

4. Teacher is represented by a finite machine:

Go dfather++ SP aM guilt
slide-7
SLIDE 7

Plan for Today

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Part I: Preliminaries 1. Teacher possesses memory of k = 0 rounds:

Bully

2. Teacher possesses memory of k = 1 round:

Go dfather

3. Teacher possesses memory of k > 1 rounds:

{lenient, stri t} Go dfather

4. Teacher is represented by a finite machine:

Go dfather++

Part II: Crandall & Goodrich (2005)

SP aM: an algorithm that claims to

integrate follower and teacher algorithms.

guilt
slide-8
SLIDE 8

Plan for Today

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Part I: Preliminaries 1. Teacher possesses memory of k = 0 rounds:

Bully

2. Teacher possesses memory of k = 1 round:

Go dfather

3. Teacher possesses memory of k > 1 rounds:

{lenient, stri t} Go dfather

4. Teacher is represented by a finite machine:

Go dfather++

Part II: Crandall & Goodrich (2005)

SP aM: an algorithm that claims to

integrate follower and teacher algorithms.

guilt
slide-9
SLIDE 9

Plan for Today

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Part I: Preliminaries 1. Teacher possesses memory of k = 0 rounds:

Bully

2. Teacher possesses memory of k = 1 round:

Go dfather

3. Teacher possesses memory of k > 1 rounds:

{lenient, stri t} Go dfather

4. Teacher is represented by a finite machine:

Go dfather++

Part II: Crandall & Goodrich (2005)

SP aM: an algorithm that claims to

integrate follower and teacher algorithms. 1. Three points of criticism to Godfather++.

guilt
slide-10
SLIDE 10

Plan for Today

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Part I: Preliminaries 1. Teacher possesses memory of k = 0 rounds:

Bully

2. Teacher possesses memory of k = 1 round:

Go dfather

3. Teacher possesses memory of k > 1 rounds:

{lenient, stri t} Go dfather

4. Teacher is represented by a finite machine:

Go dfather++

Part II: Crandall & Goodrich (2005)

SP aM: an algorithm that claims to

integrate follower and teacher algorithms. 1. Three points of criticism to Godfather++. 2. Core idea of SPaM: combine teacher and follower capabilities.

guilt
slide-11
SLIDE 11

Plan for Today

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Part I: Preliminaries 1. Teacher possesses memory of k = 0 rounds:

Bully

2. Teacher possesses memory of k = 1 round:

Go dfather

3. Teacher possesses memory of k > 1 rounds:

{lenient, stri t} Go dfather

4. Teacher is represented by a finite machine:

Go dfather++

Part II: Crandall & Goodrich (2005)

SP aM: an algorithm that claims to

integrate follower and teacher algorithms. 1. Three points of criticism to Godfather++. 2. Core idea of SPaM: combine teacher and follower capabilities. 3. Notion of

guilt to trigger switches between teaching and following.
slide-12
SLIDE 12

Literature

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

slide-13
SLIDE 13

Literature

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Michael L. Littman and Peter Stone (2001). “Leading best-response strategies in repeated games”. Research note.

One of the first papers, if not the first paper, that mentions Bully and Godfather.

slide-14
SLIDE 14

Literature

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Michael L. Littman and Peter Stone (2001). “Leading best-response strategies in repeated games”. Research note.

One of the first papers, if not the first paper, that mentions Bully and Godfather.

Michael L. Littman and Peter Stone (2005). “A polynomial-time Nash equilibrium algorithm for repeated games”. In Decision Support Systems Vol. 39, pp. 55-66.

Paper that describes Godfather++.

slide-15
SLIDE 15

Literature

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Michael L. Littman and Peter Stone (2001). “Leading best-response strategies in repeated games”. Research note.

One of the first papers, if not the first paper, that mentions Bully and Godfather.

Michael L. Littman and Peter Stone (2005). “A polynomial-time Nash equilibrium algorithm for repeated games”. In Decision Support Systems Vol. 39, pp. 55-66.

Paper that describes Godfather++.

Jacob W. Crandall and Michael A. Goodrich (2005). “Learning to teach and follow in repeated games”. In AAAI Workshop on Multiagent Learning, Pittsburgh, PA.

Paper that attempts to combine Fictitious Play and a modified Godfather++ to define an algorithm that “knows” when to teach and when to follow.

slide-16
SLIDE 16

Literature

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Michael L. Littman and Peter Stone (2001). “Leading best-response strategies in repeated games”. Research note.

One of the first papers, if not the first paper, that mentions Bully and Godfather.

Michael L. Littman and Peter Stone (2005). “A polynomial-time Nash equilibrium algorithm for repeated games”. In Decision Support Systems Vol. 39, pp. 55-66.

Paper that describes Godfather++.

Jacob W. Crandall and Michael A. Goodrich (2005). “Learning to teach and follow in repeated games”. In AAAI Workshop on Multiagent Learning, Pittsburgh, PA.

Paper that attempts to combine Fictitious Play and a modified Godfather++ to define an algorithm that “knows” when to teach and when to follow.

Doran Chakraborty and Peter Stone (2008). “Online Multiagent Learning against Memory Bounded Adversaries,” Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Artificial Intelligence Vol. 5212, pp. 211-26

slide-17
SLIDE 17

Taxonomy of possible adversaries

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

(Taken from Chakraborty and Stone, 2008):

Adversaries Joint-action based k-Markov 1. Best response 2. Godfather 3. Bully Dependent on entire history 1. Fictitious play 2. Grim opponent 3. WoLF-PHC Joint-strategy based Previous step joint- strategy 1. IGA 2. WoLF-IGA 3. ReDVaLer Entire history of joint strategies. 1. No-regret learners.

slide-18
SLIDE 18

Bully

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

slide-19
SLIDE 19

Bully

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Play any strategy that gives you the highest payoff, assuming that your

  • pponent is a mindless follower.
slide-20
SLIDE 20

Bully

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Play any strategy that gives you the highest payoff, assuming that your

  • pponent is a mindless follower.

Example of finding a pure Bully strategy:

slide-21
SLIDE 21

Bully

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Play any strategy that gives you the highest payoff, assuming that your

  • pponent is a mindless follower.

Example of finding a pure Bully strategy: T C B L M R   3, 6 8, 6 7, 3 8, 1 6, 3 7, 3 3, 5 9, 2 7, 5  

slide-22
SLIDE 22

Bully

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Play any strategy that gives you the highest payoff, assuming that your

  • pponent is a mindless follower.

Example of finding a pure Bully strategy: T C B L M R   3, 6 8, 6 7, 3 8, 1 6, 3 7, 3 3, 5 9, 2 7, 5   1. Find, for every action of yourself, the best response(s) of your opponent.

slide-23
SLIDE 23

Bully

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Play any strategy that gives you the highest payoff, assuming that your

  • pponent is a mindless follower.

Example of finding a pure Bully strategy: T C B L M R   3, 6 8, 6 7, 3 8, 1 6, 3 7, 3 3, 5 9, 2 7, 5   1. Find, for every action of yourself, the best response(s) of your opponent. This yields L and M for T, M and R for C and L and R for B.

slide-24
SLIDE 24

Bully

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Play any strategy that gives you the highest payoff, assuming that your

  • pponent is a mindless follower.

Example of finding a pure Bully strategy: T C B L M R   3, 6 8, 6 7, 3 8, 1 6, 3 7, 3 3, 5 9, 2 7, 5   1. Find, for every action of yourself, the best response(s) of your opponent. This yields L and M for T, M and R for C and L and R for B. 2. For these opponents actions, you’ll receive

slide-25
SLIDE 25

Bully

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Play any strategy that gives you the highest payoff, assuming that your

  • pponent is a mindless follower.

Example of finding a pure Bully strategy: T C B L M R   3, 6 8, 6 7, 3 8, 1 6, 3 7, 3 3, 5 9, 2 7, 5   1. Find, for every action of yourself, the best response(s) of your opponent. This yields L and M for T, M and R for C and L and R for B. 2. For these opponents actions, you’ll receive 3 and 8 for T, 6 and 7 for C and 3 and 9 for B.

slide-26
SLIDE 26

Bully

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Play any strategy that gives you the highest payoff, assuming that your

  • pponent is a mindless follower.

Example of finding a pure Bully strategy: T C B L M R   3, 6 8, 6 7, 3 8, 1 6, 3 7, 3 3, 5 9, 2 7, 5   1. Find, for every action of yourself, the best response(s) of your opponent. This yields L and M for T, M and R for C and L and R for B. 2. For these opponents actions, you’ll receive 3 and 8 for T, 6 and 7 for C and 3 and 9 for B. Now choose one, and only one,

  • f the actions with a highest

security value.

slide-27
SLIDE 27

Bully

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Play any strategy that gives you the highest payoff, assuming that your

  • pponent is a mindless follower.

Example of finding a pure Bully strategy: T C B L M R   3, 6 8, 6 7, 3 8, 1 6, 3 7, 3 3, 5 9, 2 7, 5   1. Find, for every action of yourself, the best response(s) of your opponent. This yields L and M for T, M and R for C and L and R for B. 2. For these opponents actions, you’ll receive 3 and 8 for T, 6 and 7 for C and 3 and 9 for B. Now choose one, and only one,

  • f the actions with a highest

security value. Here that would be C with security value 6.

slide-28
SLIDE 28

Idea for an app to learn to play {against} Bully

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Play against the computer. At the outset, the computer initializes to either Bully (with a probability of 50%) or pure fictitious play, the choice of which you can’t see. After that, the computer won’t change strategy. Try to press regret down as within few rounds as possible.

slide-29
SLIDE 29

Bully: precise definition

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Play any strategy that gives you the highest payoff, assuming that your

  • pponent is a mindless follower.
b est resp
  • nse
maxmin se urit y value
slide-30
SLIDE 30

Bully: precise definition

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Play any strategy that gives you the highest payoff, assuming that your

  • pponent is a mindless follower.

Surprisingly difficult to capture in an exact definition.

b est resp
  • nse
maxmin se urit y value
slide-31
SLIDE 31

Bully: precise definition

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Play any strategy that gives you the highest payoff, assuming that your

  • pponent is a mindless follower.

Surprisingly difficult to capture in an exact definition. The notion of

b est resp
  • nse helps us out:
maxmin se urit y value
slide-32
SLIDE 32

Bully: precise definition

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Play any strategy that gives you the highest payoff, assuming that your

  • pponent is a mindless follower.

Surprisingly difficult to capture in an exact definition. The notion of

b est resp
  • nse helps us out:

Bullyi

maxmin se urit y value
slide-33
SLIDE 33

Bully: precise definition

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Play any strategy that gives you the highest payoff, assuming that your

  • pponent is a mindless follower.

Surprisingly difficult to capture in an exact definition. The notion of

b est resp
  • nse helps us out:

Bullyi =Def

maxmin se urit y value
slide-34
SLIDE 34

Bully: precise definition

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Play any strategy that gives you the highest payoff, assuming that your

  • pponent is a mindless follower.

Surprisingly difficult to capture in an exact definition. The notion of

b est resp
  • nse helps us out:

Bullyi =Def argmaxsi∈Si

maxmin se urit y value
slide-35
SLIDE 35

Bully: precise definition

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Play any strategy that gives you the highest payoff, assuming that your

  • pponent is a mindless follower.

Surprisingly difficult to capture in an exact definition. The notion of

b est resp
  • nse helps us out:

Bullyi =Def argmaxsi∈Si min{ui(si, s−i)

maxmin se urit y value
slide-36
SLIDE 36

Bully: precise definition

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Play any strategy that gives you the highest payoff, assuming that your

  • pponent is a mindless follower.

Surprisingly difficult to capture in an exact definition. The notion of

b est resp
  • nse helps us out:

Bullyi =Def argmaxsi∈Si min{ui(si, s−i) | s−i ∈ BR(si)}.

maxmin se urit y value
slide-37
SLIDE 37

Bully: precise definition

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Play any strategy that gives you the highest payoff, assuming that your

  • pponent is a mindless follower.

Surprisingly difficult to capture in an exact definition. The notion of

b est resp
  • nse helps us out:

Bullyi =Def argmaxsi∈Si min{ui(si, s−i) | s−i ∈ BR(si)}.

Right most inner part (green): best response of opponent to si.

maxmin se urit y value
slide-38
SLIDE 38

Bully: precise definition

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Play any strategy that gives you the highest payoff, assuming that your

  • pponent is a mindless follower.

Surprisingly difficult to capture in an exact definition. The notion of

b est resp
  • nse helps us out:

Bullyi =Def argmaxsi∈Si min{ui(si, s−i) | s−i ∈ BR(si)}.

Right most inner part (green): best response of opponent to si.

Middle inner part (as from min): guaranteed payoff for bullying

  • pponent with si.
maxmin se urit y value
slide-39
SLIDE 39

Bully: precise definition

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Play any strategy that gives you the highest payoff, assuming that your

  • pponent is a mindless follower.

Surprisingly difficult to capture in an exact definition. The notion of

b est resp
  • nse helps us out:

Bullyi =Def argmaxsi∈Si min{ui(si, s−i) | s−i ∈ BR(si)}.

Right most inner part (green): best response of opponent to si.

Middle inner part (as from min): guaranteed payoff for bullying

  • pponent with si.

Entire formula: choose si that maximises this guaranteed payoff.

maxmin se urit y value
slide-40
SLIDE 40

Bully: precise definition

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Play any strategy that gives you the highest payoff, assuming that your

  • pponent is a mindless follower.

Surprisingly difficult to capture in an exact definition. The notion of

b est resp
  • nse helps us out:

Bullyi =Def argmaxsi∈Si min{ui(si, s−i) | s−i ∈ BR(si)}.

Right most inner part (green): best response of opponent to si.

Middle inner part (as from min): guaranteed payoff for bullying

  • pponent with si.

Entire formula: choose si that maximises this guaranteed payoff. Recognise the

maxmin = the se urit y value in this formula!
slide-41
SLIDE 41

Bully: precise definition (in parts)

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

set
  • f
all b est resp
  • nses
pa y
  • gua
ranteed set
  • f
bully strategies
slide-42
SLIDE 42

Bully: precise definition (in parts)

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Let BR(si) be the

set
  • f
all b est resp
  • nses to i’s strategy si:

BR(si) =Def argmaxs−i{u−i(si, s−i) | s−i ∈ S−i}.

pa y
  • gua
ranteed set
  • f
bully strategies
slide-43
SLIDE 43

Bully: precise definition (in parts)

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Let BR(si) be the

set
  • f
all b est resp
  • nses to i’s strategy si:

BR(si) =Def argmaxs−i{u−i(si, s−i) | s−i ∈ S−i}.

Let Bullyi(si) be the

pa y
  • gua
ranteed for playing si against mindless

followers (i.e, best responders): Bullyi(si) =Def min{ui(si, s−i) | s−i ∈ BR(si)}

set
  • f
bully strategies
slide-44
SLIDE 44

Bully: precise definition (in parts)

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Let BR(si) be the

set
  • f
all b est resp
  • nses to i’s strategy si:

BR(si) =Def argmaxs−i{u−i(si, s−i) | s−i ∈ S−i}.

Let Bullyi(si) be the

pa y
  • gua
ranteed for playing si against mindless

followers (i.e, best responders): Bullyi(si) =Def min{ui(si, s−i) | s−i ∈ BR(si)}

The

set
  • f
bully strategies is formed by:

Bullyi =Def argmaxsi∈SiBullyi(si)

slide-45
SLIDE 45

Bully: precise definition (in parts)

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies,

Let BR(si) be the

set
  • f
all b est resp
  • nses to i’s strategy si:

BR(si) =Def argmaxs−i{u−i(si, s−i) | s−i ∈ S−i}.

Let Bullyi(si) be the

pa y
  • gua
ranteed for playing si against mindless

followers (i.e, best responders): Bullyi(si) =Def min{ui(si, s−i) | s−i ∈ BR(si)}

The

set
  • f
bully strategies is formed by:

Bullyi =Def argmaxsi∈SiBullyi(si)

Bully is stateless (a.k.a. memoryless, i.e, memory of k = 0 rounds), hence keeps playing the same action throughout.

slide-46
SLIDE 46

Godfather

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

slide-47
SLIDE 47

Godfather (Littman and Stone, 2001)

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

ta rgetable pair
slide-48
SLIDE 48

Godfather (Littman and Stone, 2001)

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

A strategy [function H → ∆(A) from histories to mixed strategies] that makes its opponent an offer that it cannot refuse.

ta rgetable pair
slide-49
SLIDE 49

Godfather (Littman and Stone, 2001)

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

A strategy [function H → ∆(A) from histories to mixed strategies] that makes its opponent an offer that it cannot refuse.

Capitalises on the Folk theorem for repeated games with (not necessarily SGP) Nash equilibria.

ta rgetable pair
slide-50
SLIDE 50

Godfather (Littman and Stone, 2001)

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

A strategy [function H → ∆(A) from histories to mixed strategies] that makes its opponent an offer that it cannot refuse.

Capitalises on the Folk theorem for repeated games with (not necessarily SGP) Nash equilibria.

A pair of strategies (si, s−i) is called a

ta rgetable pair if playing them

results in each player getting more than the safety value (maxmin) and plays its half of the pair.

slide-51
SLIDE 51

Godfather (Littman and Stone, 2001)

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

A strategy [function H → ∆(A) from histories to mixed strategies] that makes its opponent an offer that it cannot refuse.

Capitalises on the Folk theorem for repeated games with (not necessarily SGP) Nash equilibria.

A pair of strategies (si, s−i) is called a

ta rgetable pair if playing them

results in each player getting more than the safety value (maxmin) and plays its half of the pair.

Godfather chooses a targetable pair.

slide-52
SLIDE 52

Godfather (Littman and Stone, 2001)

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

A strategy [function H → ∆(A) from histories to mixed strategies] that makes its opponent an offer that it cannot refuse.

Capitalises on the Folk theorem for repeated games with (not necessarily SGP) Nash equilibria.

A pair of strategies (si, s−i) is called a

ta rgetable pair if playing them

results in each player getting more than the safety value (maxmin) and plays its half of the pair.

Godfather chooses a targetable pair. 1. If the opponent keeps playing its half of targetable pair in one stage, Godfather plays its half in the next stage.

slide-53
SLIDE 53

Godfather (Littman and Stone, 2001)

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

A strategy [function H → ∆(A) from histories to mixed strategies] that makes its opponent an offer that it cannot refuse.

Capitalises on the Folk theorem for repeated games with (not necessarily SGP) Nash equilibria.

A pair of strategies (si, s−i) is called a

ta rgetable pair if playing them

results in each player getting more than the safety value (maxmin) and plays its half of the pair.

Godfather chooses a targetable pair. 1. If the opponent keeps playing its half of targetable pair in one stage, Godfather plays its half in the next stage. 2. Otherwise it falls back forever to the (mixed) strategy that forces the opponent to achieve at most its safety value.

slide-54
SLIDE 54

Godfather (Littman and Stone, 2001)

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

A strategy [function H → ∆(A) from histories to mixed strategies] that makes its opponent an offer that it cannot refuse.

Capitalises on the Folk theorem for repeated games with (not necessarily SGP) Nash equilibria.

A pair of strategies (si, s−i) is called a

ta rgetable pair if playing them

results in each player getting more than the safety value (maxmin) and plays its half of the pair.

Godfather chooses a targetable pair. 1. If the opponent keeps playing its half of targetable pair in one stage, Godfather plays its half in the next stage. 2. Otherwise it falls back forever to the (mixed) strategy that forces the opponent to achieve at most its safety value.

Godfather needs a memory of k = 1 (one round).

slide-55
SLIDE 55

Folk theorem for NE in repeated games

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

slide-56
SLIDE 56

Folk theorem for NE in repeated games

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

slide-57
SLIDE 57

Folk theorem for NE in repeated games

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

■ Feasible payoffs (striped): payoff combos that can be obtained by jointly repeating patterns of actions (more accurate: patterns of action profiles).

slide-58
SLIDE 58

Folk theorem for NE in repeated games

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

■ Feasible payoffs (striped): payoff combos that can be obtained by jointly repeating patterns of actions (more accurate: patterns of action profiles). ■ Enforceable payoffs (shaded): no

  • ne goes below their minmax.

Theorem. If (x, y) is both feasible and enforceable, then (x, y) is the payoff in a Nash equilibrium of the infinitely repeated G with average payoffs. Conversely, if (x, y) is the payoff in any Nash equilibrium of the in- finitely repeated G with average pay-

  • ffs, then (x, y) is enforceable.

1 2 3 4 5 1 2 3 4 5

(3, 3)

slide-59
SLIDE 59

Variations on Godfather with memory k > 1

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

(Taken from Chakraborty and Stone, 2008):

Go dfather-lenient Go dfather-stri t
slide-60
SLIDE 60

Variations on Godfather with memory k > 1

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

(Taken from Chakraborty and Stone, 2008):

Go dfather-lenient plays its part
  • f a targetable pair if, within

the last k actions, the opponent played its own half of the pair at least once. Otherwise execute threat. (But no longer forever.)

Go dfather-stri t
slide-61
SLIDE 61

Variations on Godfather with memory k > 1

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

(Taken from Chakraborty and Stone, 2008):

Go dfather-lenient plays its part
  • f a targetable pair if, within

the last k actions, the opponent played its own half of the pair at least once. Otherwise execute threat. (But no longer forever.)

Go dfather-stri t plays its part of

a targetable pair if, within the last k actions, the opponent always played its own half of the pair.

slide-62
SLIDE 62

Variations on Godfather with memory k > 1

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

(Taken from Chakraborty and Stone, 2008):

Go dfather-lenient plays its part
  • f a targetable pair if, within

the last k actions, the opponent played its own half of the pair at least once. Otherwise execute threat. (But no longer forever.)

Go dfather-stri t plays its part of

a targetable pair if, within the last k actions, the opponent always played its own half of the pair.

slide-63
SLIDE 63

Godfather++ (Littman & Stone, 2005)

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

nite state ma hine
slide-64
SLIDE 64

Godfather++ (Littman & Stone, 2005)

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

The name “Godfather++” is due to Crandall (2005).

nite state ma hine
slide-65
SLIDE 65

Godfather++ (Littman & Stone, 2005)

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

The name “Godfather++” is due to Crandall (2005).

Capitalises on the Folk theorem for repeated games with (not necessarily SGP) Nash equilibria.

nite state ma hine
slide-66
SLIDE 66

Godfather++ (Littman & Stone, 2005)

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

The name “Godfather++” is due to Crandall (2005).

Capitalises on the Folk theorem for repeated games with (not necessarily SGP) Nash equilibria.

Godfather++ a polynomial-time algorithm for constructing a

nite state ma hine.

This FSM represents a strategy which plays a Nash equilibrium for a repeated 2-player game with averaged payoffs.

slide-67
SLIDE 67

Godfather++ (Littman & Stone, 2005)

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

The name “Godfather++” is due to Crandall (2005).

Capitalises on the Folk theorem for repeated games with (not necessarily SGP) Nash equilibria.

Godfather++ a polynomial-time algorithm for constructing a

nite state ma hine.

This FSM represents a strategy which plays a Nash equilibrium for a repeated 2-player game with averaged payoffs.

slide-68
SLIDE 68

Godfather++ (Littman & Stone, 2005)

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

The name “Godfather++” is due to Crandall (2005).

Capitalises on the Folk theorem for repeated games with (not necessarily SGP) Nash equilibria.

Godfather++ a polynomial-time algorithm for constructing a

nite state ma hine.

This FSM represents a strategy which plays a Nash equilibrium for a repeated 2-player game with averaged payoffs.

  • Not for finite repeated games.
slide-69
SLIDE 69

Godfather++ (Littman & Stone, 2005)

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

The name “Godfather++” is due to Crandall (2005).

Capitalises on the Folk theorem for repeated games with (not necessarily SGP) Nash equilibria.

Godfather++ a polynomial-time algorithm for constructing a

nite state ma hine.

This FSM represents a strategy which plays a Nash equilibrium for a repeated 2-player game with averaged payoffs.

  • Not for finite repeated games.
  • Not for infinite repeated games with discounted payoffs.
slide-70
SLIDE 70

Godfather++ (Littman & Stone, 2005)

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

The name “Godfather++” is due to Crandall (2005).

Capitalises on the Folk theorem for repeated games with (not necessarily SGP) Nash equilibria.

Godfather++ a polynomial-time algorithm for constructing a

nite state ma hine.

This FSM represents a strategy which plays a Nash equilibrium for a repeated 2-player game with averaged payoffs.

  • Not for finite repeated games.
  • Not for infinite repeated games with discounted payoffs.
  • Not for n-player games, n > 2.
slide-71
SLIDE 71

Godfather++ (Littman & Stone, 2005)

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

The name “Godfather++” is due to Crandall (2005).

Capitalises on the Folk theorem for repeated games with (not necessarily SGP) Nash equilibria.

Godfather++ a polynomial-time algorithm for constructing a

nite state ma hine.

This FSM represents a strategy which plays a Nash equilibrium for a repeated 2-player game with averaged payoffs.

  • Not for finite repeated games.
  • Not for infinite repeated games with discounted payoffs.
  • Not for n-player games, n > 2.

Michael L. Littman and Peter Stone (2005). “A polynomial-time Nash equilibrium algorithm for repeated games”. In Decision Support Systems Vol. 39, pp. 55-66.

slide-72
SLIDE 72

Finite machine for “two tits for tat”

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

C D D

∗ ∗ (D, C) (C, C) ∗

Start

Finite state ma hine states transitions
slide-73
SLIDE 73

Finite machine for “two tits for tat”

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

C D D

∗ ∗ (D, C) (C, C) ∗

Start

Finite state ma hine for the Prisoners’ dilemma. states transitions
slide-74
SLIDE 74

Finite machine for “two tits for tat”

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

C D D

∗ ∗ (D, C) (C, C) ∗

Start

Finite state ma hine for the Prisoners’ dilemma.

Personal actions determine

states. transitions
slide-75
SLIDE 75

Finite machine for “two tits for tat”

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

C D D

∗ ∗ (D, C) (C, C) ∗

Start

Finite state ma hine for the Prisoners’ dilemma.

Personal actions determine

states.

Action profiles determine

transitions between states.

The “∗” represents an “else,” in the sense of “all other action profiles”.

slide-76
SLIDE 76

The use of counting nodes

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

=

c ai

(ai, a−i) ∗

ai c times

}

ai ai . . . . . . ai ai

(ai, a−i) (ai, a−i) (ai, a−i) ∗ ∗ ∗ (ai, a−i) ∗ ∗ ∗

Upon entry:

slide-77
SLIDE 77

The use of counting nodes

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

=

c ai

(ai, a−i) ∗

ai c times

}

ai ai . . . . . . ai ai

(ai, a−i) (ai, a−i) (ai, a−i) ∗ ∗ ∗ (ai, a−i) ∗ ∗ ∗

Upon entry:

If exactly c times action profile (ai, a−i) is played, then take exit above.

slide-78
SLIDE 78

The use of counting nodes

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

=

c ai

(ai, a−i) ∗

ai c times

}

ai ai . . . . . . ai ai

(ai, a−i) (ai, a−i) (ai, a−i) ∗ ∗ ∗ (ai, a−i) ∗ ∗ ∗

Upon entry:

If exactly c times action profile (ai, a−i) is played, then take exit above.

If column player deviates in round d, keep playing ai for the remaining c − (d + 1) rounds. Finally, exit below.

slide-79
SLIDE 79

The use of counting nodes

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

=

c ai

(ai, a−i) ∗

ai c times

}

ai ai . . . . . . ai ai

(ai, a−i) (ai, a−i) (ai, a−i) ∗ ∗ ∗ (ai, a−i) ∗ ∗ ∗

Upon entry:

If exactly c times action profile (ai, a−i) is played, then take exit above.

If column player deviates in round d, keep playing ai for the remaining c − (d + 1) rounds. Finally, exit below.

Because integers up to c can be expressed in log c bits (roughly), size

  • f finite machine is polynomial in log c.
slide-80
SLIDE 80

Pair of strategies that is a Nash equilibrium in a repeated game

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

a1 a2 αrow a2

(a1, b1) (a2, b2) ∗ ∗ ∗ ∗

r1 r2 r2 max{βrow, βcol} b1 b2 αcol b2

(a1, b1) (a2, b2) ∗ ∗ ∗ ∗

r1 r2 r2 max{βrow, βcol}

slide-81
SLIDE 81

Pair of strategies that is a Nash equilibrium in a repeated game

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

a1 a2 αrow a2

(a1, b1) (a2, b2) ∗ ∗ ∗ ∗

r1 r2 r2 max{βrow, βcol} b1 b2 αcol b2

(a1, b1) (a2, b2) ∗ ∗ ∗ ∗

r1 r2 r2 max{βrow, βcol}

Node a1 and a2 are the actions that row must play (in sync with col). First r1 × a1, then r2 × a2, then r1 × a1, etc.

slide-82
SLIDE 82

Pair of strategies that is a Nash equilibrium in a repeated game

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

a1 a2 αrow a2

(a1, b1) (a2, b2) ∗ ∗ ∗ ∗

r1 r2 r2 max{βrow, βcol} b1 b2 αcol b2

(a1, b1) (a2, b2) ∗ ∗ ∗ ∗

r1 r2 r2 max{βrow, βcol}

Node a1 and a2 are the actions that row must play (in sync with col). First r1 × a1, then r2 × a2, then r1 × a1, etc.

If opponent deviates, then retaliate with αrow for max{βrow, βcol} rounds.

slide-83
SLIDE 83

Pair of strategies that is a Nash equilibrium in a repeated game

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

a1 a2 αrow a2

(a1, b1) (a2, b2) ∗ ∗ ∗ ∗

r1 r2 r2 max{βrow, βcol} b1 b2 αcol b2

(a1, b1) (a2, b2) ∗ ∗ ∗ ∗

r1 r2 r2 max{βrow, βcol}

Node a1 and a2 are the actions that row must play (in sync with col). First r1 × a1, then r2 × a2, then r1 × a1, etc.

If opponent deviates, then retaliate with αrow for max{βrow, βcol} rounds.

The two automata always run in sync, no matter who deviates first. It can (easily) be deduced that, for each player, deviating at any node is detrimental ⇒ Nash equilibrium in repeated game.

slide-84
SLIDE 84

The devil and the details. . .

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

It should be that all parameters can be determined analytically, in polynomial time.

slide-85
SLIDE 85

The devil and the details. . .

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

It should be that all parameters can be determined analytically, in polynomial time.

slide-86
SLIDE 86

The devil and the details. . .

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

It should be that all parameters can be determined analytically, in polynomial time. 1. The coordinated action profiles (a1, b1), (a2, b2) and their duration of play r1, r2. Nash says: take strategy pair

(s1, s2) that maximises the

product of players’ advantages. This pair can be obtained (or at least approximated) by playing convex r1 r1 + r2

(a1, b1) +

r2 r1 + r2

(a2, b2)

for r1, r2 not too large. Pair (s1, s2) is obtained by looping through (A2)2 (all pairs of pairs of actions).

slide-87
SLIDE 87

The devil and the details. . .

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

It should be that all parameters can be determined analytically, in polynomial time. 1. The coordinated action profiles (a1, b1), (a2, b2) and their duration of play r1, r2. Nash says: take strategy pair

(s1, s2) that maximises the

product of players’ advantages. This pair can be obtained (or at least approximated) by playing convex r1 r1 + r2

(a1, b1) +

r2 r1 + r2

(a2, b2)

for r1, r2 not too large. Pair (s1, s2) is obtained by looping through (A2)2 (all pairs of pairs of actions). 2. The strategy and duration of punishment (αrow, αcol and βrow, βcol, respectively).

slide-88
SLIDE 88

The devil and the details. . .

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

It should be that all parameters can be determined analytically, in polynomial time. 1. The coordinated action profiles (a1, b1), (a2, b2) and their duration of play r1, r2. Nash says: take strategy pair

(s1, s2) that maximises the

product of players’ advantages. This pair can be obtained (or at least approximated) by playing convex r1 r1 + r2

(a1, b1) +

r2 r1 + r2

(a2, b2)

for r1, r2 not too large. Pair (s1, s2) is obtained by looping through (A2)2 (all pairs of pairs of actions). 2. The strategy and duration of punishment (αrow, αcol and βrow, βcol, respectively).

slide-89
SLIDE 89

The devil and the details. . .

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

It should be that all parameters can be determined analytically, in polynomial time. 1. The coordinated action profiles (a1, b1), (a2, b2) and their duration of play r1, r2. Nash says: take strategy pair

(s1, s2) that maximises the

product of players’ advantages. This pair can be obtained (or at least approximated) by playing convex r1 r1 + r2

(a1, b1) +

r2 r1 + r2

(a2, b2)

for r1, r2 not too large. Pair (s1, s2) is obtained by looping through (A2)2 (all pairs of pairs of actions). 2. The strategy and duration of punishment (αrow, αcol and βrow, βcol, respectively).

αrow and αcol are the minmax strategies of the stage game.

slide-90
SLIDE 90

The devil and the details. . .

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s

It should be that all parameters can be determined analytically, in polynomial time. 1. The coordinated action profiles (a1, b1), (a2, b2) and their duration of play r1, r2. Nash says: take strategy pair

(s1, s2) that maximises the

product of players’ advantages. This pair can be obtained (or at least approximated) by playing convex r1 r1 + r2

(a1, b1) +

r2 r1 + r2

(a2, b2)

for r1, r2 not too large. Pair (s1, s2) is obtained by looping through (A2)2 (all pairs of pairs of actions). 2. The strategy and duration of punishment (αrow, αcol and βrow, βcol, respectively).

αrow and αcol are the minmax strategies of the stage game.

βrow and βcol depend on turning points to “get even”. These are determined by (i) the average payoff for cooperating (ii) upper bound on largest possible value for a single round of freeriding.

slide-91
SLIDE 91

Part II: Crandall & Goodrich (2005)

Author: Gerard Vreeswijk. Slides last modified on June 18th, 2020 at 20:55 Multi-agent learning: Teaching strategies, s