Multi-agent learning Satisficing strategies Ronald Chu, Geertjan van - - PowerPoint PPT Presentation

multi agent learning
SMART_READER_LITE
LIVE PREVIEW

Multi-agent learning Satisficing strategies Ronald Chu, Geertjan van - - PowerPoint PPT Presentation

Multi-agent learning Satisficing strategies Multi-agent learning Satisficing strategies Ronald Chu, Geertjan van Vliet , Technical Artificial Intelligence, Universiteit Utrecht Ronald Chu, Geertjan van Vliet. Last update: Thursday 24 th March,


slide-1
SLIDE 1

Multi-agent learning Satisficing strategies

Multi-agent learning

Satisficing strategies

Ronald Chu, Geertjan van Vliet, Technical Artificial Intelligence, Universiteit Utrecht Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 1

slide-2
SLIDE 2

Multi-agent learning Satisficing strategies

Outline

  • What is satisficing
  • Satisficing in the repeated prisoner’s dillema (RPD)
  • Satisficing in the multi-agent social dilemma (MASD)

Stimpson et al. (2001): Satisficing and Learning Cooperation in the Prisoner’s Dilemma Stimpson et al. (2003): Learning To Cooperate in a Social Dilemma A Satisficing Approach to Bargaining

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 2

slide-3
SLIDE 3

Multi-agent learning Satisficing strategies

Satisficing (1)

Optimize: Choose the best available option Satisfice: Choose an option that meets a certain aspiration level Doesn’t have to be unique or in any way the best. Why satisficing?

  • No information needed except:

– The available actions – The payoff of the last action

  • Aspiration level is adaptive

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 3

slide-4
SLIDE 4

Multi-agent learning Satisficing strategies

Satisficing (2)

At time t player A has a strategy pair (At, αt)

  • Action At ∈ {C, D}
  • Aspiration level αt
  • Payoff Rt

Strategy is updated each round

  • At+1 = At iff Rt ≥ αt, otherwise At+1 = At
  • αt+1 = λαt + (1 − λ)Rt where 0 ≤ λ ≤ 1

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 4

slide-5
SLIDE 5

Multi-agent learning Satisficing strategies

Satisficing (3)

t tvt At Rt αt t tvt At Rt αt C C 3 4.00 10 C D 4 1.72 1 C D 4 3.50 11 D D 2 2.86 2 D D 2 3.75 12 D C 1 2.43 3 D C 1 2.88 13 C D 4 1.71 4 C D 4 1.94 14 D D 2 2.86 5 D D 2 2.97 15 D C 1 2.43 6 D C 1 2.48 16 C D 4 1.71 7 C D 4 1.74 17 D D 2 2.86 8 D D 2 2.87 18 D C 1 2.43 9 D C 1 2.44 19 C D 4 1.71 Satisficing strategy (with λ = 0.5) against a tit-for-tat strategy.

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 5

slide-6
SLIDE 6

Multi-agent learning Satisficing strategies

Repeated Prisoner’s Dilemma (1)

A two-player two-action social dilemma

  • Initial research focuses on Nash equilibrium
  • Mutual cooperation is rational in repeated prisoner’s dilemma (Axelrod

1984)

  • Usual assumes that the agent knows:

– the structure of the game – the decisions of the opponent(s) – the payoffs of the opponent(s) – that opponents’ actions affect the outcomes

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 6

slide-7
SLIDE 7

Multi-agent learning Satisficing strategies

Repeated Prisoner’s Dilemma (2)

Extend notation for a two-player game:

  • second player has strategy pair (Bt, βt)
  • both players have the same learning rate λ

Payoff matrix for the PD is generalized:

  • σ payoff for mutual cooperation
  • δ payoff for mutual defection
  • 0 < δ < σ < 1
  • 0.5 < σ

C D C

(σ, σ) (0, 1)

D

(1, 0) (δ, δ)

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 7

slide-8
SLIDE 8

Multi-agent learning Satisficing strategies

RPD Experiment

Several possible outcomes:

  • Convergence to a fixed strategy
  • Convergence to some action cycle
  • No convergence.

Stimpson et. al. ran 5.000 runs of the repeated PD, with uniformly distributed bounded random values.

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 8

slide-9
SLIDE 9

Multi-agent learning Satisficing strategies

RPD Experiment: results

74% CC 25% DD-DC-DD-CD 1% DD 0% DD-CC-DC There are several parameters influencing convergence:

  • Payoffs
  • Initial aspirations
  • Initial actions
  • Learning rate

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 9

slide-10
SLIDE 10

Multi-agent learning Satisficing strategies

RPD Experiment: payoffs

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 10

slide-11
SLIDE 11

Multi-agent learning Satisficing strategies

RPD Experiment: initial aspirations

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 11

slide-12
SLIDE 12

Multi-agent learning Satisficing strategies

RPD Experiment: initial actions

81.6% CC 81.6% DD 73.7% Random 66.7% DC or CD

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 12

slide-13
SLIDE 13

Multi-agent learning Satisficing strategies

RPD Experiment: learning rate

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 13

slide-14
SLIDE 14

Multi-agent learning Satisficing strategies

RPD Experiment: conclusion (1)

Satisficing strategie converges in the RPD to mutual cooperation

  • 1. Big difference between mutual cooperation and defection payoffs
  • 2. High initial aspirations
  • 3. Similar initial behavior
  • 4. Slow learning rate

Stimpson et. al. ran 5.000 runs with these parameters with 100% convergence to mutual cooperation.

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 14

slide-15
SLIDE 15

Multi-agent learning Satisficing strategies

RPD Experiment: conclusion (2)

In our 5.000 runs there was 94.1% convergence to mutual cooperation (94.8% with max. rounds 100.000)

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 15

slide-16
SLIDE 16

Multi-agent learning Satisficing strategies

Multi-agent social dilemma

  • Introduction
  • Satisficing algorithm

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 16

slide-17
SLIDE 17

Multi-agent learning Satisficing strategies

Multi-agent social dilemma (1)

Basic characteristics

  • Choice between selfish goal or group goal
  • Benefits from both group goal and selfish goal
  • Multi-action, multi-agent (more than 2x2)
  • Repeated game
  • Individual defection is the best option as long as other agents contribute

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 17

slide-18
SLIDE 18

Multi-agent learning Satisficing strategies

Multi-agent social dilemma (2)

Game structure.

  • We have M + 1 actions and N agents (N = |A|)
  • Each agent i ∈ A contributes ci units to the group, ci ∈ N, 0 < ci < M
  • Reward received:

Ri(c) = kg(∑

j∈A

cj) + ks(M − ci)

  • Dynamics depend on weight of the group goal Kg versus the selfish goal

Ks, which is assumed constant and the same for all agents

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 18

slide-19
SLIDE 19

Multi-agent learning Satisficing strategies

MASD Satisficing algorithm (1)

  • Ideally converges to (M, . . . , M)
  • All agents need to be satisficed to converge to a action pair
  • One agent will give up playing M if another changes strategy
  • Works best with:

– Initial aspirations higher than the best possible reward – Slow learning rate

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 19

slide-20
SLIDE 20

Multi-agent learning Satisficing strategies

MASD Satisficing algorithm (2)

At time t player i ∈ A has a strategy pair (At

i, αt i)

  • Action At

i ∈ {0, . . . , M}

  • Aspiration level αt

i

  • Payoff Rt

i resulting from its strategy at t − 1

Strategy is updated each round

  • At+1 = At iff Rt

i ≥ αt i, otherwise choose new action random

  • αt+1

i

= λαt

i + (1 − λ)Rt i

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 20

slide-21
SLIDE 21

Multi-agent learning Satisficing strategies

MASD Satisficing algorithm: example

M = 10, k = 0.6, and λ = 0.99

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 21

slide-22
SLIDE 22

Multi-agent learning Satisficing strategies

Break

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 22

slide-23
SLIDE 23

Multi-agent learning Satisficing strategies

MASD Reward function

kg and ks.

  • Ri(c) = kg(∑j∈A cj) + ks(M − ci).
  • Make the reward range independent of N and M:

kg = 1 NM ks = 1 M

  • Introduce weight factor k to the selfish goal and then ks =

k M.

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 23

slide-24
SLIDE 24

Multi-agent learning Satisficing strategies

MASD Reward function

Interesting values for k. kg = 1 NM ks = k M

  • Goals are equally important when kg = ks ⇔ k = 1

N .

  • When k > 1, then the selfish goal is always preferred by any agent

(Exercise).

  • 1

N < k < 1 which means that ks > kg.

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 24

slide-25
SLIDE 25

Multi-agent learning Satisficing strategies

MASD Reward function

Final Reward function.

  • Inserting the new constants and using
  • c−i = ∑j∈A\{i} cj the contribution of other agents

Ri(c) = 1 NM c−i + ci NM + k M(M − ci) (1)

  • Dividing by (1 − k) and dropping constants

Ri(c) = (1 − kN)ci + c−i NM(1 − k) (2)

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 25

slide-26
SLIDE 26

Multi-agent learning Satisficing strategies

MASD Properties

  • When do we have the best individual reward?
  • When do we have the best/worst average reward (over agents) ?
  • Is there a Nash Equilibrium?
  • What is the pareto optimal solution and why? (Exercise)

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 26

slide-27
SLIDE 27

Multi-agent learning Satisficing strategies

MASD Properties

  • Using reward function

Ri(c) =

1 NMc−i + ci NM + k M(M − ci) = kg(ci + c−i) + ks(M − ci)

  • Best individual reward for i : Everybody contributes except for agent i
  • Best average: Everybody contributes, giving Ri = NM/NM = 1 for each

agent i ∈ A

  • Worst average: Nobody contributes, giving Ri = Mk/M = k per agent

(1 < k < 1/N)

  • Nash Equilibrium: Nobody contributes, since an increase in contribution

returns kg versus a loss of ks > kg

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 27

slide-28
SLIDE 28

Multi-agent learning Satisficing strategies

Performance criteria

  • 1. Pareto efficiency in self-play
  • 2. Avoiding exploitability by other (selfish) agents.

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 28

slide-29
SLIDE 29

Multi-agent learning Satisficing strategies

Satisficing: Experiment 1

  • Initial values: Aspirations from the range [1.5,2.0].
  • Variables: M ∈ [0, 1, . . . , 20], λ ∈ {0.95, 0.99} k = 0.6
  • N = 2 (?)
  • Measurement: Average reward over all agents over 500 games for each M.

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 29

slide-30
SLIDE 30

Multi-agent learning Satisficing strategies

Satisficing: Result 1

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 30

slide-31
SLIDE 31

Multi-agent learning Satisficing strategies

Observations

  • Faster learning rate gives lower average reward.
  • Increasing M reduces the average reward initially, but it remains fairly

high.

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 31

slide-32
SLIDE 32

Multi-agent learning Satisficing strategies

Explanations

  • Faster learning rate gives lower average reward

– Larger adaptations, lower aspirations, lower end states

  • Having more actions available reduces the average reward initially, but it

remains fairly high – Other states than full mutual cooperation may occur that have high average rewards.

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 32

slide-33
SLIDE 33

Multi-agent learning Satisficing strategies

Satisficing: Experiment 2

  • Initial values: Aspirations from the range [Rmax, 2Rmax].
  • Variables: N ∈ [2, 3, . . . , 9], M ∈ {1, 3, 5}, λ = 0.99
  • k selected randomly from (1/N, 1)
  • Measurement: Average reward over all agents over 500 games for each N,

M

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 33

slide-34
SLIDE 34

Multi-agent learning Satisficing strategies

Satisficing: Result 2

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 34

slide-35
SLIDE 35

Multi-agent learning Satisficing strategies

Observations

  • Detoriation towards average 0.5 in a society of 10 agents with 6 options.

(Agents still contribute)

  • Larger society has lower average reward and thus less cooperation
  • Increase of options reduces average reward (as seen in experiment 1)

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 35

slide-36
SLIDE 36

Multi-agent learning Satisficing strategies

Explanations

  • Detoriation towards average 0.5 in a society of 10 agents with 6 options.

(Agents still contribute)

  • Larger society reduces the average reward and there is less cooperation

(on average) – Lower average could be associated with lower aspirations – Why do lower aspiration levels occur? – More agents need to be satisificied => Longer time for adaptation increases the satisficing region => lower aspiration levels

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 36

slide-37
SLIDE 37

Multi-agent learning Satisficing strategies

Performance of satisficing in MASD

  • 1. Pareto efficiency in self-play?
  • Convergence to mutual cooperation is likely (at least some form)
  • 2. Avoiding exploitability by other (selfish) agents.
  • In a society of selfish agents, the algorithm is likely to converge to the

single play Nash equllibrium.

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 37

slide-38
SLIDE 38

Multi-agent learning Satisficing strategies

Satisficing: Against selfish agents

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 38

slide-39
SLIDE 39

Multi-agent learning Satisficing strategies

Satisficing: Against selfish agents

  • The part of the reward function regarding the choices of opponents

reduces to 0.

  • The selfish part becomes negative for positive contributions.
  • Aspiration levels drop below but stay above the lowest possible reward

level.

  • At some point, Ri might be higher than the aspiration level
  • or the algorithm converges towards the lowest level possible
  • This only shows that convergence to a solution is possible

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 39

slide-40
SLIDE 40

Multi-agent learning Satisficing strategies

Satisficing: Against selfish agents

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 40

slide-41
SLIDE 41

Multi-agent learning Satisficing strategies

Satisficing: Against selfish agents

We know : Against selfish agents the algorithm converges to some solution.

  • Within conditions the convergence target is Nash Equillibrium.
  • In some time slot t0 → t1, the Nash Equilibrium is the only good solution.
  • Later as aspiration drops more and more, exploitable solutions are

satisficing too.

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 41

slide-42
SLIDE 42

Multi-agent learning Satisficing strategies

Satisficing: Against selfish agents

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 42

slide-43
SLIDE 43

Multi-agent learning Satisficing strategies

Conclusions

  • The multi-agent social dilemma offers a platform similiar to prisoner’s

dillemma for multiple agents and actions

  • The adapted satisficing algorithm learns mutual cooperation often and

reverts to the Nash equilibrium against selfish agents and is therefore not exploitable

  • Other algorihms studied by Stimpson in the paper do not match this

performance

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 43

slide-44
SLIDE 44

Multi-agent learning Satisficing strategies

Discussion: Variations

Stimpson has made several choices that may influence the algorithms performance or his results

  • Equal weight vectors for all agents regarding the selfish and group goals
  • Equal learning rate for all agents
  • Uniformly choose new action when not satisficed
  • Very high initial aspiration levels
  • What is the performance against other learning agents?

Ronald Chu, Geertjan van Vliet. Last update: Thursday 24th March, 2011 at 12:53h. Slide 44