Implicit Imitation in Multiagent Reinforcement Learning Bob Price - - PowerPoint PPT Presentation

implicit imitation in multiagent reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Implicit Imitation in Multiagent Reinforcement Learning Bob Price - - PowerPoint PPT Presentation

Implicit Imitation in Multiagent Reinforcement Learning Bob Price and Craig Boutilier Slides: Dana Dahlstrom CSE 254, UCSD 2002.04.23 1 Overview In imitation, a learner observes a mentor in action. The approach proposed in this paper


slide-1
SLIDE 1

Implicit Imitation in Multiagent Reinforcement Learning

Bob Price and Craig Boutilier Slides: Dana Dahlstrom CSE 254, UCSD 2002.04.23

1

slide-2
SLIDE 2

Overview

  • In imitation, a learner observes a mentor in action.
  • The approach proposed in this paper is to extract a model of the

environment from observations of a mentor’s state trajectory.

  • Prioritized sweeping is used to compute estimates of transition

probabilities; combined with the (given) reward function, this gives an action-value function and a concomitant policy.

  • Empirical results show this approach yields improved performance

and convergence compared to a non-imitating agent using parameterized sweeping alone.

2

slide-3
SLIDE 3

Background

  • Other multi-agent learning schemes include:

– explicit teaching (or demonstration) – sharing of privileged information – elaborate psychological imitation theory

  • All of these require explicit communication, and usually voluntary

cooperation by the mentor.

  • A common thread: the observer explores, guided by the mentor.

3

slide-4
SLIDE 4

Implicit Imitation

In implicit imitation, the learner observes the mentor’s state transitions.

  • No demands are made of the mentor beyond ordinary behavior.

– no requisite cooperation – no explicit communication

  • The learner can take advantage of multiple mentors.
  • The learner is not forced to follow in the mentor’s footsteps.

– can learn from both positive and negative examples

  • It isn’t necessary for the learner to know the mentor’s actions.

4

slide-5
SLIDE 5

Model

A preliminary assumption: the learner and mentor(s) act concurrently in a single environment, but their actions are noninteracting.

  • Therefore the underlying multi-agent Markov decision process

(MMDP) can be factored into separate single-agent MDPs: Mo = So, Ao, Pro, Ro, Mm = Sm, Am, Prm, Rm

5

slide-6
SLIDE 6

Assumptions

  • The learner and mentor have identical state spaces: So = Sm = S
  • All the mentor’s actions are available to the learner: Ao ⊇ Am
  • The mentor’s transition probabilities apply to the learner:

∀s, t ∈ S, ∀a ∈ Am

  • Pro(t|s, a) = Prm(t|s, a)
  • The learner knows its reward function Ro(s) up front.

6

slide-7
SLIDE 7

Further Assumptions

  • The learner can observe the mentor’s state transitions s, t.
  • The environment constitutes a discounted infinite horizon context

with discount factor γ.

  • An agent’s actions do not influence its reward structure.

7

slide-8
SLIDE 8

The Reinforcement Learning Task

The learner’s task is to learn a policy π : S → Ao which maximizes the total discounted reward. The value of a state in this regard is given by the Bellman equation: V (s) = Ro(s) + γ max

a∈Ao

  • t∈S

Pro(t|s, a)V (t)

  • (1)
  • Given samples s, a, t the agent could:

– learn an action-value function directly via Q-learning – estimate Pro and solve for V in Equation 1

  • Parameterized sweeping converges on a solution to the Bellman

equation as its estimate of Pro improves.

8

slide-9
SLIDE 9

Utilizing Observations

The learner can employ observations of the mentor in different ways:

  • to update its estimate of the transition probabilities Pro
  • to determine the order to apply Bellman backups

– the priority in prioritized sweeping Other possible uses they mention but don’t explore:

  • to infer a policy directly
  • to directly compute the state-value function, or constraints on it

9

slide-10
SLIDE 10

Mentor Transition Probability Estimation

Assuming the mentor uses a stationary, deterministic policy πm, Prm(t|s) = Prm(t|s, πm(s)).

  • In this case, the mentor’s transition probabilities can be estimated by

the observed frequency quotient

  • Prm(t|s) =

fm

  • s, t
  • t′∈S fm
  • s, t′

Prm(t|s) will converge to Prm(t|s) for all t if the mentor visits state s infinitely many times.

10

slide-11
SLIDE 11

Observation-Augmented State Value

The following augmented Bellman equation specifies the learner’s state-value function: V (s) = Ro(s)+ γ max

  • max

a∈Ao t∈S

Pro(t|s, a)V (t)

  • ,
  • t∈S

Prm(t|s)V (t)

  • (2)
  • Since πm(s) ∈ Ao and Pro(t|s, πm(s)) = Prm(t|s), Equation 2

simplifies to Equation 1.

  • Extension to multiple mentors is straightforward.

11

slide-12
SLIDE 12

Confidence Estimation

In practice, the learner must rely on estimates Pro(t|s, a) and Prm(t|s). Equation 2 does not account for the unreliability of these estimates.

  • Assume a Dirichlet prior over the parameters of the multinomial

distributions of Pro(t|s, a) and Prm(t|s).

  • Use experience and mentor observations to construct lower bounds

v−

  • and v−

m on V (s) within a suitable confidence interval.

  • If v−

m < v−

  • , then ignore mentor observations: either the mentor’s

policy is suboptimal or confidence in Prm is too low.

  • This reasoning holds even for stationary stochastic mentor policies.

12

slide-13
SLIDE 13

Accommodating Action Costs

When the reward function Ro(s, a) depends on the action, how can it be applied to mentor observations without knowing the mentor’s action?

  • Let κ(s) denote an action whose transition distribution at state s has

minimum Kullback-Liebler (KL) distance from Prm(t|s): κ(s) = argmin

a∈Ao

  • t∈S

Pro(t|s, a) log Prm(t|s)

  • (3)
  • Using the guessed mentor action κ(s), Equation 2 can be rewritten as:

V (s) = max      max

a∈Ao

  • Ro(s, a) + γ

t∈S

Pro(t|s, a)V (t)

  • ,

Ro(s, κ(s)) + γ

t∈S

Prm(t|s)V (t)     

13

slide-14
SLIDE 14

Focusing

The learner can focus its attention on the states visited by the mentor by doing a (possibly augmented) Bellman backup for each mentor transition.

  • If the mentor visits interesting regions of the state space, the learner’s

attention is drawn there.

  • Computational effort is directed toward parts of the state space where
  • Prm(t|s) is changing, and hence where

Pro(t|s, a) may change.

  • Computation is focused where the model is likely more accurate.

14

slide-15
SLIDE 15

Prioritized Sweeping

  • More than one backup is performed for each transition:

– A priority queue of state-action pairs is maintained, where the pair that would change the most is at the head of the queue. – When the highest priority backup is performed, its predecessors may be inserted into the queue (or their priority may be updated).

  • To incorporate implicit imitation:

– use augmented backups ` a la Equation 2 in lieu of Q update rule – do backups for mentor transitions as well as learner transitions

15

slide-16
SLIDE 16

Implicit Imitation in Q-Learning

  • Augment the action space with a “fictitious” action am ∈ Ao.
  • For each transition s, t use the update rule:

Q(s, a) ← (1 − α)Q(s, a) + α

  • Ro(t) + γ max

a′∈Ao Q(t, a′)

  • wherein a = am for observed mentor transitions, and a is the action

performed by the learner otherwise.

16

slide-17
SLIDE 17

Action Selection

  • An ε-greedy action selection policy ensures exploration:

– with probability ε, pick an action uniformly at random – with probability 1 − ε, pick the greedy action

  • They let ε decay over time.
  • They define the “greedy action” as the a whose estimated distribution
  • Pro(t|s, a) has minimum KL distance from

Prm(t|s).

17

slide-18
SLIDE 18

Experimental Setup

They simulate an expert mentor, an imitation learner, and a non-imitation, prioritized sweeping control learner in the same environment.

  • The mentor follows an ε-greedy policy with ε ∈ Θ(0.01).
  • The imitation learner and the control learner use the same

parameters, and a fixed number of backups per sample.

  • The environments are stochastic grid worlds with eight-connectivity,

but with movement only in the four cardinal directions.

  • All the results shown are averages over 10 runs.

18

slide-19
SLIDE 19

500 1000 1500 2000 2500 3000 3500 4000 4500 −10 10 20 30 40 50 Imitation Control (Imitation − Control)

Goals over previous 1000 time steps Time step

Figure 1: Performance in a 10 × 10 grid world with 10% noisy actions.

19

slide-20
SLIDE 20

1000 2000 3000 4000 5000 6000 −5 5 10 15 20 25 30 10x10, 10% noise 13x13, 10% noise 10x10, 40% noise

Time step Goals (imitation − control) over previous 1000 time steps

Figure 2: Imitation vs. control for different grid-world parameters.

20

slide-21
SLIDE 21

+5 +5 +1 +5 +5

Figure 3: A grid world with misleading priors.

21

slide-22
SLIDE 22

2000 4000 6000 8000 10000 12000 −5 5 10 15 20 25 30 35 40 45 Control Imitation (Imitation − Control)

Time step Goals over previous 1000 time steps

Figure 4: Performance in the grid world of Figure 3.

22

slide-23
SLIDE 23

Figure 5: A “complex maze” grid world.

23

slide-24
SLIDE 24

0.5 1 1.5 2 2.5

Goals over previous 1000 time steps Time step (x 100000)

1 2 3 4 5 6 7 (Imitation − Control) Imitation Control

Figure 6: Performance in the grid world of Figure 5.

24

slide-25
SLIDE 25

* * * * * * * * * * * * * * * * * * 1 4 3 5 2

Figure 7: A “perilous shortcut” grid world.

25

slide-26
SLIDE 26

0.5 1 1.5 2 2.5 3 5 10 15 20 25 30 35 (Imitation − Control) Imitation Control

Goals over previous 1000 time steps Time steps (x 10000)

Figure 8: Performance in the grid world of Figure 7.

26

slide-27
SLIDE 27

2 4 5 1 3 First Mentor Task to Learn Second Mentor

Figure 9: A grid world with multiple mentors whose trajectories are dif- ferent from, but overlapping with, the learner’s target trajectory.

27

slide-28
SLIDE 28

1000 2000 3000 4000 5000 6000 10 20 30 40 50 60 Control Imitation (Imitation − Control)

Time step Goals over previous 1000 time steps

Figure 10: Performance in the grid world of Figure 9.

28

slide-29
SLIDE 29

Summary: Assumptions

  • Multiple agents’ actions are noninteracting.
  • The learner and mentor have ”similar” capabilities:

– identical state spaces – all actions the mentor can take are available to the learner – all of the mentor’s transition probabilities apply to the learner

  • The learner knows its reward function up front.
  • The learner can observe the mentor’s state transitions.

– for convergence, the observation period is indefinite – the mentor’s policy is stationary

  • Agents’ actions do not affect the reward structure.

29

slide-30
SLIDE 30

Summary: Results

Implicit imitation via model extraction shows:

  • improvement over standard learning (given an expert mentor)
  • tolerance to noise (Figures 1 and 2)
  • the ability to integrate subskills from multiple mentors (Figure 10)
  • benefits that increase with problem difficulty (Figures 5 and 6)

30