1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship - - PowerPoint PPT Presentation

▶

Sep 15, 2023 39 likes •383 views

1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell Motivation Given: (1) measurements of an agent's

SLIDE 1

1. Algorithms for Inverse Reinforcement Learning
2. Apprenticeship learning via Inverse Reinforcement

Learning

SLIDE 2

Algorithms for Inverse Reinforcement Learning

Andrew Ng and Stuart Russell

SLIDE 3

Motivation

Given: (1) measurements of an agent's

behavior over time, in a variety of circumstances, (2) if needed, measurements of the sensory inputs to that agent; (3) if available, a model of the environment.

Determine: the reward function being
ptimized.

SLIDE 4

Why?

Reason #1: Computational models for animal

and human learning.

“In examining animal and human behavior we

must consider the reward function as an unknown to be ascertained through empirical investigation.”

Particularly true of multiattribute reward

functions (e.g. Bee foraging: amount of nectar

vs. flight time vs. risk from wind/predators)

SLIDE 5

Why?

Reason #2: Agent construction.
“An agent designer [...] may only have a very

rough idea of the reward function whose

ptimization would generate 'desirable'

behavior.”

e.g. “Driving well”
Apprenticeship learning: Recovering expert's

underlying reward function more “parsimonious” than learning expect's policy?

SLIDE 6

Possible applications in multi-agent systems

In multi-agent adversarial games, learning
pponents’ reward functions that guild their

actions to devise strategies against them.

example
In mechanism design, learning each agent’s

reward function from histories to manipulate its actions.

and more?

SLIDE 7

Inverse Reinforcement Learning (1) – MDP Recap

MDP is represented as a tuple (S, A, {Psa}, ,R)

Note: R is bounded by Rmax

Value function for policy :
Q-function:

SLIDE 8

Inverse Reinforcement Learning (1) – MDP Recap

Bellman Equation:
Bellman Optimality:

SLIDE 9

Inverse Reinforcement Learning (2) Finite State Space

Reward function solution set (a1 is optimal action)

SLIDE 10

Inverse Reinforcement Learning (2) Finite State Space

There are many solutions of R that satisfy the inequality (e.g. R = 0), which one might be the best solution?

1. Make deviation from as costly as possible:
2. Make reward function as simple as possible

SLIDE 11

Inverse Reinforcement Learning (2) Finite State Space

Linear Programming Formulation:

a1, a2, …, an R? Va1 Va2 maximized

SLIDE 12

Inverse Reinforcement Learning (3) Large State Space

Linear approximation of reward function (in driving

example, basis functions can be collision, stay

n right lane,…etc)
Let be value function of policy , when reward R

=

For R to make optimal

SLIDE 13

Inverse Reinforcement Learning (3) Large State Spaces

In an infinite or large number of state space, it is

usually not possible to check all constraints:

Choose a finite subset S0 from all states
Linear Programming formulation, find αi that:
x>=0, p(x)=x; otherwise p(x)=2x

SLIDE 14

Inverse Reinforcement Learning (4) IRL from Sample Trajectories

If is only accessible through a set of sampled

trajectories (e.g. driving demo in 2nd paper)

Assume we start from a dummy state s0,(whose next

state distribution is according to D).

In the case that reward trajectory state

sequence (s0, s1, s2….):

SLIDE 15

Inverse Reinforcement Learning (4) IRL from Sample Trajectories

Assume we have some set of policies
Linear Programming formulation
The above optimization gives a new reward R, we then

compute based on R, and add it to the set of policies

reiterate

SLIDE 16

Discrete Gridworld Experiment

5x5 grid world
Agent starts in bottom-left square.
Reward of 1 in the upper-right square.
Actions = N,W,S,E (30% chance of random)

SLIDE 17

Discrete Gridworld Results

SLIDE 18

Mountain Car Experiment #1

Car starts in valley, goal is at the top of hill
Reward is -1 per “step” until goal is reached
State = car's x-position & velocity (continuous!)
Function approx. class: all linear combinations
f 26 evenly spaced Gaussian-shaped basis

functions

SLIDE 19

Mountain Car Experiment #2

Goal is in bottom of valley
Car starts... not sure. Top of hill?
Reward is 1 in the goal area, 0 elsewhere
γ = 0.99
State = car's x-position & velocity (continuous!)
Function approx. class: all linear combinations
f 26 evenly spaced Gaussian-shaped basis

functions

SLIDE 20

Mountain Car Results

#1 #2

SLIDE 21

Continuous Gridworld Experiment

State space is now [0,1] x [0,1] continuous grid
Actions: 0.2 movement in any direction + noise

in x and y coordinates of [-0.1,0.1]

Reward 1 in region [0.8,1] x [0.8,1], 0

elsewhere

γ = 0.9
Function approx. class: all linear combinations
f a 15x15 array of 2-D Gaussian-shaped basis

functions

m=5000 trajectories of 30 steps each per policy

SLIDE 22

Continuous Gridworld Results

3%-10% error when comparing fitted reward's optimal policy with the true

ptimal policy

However, no significant difference in quality of policy (measured using true reward function)

SLIDE 23

Apprenticeship Learning via Inverse Reinforcement Learning

Pieter Abbeel & Andrew Y. Ng

SLIDE 24

Algorithm

For t = 1,2,…

 Inverse RL step: 

Estimate expert’s reward function R(s)= wTf (s) such that under R(s) the expert performs better than all previously found policies {pi}.

 RL step: 

Compute optimal policy pt for



the estimated reward w.

Courtesy of Pieter Abbeel

SLIDE 25

Algorithm: IRL step

Maximize t, w:||w||2≤ 1 t
s.t. Vw(pE) ³ Vw(pi) + t i=1,…,t-1
t = margin of expert’s performance over the

performance of previously found policies.

Vw(p)

= E [St gt R(st)|p] = E [St gt wTf(st)|p]

= wT E [St gt f(st)|p]
= wT m(p)
m(p) = E [St gt f(st)|p] are the “feature

expectations”

Courtesy of Pieter Abbeel

SLIDE 26

Feature Expectation Closeness and Performance

If we can find a policy p such that
||m(pE) - m(p)||2 £ e,
then for any underlying reward R*(s) =w*Tf(s),
we have that
|Vw*(pE) - Vw*(p)| = |w*T m(pE) - w*T m(p)|
£ ||w*||2 ||m(pE) - m(p)||2
£ e.

Courtesy of Pieter Abbeel

SLIDE 27

IRL step as Support Vector Machine

maximum margin hyperplane seperating two sets of points m(pE) m(p) |w*T m(pE) - w*T m(p)| = |Vw*(pE) - Vw*(p)| = maximal difference between expert policy’s value function and 2nd to the

ptimal policy’s value function

SLIDE 28

m1 m(p0) w(1) w(2) m(p1) m(p2) m2 w(3)

Uw(p) = wTm(π)

m(pE)

Courtesy of Pieter Abbeel

SLIDE 29

Gridworld Experiment

128 x 128 grid world divided into 64 regions,

each of size 16 x 16 (“macrocells”).

A small number of macrocells have positive

rewards.

For each macrocell, there is one feature Φi(s)

indicating whether that state s is in macrocell i

Algorithm was also run on the subset of

features Φi(s) that correspond to non-zero rewards.

SLIDE 30

Gridworld Results

Performance vs. # Trajectories Distance to expert vs. # Iterations

SLIDE 31

Car Driving Experiment

No explict reward function at all!
Expert demonstrates proper policy via 2 min. of

driving time on simulator (1200 data points).

5 different “driver types” tried.
Features: which lane the car is in, distance to

closest car in current lane.

Algorithm run for 30 iterations, policy hand-

picked.

Movie Time! (Expert left, IRL right)

SLIDE 32

Demo-1 Nice

SLIDE 33

Demo-2 Right Lane Nasty

SLIDE 34

Learning

Algorithms for Inverse Reinforcement Learning

Andrew Ng and Stuart Russell

Motivation

behavior over time, in a variety of circumstances, (2) if needed, measurements of the sensory inputs to that agent; (3) if available, a model of the environment.

Why?

and human learning.

must consider the reward function as an unknown to be ascertained through empirical investigation.”

functions (e.g. Bee foraging: amount of nectar

Why?

rough idea of the reward function whose

behavior.”

underlying reward function more “parsimonious” than learning expect's policy?

Possible applications in multi-agent systems

actions to devise strategies against them.

reward function from histories to manipulate its actions.

Inverse Reinforcement Learning (1) – MDP Recap

Note: R is bounded by Rmax

Inverse Reinforcement Learning (1) – MDP Recap

Inverse Reinforcement Learning (2) Finite State Space

Inverse Reinforcement Learning (2) Finite State Space

There are many solutions of R that satisfy the inequality (e.g. R = 0), which one might be the best solution?

Inverse Reinforcement Learning (2) Finite State Space

Inverse Reinforcement Learning (3) Large State Space

example, basis functions can be collision, stay

=

Inverse Reinforcement Learning (3) Large State Spaces

usually not possible to check all constraints:

Inverse Reinforcement Learning (4) IRL from Sample Trajectories

trajectories (e.g. driving demo in 2nd paper)

state distribution is according to D).

sequence (s0, s1, s2….):

Inverse Reinforcement Learning (4) IRL from Sample Trajectories

compute based on R, and add it to the set of policies

Discrete Gridworld Experiment

Discrete Gridworld Results

Mountain Car Experiment #1

functions

Mountain Car Experiment #2

functions

Mountain Car Results

Continuous Gridworld Experiment

in x and y coordinates of [-0.1,0.1]

elsewhere

functions

Continuous Gridworld Results

Apprenticeship Learning via Inverse Reinforcement Learning

Pieter Abbeel & Andrew Y. Ng

Algorithm

Estimate expert’s reward function R(s)= wTf (s) such that under R(s) the expert performs better than all previously found policies {pi}.

Compute optimal policy pt for

the estimated reward w.

Algorithm: IRL step

performance of previously found policies.

= E [St gt R(st)|p] = E [St gt wTf(st)|p]

expectations”

Feature Expectation Closeness and Performance

IRL step as Support Vector Machine

m1 m(p0) w(1) w(2) m(p1) m(p2) m2 w(3)

Uw(p) = wTm(π)

m(pE)

Gridworld Experiment

each of size 16 x 16 (“macrocells”).

rewards.

indicating whether that state s is in macrocell i

features Φi(s) that correspond to non-zero rewards.

Gridworld Results

Car Driving Experiment

driving time on simulator (1200 data points).

closest car in current lane.

picked.

Demo-1 Nice

Demo-2 Right Lane Nasty

Car Driving Results