1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship - - PowerPoint PPT Presentation

1 algorithms for inverse reinforcement learning 2
SMART_READER_LITE
LIVE PREVIEW

1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship - - PowerPoint PPT Presentation

1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell Motivation Given: (1) measurements of an agent's


slide-1
SLIDE 1
  • 1. Algorithms for Inverse Reinforcement Learning
  • 2. Apprenticeship learning via Inverse Reinforcement

Learning

slide-2
SLIDE 2

Algorithms for Inverse Reinforcement Learning

Andrew Ng and Stuart Russell

slide-3
SLIDE 3

Motivation

  • Given: (1) measurements of an agent's

behavior over time, in a variety of circumstances, (2) if needed, measurements of the sensory inputs to that agent; (3) if available, a model of the environment.

  • Determine: the reward function being
  • ptimized.
slide-4
SLIDE 4

Why?

  • Reason #1: Computational models for animal

and human learning.

  • “In examining animal and human behavior we

must consider the reward function as an unknown to be ascertained through empirical investigation.”

  • Particularly true of multiattribute reward

functions (e.g. Bee foraging: amount of nectar

  • vs. flight time vs. risk from wind/predators)
slide-5
SLIDE 5

Why?

  • Reason #2: Agent construction.
  • “An agent designer [...] may only have a very

rough idea of the reward function whose

  • ptimization would generate 'desirable'

behavior.”

  • e.g. “Driving well”
  • Apprenticeship learning: Recovering expert's

underlying reward function more “parsimonious” than learning expect's policy?

slide-6
SLIDE 6

Possible applications in multi-agent systems

  • In multi-agent adversarial games, learning
  • pponents’ reward functions that guild their

actions to devise strategies against them.

  • example
  • In mechanism design, learning each agent’s

reward function from histories to manipulate its actions.

  • and more?
slide-7
SLIDE 7

Inverse Reinforcement Learning (1) – MDP Recap

  • MDP is represented as a tuple (S, A, {Psa}, ,R)

Note: R is bounded by Rmax

  • Value function for policy :
  • Q-function:
slide-8
SLIDE 8

Inverse Reinforcement Learning (1) – MDP Recap

  • Bellman Equation:
  • Bellman Optimality:
slide-9
SLIDE 9

Inverse Reinforcement Learning (2) Finite State Space

  • Reward function solution set (a1 is optimal action)
slide-10
SLIDE 10

Inverse Reinforcement Learning (2) Finite State Space

There are many solutions of R that satisfy the inequality (e.g. R = 0), which one might be the best solution?

  • 1. Make deviation from as costly as possible:
  • 2. Make reward function as simple as possible
slide-11
SLIDE 11

Inverse Reinforcement Learning (2) Finite State Space

  • Linear Programming Formulation:

a1, a2, …, an R? Va1 Va2 maximized

slide-12
SLIDE 12

Inverse Reinforcement Learning (3) Large State Space

  • Linear approximation of reward function (in driving

example, basis functions can be collision, stay

  • n right lane,…etc)
  • Let be value function of policy , when reward R

=

  • For R to make optimal
slide-13
SLIDE 13

Inverse Reinforcement Learning (3) Large State Spaces

  • In an infinite or large number of state space, it is

usually not possible to check all constraints:

  • Choose a finite subset S0 from all states
  • Linear Programming formulation, find αi that:
  • x>=0, p(x)=x; otherwise p(x)=2x
slide-14
SLIDE 14

Inverse Reinforcement Learning (4) IRL from Sample Trajectories

  • If is only accessible through a set of sampled

trajectories (e.g. driving demo in 2nd paper)

  • Assume we start from a dummy state s0,(whose next

state distribution is according to D).

  • In the case that reward trajectory state

sequence (s0, s1, s2….):

slide-15
SLIDE 15

Inverse Reinforcement Learning (4) IRL from Sample Trajectories

  • Assume we have some set of policies
  • Linear Programming formulation
  • The above optimization gives a new reward R, we then

compute based on R, and add it to the set of policies

  • reiterate
slide-16
SLIDE 16

Discrete Gridworld Experiment

  • 5x5 grid world
  • Agent starts in bottom-left square.
  • Reward of 1 in the upper-right square.
  • Actions = N,W,S,E (30% chance of random)
slide-17
SLIDE 17

Discrete Gridworld Results

slide-18
SLIDE 18

Mountain Car Experiment #1

  • Car starts in valley, goal is at the top of hill
  • Reward is -1 per “step” until goal is reached
  • State = car's x-position & velocity (continuous!)
  • Function approx. class: all linear combinations
  • f 26 evenly spaced Gaussian-shaped basis

functions

slide-19
SLIDE 19

Mountain Car Experiment #2

  • Goal is in bottom of valley
  • Car starts... not sure. Top of hill?
  • Reward is 1 in the goal area, 0 elsewhere
  • γ = 0.99
  • State = car's x-position & velocity (continuous!)
  • Function approx. class: all linear combinations
  • f 26 evenly spaced Gaussian-shaped basis

functions

slide-20
SLIDE 20

Mountain Car Results

#1 #2

slide-21
SLIDE 21

Continuous Gridworld Experiment

  • State space is now [0,1] x [0,1] continuous grid
  • Actions: 0.2 movement in any direction + noise

in x and y coordinates of [-0.1,0.1]

  • Reward 1 in region [0.8,1] x [0.8,1], 0

elsewhere

  • γ = 0.9
  • Function approx. class: all linear combinations
  • f a 15x15 array of 2-D Gaussian-shaped basis

functions

  • m=5000 trajectories of 30 steps each per policy
slide-22
SLIDE 22

Continuous Gridworld Results

3%-10% error when comparing fitted reward's optimal policy with the true

  • ptimal policy

However, no significant difference in quality of policy (measured using true reward function)

slide-23
SLIDE 23

Apprenticeship Learning via Inverse Reinforcement Learning

Pieter Abbeel & Andrew Y. Ng

slide-24
SLIDE 24

Algorithm

  • For t = 1,2,…

 Inverse RL step: 

Estimate expert’s reward function R(s)= wTf (s) such that under R(s) the expert performs better than all previously found policies {pi}.

 RL step: 

Compute optimal policy pt for

the estimated reward w.

Courtesy of Pieter Abbeel

slide-25
SLIDE 25

Algorithm: IRL step

  • Maximize t, w:||w||2≤ 1 t
  • s.t. Vw(pE) ³ Vw(pi) + t i=1,…,t-1
  • t = margin of expert’s performance over the

performance of previously found policies.

  • Vw(p)

= E [St gt R(st)|p] = E [St gt wTf(st)|p]

  • = wT E [St gt f(st)|p]
  • = wT m(p)
  • m(p) = E [St gt f(st)|p] are the “feature

expectations”

Courtesy of Pieter Abbeel

slide-26
SLIDE 26

Feature Expectation Closeness and Performance

  • If we can find a policy p such that
  • ||m(pE) - m(p)||2 £ e,
  • then for any underlying reward R*(s) =w*Tf(s),
  • we have that
  • |Vw*(pE) - Vw*(p)| = |w*T m(pE) - w*T m(p)|
  • £ ||w*||2 ||m(pE) - m(p)||2
  • £ e.

Courtesy of Pieter Abbeel

slide-27
SLIDE 27

IRL step as Support Vector Machine

maximum margin hyperplane seperating two sets of points m(pE) m(p) |w*T m(pE) - w*T m(p)| = |Vw*(pE) - Vw*(p)| = maximal difference between expert policy’s value function and 2nd to the

  • ptimal policy’s value function
slide-28
SLIDE 28

m1 m(p0) w(1) w(2) m(p1) m(p2) m2 w(3)

Uw(p) = wTm(π)

m(pE)

Courtesy of Pieter Abbeel

slide-29
SLIDE 29

Gridworld Experiment

  • 128 x 128 grid world divided into 64 regions,

each of size 16 x 16 (“macrocells”).

  • A small number of macrocells have positive

rewards.

  • For each macrocell, there is one feature Φi(s)

indicating whether that state s is in macrocell i

  • Algorithm was also run on the subset of

features Φi(s) that correspond to non-zero rewards.

slide-30
SLIDE 30

Gridworld Results

Performance vs. # Trajectories Distance to expert vs. # Iterations

slide-31
SLIDE 31

Car Driving Experiment

  • No explict reward function at all!
  • Expert demonstrates proper policy via 2 min. of

driving time on simulator (1200 data points).

  • 5 different “driver types” tried.
  • Features: which lane the car is in, distance to

closest car in current lane.

  • Algorithm run for 30 iterations, policy hand-

picked.

  • Movie Time! (Expert left, IRL right)
slide-32
SLIDE 32

Demo-1 Nice

slide-33
SLIDE 33

Demo-2 Right Lane Nasty

slide-34
SLIDE 34

Car Driving Results