Extrapolating Beyond Suboptimal Demonstrations via Inverse - - PowerPoint PPT Presentation

extrapolating beyond suboptimal demonstrations via
SMART_READER_LITE
LIVE PREVIEW

Extrapolating Beyond Suboptimal Demonstrations via Inverse - - PowerPoint PPT Presentation

Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations Daniel Brown*, Wonjoon Goo*, Prabhat Nagarajan, and Scott Niekum Inverse Reinforcement Learning Current approaches 1. Cant do better


slide-1
SLIDE 1

Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations

Daniel Brown*, Wonjoon Goo*, Prabhat Nagarajan, and Scott Niekum

slide-2
SLIDE 2

Inverse Reinforcement Learning

Current approaches …

  • 1. Can’t do better than the

demonstrator. We find a reward function that explains the ranking, allowing for extrapolation.

  • 2. Are hard to scale to complex

problems.

slide-3
SLIDE 3

Inverse Reinforcement Learning

IRL via Ranked Demonstrations Current approaches …

  • 1. Can’t do better than the

demonstrator. We find a reward function that explains the ranking, allowing for extrapolation.

  • 2. Are hard to scale to complex

problems.

slide-4
SLIDE 4

Inverse Reinforcement Learning

Current approaches …

  • 1. Can’t do better than the

demonstrator. We find a reward function that explains the ranking, allowing for extrapolation.

  • 2. Are hard to scale to complex

problems.

IRL via Ranked Demonstrations

slide-5
SLIDE 5

Inverse Reinforcement Learning

Current approaches …

  • 1. Can’t do better than the

demonstrator. We find a reward function that explains the ranking, allowing for extrapolation.

  • 2. Are hard to scale to complex

problems. Inverse Reinforcement Learning becomes standard binary classification.

IRL via Ranked Demonstrations

slide-6
SLIDE 6

Trajectory-ranked Reward Extrapolation (T-REX)

slide-7
SLIDE 7

Trajectory-ranked Reward Extrapolation (T-REX)

slide-8
SLIDE 8

Trajectory-ranked Reward Extrapolation (T-REX)

Given ranked demonstrations How do we train the reward function ?

slide-9
SLIDE 9

Trajectory-ranked Reward Extrapolation (T-REX)

slide-10
SLIDE 10

Trajectory-ranked Reward Extrapolation (T-REX)

slide-11
SLIDE 11

Trajectory-ranked Reward Extrapolation (T-REX)

slide-12
SLIDE 12

Trajectory-ranked Reward Extrapolation (T-REX)

slide-13
SLIDE 13

Trajectory-ranked Reward Extrapolation (T-REX)

slide-14
SLIDE 14

Trajectory-ranked Reward Extrapolation (T-REX)

slide-15
SLIDE 15

Trajectory-ranked Reward Extrapolation (T-REX)

slide-16
SLIDE 16

Trajectory-ranked Reward Extrapolation (T-REX)

slide-17
SLIDE 17

Trajectory-ranked Reward Extrapolation (T-REX)

We subsample trajectories to create a large dataset of weakly labeled pairs!

slide-18
SLIDE 18

Trajectory-ranked Reward Extrapolation (T-REX)

  • Simple:
  • IRL as binary classification.
  • No human supervision during policy

learning.

  • No inner-loop MDP solver.
  • No inference time data collection

(e.g. GAIL).

  • No action labels required.
slide-19
SLIDE 19

Trajectory-ranked Reward Extrapolation (T-REX)

  • Simple:
  • IRL as binary classification.
  • No human supervision during policy

learning.

  • No inner-loop MDP solver.
  • No inference time data collection

(e.g. GAIL).

  • No action labels required.
  • Scales to high-dimensional tasks

(e.g. Atari games)

slide-20
SLIDE 20

Trajectory-ranked Reward Extrapolation (T-REX)

  • Simple:
  • IRL as binary classification.
  • No human supervision during policy

learning.

  • No inner-loop MDP solver.
  • No inference time data collection

(e.g. GAIL).

  • No action labels required.
  • Scales to high-dimensional tasks

(e.g. Atari games)

  • Can produce policies much better than

demonstrator

slide-21
SLIDE 21

T-REX Policy Performance

slide-22
SLIDE 22

T-REX on HalfCheetah

Best demo (88.97) T-REX (143.40)

slide-23
SLIDE 23

Reward Extrapolation

T-REX can extrapolate beyond the performance of the best demo HalfCheetah Hopper Ant

slide-24
SLIDE 24

Results: Atari Games

T-REX ou

  • utperf

rform

  • rms b

best d demon

  • nstration on
  • n 7 out of 8 g

8 games! s!

slide-25
SLIDE 25

T-REX on Enduro

Best demo (84) T-REX (520)

slide-26
SLIDE 26

Come see our poster @ Pacific Ballroom #47

Robust to noisy ranking labels Automatic ranking by watching a learner improve at a task Human demos / ranking labels Reward function visualization

slide-27
SLIDE 27

Come see our poster @ Pacific Ballroom #47

Robust to noisy ranking labels Automatic ranking by watching a learner improve at a task Human demos / ranking labels Reward function visualization

T-REX