On the Feasibility of Learning Human Biases for Reward Inference - - PowerPoint PPT Presentation

on the feasibility of learning human biases for reward
SMART_READER_LITE
LIVE PREVIEW

On the Feasibility of Learning Human Biases for Reward Inference - - PowerPoint PPT Presentation

On the Feasibility of Learning Human Biases for Reward Inference Rohin Shah, Noah Gundotra, Pieter Abbeel, Anca Dragan A conversation amongst IRL researchers A conversation amongst IRL researchers [Ziebart et al, 2008] To deal with suboptimal


slide-1
SLIDE 1

On the Feasibility of Learning Human Biases for Reward Inference

Rohin Shah, Noah Gundotra, Pieter Abbeel, Anca Dragan

slide-2
SLIDE 2

A conversation amongst IRL researchers

slide-3
SLIDE 3

A conversation amongst IRL researchers

To deal with suboptimal demos, let’s model the human as noisily rational [Ziebart et al, 2008]

slide-4
SLIDE 4

A conversation amongst IRL researchers

To deal with suboptimal demos, let’s model the human as noisily rational [Ziebart et al, 2008] [Christiano, 2015] Then you are limited to human performance, since you don’t know how the human made a mistake

slide-5
SLIDE 5

A conversation amongst IRL researchers

To deal with suboptimal demos, let’s model the human as noisily rational [Ziebart et al, 2008] [Christiano, 2015] Then you are limited to human performance, since you don’t know how the human made a mistake We can model human biases:

  • Myopia
  • Hyperbolic time discounting
  • Sparse noise
  • Risk sensitivity

[Evans et al, 2016], [Zheng et al, 2014], [Majumdar et al, 2017] s r a

π(a|s) ∝ eβQ(s,a;r)

slide-6
SLIDE 6

A conversation amongst IRL researchers

[Christiano, 2015] Then you are limited to human performance, since you don’t know how the human made a mistake We can model human biases:

  • Myopia
  • Hyperbolic time discounting
  • Sparse noise
  • Risk sensitivity

[Evans et al, 2016], [Zheng et al, 2014], [Majumdar et al, 2017] [Steinhardt and Evans, 2017] Your human model will inevitably be misspecified s r a

π(a|s) ∝ eβQ(s,a;r)

slide-7
SLIDE 7

A conversation amongst IRL researchers

We can model human biases:

  • Myopia
  • Hyperbolic time discounting
  • Sparse noise
  • Risk sensitivity

[Evans et al, 2016], [Zheng et al, 2014], [Majumdar et al, 2017] s r a

π(a|s) ∝ eβQ(s,a;r)

[Steinhardt and Evans, 2017] Your human model will inevitably be misspecified Hmm, maybe we can learn the systematic biases from data? Then we could correct for these biases during IRL s r a

slide-8
SLIDE 8

A conversation amongst IRL researchers

[Steinhardt and Evans, 2017] Your human model will inevitably be misspecified Hmm, maybe we can learn the systematic biases from data? Then we could correct for these biases during IRL s r a [Armstrong and Mindermann, 2017] That’s impossible without additional assumptions

slide-9
SLIDE 9

Learning a policy isn’t sufficient

s r a π

We consider a multi-task setting so that we can learn D from examples

w r D π a s

Biases are a part of cognition, and are not in the policy π They are in the planning algorithm D that created the policy π

slide-10
SLIDE 10

Architecture

To learn the biased planner, minimize over θ To perform IRL, minimize over R

slide-11
SLIDE 11

Algorithms

Algorithm 1: Some known rewards 1. On tasks with known rewards, learn the planner 2. Freeze the planner and learn the reward on remaining tasks Algorithm 2: ”Near” optimal 1. Use Algorithm 1 to mimic a simulated optimal agent 2. Finetune planner and reward jointly on human demonstrations

slide-12
SLIDE 12

Experiments

We developed five simulated human biases to test our algorithms.

slide-13
SLIDE 13

(Some) Results

Our algorithms perform better on average, compared to a learned Optimal or Boltzmann model Optimal Boltzmann Known rewards “Near” optimal … But an exact model of the demonstrator does much better, hitting 98%.

slide-14
SLIDE 14

Conclusion

Learning systematic biases has the potential to improve reward inference, but differentiable planners need to become significantly better before this will be feasible.