on the feasibility of learning human biases for reward
play

On the Feasibility of Learning Human Biases for Reward Inference - PowerPoint PPT Presentation

On the Feasibility of Learning Human Biases for Reward Inference Rohin Shah, Noah Gundotra, Pieter Abbeel, Anca Dragan A conversation amongst IRL researchers A conversation amongst IRL researchers [Ziebart et al, 2008] To deal with suboptimal


  1. On the Feasibility of Learning Human Biases for Reward Inference Rohin Shah, Noah Gundotra, Pieter Abbeel, Anca Dragan

  2. A conversation amongst IRL researchers

  3. A conversation amongst IRL researchers [Ziebart et al, 2008] To deal with suboptimal demos, let’s model the human as noisily rational

  4. A conversation amongst IRL researchers [Ziebart et al, 2008] To deal with suboptimal demos, let’s model the human as noisily rational [Christiano, 2015] Then you are limited to human performance, since you don’t know how the human made a mistake

  5. A conversation amongst IRL researchers [Ziebart et al, 2008] To deal with suboptimal demos, let’s model the human as noisily rational [Christiano, 2015] Then you are limited to human performance, since you don’t know how the human made a mistake [Evans et al, 2016], [Zheng et al, 2014], [Majumdar et al, 2017] We can model human biases: π ( a | s ) ∝ e β Q ( s , a ; r ) - Myopia s - Hyperbolic time discounting a - Sparse noise r - Risk sensitivity

  6. A conversation amongst IRL researchers [Christiano, 2015] Then you are limited to human performance, since you don’t know how the human made a mistake [Evans et al, 2016], [Zheng et al, 2014], [Majumdar et al, 2017] We can model human biases: π ( a | s ) ∝ e β Q ( s , a ; r ) - Myopia s - Hyperbolic time discounting a - Sparse noise r - Risk sensitivity [Steinhardt and Evans, 2017] Your human model will inevitably be misspecified

  7. A conversation amongst IRL researchers [Evans et al, 2016], [Zheng et al, 2014], [Majumdar et al, 2017] We can model human biases: π ( a | s ) ∝ e β Q ( s , a ; r ) - Myopia s - Hyperbolic time discounting a - Sparse noise r - Risk sensitivity [Steinhardt and Evans, 2017] Your human model will inevitably be misspecified s Hmm, maybe we can learn the a systematic biases from data? r Then we could correct for these biases during IRL

  8. A conversation amongst IRL researchers [Steinhardt and Evans, 2017] Your human model will inevitably be misspecified s Hmm, maybe we can learn the a systematic biases from data? r Then we could correct for these biases during IRL [Armstrong and Mindermann, 2017] That’s impossible without additional assumptions

  9. Learning a policy isn’t sufficient s s w π D a π r r a Biases are a part of cognition, They are in the planning algorithm and are not in the policy π D that created the policy π We consider a multi-task setting so that we can learn D from examples

  10. Architecture To learn the biased planner, minimize over θ To perform IRL, minimize over R

  11. Algorithms Algorithm 1: Some known rewards Algorithm 2: ”Near” optimal 1. On tasks with known rewards, 1. Use Algorithm 1 to mimic a learn the planner simulated optimal agent 2. Freeze the planner and learn 2. Finetune planner and reward the reward on remaining tasks jointly on human demonstrations

  12. Experiments We developed five simulated human biases to test our algorithms.

  13. (Some) Results Optimal Our algorithms perform better on Boltzmann average, compared to a learned Known rewards Optimal or Boltzmann model “Near” optimal … But an exact model of the demonstrator does much better, hitting 98%.

  14. Conclusion Learning systematic biases has the potential to improve reward inference , but differentiable planners need to become significantly better before this will be feasible.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend