On the Feasibility of Learning Human Biases for Reward Inference - PowerPoint PPT Presentation

On the Feasibility of Learning Human Biases for Reward Inference Rohin Shah, Noah Gundotra, Pieter Abbeel, Anca Dragan

A conversation amongst IRL researchers

A conversation amongst IRL researchers [Ziebart et al, 2008] To deal with suboptimal demos, let’s model the human as noisily rational

A conversation amongst IRL researchers [Ziebart et al, 2008] To deal with suboptimal demos, let’s model the human as noisily rational [Christiano, 2015] Then you are limited to human performance, since you don’t know how the human made a mistake

A conversation amongst IRL researchers [Ziebart et al, 2008] To deal with suboptimal demos, let’s model the human as noisily rational [Christiano, 2015] Then you are limited to human performance, since you don’t know how the human made a mistake [Evans et al, 2016], [Zheng et al, 2014], [Majumdar et al, 2017] We can model human biases: π ( a | s ) ∝ e β Q ( s , a ; r ) - Myopia s - Hyperbolic time discounting a - Sparse noise r - Risk sensitivity

A conversation amongst IRL researchers [Christiano, 2015] Then you are limited to human performance, since you don’t know how the human made a mistake [Evans et al, 2016], [Zheng et al, 2014], [Majumdar et al, 2017] We can model human biases: π ( a | s ) ∝ e β Q ( s , a ; r ) - Myopia s - Hyperbolic time discounting a - Sparse noise r - Risk sensitivity [Steinhardt and Evans, 2017] Your human model will inevitably be misspecified

A conversation amongst IRL researchers [Evans et al, 2016], [Zheng et al, 2014], [Majumdar et al, 2017] We can model human biases: π ( a | s ) ∝ e β Q ( s , a ; r ) - Myopia s - Hyperbolic time discounting a - Sparse noise r - Risk sensitivity [Steinhardt and Evans, 2017] Your human model will inevitably be misspecified s Hmm, maybe we can learn the a systematic biases from data? r Then we could correct for these biases during IRL

A conversation amongst IRL researchers [Steinhardt and Evans, 2017] Your human model will inevitably be misspecified s Hmm, maybe we can learn the a systematic biases from data? r Then we could correct for these biases during IRL [Armstrong and Mindermann, 2017] That’s impossible without additional assumptions

Learning a policy isn’t sufficient s s w π D a π r r a Biases are a part of cognition, They are in the planning algorithm and are not in the policy π D that created the policy π We consider a multi-task setting so that we can learn D from examples

Architecture To learn the biased planner, minimize over θ To perform IRL, minimize over R

Algorithms Algorithm 1: Some known rewards Algorithm 2: ”Near” optimal 1. On tasks with known rewards, 1. Use Algorithm 1 to mimic a learn the planner simulated optimal agent 2. Freeze the planner and learn 2. Finetune planner and reward the reward on remaining tasks jointly on human demonstrations

Experiments We developed five simulated human biases to test our algorithms.

(Some) Results Optimal Our algorithms perform better on Boltzmann average, compared to a learned Known rewards Optimal or Boltzmann model “Near” optimal … But an exact model of the demonstrator does much better, hitting 98%.

Conclusion Learning systematic biases has the potential to improve reward inference , but differentiable planners need to become significantly better before this will be feasible.

On the Feasibility of Learning Human Biases for Reward Inference - PowerPoint PPT Presentation

On the Feasibility of Learning Human Biases for Reward Inference Rohin Shah, Noah Gundotra, Pieter Abbeel, Anca Dragan A conversation amongst IRL researchers A conversation amongst IRL researchers [Ziebart et al, 2008] To deal with suboptimal

The ULTIMATE Business Incentive Company REWARD YOUR CUSTOMERS; REWARD YOUR EMPLOYEES REWARD YOUR

Risk/Reward Risk/Reward If you buy here, what is the target? What is the risk? 1 221

Heuristics and biases Tina Nane 2 Heuristics and biases Lotto Icon by Dapete is

Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat

Feasibility of Consistent, Feasibility of Consistent, Feasibility of Consistent, Feasibility of

TEXT AND TEXT AND AUTOMATED BIASES AUTOMATED BIASES NATURAL LANGUAGES ARE THE NATURAL

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

VI. The Feasibility Study VI. The Feasibility Study What is a feasibility study? What is a

Unconscious Bias 1 Questions to Start: Are we aware of our unconscious biases? Do we accept

Investigating Potential Investigating Potential Biases in Aerosol Light Biases in Aerosol Light

Biases in Decision Making Alexander Felfernig alexander.felfernig@ist.tugraz.at Decision Biases

Capital Budgeting: Biases (Welch, Chapter 13-5) Ivo Welch More Biases Overconfidence Are you

Reward Shaping in Episodic Reinforcement Learning Marek Grze s Canterbury, UK AAMAS 2017

Where's The Reward? Where's The Reward? A Review of Reinforcement Learning for Instructional

MARKET FEASIBILITY Market Feasibility July 27, 2017 Business Feasibility CHARACTERISTICS OF

MENTAL WELLBEING: THE HEART OF YOUR TOTAL REWARD PROPOSITION Jane Gibbon Group Reward Director

Decision Tree Learning: Part 2 CS 760@UW-Madison Goals for the last lecture you should

1Q19 Trading Update 1Q19 ading Update May 16, 2019 1 Legal Disclaimers Legal Disclaimers

The disability blind spot in health care reform Harold Pollack University of Chicago

AT THE END ) Emmanuel Farhi and Xavier Gabaix Harvard Lecture 3: September 2018 I NTRODUCTION

Prediction and Odds 18.05 Spring 2014 January 1, 2017 1 / 20 Probabilistic Prediction Also

Irreducible Parallelism and Desirable Serialism John McCarthy UMass Amherst 1 Background

Riding the unknowns for The Treasury Lecture [Background reading only] AUTHOR: Adrian Orr CEO

Locale-specific threats Security challenges due to globalization Anthony Bettini McAfee Labs

Sambuz

Useful Links

Newsletter

Mail Us