Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell - PowerPoint PPT Presentation

Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

The Value Alignment Problem Example taken from Eliezer Yudkowsky’s NYU talk

The Value Alignment Problem

Action Selection in Agents: Ideal Observe Update Plan Act Observe Act

Action Selection in Agents: Reality Desired Behavior Observe Act Objective Encoding Challenge: how do we account for errors and failures in the encoding of an objective?

The Value Alignment Problem How do we make sure that the agents we build pursue ends that we actually intend?

Reward Engineering is Hard

What could go wrong? “…a computer-controlled radiation therapy machine….massively overdosed 6 people. These accidents have been describes as the worst in the 35-year history of medical accelerators.”

Reward Engineering is Hard At best, reinforcement learning and similar approaches reduce the problem of generating useful behavior to that of designing a ‘good’ reward function.

Reward Engineering is Hard True (Complicated) Reward Function ∼ R ∗ R Observed (likely incorrect) Reward Function

Why is reward engineering hard? ξ 0 ξ 1 ξ 2 ξ ∗ = argmax r ( ξ ) ξ 4 ξ 5 ξ 3 ξ ∈ Ξ

Why is reward engineering hard? ∼ r ∗ r ξ 0 ξ 1 ξ 2 ξ 4 ξ 5 ξ 3 0 1 2 3 4 5 6 7

Why is reward engineering hard? ∼ r ∗ ξ 0 ξ 1 ξ 2 r ξ 4 ξ 5 ξ 3 ξ 7 ξ 6 0 1 2 3 4 5 6 7

Negative Side Effects ? ? “Get money”

Reward Hacking 2 0 2 0 ? ? 0 1 5 0 “Get points”

Analogy: Computer Security

Solution 1: Blacklist Disallowed Characters Input Clean Text Text

Solution 2: Whitelist Filter of Allowed Characters Input Clean Text Text

Goal Reduce the extent to which system designers have to play whack-a-mole

Inspiration: Pragmatics 🎪 😋 😏 😏 🎪 🎪 😋 😏 😏 😋 😏 😏 “Glasses” “Hat”

Inspiration: Pragmatics 🎪 🎪 😋 😏 😏 😋 😏 😏 “Glasses” “Hat” “Glasses” 🎪 😋 😏 😏

Inspiration: Pragmatics 🎪 🎪 😋 😏 😏 😋 😏 😏 “Glasses” “Hat” “My friend has glasses” “Glasses” 🎪 😋 😏 😏 🎪 😋 😏 😏

Notation ξ trajectory φ features weights ξ 0 ξ 1 ξ 2 w linear reward function ξ 4 ξ 5 ξ 3 R ( ξ ; w ) = w > φ ( ξ )

Literal Reward Interpretation > φ ( ξ ) ⇣ ⇠ ⌘ ⇠ π ( ξ ) ∝ exp w v ξ 4 ξ 5 selects trajectories in proportion to proxy reward evaluation

Designing Reward for Literal Interpretation Assumption: rewarded behavior has high true utility in the training situations

Designing Reward for Literal Interpretation Literal optimizer’s trajectory distribution conditioned on . ∼ w ⇣ ⌘ ⇠ ⇠ E [ w ⇤> φ ( ξ ) | ξ ∼ w | w ⇤ ) ∝ exp P ( π ] True reward received for each trajectory

Inverting Reward Design ∼ ∼ P ( w ∗ | w ) ∝ P ( w | w ∗ ) P ( w ∗ )

Inverting Reward Design ∼ ∼ P ( w ∗ | w ) ∝ P ( w | w ∗ ) P ( w ∗ ) Key Idea: At test time, interpret reward functions in the context of an ‘intended’ situation

Experiment M test Measure how often the agent New state Three types of selects introduced In the states in the trajectories with ‘testing’ MDP training MDP the new state Domain: Lavaland ∼ π

Negative Side Effects   1 0 0 0 ? ? 0 1 0 0   “Get   0 0 1 0 money”   0 0 0 1

Reward Hacking 2 0 2 0 ?   1 1 0 0 0 ? 0 0 1 1 0 0 1   5 0   “Get 0 0 0 0 1   points” 1 0 0 1 0

Challenge: Missing Latent Rewards Proxy reward k = 0 I s µ k function is only trained k = 1 for the state types k = 2 observed during k = 3 training φ s Σ k

Results Sampled-Proxy Sampled-Z MaxEnt Z Mean Proxy 0.68 0.52 0.41 0.4 0.21 0.19 0.15 0.11 0.1 0.07 0.06 0.03 0.01 0.01 0 Negative Side E ff ect Reward Hacking Missing Latent Reward

On the folly of rewarding A and hoping for B “Whether dealing with monkeys, rats, or human beings, it is hardly controversial to state that most organisms seek information concerning what activities are rewarded, and then seek to do (or at least pretend to do) those things, often to the virtual exclusion of activities not rewarded…. Nevertheless, numerous examples exist of reward systems that are fouled up in that behaviors which are rewarded are those which the rewarder is trying to discourage…. ” – Kerr, 1975

The Principal-Agent Problem Agent Principal

A Simple Principal-Agent Problem ■ Principal and Agent negotiate contract ■ Agent selects effort ■ Value generated for principal, wages paid to agent

A Simple Principal Agent Problem

Misaligned Principal Agent Problem Value to Principal Performance Measure [Baker 2002]

Misaligned Principal Agent Problem Scale Alignment [Baker 2002]

Principal Agent vs Value Alignment ■ Incentive Compatibility is a fundamental constraint on (human or artificial) agent behavior ■ PA model has fundamental misalignment because humans have differing objectives ■ Primary source of misalignment in VA is extrapolation Although we may want to view algorithmic restrictions as a fundamental ■ misalignment ■ Recent news: Principal Agent models was awarded the 2016 Nobel prize in Economics

The Value Alignment Problem

Can we intervene? vs Better question: do our agents want us to intervene

The Off-Switch Game

The Off-Switch Game Desired Behavior Disobedient Behavior

A trivial agent that ‘wants’ intervention

The Off Switch Game Desired Behavior Disobedient Behavior Non-Functional Behavior

The Off-Switch Game

The Off-Switch Game Non-Functional Desired Behavior Disobedient Behavior Behavior

Why have an off-switch? Desired Behavior Observe Act Objective The system designer has uncertainty about the correct Encoding objective, this is never represented to the robot! This step might go wrong

The Structure of a Solution Infer the desired behavior Desired from the human’s actions Behavior Observe Human Act Observe World Distribution over Objectives

Inverse Reinforcement Learning ■ Given MDP without reward function ■ Determine Observations of optimal behavior The reward function being optimized [Ng and Russell 2000]

Can we use IRL to infer objectives? Observe Human Desired Behavior Bayesian IRL Observe World Act Inferred Objective Distribution over Objectives

IRL Issue #1 Don’t want the robot to imitate the human

IRL Issue #2: Assumes Human is Oblivious IRL assumes the human is unaware she is being observed one way mirror

IRL Issue #3 Action selection is independent of reward uncertainty Implicit Assumption: Robot gets no more information about the objective

Proposal: Robot Plays Cooperative Game ■ Cooperative Inverse Reinforcement Learning [Hadfield-Menell et al. NIPS 2016] ■ ■ Two players: ■ Both players maximize a shared reward function, but only observes the actual reward signal; only knows a prior distribution on reward functions learns the reward parameters by observing ■

Cooperative Inverse Reinforcement Learning Environment Hadfield-Menell et al. NIPS 2016

The Off-Switch Game

Intuition “Probably better to make coffee, but I should ask the human, just in case I’m wrong” “Probably better to switch off, but I should ask the human, just in case I’m wrong”

Theorem 1 A rational human is a sufficient to incentivize the robot to let itself be switched off

Incentives for the Robot vs vs

Theorem 1: Sufficient Conditions rational

Theorem 2 If the robot knows the utility evaluations in the off switch game with certainty, then a rational human is necessary to incentivize obedient behavior

Conclusion Uncertainty about the objective is crucial to incentivizing cooperative behaviors.

When is obedience a bad idea? vs

Robot Uncertainty vs Human Suboptimality

Incentives for Designers Population statistics on preferences i.e., market research Evidence about preferences from interaction with a particular customer Question: is it a good idea to `lie’ to the agent and tell it that the variance of is ?

Incentives for Designers

Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell - PowerPoint PPT Presentation

Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017 The Value Alignment Problem Example taken from Eliezer Yudkowskys NYU talk The Value Alignment Problem The Value Alignment Problem

1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Cooperative Web Caching Cooperative Web Caching Cooperative Caching Cooperative Caching

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Dynamic Inverse Problems: Schmitt Efficient Algorithms and Approximate Inverse Problems

Statistical Inverse Problems and abstract inverse problems examples Instrumental Variables

Inverse Reinforcement Learning CS 294-112: Deep Reinforcement Learning Sergey Levine Todays

Autonomous Navigation CSE 571 Inverse Optimal Control (Inverse Reinforcement Learning) Many

Advanced planning for autonomous vehicles using reinforcement learning and deep inverse

Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from

Cooperative Game Theory Outline Introduction Relationship between Non-cooperative and

Lecture 5: Procedure & Process Models 2017-05-15 Prof. Dr. Andreas Podelski, Dr. Bernd

Process Mining; Concepts and Case Studies Milad Naeimaei (M.A, B.Eng) Ali Bozorgi (Ph.D) 6-Case

Kap. 12 Workflow Management in ERP-Systemen 12.1 Workflow Management: Konzepte 12.2 Einbindung

Strengthening the Rational Closure for Description Logics: an overview Laura Giordano and

EasKey A language for mouse/keyboard operation Keqiu Hu Xiaoyu Huang Jinqi Huang Zongheng Wang

Intro to Electronics Week 4 Intro to Electronics, Week 4 Last updated Oct. 31, 2012 1 Make an

channel and fault attacks Jasper van Woudenberg @jzvw January 10, 2019 1 Our vision

Computer Networks Kurtis Heimerl kheimerl@cs Sixto (Joshua) Rios jrios777@cs Zhitao (Reid)

Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell - PowerPoint PPT Presentation

Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017 The Value Alignment Problem Example taken from Eliezer Yudkowskys NYU talk The Value Alignment Problem The Value Alignment Problem

1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Cooperative Web Caching Cooperative Web Caching Cooperative Caching Cooperative Caching

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Dynamic Inverse Problems: Schmitt Efficient Algorithms and Approximate Inverse Problems

Statistical Inverse Problems and abstract inverse problems examples Instrumental Variables

Inverse Reinforcement Learning CS 294-112: Deep Reinforcement Learning Sergey Levine Todays

Autonomous Navigation CSE 571 Inverse Optimal Control (Inverse Reinforcement Learning) Many

Advanced planning for autonomous vehicles using reinforcement learning and deep inverse

Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from

Cooperative Game Theory Outline Introduction Relationship between Non-cooperative and

Lecture 5: Procedure &amp; Process Models 2017-05-15 Prof. Dr. Andreas Podelski, Dr. Bernd

Process Mining; Concepts and Case Studies Milad Naeimaei (M.A, B.Eng) Ali Bozorgi (Ph.D) 6-Case

Kap. 12 Workflow Management in ERP-Systemen 12.1 Workflow Management: Konzepte 12.2 Einbindung

Strengthening the Rational Closure for Description Logics: an overview Laura Giordano and

EasKey A language for mouse/keyboard operation Keqiu Hu Xiaoyu Huang Jinqi Huang Zongheng Wang

Intro to Electronics Week 4 Intro to Electronics, Week 4 Last updated Oct. 31, 2012 1 Make an

channel and fault attacks Jasper van Woudenberg @jzvw January 10, 2019 1 Our vision

Computer Networks Kurtis Heimerl kheimerl@cs Sixto (Joshua) Rios jrios777@cs Zhitao (Reid)

Lecture 5: Procedure & Process Models 2017-05-15 Prof. Dr. Andreas Podelski, Dr. Bernd