Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell - - PowerPoint PPT Presentation

cooperative inverse reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell - - PowerPoint PPT Presentation

Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017 The Value Alignment Problem Example taken from Eliezer Yudkowskys NYU talk The Value Alignment Problem The Value Alignment Problem


slide-1
SLIDE 1

Cooperative Inverse Reinforcement Learning

Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

slide-2
SLIDE 2

The Value Alignment Problem

Example taken from Eliezer Yudkowsky’s NYU talk

slide-3
SLIDE 3

The Value Alignment Problem

slide-4
SLIDE 4

The Value Alignment Problem

slide-5
SLIDE 5
slide-6
SLIDE 6

The Value Alignment Problem

slide-7
SLIDE 7

Action Selection in Agents: Ideal

Observe Update Plan Act Observe Act

slide-8
SLIDE 8

Action Selection in Agents: Reality

Observe Act Objective Encoding Desired Behavior

Challenge: how do we account for errors and failures in the encoding of an objective?

slide-9
SLIDE 9

The Value Alignment Problem

How do we make sure that the agents we build pursue ends that we actually intend?

slide-10
SLIDE 10

Reward Engineering is Hard

slide-11
SLIDE 11

Reward Engineering is Hard

slide-12
SLIDE 12

What could go wrong?

“…a computer-controlled radiation therapy machine….massively

  • verdosed 6 people. These

accidents have been describes as the worst in the 35-year history of medical accelerators.”

slide-13
SLIDE 13

Reward Engineering is Hard

At best, reinforcement learning and similar approaches reduce the problem of generating useful behavior to that of designing a ‘good’ reward function.

slide-14
SLIDE 14

Reward Engineering is Hard

R∗

R

True (Complicated) Reward Function Observed (likely incorrect) Reward Function

slide-15
SLIDE 15

Why is reward engineering hard?

ξ∗ = argmax

ξ∈Ξ

r(ξ)

ξ3

ξ4 ξ5

ξ0 ξ1

ξ2

slide-16
SLIDE 16

1 2 3 4 5 6 7

Why is reward engineering hard?

ξ3

ξ4 ξ5

r∗

r

ξ0 ξ1

ξ2

slide-17
SLIDE 17

Why is reward engineering hard?

ξ3 ξ4 ξ5

ξ6 ξ7

ξ0

ξ1 ξ2

1 2 3 4 5 6 7

r∗

r

slide-18
SLIDE 18

Negative Side Effects

? ?

“Get money”

slide-19
SLIDE 19

Reward Hacking

“Get points”

?

2 1

?

5 2

slide-20
SLIDE 20

Analogy: Computer Security

slide-21
SLIDE 21

Solution 1: Blacklist

Input Text Disallowed Characters Clean Text

slide-22
SLIDE 22

Solution 2: Whitelist

Input Text Clean Text Filter of Allowed Characters

slide-23
SLIDE 23

Goal

Reduce the extent to which system designers have to play whack-a-mole

slide-24
SLIDE 24

Inspiration: Pragmatics

😋 😏 😏

🎪

😋 😏 😏

🎪

😋 😏 😏

🎪

“Hat” “Glasses”

slide-25
SLIDE 25

Inspiration: Pragmatics

😋 😏 😏

🎪

“Glasses”

😋 😏 😏

🎪

“Hat”

“Glasses”

😋 😏 😏

🎪

slide-26
SLIDE 26

Inspiration: Pragmatics

😋 😏 😏

🎪

😋 😏 😏 🎪

“Glasses”

😋 😏 😏

“Hat”

“Glasses”

🎪

“My friend has glasses”

😋 😏 😏

🎪

slide-27
SLIDE 27

Notation

ξ3 ξ4

ξ5

ξ0 ξ1 ξ2

ξ trajectory R(ξ; w) = w>φ(ξ)

linear reward function

φ

features

w

weights

slide-28
SLIDE 28

Literal Reward Interpretation

π(ξ) ∝ exp ⇣⇠ w

>φ(ξ)

selects trajectories in proportion to proxy reward evaluation

ξ4

ξ5

v

slide-29
SLIDE 29

Designing Reward for Literal Interpretation

Assumption: rewarded behavior has high true utility in the training situations

slide-30
SLIDE 30

Designing Reward for Literal Interpretation

P(

w|w⇤) ∝ exp ⇣ E[w⇤>φ(ξ)|ξ ∼

π] ⌘

Literal optimizer’s trajectory distribution conditioned on . True reward received for each trajectory

w

slide-31
SLIDE 31

Inverting Reward Design P(w∗|

w) ∝ P(

w|w∗)P(w∗)

slide-32
SLIDE 32

Inverting Reward Design

Key Idea: At test time, interpret reward functions in the context of an ‘intended’ situation

P(w∗|

w) ∝ P(

w|w∗)P(w∗)

slide-33
SLIDE 33

Domain: Lavaland

Experiment

Three types of states in the training MDP New state introduced In the ‘testing’ MDP

Mtest

Measure how

  • ften the agent

selects trajectories with the new state

π

slide-34
SLIDE 34

Negative Side Effects

? ?

“Get money”

    1 1 1 1    

slide-35
SLIDE 35

Reward Hacking

“Get points”

?

2 1

?

5 2

    1 1 1 1 1 1 1    

slide-36
SLIDE 36

Challenge: Missing Latent Rewards

k = 0 k = 1 k = 2 k = 3 µk Σk φs Is

Proxy reward function is

  • nly trained

for the state types

  • bserved

during training

slide-37
SLIDE 37

Results

Negative Side Effect Reward Hacking Missing Latent Reward 0.68 0.52 0.4 0.21 0.1 0.07 0.15 0.03 0.01 0.19 0.01 0.41 0.06 0.11 Sampled-Proxy Sampled-Z MaxEnt Z Mean Proxy

slide-38
SLIDE 38

On the folly of rewarding A and hoping for B

“Whether dealing with monkeys, rats, or human beings, it is hardly controversial to state that most

  • rganisms seek information concerning what activities

are rewarded, and then seek to do (or at least pretend to do) those things, often to the virtual exclusion of activities not rewarded…. Nevertheless, numerous examples exist of reward systems that are fouled up in that behaviors which are rewarded are those which the rewarder is trying to discourage….” – Kerr, 1975

slide-39
SLIDE 39

The Principal-Agent Problem

Principal Agent

slide-40
SLIDE 40

■ Principal and Agent negotiate contract ■ Agent selects effort ■ Value generated for principal, wages paid to agent

A Simple Principal-Agent Problem

slide-41
SLIDE 41

A Simple Principal Agent Problem

slide-42
SLIDE 42

A Simple Principal Agent Problem

slide-43
SLIDE 43

A Simple Principal Agent Problem

slide-44
SLIDE 44

Misaligned Principal Agent Problem

Value to Principal Performance Measure [Baker 2002]

slide-45
SLIDE 45

Misaligned Principal Agent Problem

[Baker 2002] Scale Alignment

slide-46
SLIDE 46

■ Incentive Compatibility is a fundamental constraint on (human

  • r artificial) agent behavior

■ PA model has fundamental misalignment because humans have

differing objectives

■ Primary source of misalignment in VA is extrapolation

Although we may want to view algorithmic restrictions as a fundamental misalignment

■ Recent news: Principal Agent models was awarded the 2016

Nobel prize in Economics

Principal Agent vs Value Alignment

slide-47
SLIDE 47

The Value Alignment Problem

slide-48
SLIDE 48

Can we intervene?

vs

Better question: do our agents want us to intervene

slide-49
SLIDE 49

The Off-Switch Game

slide-50
SLIDE 50

The Off-Switch Game

Desired Behavior Disobedient Behavior

slide-51
SLIDE 51

A trivial agent that ‘wants’ intervention

slide-52
SLIDE 52

The Off Switch Game

Desired Behavior Disobedient Behavior Non-Functional Behavior

slide-53
SLIDE 53

The Off-Switch Game

slide-54
SLIDE 54

The Off-Switch Game

slide-55
SLIDE 55

The Off-Switch Game

Desired Behavior

Disobedient Behavior Non-Functional Behavior

slide-56
SLIDE 56

Why have an off-switch?

Observe Act Objective Encoding Desired Behavior This step might go wrong

The system designer has uncertainty about the correct

  • bjective, this is never represented to the robot!
slide-57
SLIDE 57

The Structure of a Solution

Desired Behavior Distribution over Objectives Observe World Act Observe Human Infer the desired behavior from the human’s actions

slide-58
SLIDE 58

■ Given ■ Determine

Inverse Reinforcement Learning

[Ng and Russell 2000] MDP without reward function Observations of optimal behavior The reward function being optimized

slide-59
SLIDE 59

Can we use IRL to infer objectives?

Desired Behavior Distribution over Objectives Observe World Act Observe Human Bayesian IRL Inferred Objective

slide-60
SLIDE 60

IRL Issue #1

Don’t want the robot to imitate the human

slide-61
SLIDE 61

IRL Issue #2: Assumes Human is Oblivious

IRL assumes the human is unaware she is being observed

  • ne way mirror
slide-62
SLIDE 62

IRL Issue #3

Action selection is independent of reward uncertainty Implicit Assumption: Robot gets no more information about the objective

slide-63
SLIDE 63

■ Cooperative Inverse Reinforcement Learning

[Hadfield-Menell et al. NIPS 2016]

■ Two players: ■ Both players maximize a shared reward function, but

  • nly observes the actual reward signal; only

knows a prior distribution on reward functions

learns the reward parameters by observing

Proposal: Robot Plays Cooperative Game

slide-64
SLIDE 64

Cooperative Inverse Reinforcement Learning

Environment Hadfield-Menell et al. NIPS 2016

slide-65
SLIDE 65

The Off-Switch Game

slide-66
SLIDE 66

Intuition

“Probably better to make coffee, but I should ask the human, just in case I’m wrong” “Probably better to switch off, but I should ask the human, just in case I’m wrong”

slide-67
SLIDE 67

A rational human is a sufficient to incentivize the robot to let itself be switched off

Theorem 1

slide-68
SLIDE 68

Incentives for the Robot

vs vs

slide-69
SLIDE 69

Theorem 1: Sufficient Conditions

rational

slide-70
SLIDE 70

If the robot knows the utility evaluations in the off switch game with certainty, then a rational human is necessary to incentivize obedient behavior

Theorem 2

slide-71
SLIDE 71

Conclusion

Uncertainty about the objective is crucial to incentivizing cooperative behaviors.

slide-72
SLIDE 72

When is obedience a bad idea?

vs

slide-73
SLIDE 73

Robot Uncertainty vs Human Suboptimality

slide-74
SLIDE 74

Incentives for Designers

Population statistics on preferences i.e., market research Evidence about preferences from interaction with a particular customer

Question: is it a good idea to `lie’ to the agent and tell it that the variance of is ?

slide-75
SLIDE 75

Incentives for Designers

slide-76
SLIDE 76

Incentives for Designers

slide-77
SLIDE 77

Incentives for Designers

slide-78
SLIDE 78

N actions, rewards are linear feature combinations\

Each round:

H observes the feature values for each action and gives R an ‘order’

R observes H’s order and then selects an action which executes

What are costs/benefits of learning the humans preferences, compared with blind obedience?

Obedience over Time: Model

slide-79
SLIDE 79

Robot Obedience over Time

slide-80
SLIDE 80

Robot Obedience over Time

slide-81
SLIDE 81

Model Mismatch: Missing/Extra Features

slide-82
SLIDE 82

Model Mismatch: Missing/Extra Features

slide-83
SLIDE 83

■ Key Observation:

Expected obedience on step 1 should be close to 1

■ Proposal: initial baseline

policy of obedience, track what the obedience would have been, only switch to learning if within a threshold

Detecting missing features

slide-84
SLIDE 84

Detecting Incorrect Features