cooperative inverse reinforcement learning
play

Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell - PowerPoint PPT Presentation

Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017 The Value Alignment Problem Example taken from Eliezer Yudkowskys NYU talk The Value Alignment Problem The Value Alignment Problem


  1. Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

  2. The Value Alignment Problem Example taken from Eliezer Yudkowsky’s NYU talk

  3. The Value Alignment Problem

  4. The Value Alignment Problem

  5. The Value Alignment Problem

  6. Action Selection in Agents: Ideal Observe Update Plan Act Observe Act

  7. Action Selection in Agents: Reality Desired Behavior Observe Act Objective Encoding Challenge: how do we account for errors and failures in the encoding of an objective?

  8. The Value Alignment Problem How do we make sure that the agents we build pursue ends that we actually intend?

  9. Reward Engineering is Hard

  10. Reward Engineering is Hard

  11. What could go wrong? “…a computer-controlled radiation therapy machine….massively overdosed 6 people. These accidents have been describes as the worst in the 35-year history of medical accelerators.”

  12. Reward Engineering is Hard At best, reinforcement learning and similar approaches reduce the problem of generating useful behavior to that of designing a ‘good’ reward function.

  13. Reward Engineering is Hard True (Complicated) Reward Function ∼ R ∗ R Observed (likely incorrect) Reward Function

  14. Why is reward engineering hard? ξ 0 ξ 1 ξ 2 ξ ∗ = argmax r ( ξ ) ξ 4 ξ 5 ξ 3 ξ ∈ Ξ

  15. Why is reward engineering hard? ∼ r ∗ r ξ 0 ξ 1 ξ 2 ξ 4 ξ 5 ξ 3 0 1 2 3 4 5 6 7

  16. Why is reward engineering hard? ∼ r ∗ ξ 0 ξ 1 ξ 2 r ξ 4 ξ 5 ξ 3 ξ 7 ξ 6 0 1 2 3 4 5 6 7

  17. Negative Side Effects ? ? “Get money”

  18. Reward Hacking 2 0 2 0 ? ? 0 1 5 0 “Get points”

  19. Analogy: Computer Security

  20. Solution 1: Blacklist Disallowed Characters Input Clean Text Text

  21. Solution 2: Whitelist Filter of Allowed Characters Input Clean Text Text

  22. Goal Reduce the extent to which system designers have to play whack-a-mole

  23. Inspiration: Pragmatics 🎪 😋 😏 😏 🎪 🎪 😋 😏 😏 😋 😏 😏 “Glasses” “Hat”

  24. Inspiration: Pragmatics 🎪 🎪 😋 😏 😏 😋 😏 😏 “Glasses” “Hat” “Glasses” 🎪 😋 😏 😏

  25. Inspiration: Pragmatics 🎪 🎪 😋 😏 😏 😋 😏 😏 “Glasses” “Hat” “My friend has glasses” “Glasses” 🎪 😋 😏 😏 🎪 😋 😏 😏

  26. Notation ξ trajectory φ features weights ξ 0 ξ 1 ξ 2 w linear reward function ξ 4 ξ 5 ξ 3 R ( ξ ; w ) = w > φ ( ξ )

  27. Literal Reward Interpretation > φ ( ξ ) ⇣ ⇠ ⌘ ⇠ π ( ξ ) ∝ exp w v ξ 4 ξ 5 selects trajectories in proportion to proxy reward evaluation

  28. Designing Reward for Literal Interpretation Assumption: rewarded behavior has high true utility in the training situations

  29. Designing Reward for Literal Interpretation Literal optimizer’s trajectory distribution conditioned on . ∼ w ⇣ ⌘ ⇠ ⇠ E [ w ⇤> φ ( ξ ) | ξ ∼ w | w ⇤ ) ∝ exp P ( π ] True reward received for each trajectory

  30. Inverting Reward Design ∼ ∼ P ( w ∗ | w ) ∝ P ( w | w ∗ ) P ( w ∗ )

  31. Inverting Reward Design ∼ ∼ P ( w ∗ | w ) ∝ P ( w | w ∗ ) P ( w ∗ ) Key Idea: At test time, interpret reward functions in the context of an ‘intended’ situation

  32. Experiment M test Measure how often the agent New state Three types of selects introduced In the states in the trajectories with ‘testing’ MDP training MDP the new state Domain: Lavaland ∼ π

  33. Negative Side Effects   1 0 0 0 ? ? 0 1 0 0   “Get   0 0 1 0 money”   0 0 0 1

  34. Reward Hacking 2 0 2 0 ?   1 1 0 0 0 ? 0 0 1 1 0 0 1   5 0   “Get 0 0 0 0 1   points” 1 0 0 1 0

  35. Challenge: Missing Latent Rewards Proxy reward k = 0 I s µ k function is only trained k = 1 for the state types k = 2 observed during k = 3 training φ s Σ k

  36. Results Sampled-Proxy Sampled-Z MaxEnt Z Mean Proxy 0.68 0.52 0.41 0.4 0.21 0.19 0.15 0.11 0.1 0.07 0.06 0.03 0.01 0.01 0 Negative Side E ff ect Reward Hacking Missing Latent Reward

  37. On the folly of rewarding A and hoping for B “Whether dealing with monkeys, rats, or human beings, it is hardly controversial to state that most organisms seek information concerning what activities are rewarded, and then seek to do (or at least pretend to do) those things, often to the virtual exclusion of activities not rewarded…. Nevertheless, numerous examples exist of reward systems that are fouled up in that behaviors which are rewarded are those which the rewarder is trying to discourage…. ” – Kerr, 1975

  38. The Principal-Agent Problem Agent Principal

  39. A Simple Principal-Agent Problem ■ Principal and Agent negotiate contract ■ Agent selects effort ■ Value generated for principal, wages paid to agent

  40. A Simple Principal Agent Problem

  41. A Simple Principal Agent Problem

  42. A Simple Principal Agent Problem

  43. Misaligned Principal Agent Problem Value to Principal Performance Measure [Baker 2002]

  44. Misaligned Principal Agent Problem Scale Alignment [Baker 2002]

  45. Principal Agent vs Value Alignment ■ Incentive Compatibility is a fundamental constraint on (human or artificial) agent behavior ■ PA model has fundamental misalignment because humans have differing objectives ■ Primary source of misalignment in VA is extrapolation Although we may want to view algorithmic restrictions as a fundamental ■ misalignment ■ Recent news: Principal Agent models was awarded the 2016 Nobel prize in Economics

  46. The Value Alignment Problem

  47. Can we intervene? vs Better question: do our agents want us to intervene

  48. The Off-Switch Game

  49. The Off-Switch Game Desired Behavior Disobedient Behavior

  50. A trivial agent that ‘wants’ intervention

  51. The Off Switch Game Desired Behavior Disobedient Behavior Non-Functional Behavior

  52. The Off-Switch Game

  53. The Off-Switch Game

  54. The Off-Switch Game Non-Functional Desired Behavior Disobedient Behavior Behavior

  55. Why have an off-switch? Desired Behavior Observe Act Objective The system designer has uncertainty about the correct Encoding objective, this is never represented to the robot! This step might go wrong

  56. The Structure of a Solution Infer the desired behavior Desired from the human’s actions Behavior Observe Human Act Observe World Distribution over Objectives

  57. Inverse Reinforcement Learning ■ Given MDP without reward function ■ Determine Observations of optimal behavior The reward function being optimized [Ng and Russell 2000]

  58. Can we use IRL to infer objectives? Observe Human Desired Behavior Bayesian IRL Observe World Act Inferred Objective Distribution over Objectives

  59. IRL Issue #1 Don’t want the robot to imitate the human

  60. IRL Issue #2: Assumes Human is Oblivious IRL assumes the human is unaware she is being observed one way mirror

  61. IRL Issue #3 Action selection is independent of reward uncertainty Implicit Assumption: Robot gets no more information about the objective

  62. Proposal: Robot Plays Cooperative Game ■ Cooperative Inverse Reinforcement Learning [Hadfield-Menell et al. NIPS 2016] ■ ■ Two players: ■ Both players maximize a shared reward function, but only observes the actual reward signal; only knows a prior distribution on reward functions learns the reward parameters by observing ■

  63. Cooperative Inverse Reinforcement Learning Environment Hadfield-Menell et al. NIPS 2016

  64. The Off-Switch Game

  65. Intuition “Probably better to make coffee, but I should ask the human, just in case I’m wrong” “Probably better to switch off, but I should ask the human, just in case I’m wrong”

  66. Theorem 1 A rational human is a sufficient to incentivize the robot to let itself be switched off

  67. Incentives for the Robot vs vs

  68. Theorem 1: Sufficient Conditions rational

  69. Theorem 2 If the robot knows the utility evaluations in the off switch game with certainty, then a rational human is necessary to incentivize obedient behavior

  70. Conclusion Uncertainty about the objective is crucial to incentivizing cooperative behaviors.

  71. When is obedience a bad idea? vs

  72. Robot Uncertainty vs Human Suboptimality

  73. Incentives for Designers Population statistics on preferences i.e., market research Evidence about preferences from interaction with a particular customer Question: is it a good idea to `lie’ to the agent and tell it that the variance of is ?

  74. Incentives for Designers

  75. Incentives for Designers

  76. Incentives for Designers

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend