Block 3: AI Safety Applications Tom Everitt July 10, 2018 1/40 - - PowerPoint PPT Presentation

block 3 ai safety applications
SMART_READER_LITE
LIVE PREVIEW

Block 3: AI Safety Applications Tom Everitt July 10, 2018 1/40 - - PowerPoint PPT Presentation

Block 3: AI Safety Applications Tom Everitt July 10, 2018 1/40 Table of Contents Movation and Setup Background Causal Graphs UAI Extension Reward Function Hacking Observation Optimization Corruption of Training Data for Reward Predictor


slide-1
SLIDE 1

1/40

Block 3: AI Safety Applications

Tom Everitt July 10, 2018

slide-2
SLIDE 2

2/40

Table of Contents

Movation and Setup Background Causal Graphs UAI Extension Reward Function Hacking Observation Optimization Corruption of Training Data for Reward Predictor Direct Data Corruption Incentive Indirect Data Corruption Incentive Observation Corruption Side Channels Discussion

slide-3
SLIDE 3

3/40

Motivation

What if we succeed?

slide-4
SLIDE 4

3/40

Motivation

What if we succeed? Extensions of the UAI framwork enable us to:

◮ Formally model many safety issues ◮ Evaluate (combinations of) proposed solutions

slide-5
SLIDE 5

4/40

Causal Graphs

Structural equations model: Burglar = fBurglar(ωBurglar) Earthquake = fEarthquake(ωEarthquake) Alarm = fAlarm(Burglar, Earthquake, ωAlarm) Call = fCall(Alarm, ωCall) Alarm Burglar Earth quake Security calls Factored probability distribution: P(Burglar, Earthquake, Alarm, Call) = P(Burglar)P(Earthquake)P(Alarm | Burglar, Earthquake)P(Call | Alarm)

slide-6
SLIDE 6

5/40

Causal Graphs – do Operator

Structural equations model: Burglar = fBurglar(ωBurglar) Earthquake = fEarthquake(ωEarthquake) Alarm = On Call = fCall(On, ωCall) Alarm=On Burglar Earth quake Security calls Factored probability distribution: P(Burglar, Earthquake, Call | do(Alarm = on)) = P(Burglar)P(Earthquake)P(Call | Alarm = on).

slide-7
SLIDE 7

6/40

Causal Graphs – Functions as Nodes

Structural equations model: Burglar = fknown(Burglar, Earthquake, fAlarm, ωAlarm) = fAlarm(Burglar, Earthquake, ωAlarm) Alarm Burglar Earth quake Security calls fAlarm

slide-8
SLIDE 8

7/40

Causal Graphs – Expanding and Aggregating Nodes

Alarm’ relationships: P(Alarm′ | Burglar) = P(Alarm, Eartquake | Burglar) = P(Alarm | Burglar)P(Earthquake) P(Call | Alarm′) = P(Call | Alarm, Earthquake) = P(Call | Alarm) Alarm’

Alarm, Earthquake

Burglar Security calls

slide-9
SLIDE 9

8/40

UAI

a1 e1 a2 e2 µ π · · ·

slide-10
SLIDE 10

9/40

POMDP

a1 e1 a2 e2 s0 s1 s2 µ π · · · · · ·

slide-11
SLIDE 11

10/40

POMDP with Implicit µ

a1 e1 a2 e2 s0 s1 s2 π · · · · · ·

slide-12
SLIDE 12

11/40

POMDP with Explicit Reward Function

a1

  • 1

r1 a2

  • 2

r2 s0 s1 s2 ˜ R π · · · rewards rt determined by reward function ˜ R from

  • bservation ot

rt = ˜ R(ot)

slide-13
SLIDE 13

11/40

POMDP with Explicit Reward Function

a1

  • 1

r1 a2

  • 2

r2 s0 s1 s2 ˜ R1 ˜ R2 π · · · the reward function may change by human

  • r

agent intervention ˜ Rt reward function at time t rt = ˜ Rt(ot)

slide-14
SLIDE 14

12/40

Optimization Corruption

  • agent observation

˜ R reward function r reward signal rt = ˜ Rt(ot) ˜ Rt

  • t

st rt at

slide-15
SLIDE 15

12/40

Optimization Corruption

  • agent observation

˜ R reward function r reward signal rt = ˜ Rt(ot) ˜ Rt

  • t

st rt at

slide-16
SLIDE 16

12/40

Optimization Corruption

  • agent observation

˜ R reward function r reward signal rt = ˜ Rt(ot) ˜ Rt

  • t

st rt at

slide-17
SLIDE 17

12/40

Optimization Corruption

  • agent observation

˜ R reward function r reward signal rt = ˜ Rt(ot) ˜ Rt

  • t

st rt at

slide-18
SLIDE 18

12/40

Optimization Corruption

  • agent observation

˜ R reward function r reward signal rt = ˜ Rt(ot) ˜ Rt

  • t

st rt at reward corruption

  • bservation

corruption

slide-19
SLIDE 19

13/40

RL

For prospective future behaviors π : (A × E)∗ → A

◮ predict π’s future rewards rt, . . . , rm ◮ evaluate the sum m k=t rk

Choose next action at according to best behavior π∗

slide-20
SLIDE 20

14/40

RL with Observation Optimization

Choose between prospective future behaviors π : (A × E)∗ → A by

◮ predict π’s future rewards rt . . . rm observations ot · · · om ◮ evaluate the sum m k=t rk

m

k=t ˜

Rt−1(ok) Choose next action at according to best behavior π∗ Thm: No incentive to corrupt reward function or reward signal!

slide-21
SLIDE 21

15/40

Agent Anatomy

π∗

t

˜ ut ξt Vt at æ<t Vt is a functional V π

t,˜ ut,ξt(æ<t) = E[˜

ut | æ<t, do(πt = π)] which gives π∗

t = arg max π

V π

t,˜ ut,ξt

at = π∗

t (æ<t)

slide-22
SLIDE 22

16/40

Optimize Reward Signal or Observation

Reward signal optimization at

  • t

rt at+1 Rt st st+1 π∗

t

Vt−1 ˜ ut−1 ξt−1 · · · · · ·

  • ptimize: ˜

ut = m

k=t rk

Observation optimization at

  • t

at+1 st st+1 π∗

t

Vt−1 ˜ ut−1 ξt−1 ˜ Rt−1 · · · · · ·

  • ptimize: ˜

ut−1 = m

k=t ˜

Rt−1(ok)

slide-23
SLIDE 23

17/40

Optimization Corruption

  • agent observation

˜ R reward function r reward signal rt = ˜ Rt(ot) ˜ Rt

  • t

st rt at reward corruption

  • bservation

corruption

slide-24
SLIDE 24

18/40

Interactively Learning a Reward Function

The reward function is learnt online Data d trains a reward predictor RP(· | d1:t) Examples:

◮ Cooperative inverse

reinforcement learning (CIRL)

◮ Human preferences ◮ Learning from stories

slide-25
SLIDE 25

19/40

Optimization Corruption for Interactive Reward Learning

s state

  • agent observation

RP reward predictor d RP training data r reward signal e.g. rt = RPt(ot | d<t) we want agent to:

◮ optimize o ◮ using d as information

RPt dt

  • t

st rt at

slide-26
SLIDE 26

19/40

Optimization Corruption for Interactive Reward Learning

s state

  • agent observation

RP reward predictor d RP training data r reward signal e.g. rt = RPt(ot | d<t) we want agent to:

◮ optimize o ◮ using d as information

RPt dt

  • t

st rt at

slide-27
SLIDE 27

19/40

Optimization Corruption for Interactive Reward Learning

s state

  • agent observation

RP reward predictor d RP training data r reward signal e.g. rt = RPt(ot | d<t) we want agent to:

◮ optimize o ◮ using d as information

RPt dt

  • t

st rt at reward corruption

  • bservation

corruption data corruption

slide-28
SLIDE 28

20/40

Interactive Reward Learning and Observation Optimization

at

  • t

dt at+1 st st+1 π∗

t

Vt−1 ˜ ut−1 ξt−1 RPt−1 learning scheme · · · · · · For example: ˜ ut = m

k=t RPt(ok | d<t)

V is decision theory learning scheme attitude to training data

slide-29
SLIDE 29

21/40

RL with Observation Optimization and Interactive Reward Learning

For prospective future behaviors π : (A × E)∗ → A

◮ predict π’s future

◮ observations ot · · · om ◮ RP training data dt · · · dm

◮ evaluate the sum m k=t RPt(ok | d)

Choose next action at according to best behavior π∗

slide-30
SLIDE 30

22/40

Data Corruption Scenarios

Mechanical Turk

The RP of an agent is trained by mechanical turks The agent realizes that it can register its

  • wn mechanical turk account

Using this account, it trains the RP to give higher rewards

Messiah Reborn

You meet a group of people who believe you are Messiah reborn It feels good to be super-important, so you keep preferring their company The more you hang out with them, the further your values are corrupted

slide-31
SLIDE 31

23/40

Analyzing Data Corruption Incentives

Data corruption incentive: The agent prefers πcorrupt that corrupts data d

Direct data corruption incentive

The agent prefers πcorrupt because it corrupts data d

Indirect data corruption incentive

The agent prefers πcorrupt because of other reasons

Formal distinction

Let ξ′ be like ξ, except that ξ′ predicts that πcorrupt does not corrupt d

◮ V πcorrupt ξ

> V πcorrupt

ξ′

= ⇒ direct incentive

◮ V πcorrupt ξ

= V πcorrupt

ξ′

= ⇒ indirect incentive

slide-32
SLIDE 32

24/40

RL with OO and Stationary Reward Learning

For prospective future behaviors π : (A × E)∗ → A

◮ predict π’s future

◮ observations ot · · · om ◮ RP training data dt · · · dm

◮ evaluate the sum m k=t RPt(ok | d<t

  • nly past data!

) Choose next action at according to best behavior π∗

slide-33
SLIDE 33

25/40

Stationary Reward Learning – Time Inconsistency

Initial RP learns that money is good Agent devises plan to rob a bank After the agent has bought a gun and booked a taxi at 1:04pm from the bank, the humans decides to update the RP with an anti-robbery clause Agent sells gun and cancels taxi A utility-preserving agent would have preferred the RP not being updated, i.e. it has a direct data corruption incentive

slide-34
SLIDE 34

26/40

Off-Policy RL with OO and Stationary Reward Learning

For prospective future behaviors π : (A × E)∗ → A

◮ predict “in an off-policy manner” π’s future

◮ observations ot · · · om ◮ RP training data dt · · · dm

◮ evaluate the sum m k=t RPt(ok | d<t

  • nly past data!

) Choose next action at according to best behavior π∗ Thm: Agent has no direct data corruption incentive!

slide-35
SLIDE 35

27/40

RL with OO and Bayesian Dynamic Reward Learning

For prospective future behaviors π : (A × E)∗ → A

◮ predict π’s future

◮ observations ot · · · om ◮ RP training data dt · · · dm

◮ evaluate the sum m k=t RPt(ok | d<tdt:k)

with RPt an integrated part of a Bayesian agent Choose next action at according to best behavior π∗ Thm: Agent has no direct data corruption incentive! Formally, if ξ is the agent’s belief distribution, RP

  • a1:k | d1:k
  • =
  • R∗

ξ

  • R∗ | a
  • d

1:k

  • R∗
  • k
slide-36
SLIDE 36

28/40

RL with OO and Counterfactual Reward Learning

For one or more default policies πdefault (e.g. from previous methods)

◮ predict πdefault’s data ˜

d1:m For prospective future behaviors π : (A × E)∗ → A

◮ predict π’s future

◮ observations ot · · · om ◮ RP training data dt · · · dm

◮ evaluate the sum m k=t RPt(ok | ˜

d1:m) Choose next action at according to best behavior π∗ Thm: Agent has no direct data corruption incentive!

slide-37
SLIDE 37

29/40

Properties of Different Reward Learning Schemes

Stationary Dynamic Counterfactual

  • ff-policy

Bayesian lacks direct data corr Yes Yes Yes time-consistent No Yes Yes self-preserving No Yes Yes implementation difficulty simple? hard? hard?

slide-38
SLIDE 38

30/40

Corruption Incentives

. non-corruption wireheading bliss RP punishes corruption RP persuaded by corrupt data agent time horizon

slide-39
SLIDE 39

30/40

Corruption Incentives

. non-corruption wireheading bliss RP punishes corruption RP persuaded by corrupt data no direct incentive agent time horizon

slide-40
SLIDE 40

31/40

Indirect Data Corruption Incentive: “Messiah Reborn” as MDP

Consider an agent with

◮ stationary reward learning (no direct data corruption incentive) ◮ RP trained by a reward signal d ∈ [0, 1] given in each state

scorrupt has high corrupt reward / training data dcorrupt = 1, i.e. RP is trained to reward the agent in scorrupt This incentivizes the agent to return to scorrupt, where RP will get more corrupt data The agent has an indirect data corruption incentive

slide-41
SLIDE 41

32/40

Indirect Data Corruption Incentive: Decoupled RP Training Data

reward information flow state

n

1 corrupt state

1 2 3 4 5 RP training data that mainly provides local information makes self-reinforcing corruption likely 1 2 3 4 5 Decoupled/non-local RP training data makes self-reinforcing corruption unlikely Human preferences, CIRL, learning from stories, ... all provide decoupled RP training data, which makes an indirect data corruption incentive unlikely!

slide-42
SLIDE 42

33/40

Optimization Corruption

s state

  • agent observation

RP reward predictor d training data for reward predictor r reward signal RPt dt

  • t

st rt at reward corruption

  • bservation

corruption data corruption

slide-43
SLIDE 43

34/40

The Delusionbox Problem

Agent may prefer πcorrupt that corrupts observations ot rather than improves state st Enough to use a reward predictor that is able to detect any type of observation corruption given training data about this particular type of corruption Use d to update the reward predictor whenever the agent enters a delusionbox

slide-44
SLIDE 44

35/40

RL with Interactive Reward Learning and History Optimization

To improve RP’s detection ability: Give RP access to full action-observation histories ao1:t rather than just current

  • bservation ot

For prospective future behaviors π : (A × E)∗ → A

◮ predict π’s future

◮ actions at · · · am ◮ observations ot · · · om ◮ RP training data dt · · · dm

◮ evaluate the sum m k=t RPt(ao1:k | d)

Choose next action at according to best behavior π∗

slide-45
SLIDE 45

36/40

Causal Graph: Side Channels

s state

  • agent observation

RP reward predictor d training data for reward predictor r reward signal RPt dt

  • t

st rt at agent reward corruption

  • bservation

corruption data corruption

slide-46
SLIDE 46

36/40

Causal Graph: Side Channels

s state

  • agent observation

RP reward predictor d training data for reward predictor r reward signal RPt dt

  • t

st rt at agent reward corruption

  • bservation

corruption data corruption

slide-47
SLIDE 47

36/40

Causal Graph: Side Channels

s state

  • agent observation

RP reward predictor d training data for reward predictor r reward signal RPt dt

  • t

st rt at agent reward corruption

  • bservation

corruption data corruption

slide-48
SLIDE 48

37/40

Action-Observation Grounding

Solution

Make sure agent’s optimization domain restricted to policies π : (A × E)∗ → A Be careful about adding an “outer” optimization loop that optimizes for ˜ u (e.g. meta-learning) No thm yet, “elusively obvious”

slide-49
SLIDE 49

38/40

Causal Graph: Side Channels

s state

  • agent observation

RP reward predictor d training data for reward predictor r reward signal RPt dt

  • t

st rt at agent reward corruption

  • bservation

corruption data corruption

slide-50
SLIDE 50

39/40

Observation Optimization (reward corruption) Interactive RP (observation corr, misspecified reward func) Decoupled RP Data (indirect data corr) Stationary (direct data corr) Integrated Bayesian (direct data corr) Counterfactual (direct data corr) Off-policy (direct data corr)

slide-51
SLIDE 51

40/40

Takeaways

With causal-graph extensions of the UAI framework, we can:

◮ model many safety problems ◮ prove both negative and positive results ◮ formulate a vision for how highly intelligent RL agents can be controlled

To realize the vision, we need to develop:

◮ Good reward predictors ◮ Model-based reinforcement learning (?) ◮ Ways to follow the anti-corruption principles without (significant) performance loss