Block 3: AI Safety Applications Tom Everitt July 10, 2018 1/40

Table of Contents Movation and Setup Background Causal Graphs UAI Extension Reward Function Hacking Observation Optimization Corruption of Training Data for Reward Predictor Direct Data Corruption Incentive Indirect Data Corruption Incentive Observation Corruption Side Channels Discussion 2/40

Motivation What if we succeed? 3/40

Motivation What if we succeed? Extensions of the UAI framwork enable us to: ◮ Formally model many safety issues ◮ Evaluate (combinations of) proposed solutions 3/40

Causal Graphs Earth Burglar quake Structural equations model: Burglar = f Burglar ( ω Burglar ) Alarm Earthquake = f Earthquake ( ω Earthquake ) Alarm = f Alarm (Burglar , Earthquake , ω Alarm ) Call = f Call ( Alarm , ω Call ) Security calls Factored probability distribution: P ( Burglar , Earthquake , Alarm , Call ) = P ( Burglar ) P ( Earthquake ) P ( Alarm | Burglar , Earthquake ) P ( Call | Alarm ) 4/40

Causal Graphs – do Operator Earth Burglar quake Structural equations model: Burglar = f Burglar ( ω Burglar ) Alarm=On Earthquake = f Earthquake ( ω Earthquake ) Alarm = On Call = f Call ( On , ω Call ) Security calls Factored probability distribution: P ( Burglar , Earthquake , Call | do( Alarm = on )) = P ( Burglar ) P ( Earthquake ) P ( Call | Alarm = on ) . 5/40

Causal Graphs – Functions as Nodes Earth Burglar quake Structural equations model: f Alarm Alarm Burglar = f known (Burglar , Earthquake , f Alarm , ω Alarm ) = f Alarm (Burglar , Earthquake , ω Alarm ) Security calls 6/40

UAI µ a 1 e 1 a 2 e 2 · · · π 8/40

POMDP µ s 0 s 1 s 2 · · · a 1 e 1 a 2 e 2 · · · π 9/40

POMDP with Implicit µ s 0 s 1 s 2 · · · a 1 e 1 a 2 e 2 · · · π 10/40

POMDP with Explicit Reward Function s 0 s 1 s 2 · · · rewards r t determined by ˜ R reward function ˜ R from observation o t a 1 o 1 r 1 a 2 o 2 r 2 r t = ˜ R ( o t ) π 11/40

POMDP with Explicit Reward Function s 0 s 1 s 2 · · · the reward function may ˜ ˜ R 1 R 2 change by human or agent intervention ˜ R t reward function at a 1 o 1 r 1 a 2 o 2 r 2 time t r t = ˜ R t ( o t ) π 11/40

Optimization Corruption s t o t agent observation o a t ˜ R t ˜ reward function R r reward signal r t r t = ˜ R t ( o t ) 12/40

Optimization Corruption observation corruption s t o t ˜ a t R t o agent observation ˜ R reward function reward signal r r t reward corruption r t = ˜ R t ( o t ) 12/40

RL For prospective future behaviors π : ( A × E ) ∗ → A ◮ predict π ’s future rewards r t , . . . , r m ◮ evaluate the sum � m k = t r k Choose next action a t according to best behavior π ∗ 13/40

RL with Observation Optimization Choose between prospective future behaviors π : ( A × E ) ∗ → A by ◮ predict π ’s future rewards r t . . . r m observations o t · · · o m ◮ evaluate the sum � m � m k = t ˜ k = t r k R t − 1 ( o k ) Choose next action a t according to best behavior π ∗ Thm: No incentive to corrupt reward function or reward signal! 14/40

Agent Anatomy a t æ <t V t is a functional V π u t ,ξ t ( æ <t ) = E [˜ u t | æ <t , do( π t = π )] t, ˜ which gives π ∗ t V π π ∗ t = arg max t, ˜ u t ,ξ t π a t = π ∗ t ( æ <t ) u t ˜ V t ξ t 15/40

Optimize Reward Signal or Observation Reward signal optimization Observation optimization s t +1 s t +1 s t · · · s t · · · R t a t o t r t a t +1 · · · a t o t a t +1 · · · π ∗ π ∗ t t ˜ ˜ ˜ u t − 1 ξ t − 1 u t − 1 ξ t − 1 R t − 1 V t − 1 V t − 1 u t = � m u t − 1 = � m k = t ˜ optimize: ˜ k = t r k optimize: ˜ R t − 1 ( o k ) 16/40

Optimization Corruption observation corruption s t o t ˜ a t R t o agent observation ˜ R reward function reward signal r r t reward corruption r t = ˜ R t ( o t ) 17/40

Interactively Learning a Reward Function The reward function is learnt online Data d trains a reward predictor RP( · | d 1: t ) Examples: ◮ Cooperative inverse reinforcement learning (CIRL) ◮ Human preferences ◮ Learning from stories 18/40

Optimization Corruption for Interactive Reward Learning s state agent observation o RP reward predictor s t o t d t d RP training data reward signal r a t RP t e.g. r t = RP t ( o t | d <t ) we want agent to: r t ◮ optimize o ◮ using d as information 19/40

Optimization Corruption for Interactive Reward Learning s state observation data agent observation o corruption corruption RP reward predictor RP training data d s t o t d t r reward signal e.g. r t = RP t ( o t | d <t ) a t RP t we want agent to: ◮ optimize o r t ◮ using d as information reward corruption 19/40

Interactive Reward Learning and Observation Optimization s t +1 s t · · · a t o t a t +1 · · · d t u t = � m For example: ˜ k = t RP t ( o k | d <t ) π ∗ t RP t − 1 ˜ u t − 1 ξ t − 1 V is decision theory V t − 1 learning scheme attitude to training data learning scheme 20/40

RL with Observation Optimization and Interactive Reward Learning For prospective future behaviors π : ( A × E ) ∗ → A ◮ predict π ’s future ◮ observations o t · · · o m ◮ RP training data d t · · · d m ◮ evaluate the sum � m k = t RP t ( o k | d ) Choose next action a t according to best behavior π ∗ 21/40

Data Corruption Scenarios Messiah Reborn Mechanical Turk The RP of an agent is trained by You meet a group of people who believe mechanical turks you are Messiah reborn The agent realizes that it can register its It feels good to be super-important, so you own mechanical turk account keep preferring their company Using this account, it trains the RP to give The more you hang out with them, the higher rewards further your values are corrupted 22/40

Analyzing Data Corruption Incentives Data corruption incentive: The agent prefers π corrupt that corrupts data d Direct data corruption incentive The agent prefers π corrupt because it corrupts data d Indirect data corruption incentive The agent prefers π corrupt because of other reasons Formal distinction Let ξ ′ be like ξ , except that ξ ′ predicts that π corrupt does not corrupt d ◮ V π corrupt > V π corrupt = ⇒ direct incentive ξ ξ ′ ◮ V π corrupt = V π corrupt = ⇒ indirect incentive ξ ξ ′ 23/40

RL with OO and Stationary Reward Learning For prospective future behaviors π : ( A × E ) ∗ → A ◮ predict π ’s future ◮ observations o t · · · o m ◮ RP training data d t · · · d m ◮ evaluate the sum � m k = t RP t ( o k | d <t ) �� only past data! Choose next action a t according to best behavior π ∗ 24/40

Stationary Reward Learning – Time Inconsistency Initial RP learns that money is good Agent devises plan to rob a bank After the agent has bought a gun and booked a taxi at 1:04pm from the bank, the humans decides to update the RP with an anti-robbery clause Agent sells gun and cancels taxi A utility-preserving agent would have preferred the RP not being updated, i.e. it has a direct data corruption incentive 25/40

Off-Policy RL with OO and Stationary Reward Learning For prospective future behaviors π : ( A × E ) ∗ → A ◮ predict “in an off-policy manner” π ’s future ◮ observations o t · · · o m ◮ RP training data d t · · · d m ◮ evaluate the sum � m k = t RP t ( o k | d <t ) �� only past data! Choose next action a t according to best behavior π ∗ Thm: Agent has no direct data corruption incentive! 26/40

RL with OO and Bayesian Dynamic Reward Learning For prospective future behaviors π : ( A × E ) ∗ → A ◮ predict π ’s future ◮ observations o t · · · o m ◮ RP training data d t · · · d m ◮ evaluate the sum � m k = t RP t ( o k | d <t d t : k ) with RP t an integrated part of a Bayesian agent Choose next action a t according to best behavior π ∗ Thm: Agent has no direct data corruption incentive! Formally, if ξ is the agent’s belief distribution, � � � � R ∗ | a � R ∗ � � RP oa 1: k | d 1: k = ξ od o k 1: k R ∗ 27/40

Block 3: AI Safety Applications Tom Everitt July 10, 2018 1/40 - PowerPoint PPT Presentation

Block 3: AI Safety Applications Tom Everitt July 10, 2018 1/40 Table of Contents Movation and Setup Background Causal Graphs UAI Extension Reward Function Hacking Observation Optimization Corruption of Training Data for Reward Predictor

Problem 1 k zero bits n bits IV Block Block Block Block Cipher Cipher Cipher Cipher

Intersection Safety Intersection Safety Intersection Safety FHWA Safety Focus Areas FHWA Safety

Introducing EF Block TM Introduction to EF Block Building Materials Overview of EF

CRYSTAL CITY BLOCK PLAN # CCBP- J-K 2019 1 BLOCK J-K Long Range Planning Committee Block

CRYSTAL CITY BLOCK PLAN # CCBP- G 1 BLOCK G (Metro Market Square block) 1 Long Range

CRYSTAL CITY BLOCK PLAN # CCBP- G 1 BLOCK G (Metro Market Square block) 1 Long Range

Block Ciphers Eli Biham - May 3, 2005 c 83 Block Ciphers (4) Block Ciphers and Stream

Chapter 16 Chapter 16 The Elements: The he Elements: The d -Block -Block The d -Block

InfiniBand Network Block Device Overview IBNBD: InfiniBand Network Block device Transfer

CYBER CYBER-SAFETY CYBER CYBER SAFETY SAFETY SAFETY BASICS BASICS Engineering Staff College

Safety Presentation The Silence 1 Safety Presentation SAFETY SAFETY OR 2 Safety

CRYSTAL CITY BLOCK PLAN # CCBP- G 1 BLOCK G (Metro Market Square block) 1 Long Range

CRYSTAL CITY BLOCK PLAN # CCBP- J-K 2019 1 BLOCK J-K Long Range Planning Committee Meeting

A/B BLOCK SCHEDULING 2015-2016 BLOCK DISCUSSION Weve been discussing researching the

CRYSTAL CITY BLOCK PLAN # CCBP- G 1 BLOCK G (Metro Market Square block) 1 Long Range

Dashboard Block Block details by height Block details by ID Transaction details by receipient

OTSDN What is it? Does it help? Dennis Gammel Schweitzer Engineering Laboratories, Inc.

Virtually Cool Ternary Content Addressable Memory Suparna Bhattacharya, K Gopinath IBM, Indian

Disclosure The speakers have no conflicts of interest to disclose. Presentation Objectives

Part II: Pseudorandom Correlation Generators What are they? How can we build them? Re

From OT to generative modeling: the VEGAN cookbook Cheng Xin Notations set: calligraphic

SPOT App Syntax-Prosody in OT Jenny Bellik & Nick Kalivoda, UC Santa Cruz October 7, 2018 @

S Sanctification Under the OT tifi ti U d th OT Theocracy eoc acy Charles Clough What

B03 B03 - 402.2.4 OT Elect ctronics cs Anadi Canepa (Fermilab), Yuri Gershtein (Rutgers) HL

Sambuz

Useful Links

Newsletter

Mail Us

Block 3: AI Safety Applications Tom Everitt July 10, 2018 1/40 - PowerPoint PPT Presentation

Block 3: AI Safety Applications Tom Everitt July 10, 2018 1/40 Table of Contents Movation and Setup Background Causal Graphs UAI Extension Reward Function Hacking Observation Optimization Corruption of Training Data for Reward Predictor

Problem 1 k zero bits n bits IV Block Block Block Block Cipher Cipher Cipher Cipher

Intersection Safety Intersection Safety Intersection Safety FHWA Safety Focus Areas FHWA Safety

Introducing EF Block TM Introduction to EF Block Building Materials Overview of EF

CRYSTAL CITY BLOCK PLAN # CCBP- J-K 2019 1 BLOCK J-K Long Range Planning Committee Block

CRYSTAL CITY BLOCK PLAN # CCBP- G 1 BLOCK G (Metro Market Square block) 1 Long Range

CRYSTAL CITY BLOCK PLAN # CCBP- G 1 BLOCK G (Metro Market Square block) 1 Long Range

Block Ciphers Eli Biham - May 3, 2005 c 83 Block Ciphers (4) Block Ciphers and Stream

Chapter 16 Chapter 16 The Elements: The he Elements: The d -Block -Block The d -Block

InfiniBand Network Block Device Overview IBNBD: InfiniBand Network Block device Transfer

CYBER CYBER-SAFETY CYBER CYBER SAFETY SAFETY SAFETY BASICS BASICS Engineering Staff College

Safety Presentation The Silence 1 Safety Presentation SAFETY SAFETY OR 2 Safety

CRYSTAL CITY BLOCK PLAN # CCBP- G 1 BLOCK G (Metro Market Square block) 1 Long Range

CRYSTAL CITY BLOCK PLAN # CCBP- J-K 2019 1 BLOCK J-K Long Range Planning Committee Meeting

A/B BLOCK SCHEDULING 2015-2016 BLOCK DISCUSSION Weve been discussing researching the

CRYSTAL CITY BLOCK PLAN # CCBP- G 1 BLOCK G (Metro Market Square block) 1 Long Range

Dashboard Block Block details by height Block details by ID Transaction details by receipient

OTSDN What is it? Does it help? Dennis Gammel Schweitzer Engineering Laboratories, Inc.

Virtually Cool Ternary Content Addressable Memory Suparna Bhattacharya, K Gopinath IBM, Indian

Disclosure The speakers have no conflicts of interest to disclose. Presentation Objectives

Part II: Pseudorandom Correlation Generators What are they? How can we build them? Re

From OT to generative modeling: the VEGAN cookbook Cheng Xin Notations set: calligraphic

SPOT App Syntax-Prosody in OT Jenny Bellik &amp; Nick Kalivoda, UC Santa Cruz October 7, 2018 @

S Sanctification Under the OT tifi ti U d th OT Theocracy eoc acy Charles Clough What

B03 B03 - 402.2.4 OT Elect ctronics cs Anadi Canepa (Fermilab), Yuri Gershtein (Rutgers) HL

Sambuz

Useful Links

Newsletter

Mail Us

SPOT App Syntax-Prosody in OT Jenny Bellik & Nick Kalivoda, UC Santa Cruz October 7, 2018 @