Adversarial Robustness for Aligned AI Ian Goodfellow, Sta ff - PowerPoint PPT Presentation

Adversarial Robustness for Aligned AI Ian Goodfellow, Sta ff Research NIPS 2017 Workshop on Aligned Artificial Intelligence Many thanks to Catherine Olsson for feedback on drafts

The Alignment Problem (This is now fixed. Don’t try it!) (Goodfellow 2017)

Main Takeaway • My claim: if you want to use alignment as a means of guaranteeing safety, you probably need to solve the adversarial robustness problem first (Goodfellow 2017)

Why the “if”? • I don’t want to imply that alignment is the only or best path to providing safety mechanisms • Some problematic aspects of alignment • Di ff erent people have di ff erent values • People can have bad values • Di ffi culty / lower probability of success. Need to model a black box, rather than a first principle (like low-impact, reversibility, etc.) • Alignment may not be necessary • People can coexist and cooperate without being fully aligned (Goodfellow 2017)

Some context: many people have already been working on alignment for decades • Consider alignment to be “learning and respecting human preferences” • Object recognition is “human preferences about how to categorize images” • Sentiment analysis is “human preferences about how to categorize sentences” (Goodfellow 2017)

What do we want from alignment? • Alignment is often suggested as something that is primarily a concern for RL , where an agent maximizes a reward • but we should want alignment for supervised learning too • Alignment can make better products that are more useful • Many want to rely on alignment to make systems safe • Our methods of providing alignment are not (yet?) reliable enough to be used for this purpose (Goodfellow 2017)

Improving RL with human input • Much work focuses on making RL more like supervised learning • Reward based on a model of human preferences • Human demonstrations • Human feedback • This can be good for RL capabilities • The original AlphaGo bootstrapped from observing human games • OpenAI’s “Learning from Human Feedback” shows successful learning to backflip • This makes RL more like supervised learning and makes it work , but does it make it robust ? (Goodfellow 2017)

Adversarial Examples Timeline: “Adversarial Classification” Dalvi et al 2004: fool spam filter “Evasion Attacks Against Machine Learning at Test Time” Biggio 2013: fool neural nets Szegedy et al 2013: fool ImageNet classifiers imperceptibly Goodfellow et al 2014: cheap, closed form attack (Goodfellow 2017)

Maximizing model’s estimate of human preference for input to be categorized as “airplane” (Goodfellow 2017)

Sampling: an easier task? • Absolutely maximizing human satisfaction might to be too hard. What about sampling from the set of things humans have liked before? • Even though this problem is easier, it’s still notoriously di ffi cult (GANs and other generative models) • GANs have a trick to get more data • Start with a small set of data that the human likes • Generate millions of examples and assume that the human dislikes them all (Goodfellow 2017)

Spectrally Normalized GANs Welsh Springer Spaniel Palace Pizza (Miyato et al., 2017) This is better than the adversarial panda, but still not a satisfying safety mechanism. (Goodfellow 2017)

Progressive GAN has learned that humans think cats are furry animals accompanied by floating symbols (Karras et al, 2017) (Goodfellow 2017)

Confidence • Many proposals for achieving aligned behavior rely on accurate estimates of an agents’ confidence, or rely on the agent having low confidence in some scenarios (e.g. Hadfield-Menell et al 2017) • Unfortunately, adversarial examples often have much higher confidence than naturally occurring, correctly processed examples (Goodfellow 2017)

Adversarial Examples for RL (Huang et al., 2017) (Goodfellow 2017)

Summary so Far • High level strategies will fail if low-level building blocks are not robust • Reward maximizing places low-level building blocks under exactly the same situation as adversarial attack • Current ML systems fail frequently and gracelessly under adversarial attack; have higher confidence when wrong (Goodfellow 2017)

What are we doing about it? • Two recent techniques for achieving adversarial robustness: • Thermometer codes • Ensemble adversarial training • A long road ahead (Goodfellow 2017)

Thermometer Encoding: One Hot Way to Resist Adversarial Examples Aurko Roy* Colin Ra ff el Jacob Ian Buckman* Goodfellow *joint first author

Linear Extrapolation Vulnerabilities 4 2 0 − 2 − 4 − 10 . 0 − 7 . 5 − 5 . 0 − 2 . 5 0 . 0 2 . 5 5 . 0 7 . 5 10 . 0 (Goodfellow 2017)

Neural nets are “too linear” Argument to softmax Plot from “Explaining and Harnessing Adversarial Examples”, Goodfellow et al, 2014 (Goodfellow 2017)

(Goodfellow 2017)

Large improvements on SVHN direct (“white box”) attacks 5 years ago, this would have been SOTA on clean data (Goodfellow 2017)

Large Improvements against CIFAR-10 direct (“white box”) attacks 6 years ago, this would have been SOTA on clean data (Goodfellow 2017)

Ensemble Adversarial Training Florian Alexey Nicolas Ian Tramèr Kurakin Papernot Goodfellow Patrick Dan Boneh McDaniel

Cross-model, cross-dataset generalization (Goodfellow 2017)

Ensemble Adversarial Training (Goodfellow 2017)

Transfer Attacks Against Inception ResNet v2 on ImageNet (Goodfellow 2017)

Competition Best defense so far on ImageNet: Ensemble adversarial training. Used as at least part of all top 10 entries in dev round 3 (Goodfellow 2017)

Future Work • Adversarial examples in the max-norm ball are not the real problem • For alignment: formulate the problem in terms of inputs that reward-maximizers will visit • Verification methods • Develop a theory of what kinds of robustness are possible • See “Adversarial Spheres” (Gilmer et al 2017) for some arguments that it may not be feasible to build su ffi ciently accurate models (Goodfellow 2017)

Get involved! https://github.com/tensorflow/cleverhans (Goodfellow 2017)

Adversarial Robustness for Aligned AI Ian Goodfellow, Sta ff - PowerPoint PPT Presentation

Adversarial Robustness for Aligned AI Ian Goodfellow, Sta ff Research NIPS 2017 Workshop on Aligned Artificial Intelligence Many thanks to Catherine Olsson for feedback on drafts The Alignment Problem (This is now fixed. Dont try it!)

Limits on Robustness to Adversarial Examples Elvis Dohmatob Criteo AI Lab October 2, 2019 Elvis

Reinforcing Adversarial Robustness using Model Confidence Induced by Adversarial Training Xi Wu

UCSD Robustness Summer School David Donoho 20190812 David Donoho UCSD Robustness Summer School

Robustness? Robustness ? Robustness?

Lessons Learned from Evaluating the Robustness of Defenses to Adversarial Examples Nicholas

Adversarial Robustness for Code Pavol Bielik , Martin Vechev pavol.bielik@inf.ethz.ch,

Adversarial Robustness of Machine Learning Models for Graphs Prof. Dr. Stephan Gnnemann

Adversarial Domain Adaptation and Adversarial Robustness Judy Hoffman + = Big Deep success

Adversarial Training and Robustness for Multiple Perturbations Poster #87 Florian Tramr &

Adversarial Approaches to Bayesian Learning and Bayesian Approaches to Adversarial Robustness

Confidence-Calibrated Adversarial Training Generalizing to Unseen Attacks David Stutz, Matthias

Robustness and Generalization Huan Xu The University of Texas at Austin Department of Electrical

Where Are We? Lecture 9 Robustness through Training 1 Robustness Explicit Handling of Noise

Certified Adversarial Robustness via Randomized Smoothing Jeremy Cohen Elan Rosenfeld

Improving Adversarial Robustness via Promoting Ensemble Diversity Tianyu

Adversarial Robustness: Theory and Practice Zico Kolter Aleksander Mdry madry-lab.ml

Slides from: Elena Tsiporkova What is Special about Time Series Data? Gene expression time series

Differential Slicing: Identifying Causal Execution Differences for Security Applications Noah M.

Statistical Machine Translation Overview p EM algorithm Lecture 3 Improved word alignment

Statistical NLP Spring 2011 Lecture 8: Word Alignment Dan Klein UC Berkeley Phrase-Based

Algorithms in Bioinformatics: A Practical Introduction Multiple Sequence Alignment Multiple

RTCP Extension For Time Alignment draft-taylor-avt-time-align-00.txt Tom Taylor et al IETF 66

Aims of Session Understand the concept of constructive alignment Identify the benefits

Binary Foreground Map Evaluation Deng-Ping Fan Nankai University of Media Computing Lab

Adversarial Robustness for Aligned AI Ian Goodfellow, Sta ff - PowerPoint PPT Presentation

Adversarial Robustness for Aligned AI Ian Goodfellow, Sta ff Research NIPS 2017 Workshop on Aligned Artificial Intelligence Many thanks to Catherine Olsson for feedback on drafts The Alignment Problem (This is now fixed. Dont try it!)

Limits on Robustness to Adversarial Examples Elvis Dohmatob Criteo AI Lab October 2, 2019 Elvis

Reinforcing Adversarial Robustness using Model Confidence Induced by Adversarial Training Xi Wu

UCSD Robustness Summer School David Donoho 20190812 David Donoho UCSD Robustness Summer School

Robustness? Robustness ? Robustness?

Lessons Learned from Evaluating the Robustness of Defenses to Adversarial Examples Nicholas

Adversarial Robustness for Code Pavol Bielik , Martin Vechev pavol.bielik@inf.ethz.ch,

Adversarial Robustness of Machine Learning Models for Graphs Prof. Dr. Stephan Gnnemann

Adversarial Domain Adaptation and Adversarial Robustness Judy Hoffman + = Big Deep success

Adversarial Training and Robustness for Multiple Perturbations Poster #87 Florian Tramr &amp;

Adversarial Approaches to Bayesian Learning and Bayesian Approaches to Adversarial Robustness

Confidence-Calibrated Adversarial Training Generalizing to Unseen Attacks David Stutz, Matthias

Robustness and Generalization Huan Xu The University of Texas at Austin Department of Electrical

Where Are We? Lecture 9 Robustness through Training 1 Robustness Explicit Handling of Noise

Certified Adversarial Robustness via Randomized Smoothing Jeremy Cohen Elan Rosenfeld

Improving Adversarial Robustness via Promoting Ensemble Diversity Tianyu

Adversarial Robustness: Theory and Practice Zico Kolter Aleksander Mdry madry-lab.ml

Slides from: Elena Tsiporkova What is Special about Time Series Data? Gene expression time series

Differential Slicing: Identifying Causal Execution Differences for Security Applications Noah M.

Statistical Machine Translation Overview p EM algorithm Lecture 3 Improved word alignment

Statistical NLP Spring 2011 Lecture 8: Word Alignment Dan Klein UC Berkeley Phrase-Based

Algorithms in Bioinformatics: A Practical Introduction Multiple Sequence Alignment Multiple

RTCP Extension For Time Alignment draft-taylor-avt-time-align-00.txt Tom Taylor et al IETF 66

Aims of Session Understand the concept of constructive alignment Identify the benefits

Binary Foreground Map Evaluation Deng-Ping Fan Nankai University of Media Computing Lab

Adversarial Training and Robustness for Multiple Perturbations Poster #87 Florian Tramr &