Adversarial Examples and Adversarial Training
Ian Goodfellow, OpenAI Research Scientist Security Seminar, Stanford University, 2017-01-17
Adversarial Examples and Adversarial Training Ian Goodfellow, - - PowerPoint PPT Presentation
Adversarial Examples and Adversarial Training Ian Goodfellow, OpenAI Research Scientist Security Seminar, Stanford University, 2017-01-17 In this presentation Intriguing Properties of Neural Networks Szegedy et al, 2013
Ian Goodfellow, OpenAI Research Scientist Security Seminar, Stanford University, 2017-01-17
(Goodfellow 2016)
et al, 2013
Goodfellow et al 2014
Networks” Warde-Farley and Goodfellow, 2016
(Goodfellow 2016)
to Black-Box Attacks using Adversarial Samples” Papernot et al 2016
Systems using Adversarial Examples” Papernot et al 2016
Networks for Malware Classification” Grosse et al 2016 (not my own work)
(Goodfellow 2016)
Adversarial Training” Miyato et al 2015 (not my own work)
Supervised Text Classification” Miyato et al 2016
Kurakin et al 2016
(Goodfellow 2016)
systems?
learning, even when there is no adversary
(Goodfellow 2016)
...solving CAPTCHAS and reading addresses...
...recognizing objects and faces….
(Szegedy et al, 2014) (Goodfellow et al, 2013) (Taigmen et al, 2013) (Goodfellow et al, 2013)
and other tasks... Since 2013, deep neural networks have matched human performance at...
(Goodfellow 2016)
Timeline: “Adversarial Classification” Dalvi et al 2004: fool spam filter “Evasion Attacks Against Machine Learning at Test Time” Biggio 2013: fool neural nets Szegedy et al 2013: fool ImageNet classifiers imperceptibly Goodfellow et al 2014: cheap, closed form attack
(Goodfellow 2016)
(Goodfellow 2016)
(Goodfellow 2016)
(Goodfellow 2016)
(Goodfellow 2016)
(Goodfellow 2016)
Rectified linear unit Carefully tuned sigmoid Maxout LSTM
Google Proprietary
Rectified linear unit Carefully tuned sigmoid Maxout LSTM
(Goodfellow 2016)
Argument to softmax
(Goodfellow 2016)
Clean example Perturbation Corrupted example All three perturbations have L2 norm 3.96 This is actually small. We typically use 7! Perturbation changes the true class Random perturbation does not change the class Perturbation changes the input to “rubbish class”
(Goodfellow 2016)
(Goodfellow 2016)
(collaboration with David Warde-Farley and Nicolas Papernot)
(Goodfellow 2016)
(Goodfellow 2016)
Adversarial examples are not noise
(collaboration with David Warde-Farley and Nicolas Papernot)
(Goodfellow 2016)
(“Clever Hans, Clever Algorithms,” Bob Sturm)
(Goodfellow 2016)
(Goodfellow 2016)
Weights Signs of weights Clean examples Adversarial
(Goodfellow 2016)
(Andrej Karpathy, “Breaking Linear Classifiers on ImageNet”)
(Goodfellow 2016)
(Goodfellow 2016)
(Goodfellow 2016)
(Papernot 2016)
(Goodfellow 2016)
Train your
Target model with unknown weights, machine learning algorithm, training set; maybe non- differentiable Substitute model mimicking target model with known, differentiable function Adversarial examples Adversarial crafting against substitute Deploy adversarial examples against the target; transferability property results in them succeeding
(Goodfellow 2016)
Cross-Training Data Transferability
Strong Weak Intermediate
(Papernot 2016)
(Goodfellow 2016)
(Pinna and Gregory, 2002) These are concentric circles, not intertwined spirals.
(Goodfellow 2016)
(MetaMind, Amazon, Google)
and fool machine learning systems that perceive them through a camera
(Goodfellow 2016)
(Goodfellow 2016)
Hypothetical Attacks on Autonomous Vehicles
Denial of service
Confusing object
Harm others
Adversarial input recognized as “open space on the road”
Harm self / passengers
Adversarial input recognized as “navigable road”
(Goodfellow 2016)
Weight decay Adding noise at test time Adding noise at train time Dropout Ensembles Multiple glimpses Generative pretraining Removing perturbation with an autoencoder Error correcting codes Confidence-reducing perturbation at test time Various non-linear units Double backprop
(Goodfellow 2016)
mixture models implement roughly the same marginal over x, with very different posteriors over the classes. The likelihood criterion cannot strongly prefer one to the other, and in many cases will prefer the bad
(Goodfellow 2016)
Neural nets can represent either function: Maximum likelihood doesn’t cause them to learn the right function. But we can fix that...
Google Proprietary
Neural nets can represent either function: Maximum likelihood doesn’t cause them to learn the right function. But we can fix that...
(Goodfellow 2016)
50 100 150 200 250 300 Training time (epochs) 10−2 10−1 100 Test misclassification rate
Train=Clean, Test=Clean Train=Clean, Test=Adv Train=Adv, Test=Clean Train=Adv, Test=Adv
(Goodfellow 2016)
a step function, so adversarial training is less useful, very similar to weight decay
secure than other models. Adversarially trained neural nets have the best empirical success rate on adversarial examples of any machine learning model.
(Goodfellow 2016)
(Goodfellow 2016)
Labeled as bird Decrease probability
Still has same label (bird)
(Goodfellow 2016)
Unlabeled; model guesses it’s probably a bird, maybe a plane Adversarial perturbation intended to change the guess New guess should match old guess (probably bird, maybe plane)
(Goodfellow 2016)
RCV1 Misclassification Rate
6.00 6.50 7.00 7.50 8.00
Earlier SOTA SOTA Our baseline Adversarial Virtual Adversarial Both Both + bidirectional model
6.68 6.97 7.05 7.12 7.40 7.20 7.70
Zoomed in for legibility
(Goodfellow 2016)
Universal engineering machine (model-based optimization) Training data Extrapolation
Make new inventions by finding input that maximizes model’s predicted performance
(Goodfellow 2016)
semi-supervised learning
model-based optimization generally
(Goodfellow 2016)
Open-source library available at: https://github.com/openai/cleverhans Built on top of TensorFlow (Theano support anticipated) Standard implementation of attacks, for adversarial training and reproducible benchmarks