Adversaries & Interpretability SIDN: An IAP Practicum Shibani - PowerPoint PPT Presentation

Adversaries & Interpretability SIDN: An IAP Practicum Shibani Santurkar Dimitris Tsipras gradient-science.org

Outline for today 1. Simple gradient explanations • Exercise 1: Gradient saliency • Exercise 2: SmoothGrad 2. Adversarial examples and interpretability • Exercise 3: Adversarial attacks 3. Interpreting robust models • Exercise 4: Large adversarial attacks for robust models • Exercise 5: Robust gradients • Exercise 6: Robust feature visualization

Lab notebook github.com/SIDN-IAP/adversaries

Local explanations How can we understand per-image model behavior? Dog 95% Bird 2% … Primate 4% Truck 0% Input x Pile of linear algebra Predictions Why is this image classified as a dog? Which pixels are important for this?

Local explanations Sensitivity: How does each pixel affect predictions? Dog 95% Bird 2% … Primate 4% Truck 0% Input x Pile of linear algebra Predictions Gradient saliency: g i ( x ) = ∇ x C i ( x ; θ ) → Conceptually: Highlights important pixels

Exercise 1: Try it yourself (5m) Explore model sensitivity via gradients → Basic method: Visualize gradients for different inputs → What is the dimension of the gradient? → Optional: Does model architecture affect visualization?

What did you see? Gradient explanations do not look amazing Original Image Gradient How can we get rid of all this noise?

Better Gradients SmoothGrad: average gradients from multiple (nearby) inputs [Smilkov et al. 2017] N sg ( x ) = 1 ∑ g ( x + N (0, σ )) N average add noise Intuition: “noisy” part of the gradient will cancel out

Exercise 2: SmoothGrad (10m) N sg ( x ) = 1 Implement SmoothGrad ∑ g ( x + N (0, σ )) N → Basic method: Visualize SmoothGrad for different inputs → Does visual quality improve over vanilla gradient? → Play with number of samples (N) and variance ( σ )

What did you see? Interpretations look much cleaner Original Image SmoothGrad But , did we change something fundamental? Did the “noise” we hide mean something?

Adversarial examples “pig” (91%) perturbation “airliner” (99%) +0.005x = [Biggio et al. 2013; Szegedy et al. 2013] perturbation = arg max δ ∈Δ ℓ ( θ ; x + δ , y ) Why is the model so sensitive to the perturbation?

Exercise 3: Adv. Examples (5m) Fool std. models with imperceptible changes to inputs Perturbation: δ ′ � = arg max || δ || 2 ∈ ϵ ℓ ( θ ; x + δ , y ) → Method: Gradient descent to increase loss w.r.t. true label (Pick an incorrect class, and make model predict it) → How far do we need to go from original input? → Play with attack parameters (steps, step size, epsilon)

A conceptual model Unreasonable sensitivity to meaningless features: This has nothing to do with normal DNN behavior Useful features (used to classify) Useless features adv. examples flip these features to fool the model

Simple experiment Evaluate on Training set New training set original test set (“mislabelled”) (cats vs. dogs) Classifier Adv. ex. towards the Train dog cat dog other class dog dog cat cat dog cat cat dog cat

Simple experiment Evaluate on Training set New training set original test set (“mislabelled”) (cats vs. dogs) Classifier Adv. ex. towards the Train dog cat dog other class How well will this model do? dog dog cat cat dog cat cat dog cat

Simple experiment Evaluate on Training set New training set original test set (“mislabelled”) (cats vs. dogs) Classifier Adv. ex. towards the Result: Good accuracy on the original test set Train dog cat dog other class dog dog cat (e.g., 78% on CIFAR-10 cats vs. dogs) cat dog cat cat dog cat

What is our model missing? Useful features ? Useless features

Fixing our conceptual model Useless features Useful features (used to classify)

Fixing our conceptual model Useless features Useful features (used to classify) … … Robust features Non-robust features Adversarial examples flip some useful features

Try at home Pre-generated Datasets github.com/MadryLab/constructed-datasets Adversarial examples & training library github.com/MadryLab/robustness

Similar findings High-frequency components Predictive linear directions [Yin et al. 2019] [Jetley et al. 2018] Take away: Models rely on unintuitive features

Back to interpretations dog Equally valid classification methods

Model faithful explanations Interpretability methods might be hiding relevant information → Human-meaningless does not mean useless → Are we improving explanations or hiding things? → Better visual quality might have nothing to do with model [Adebayo et al. 2018]

How do we get better saliency? Gradient of standard models are faithful but don’t look great Better interpretability Better models (human priors) Can hide too much!

One idea: Robustness as prior Key idea: Force models to ignore non-robust features Standard Training: min $ % &,(~* [,-.. /, 0, 1 ] Robust Training: min $ % &,(~* [,-. /∈1 2344 5, 6 + /, 8 ] Set of invariances

Exercise 4: Adv. Examples II (5m) Imperceptible change images to fool robust models Perturbation: δ ′ � = arg max || δ || 2 ∈ ϵ ℓ ( θ ; x + δ , y ) → Once again: Gradient descent to increase loss (Pick an incorrect class, and make model predict it) → How easy is it to change the model prediction? (compare to standard models) → Again play with attack parameters (steps, step size, epsilon)

What did we see? For robust model: Harder to change prediction with imperceptible (small ε ) perturbation

Exercise 5: Robust models (5m) Changing model predictions: larger perturbations → Goal: modify input so that model prediction changes • Again, gradient descent to make prediction target class • Since small epsilons don’t work, try larger ones → What does the modified input look like?

What did we see? Target class: ``Primate`` Large- ε adv. examples for robust models actually modify semantically meaningful features in input

Exercise 6.1: Robust gradients (5m) Explore robust model sensitivity via gradients → Visualize gradients for different inputs → Compare to grad (and SmoothGrad) for standard models

What did we see? Vanilla gradients look nice, without post-processing Maybe robust models rely on ``better`` features

Dig deeper Visualize learned representations Features Linear Predicted Input classifier Class Use gradient descent to maximize neurons

Exercise 6.2: Visualize Features (10m) Finding inputs that maximize specific features → Extract feature representation from model (What are its dimensions?) → Write loss to max. individual neurons in feature rep. → As before: Use gradient descent to find inputs that max. loss → Optional: Repeat for standard models → Optional: Start optimization from noise instead

What did we see? Neuron 200 Neuron 500 Neuron 1444 Maximizing inputs Top-activating test images High-level concepts

Takeaways Nice-looking explanations might hide things Models can rely on weird features Robustness can be a powerful feature prior

“Robust Features” Based on joint work with Logan Andrew Brandon Alexander Aleksander Engstrom Ilyas Tran Turner M ą dry robustness gradsci.org

Adversaries & Interpretability SIDN: An IAP Practicum Shibani - PowerPoint PPT Presentation

Adversaries & Interpretability SIDN: An IAP Practicum Shibani Santurkar Dimitris Tsipras gradient-science.org Outline for today 1. Simple gradient explanations Exercise 1: Gradient saliency Exercise 2: SmoothGrad 2. Adversarial

Interpretability of Machine Learning for Computer Vision Xinshuo Weng* *Most slides borrowed

The Mythos of Model Interpretability Zachary C. Lipton https://arxiv.org/abs/1606.03490 Outline

The Mythos of Model Interpretability Zachary C. Lipton https://arxiv.org/abs/1606.03490 Outline

INTERPRETABILITY AND INTERPRETABILITY AND EXPLAINABILITY EXPLAINABILITY Christian Kaestner

Interpretability and functional transparency Tommi Jaakkola in collaboration with David Alvarez

On Optimal and Reasonable Control in the Presence of Adversaries Oded Maler CNRS-VERIMAG

Synchrony Weakened by Message Adversaries vs Asynchrony Restricted by Failure Detectors Michel R

Explaining Machine Learning Models Armen Donigian Director of Data Science Engineering Roadmap

Interpretability in NLP: Moving Beyond Vision Shuoyang Ding Microsoft Translator Talk Series

Interpretability in PRA Marta Bilkova , Dick de Jongh , and Joost J. Joosten ,

Interpretability and the arithmetized completeness theorem (Taishi Kurahashi)

Interpretability and Robustness for Multi-Hop QA Mohit Bansal (MRQA-EMNLP 2019 Workshop) 1

Dr. Jeff McNeil January 29, 2015 Adversaries already present in our networks Lack of

Nonce-Disrespecting Adversaries: Practical Forgery Attacks on GCM in TLS Hanno Bck, Aaron

Computation for Mali licious Adversaries and an Honest Majority Jun Furukawa*, Yehuda Lindell**,

Recognizing and Imitating Programmer Style: Adversaries in Program Authorship Attribution Lucy

Near-Optimal Pseudorandom Generators for Constant-Depth Read-Once Formulas Dean Doron 1 Pooya

Three Fools and a Wise Woman (1 Samuel 25) I hope youre enjoying our sermon series on

Fool Proof Luke 24:13-35 April 1, 2018 1957 Swiss Spaghetti Harvest 1976 Zero-G Day

Making Default Address Selection More Robust FoolProof

Outsourcing Source Code Distribution Requirements Alexios Zavras, Stefano Zacchiroli Intel,

For the wise men of old, the cardinal problem of human life was how to conform the soul to

Living in a fools wireless - secured paradise Stefan Kiese Topics Wireless (consumer)

Learning Universal Adversarial Perturbations with Generative Models Jamie Hayes & George

Adversaries & Interpretability SIDN: An IAP Practicum Shibani - PowerPoint PPT Presentation

Adversaries & Interpretability SIDN: An IAP Practicum Shibani Santurkar Dimitris Tsipras gradient-science.org Outline for today 1. Simple gradient explanations Exercise 1: Gradient saliency Exercise 2: SmoothGrad 2. Adversarial

Interpretability of Machine Learning for Computer Vision Xinshuo Weng* *Most slides borrowed

The Mythos of Model Interpretability Zachary C. Lipton https://arxiv.org/abs/1606.03490 Outline

The Mythos of Model Interpretability Zachary C. Lipton https://arxiv.org/abs/1606.03490 Outline

INTERPRETABILITY AND INTERPRETABILITY AND EXPLAINABILITY EXPLAINABILITY Christian Kaestner

Interpretability and functional transparency Tommi Jaakkola in collaboration with David Alvarez

On Optimal and Reasonable Control in the Presence of Adversaries Oded Maler CNRS-VERIMAG

Synchrony Weakened by Message Adversaries vs Asynchrony Restricted by Failure Detectors Michel R

Explaining Machine Learning Models Armen Donigian Director of Data Science Engineering Roadmap

Interpretability in NLP: Moving Beyond Vision Shuoyang Ding Microsoft Translator Talk Series

Interpretability in PRA Marta Bilkova , Dick de Jongh , and Joost J. Joosten ,

Interpretability and the arithmetized completeness theorem (Taishi Kurahashi)

Interpretability and Robustness for Multi-Hop QA Mohit Bansal (MRQA-EMNLP 2019 Workshop) 1

Dr. Jeff McNeil January 29, 2015 Adversaries already present in our networks Lack of

Nonce-Disrespecting Adversaries: Practical Forgery Attacks on GCM in TLS Hanno Bck, Aaron

Computation for Mali licious Adversaries and an Honest Majority Jun Furukawa*, Yehuda Lindell**,

Recognizing and Imitating Programmer Style: Adversaries in Program Authorship Attribution Lucy

Near-Optimal Pseudorandom Generators for Constant-Depth Read-Once Formulas Dean Doron 1 Pooya

Three Fools and a Wise Woman (1 Samuel 25) I hope youre enjoying our sermon series on

Fool Proof Luke 24:13-35 April 1, 2018 1957 Swiss Spaghetti Harvest 1976 Zero-G Day

Making Default Address Selection More Robust FoolProof

Outsourcing Source Code Distribution Requirements Alexios Zavras, Stefano Zacchiroli Intel,

For the wise men of old, the cardinal problem of human life was how to conform the soul to

Living in a fools wireless - secured paradise Stefan Kiese Topics Wireless (consumer)

Learning Universal Adversarial Perturbations with Generative Models Jamie Hayes &amp; George

Learning Universal Adversarial Perturbations with Generative Models Jamie Hayes & George