Kevin Roth, Yannic Kilcher , Thomas Hofmann ETH Zrich 2 6 # r - - PowerPoint PPT Presentation

▶

Kevin Roth*, Yannic Kilcher *, Thomas Hofmann ETH Zrich 2 6 # r - - PowerPoint PPT Presentation

Oct 12, 2023 28 likes •243 views

Kevin Roth*, Yannic Kilcher *, Thomas Hofmann ETH Zrich 2 6 # r e t s o p Log-Odds & Adversarial Examples Log-Odds & Adversarial Examples Adversarial examples cause atypically large feature space perturbations along the

SLIDE 1

Kevin Roth, Yannic Kilcher, Thomas Hofmann ETH Zürich

t e r # 6 2

SLIDE 2

Log-Odds & Adversarial Examples

SLIDE 3

Log-Odds & Adversarial Examples

Adversarial examples cause atypically large feature space perturbations along the weight-difference direction

SLIDE 4

x*

Adversarial Cone

SLIDE 5

x* xadv

Adversarial Cone

SLIDE 6

x* xadv

Adversarial Cone

random

SLIDE 7

Py(.) = 1 Py(.) = 0

x* xadv

Adversarial Cone

random

SLIDE 8

Py(.) = 1 Py(.) = 0

x* xadv

Adversarial Cone

random

SLIDE 9

Py(.) = 1 Py(.) = 0

x* xadv

Adversarial Cone

Adversarial examples are embedded in a cone-like structure

SLIDE 10

Adversarial Cone

*

softmax

xadv + t noise

SLIDE 11

Adversarial Cone

*

softmax

xadv + t noise

SLIDE 12

Adversarial Cone

Noise as a probing instrument

*

softmax

xadv + t noise

SLIDE 13

The robustness properties of are different dependent on whether

Main Idea: Log-Odds Robustness

tends to have a characteristic direction if whereas it tends not to have a specific direction if

SLIDE 14

Main Idea: Log-Odds Robustness

natural adversarial

Noise can partially undo effect of adversarial perturbation and directionally revert log-odds towards the true class y*

SLIDE 15

Statistical Test & Corrected Classification

We propose to use noise-perturbed pairwise log-odds to test whether classified as should be thought of as a manipulated example of true class :

adversarial if

Corrected classification :

SLIDE 16

Detection Rates & Corrected Classification

Our statistical test detects nearly all adversarial examples with FPR ~1%
Our correction method reclassifies almost all adversarial examples successfully
Drop in performance on clean samples is negligible

SLIDE 17

Detection Rates & Corrected Classification

Detection rate increases with increasing attack strength Corrected classification manages to compensate for decay in uncorrected accuracy due to increase in attack strength

attack strength ε

SLIDE 18

Defending against Defense-Aware Attacks

Attacker has full knowledge of the defense :

perturbations that work in expectation under noise source used for detection Detection rates and corrected accuracies remain remarkably high

SLIDE 19

Kevin Roth Yannic Kilcher Thomas Hofmann

Thank You

Follow-Up Work: Adversarial Training Generalizes Data-dependent Spectral Norm Regularization poster #62

ICML Workshop on Generalization (June 14)

SLIDE 20

SLIDE 21

The approaches most related to our work are those that detect whether or not the input has been perturbed, either by detecting characteristic regularities in the adversarial perturbations themselves or in the network activations they induce.

Grosse, Kathrin, et al. "On the (statistical) detection of adversarial examples." (2017).
Metzen, Jan Hendrik, et al. "On detecting adversarial perturbations." (2017).
Feinman, Reuben, et al. "Detecting adversarial samples from artifacts." (2017).
Xu, Weilin, David Evans, and Yanjun Qi. "Feature squeezing: Detecting adversarial examples in

deep neural networks." (2017).

Song, Yang, et al. "Pixeldefend: Leveraging generative models to understand and defend against

adversarial examples." (2017).

Carlini, Nicholas, and David Wagner. "Adversarial examples are not easily detected: Bypassing ten

detection methods." (2017).

… and many more

Kevin Roth*, Yannic Kilcher*, Thomas Hofmann ETH Zürich

Log-Odds & Adversarial Examples

Log-Odds & Adversarial Examples

Adversarial examples cause atypically large feature space perturbations along the weight-difference direction

x*

Adversarial Cone

x* xadv

Adversarial Cone

x* xadv

Adversarial Cone

random

Py*(.) = 1 Py*(.) = 0

x* xadv

Adversarial Cone

random

Py*(.) = 1 Py*(.) = 0

x* xadv

Adversarial Cone

random

Py*(.) = 1 Py*(.) = 0

x* xadv

Adversarial Cone

Adversarial examples are embedded in a cone-like structure

Adversarial Cone

*

xadv + t noise

Adversarial Cone

*

xadv + t noise

Adversarial Cone

Noise as a probing instrument

*

xadv + t noise

The robustness properties of are different dependent on whether

Main Idea: Log-Odds Robustness

tends to have a characteristic direction if whereas it tends not to have a specific direction if

Main Idea: Log-Odds Robustness

natural adversarial

Noise can partially undo effect of adversarial perturbation and directionally revert log-odds towards the true class y*

Statistical Test & Corrected Classification

We propose to use noise-perturbed pairwise log-odds to test whether classified as should be thought of as a manipulated example of true class :

adversarial if

Corrected classification :

Detection Rates & Corrected Classification

Detection Rates & Corrected Classification

Detection rate increases with increasing attack strength Corrected classification manages to compensate for decay in uncorrected accuracy due to increase in attack strength

Defending against Defense-Aware Attacks

perturbations that work in expectation under noise source used for detection Detection rates and corrected accuracies remain remarkably high

Kevin Roth Yannic Kilcher Thomas Hofmann

Thank You

Follow-Up Work: Adversarial Training Generalizes Data-dependent Spectral Norm Regularization poster #62

The approaches most related to our work are those that detect whether or not the input has been perturbed, either by detecting characteristic regularities in the adversarial perturbations themselves or in the network activations they induce.

References

Kevin Roth, Yannic Kilcher, Thomas Hofmann ETH Zürich

Py(.) = 1 Py(.) = 0

Py(.) = 1 Py(.) = 0

Py(.) = 1 Py(.) = 0