Confidence-Calibrated Adversarial Training Generalizing to Unseen - - PowerPoint PPT Presentation

confidence calibrated adversarial training
SMART_READER_LITE
LIVE PREVIEW

Confidence-Calibrated Adversarial Training Generalizing to Unseen - - PowerPoint PPT Presentation

Confidence-Calibrated Adversarial Training Generalizing to Unseen Attacks David Stutz, Matthias Hein, Bernt Schiele 2-Minute Overview Problem: Robustness to various adversarial examples. Adversarial training on L adversarial examples:


slide-1
SLIDE 1

Confidence-Calibrated Adversarial Training

Generalizing to Unseen Attacks

David Stutz, Matthias Hein, Bernt Schiele

slide-2
SLIDE 2

Problem: Robustness to various adversarial examples. Adversarial training on L∞ adversarial examples:

0.01 0.03 0.05

0.2 0.4 0.6 0.8 1

L∞ Perturbation

in Adversarial Direction Confidence

robust

≤ǫ (seen)

training ǫ=0.03

SVHN: Correct Adversarial

2-Minute Overview

Confidence-Calibrated Adversarial Training – David Stutz

slide-3
SLIDE 3

Problem: Robustness to various adversarial examples. Adversarial training on L∞ adversarial examples:

0.01 0.03 0.05

0.2 0.4 0.6 0.8 1

L∞ Perturbation

in Adversarial Direction Confidence

robust

≤ǫ (seen)

not robust

>ǫ (unseen)

training ǫ=0.03

SVHN: Correct Adversarial

2-Minute Overview

Confidence-Calibrated Adversarial Training – David Stutz

slide-4
SLIDE 4

Problem: Robustness to various adversarial examples. Adversarial training on L∞ adversarial examples:

0.5 1 1.5 2 0.2 0.4 0.6 0.8 1

L2 Perturbation

in Adversarial Direction Confidence

not robust

L2 attack

(unseen)

SVHN: Correct Adversarial

2-Minute Overview

Confidence-Calibrated Adversarial Training – David Stutz

slide-5
SLIDE 5

Summary of adversarial training:

0.01 0.03 0.05 0.2 0.4 0.6 0.8 1

L∞ Perturbation

in Adversarial Direction Confidence

robust

≤ǫ (seen)

not robust

>ǫ (unseen)

training ǫ=0.03 0.5 1 1.5 2 0.2 0.4 0.6 0.8 1

L2 Perturbation in Adversarial Direction Confidence

not robust L2 attack (unseen)

◮ High-confidence on adversarial examples (≤ǫ). ◮ No generalization to larger/other Lp perturbations. ◮ Behavior not meaningful for arbitrarily large ǫ.

2-Minute Overview

Confidence-Calibrated Adversarial Training – David Stutz

slide-6
SLIDE 6

Confidence-calibrated adversarial training (L∞ only):

0.01 0.03 0.05

0.2 0.4 0.6 0.8 1

L∞ Perturbation

in Adversarial Direction Confidence

≤ǫ seen

training ǫ=0.03

SVHN: Correct Adversarial

2-Minute Overview

Confidence-Calibrated Adversarial Training – David Stutz

slide-7
SLIDE 7

Confidence-calibrated adversarial training (L∞ only):

0.01 0.03 0.05

0.2 0.4 0.6 0.8 1

L∞ Perturbation

in Adversarial Direction Confidence

confidence threshold robust by rejecting

≤ǫ seen >ǫ unseen

training ǫ=0.03

SVHN: Correct Adversarial

2-Minute Overview

Confidence-Calibrated Adversarial Training – David Stutz

slide-8
SLIDE 8

Confidence-calibrated adversarial training (L∞ only):

0.5 1 1.5 2 0.2 0.4 0.6 0.8 1

L2 Perturbation

in Adversarial Direction Confidence

unseenL2 attack confidence threshold robust by rejecting

SVHN: Correct Adversarial

2-Minute Overview

Confidence-Calibrated Adversarial Training – David Stutz

slide-9
SLIDE 9

Adversarial training:

0.01 0.03 0.05 0.2 0.4 0.6 0.8 1

L∞ Perturbation Confidence

robust

≤ǫ (seen)

not robust

>ǫ (unseen)

training ǫ=0.03

◮ High-confidence on adversarial examples. ◮ No robustness to unseen perturbations.

Confidence-calibrated adversarial training:

0.01 0.03 0.05 0.2 0.4 0.6 0.8 1

L∞ Perturbation Confidence

confidence threshold robust by rejecting

≤ǫ seen >ǫ unseen

training ǫ=0.03

◮ Low-confidence on adversarial examples. ◮ Robustness to unseen perturbations

by confidence thresholding.

2-Minute Overview

Confidence-Calibrated Adversarial Training – David Stutz

slide-10
SLIDE 10

More details: Paper & code: davidstutz.de/ccat Contact: david.stutz@mpi-inf.mpg.de

Interested?

Confidence-Calibrated Adversarial Training – David Stutz

slide-11
SLIDE 11

More details: Paper & code: davidstutz.de/ccat Contact: david.stutz@mpi-inf.mpg.de Outline:

  • 1. Problems of adversarial training
  • 2. Confidence-calibrated adversarial training
  • 3. Confidence-thresholded robust test error
  • 4. Results on SVHN and CIFAR10

Interested?

Confidence-Calibrated Adversarial Training – David Stutz

slide-12
SLIDE 12

Min-max formulation:

min

w Ep(x,y)

  • max

δ∞≤ǫ L(f(x + δ; w), y)

  • .

classifier minimizing cross-entropy yields high-confidence

Problems of Adversarial Training

Confidence-Calibrated Adversarial Training – David Stutz

slide-13
SLIDE 13

Min-max formulation:

min

w Ep(x,y)

  • max

δ∞≤ǫ L(f(x + δ; w), y)

  • .

classifier minimizing cross-entropy yields high-confidence

0.01 0.03 0.05 0.2 0.4 0.6 0.8 1

L∞ Perturbation in Adversarial Direction Confidence

robust

≤ǫ (seen)

not robust

>ǫ (unseen)

training ǫ=0.03 0.5 1 1.5 2 0.2 0.4 0.6 0.8 1

L2 Perturbation in Adversarial Direction Confidence

not robust L2 attack (unseen)

◮ Robustness does not generalize to unseen attacks.

Problems of Adversarial Training

Confidence-Calibrated Adversarial Training – David Stutz

slide-14
SLIDE 14

1 Transition to uniform distribution on adversarial examples within the ǫ-ball:

−0.04 −0.03 −0.02 −0.01 0.01 0.02 0.03 0.04 0.2 0.4 0.6 0.8 1

L∞ Perturbation in (Adversarial) Direction

Confidence

training ǫ=0.03 training ǫ=0.03

◮ Low-confidence extrapolated beyond ǫ-ball.

Confidence-Calibrated Adversarial Training

Confidence-Calibrated Adversarial Training – David Stutz

slide-15
SLIDE 15

1 Transition to low confidence on adversarial examples within the ǫ-ball. 2 Reject low-confidence (adversarial) examples via confidence-thresholding:

0.01 0.02 0.03 0.04 0.2 0.4 0.6 0.8 1

L∞ Perturbation

Confidence

confidence threshold reject training ǫ=0.03

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6

Confidence on Adversarial Examples CCAT

← reject

Confidence-Calibrated Adversarial Training

Confidence-Calibrated Adversarial Training – David Stutz

slide-16
SLIDE 16
  • 1. Compute high-confidence adversarial examples:

˜ δ = max

δ∞≤ǫ max k=y fk(x + δ; w)

confidence of class k

  • 2. Impose target distribution via cross-entropy loss:

˜ y = λ one_hot(y) + (1 − λ)1/K

0.01 0.02 0.03 0.2 0.4 0.6 0.8 1

L∞ Perturbation (δ∞)

Target Distribution ˜

y λ = (1 − min(1, δ∞/ǫ))ρ

transition completely uniform

1 Transition to Low Confidence

Confidence-Calibrated Adversarial Training – David Stutz

slide-17
SLIDE 17
  • 1. Compute high-confidence adversarial examples:

˜ δ = max

δ∞≤ǫ max k=y fk(x + δ; w)

confidence of class k

  • 2. Impose target distribution via cross-entropy loss:

˜ y = λ one_hot(y) + (1 − λ)1/K

0.01 0.02 0.03 0.2 0.4 0.6 0.8 1

L∞ Perturbation (δ∞)

Target Distribution ˜

y λ = (1 − min(1, δ∞/ǫ))ρ

transition completely uniform

1 Transition to Low Confidence

Confidence-Calibrated Adversarial Training – David Stutz

slide-18
SLIDE 18

0.01 0.03 0.05

0.2 0.4 0.6 0.8 1

L∞ Perturbation

in Adversarial Direction Confidence

≤ǫ seen

training ǫ=0.03

SVHN: Correct Adversarial

2 Robustness by Confidence Thresholding

Confidence-Calibrated Adversarial Training – David Stutz

slide-19
SLIDE 19

0.01 0.03 0.05

0.2 0.4 0.6 0.8 1

L∞ Perturbation

in Adversarial Direction Confidence

confidence threshold robust by rejecting

≤ǫ seen >ǫ unseen

training ǫ=0.03

SVHN: Correct Adversarial

2 Robustness by Confidence Thresholding

Confidence-Calibrated Adversarial Training – David Stutz

slide-20
SLIDE 20

0.5 1 1.5 2 0.2 0.4 0.6 0.8 1

L2 Perturbation

in Adversarial Direction Confidence

unseenL2 attack confidence threshold robust by rejecting

SVHN: Correct Adversarial

2 Robustness by Confidence Thresholding

Confidence-Calibrated Adversarial Training – David Stutz

slide-21
SLIDE 21

Adversarial training:

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1

Confidence

x = = x′

Confidence-calibrated adversarial training:

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1

Interpolation Factor κ Confidence

x = = x′

2 Meaningful Extrapolation of Confidence

Confidence-Calibrated Adversarial Training – David Stutz

slide-22
SLIDE 22

Confidence-calibrated adversarial training: 1 Transition: low confidence on adversarial examples. 2 Reject low-confidence (adversarial) examples.

0.01 0.03 0.05 0.2 0.4 0.6 0.8 1

L∞ Perturbation in Adversarial Direction Confidence

confidence threshold robust by rejecting

≤ǫ seen >ǫ unseen

training ǫ=0.03 0.5 1 1.5 2 0.2 0.4 0.6 0.8 1

L2 Perturbation in Adversarial Direction Confidence

unseenL2 attack confidence threshold robust by rejecting

◮ Robustness to previously unseen perturbations.

Summary: Generalizable Robustness

Confidence-Calibrated Adversarial Training – David Stutz

slide-23
SLIDE 23

= error on test examples that are “attacked”. Adversarial Training (AT): 57.3% RErr Ours (CCAT): 97.8% RErr

“Standard” Robust Test Error RErr

Confidence-Calibrated Adversarial Training – David Stutz

slide-24
SLIDE 24

= error on test examples that are “attacked”. Adversarial Training (AT): 57.3% RErr

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6

Confidence on Adversarial Examples AT

Total: 539/1000 Ours (CCAT): 97.8% RErr

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6

Confidence on Adversarial Examples CCAT

Total: 949/1000

“Standard” Robust Test Error RErr

Confidence-Calibrated Adversarial Training – David Stutz

slide-25
SLIDE 25

= error on test examples that are “attacked” and pass confidence thresholding. Adversarial Training (AT): 56% (−1.3%)

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6

Confidence on Adversarial Examples AT

← reject

Ours (CCAT): 39.1% (−58.7%)

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6

Confidence on Adversarial Examples CCAT

← reject

Confidence-Thresholded RErr

Confidence-Calibrated Adversarial Training – David Stutz

slide-26
SLIDE 26

◮ Independent of adversarial examples. ◮ Avoid incorrectly rejecting (clean) test examples.

Confidence threshold at 99% true positive rateTPR:

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1

Confidence on Test Examples CCAT

reject at most 1% correct test examples!

99%TPR

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6

Confidence on Adversarial Examples CCAT

Determine Confidence Threshold

Confidence-Calibrated Adversarial Training – David Stutz

slide-27
SLIDE 27

Datasets: SVHN, CIFAR10, 1000 test examples. Per-example, worst-case (thresholded) RErr across:

Attack Iterations Restarts PGD 200-1000 10-50 Query-Limited† 1000 11 Simple† 1000 10 Square† 5000 1 Geometry† 1000 1 Random† – 5000

(† Black-box attacks.)

◮ Attacks adapted to maximize confidence.

Results

Confidence-Calibrated Adversarial Training – David Stutz

slide-28
SLIDE 28

SVHN: RErr ↓ in % at 99%TPR

L∞ ǫ = 0.03

seen unseen unseen unseen unseen AT 56.0 CCAT 39.1

(Lower RErr ↓ means “better” robustness.)

SVHN: Generalization to Unseen Attacks

Confidence-Calibrated Adversarial Training – David Stutz

slide-29
SLIDE 29

SVHN: RErr ↓ in % at 99%TPR

L∞ ǫ = 0.03 L∞ ǫ = 0.06 L2 ǫ = 2 L1 ǫ = 24 L0 ǫ = 10

seen unseen unseen unseen unseen AT 56.0 CCAT 39.1

(Lower RErr ↓ means “better” robustness.)

SVHN: Generalization to Unseen Attacks

Confidence-Calibrated Adversarial Training – David Stutz

slide-30
SLIDE 30

SVHN: RErr ↓ in % at 99%TPR

L∞ ǫ = 0.03 L∞ ǫ = 0.06 L2 ǫ = 2 L1 ǫ = 24 L0 ǫ = 10

seen unseen unseen unseen unseen AT 56.0 88.4 99.4 99.5 73.6 CCAT 39.1 53.1 29.0 31.7 3.5

(Lower RErr ↓ means “better” robustness.)

SVHN: Generalization to Unseen Attacks

Confidence-Calibrated Adversarial Training – David Stutz

slide-31
SLIDE 31

CIFAR10: RErr ↓ in % at 99%TPR

L∞ ǫ = 0.03 L∞ ǫ = 0.06 L2 ǫ = 2 L1 ǫ = 24 L0 ǫ = 10

seen unseen unseen unseen unseen AT 62.7 93.7 98.4 98.4 72.4 CCAT 67.9 92.0 51.8 58.5 20.3

(Lower RErr ↓ means “better” robustness.)

Cifar10: Generalization to Unseen Attacks

Confidence-Calibrated Adversarial Training – David Stutz

slide-32
SLIDE 32

CIFAR10: RErr, FPR and CErr at 99%TPR adv. frames distal corrupted unseen unseen unseen RErr ↓ FPR↓ CErr↓ Normal 96.6 83.3 12.3 AT 78.7 75.0 16.2 CCAT 65.1 8.5

(FPR: false positive rate, fraction of non-rejected adv. examples.) (CErr: test error on corrupted examples after thresholding.)

“Unconventional” Attacks

Confidence-Calibrated Adversarial Training – David Stutz

slide-33
SLIDE 33

SVHN: Err ↓ in % no reject

99%

TPR Normal 3.6 2.6 AT 3.4 2.5 CCAT 2.9 2.1 CIFAR10: Err ↓ in % no reject

99%

TPR 8.3 7.4 16.6 15.5 10.1 6.7

(Err: test error before and after thresholding.)

Improved Accuracy

Confidence-Calibrated Adversarial Training – David Stutz

slide-34
SLIDE 34

Low-confidence on adversarial examples and beyond.

◮ Robustness generalizes to unseen attacks. ◮ Accuracy improves.

Adversarial training:

0.01 0.03 0.05 0.2 0.4 0.6 0.8 1

L∞ Perturbation

Confidence

robust

≤ǫ (seen)

not robust

>ǫ (unseen)

training ǫ=0.03

Ours:

0.01 0.03 0.05 0.2 0.4 0.6 0.8 1

L∞ Perturbation

Confidence

confidence threshold robust by rejecting

≤ǫ seen >ǫ unseen

training ǫ=0.03

Paper & code: davidstutz.de/ccat

Confidence-Calibrated Adversarial Training

Confidence-Calibrated Adversarial Training – David Stutz