Virtual Adversarial Training Buse Gul Atli Aalto University, - - PowerPoint PPT Presentation

virtual adversarial training
SMART_READER_LITE
LIVE PREVIEW

Virtual Adversarial Training Buse Gul Atli Aalto University, - - PowerPoint PPT Presentation

Virtual Adversarial Training Buse Gul Atli Aalto University, Department of Science buse.atli@aalto.fi May 21, 2019 Buse Gul Atli (Aalto University) Virtual Adversarial Training May 21, 2019 1 / 27 What We Will Cover 1 Miyato et.al Virtual


slide-1
SLIDE 1

Virtual Adversarial Training

Buse Gul Atli

Aalto University, Department of Science buse.atli@aalto.fi

May 21, 2019

Buse Gul Atli (Aalto University) Virtual Adversarial Training May 21, 2019 1 / 27

slide-2
SLIDE 2

What We Will Cover

1 Miyato et.al Virtual Adversarial Training: A Regularization Method

for Supervised and Semi-Supervised Learning 2018

2 Miyato et. al Distributional Smoothing with Virtual Adversarial

Training 2015.

Buse Gul Atli (Aalto University) Virtual Adversarial Training May 21, 2019 2 / 27

slide-3
SLIDE 3

Overfitting vs Underfitting

Poor design of the model Noise in the training set

Buse Gul Atli (Aalto University) Virtual Adversarial Training May 21, 2019 3 / 27

slide-4
SLIDE 4

Regularization

Smoothing the output distribution w.r.t spatial/temporal inputs L1 and L2 regularization Applying random perturbations to input and hidden layers Droput in NNs.

Buse Gul Atli (Aalto University) Virtual Adversarial Training May 21, 2019 4 / 27

slide-5
SLIDE 5

Adversarial Training1

Adds a noise to the image where the noise is in the adversarial direction Model’s probability of correct classification is reduced in adversarial direction.

Still has the same label (dog) Labeled as a dog

1Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy, Explaining and Harnessing

Adversarial Examples, 2015

Buse Gul Atli (Aalto University) Virtual Adversarial Training May 21, 2019 5 / 27

slide-6
SLIDE 6

Adversarial Training

Adds a noise to the image where the noise is in the adversarial direction Improves the generalization performance Robustness against adversarial perturbation

Buse Gul Atli (Aalto University) Virtual Adversarial Training May 21, 2019 6 / 27

slide-7
SLIDE 7

Adversarial Training

Ladv(xl, θ) := D[q(y|xl)p(y|xl + radv, θ)] (1) where radv : argmaxr,r2≤ǫ D[q(y|xl)p(y|xl + r, θ)] (2) radv ≃ ǫ g g2 , g = ∇xlD[h(y; yl)p(y|xl, θ)] (3) radv ≃ ǫsign(g) when norm is L∞

Buse Gul Atli (Aalto University) Virtual Adversarial Training May 21, 2019 7 / 27

slide-8
SLIDE 8

Virtual Adversarial Training

How can we modify the adversarial training loss in Eq 1 when full label information is not available ? Adversarial perturbation intended to change the guess

New guess should match the old guess (probably dog, maybe stick) Unlabeled; model guesses it’s proba- bly a dog, maybe stick

Buse Gul Atli (Aalto University) Virtual Adversarial Training May 21, 2019 8 / 27

slide-9
SLIDE 9

Virtual Adversarial Training

x∗ denotes both labeled xl or unlabeled xul samples Ladv(x∗, θ) := D[q(y|x∗)p(y|x∗ + radv, θ)] where radv : argmaxr,r2≤ǫ D[q(y|x∗)p(y|x∗ + r, θ)]

Buse Gul Atli (Aalto University) Virtual Adversarial Training May 21, 2019 9 / 27

slide-10
SLIDE 10

Virtual Adversarial Training

Replace q(y|x) with current estimate of p(y|x, ˆ θ) Local Distributional Smoothness (LDS) LDS(x∗, θ) := D[p(y|x∗, θ)p(y|x∗ + radv, θ)] (4) where radv : argmaxr,r2≤ǫ D[p(y|x∗)p(y|x∗ + r, θ)] (5)

Buse Gul Atli (Aalto University) Virtual Adversarial Training May 21, 2019 10 / 27

slide-11
SLIDE 11

Virtual Adversarial Training

LDS(x∗, θ) := D[p(y|x∗, θ)p(y|x∗ + radv, θ)] where radv : argmaxr,r2≤ǫ D[p(y|x∗)p(y|x∗ + r, θ)] (6) Rvadv(Dl, Dul, θ) := 1 Nl + Nul

  • x∗∈Dl,Dul

LDS(x∗, θ) (7)

Buse Gul Atli (Aalto University) Virtual Adversarial Training May 21, 2019 11 / 27

slide-12
SLIDE 12

Virtual Adversarial Training

LDS(x∗, θ) := D[P(y|x∗, θ)p(y|x∗ + radv, θ)] where radv : argmaxr,r2≤ǫ D[p(y|x∗)p(y|x∗ + r, θ)] Rvadv(Dl, Dul, θ) := 1 Nl + Nul

  • x∗∈Dl,Dul

LDS(x∗, θ) Full objective function ℓ(Dl, θ) + αRvadv(Dl, Dul, θ) (8)

Buse Gul Atli (Aalto University) Virtual Adversarial Training May 21, 2019 12 / 27

slide-13
SLIDE 13

VAT : Fast Approximation Methods for radv

Linear approximation in Eq 3 cannot be performed for LDS D(r, x∗, ˆ θ) Use second order Taylor approximation, since ∇rD(r, x∗, ˆ θ) = 0 when r = 0 D(r, x, ˆ θ) ≃ 1 2rTH(X, ˆ θ)r (9)

Buse Gul Atli (Aalto University) Virtual Adversarial Training May 21, 2019 13 / 27

slide-14
SLIDE 14

VAT : Fast Approximation Methods for radv

D(r, x, ˆ θ) ≃ 1 2rTH(X, ˆ θ)r radv ≃ argmaxr{rTH(x, ˆ θ)r; r2 ≤ ǫ} = ǫu(x, ˆ θ) (10) where v =

v v2 and u is the first dominant eigenvector

Buse Gul Atli (Aalto University) Virtual Adversarial Training May 21, 2019 14 / 27

slide-15
SLIDE 15

VAT : Fast Approximation Methods for radv

O(n3) for computing eigenvectors of Hessian Use power iteration method d ← Hd (11) Hd ≃ ∇rD(r, x, ˆ θ)|r=ξd − ∇rD(r, x, ˆ θ)|r=0 ξ = ∇rD(r, x, ˆ θ)|r=ξd ξd (12) d ← ∇rD(r, x, ˆ θ)|r=ξd (13)

Buse Gul Atli (Aalto University) Virtual Adversarial Training May 21, 2019 15 / 27

slide-16
SLIDE 16

VAT : Fast Approximation Methods for radv

radv ≃ ǫ g g2 (14) g = ∇rD[p(y|x, ˆ θ), p(y|x + r, ˆ θ)]|r=ξd (15) KL divergence for the choice of D

Buse Gul Atli (Aalto University) Virtual Adversarial Training May 21, 2019 16 / 27

slide-17
SLIDE 17

VAT Example

VAT forces the model to be smooth around the points with large LDS values. Model predicts the same label for the set of points that belong to the same cluster after 100 updates

Buse Gul Atli (Aalto University) Virtual Adversarial Training May 21, 2019 17 / 27

slide-18
SLIDE 18

VAT vs. Other Regularization Methods

Buse Gul Atli (Aalto University) Virtual Adversarial Training May 21, 2019 18 / 27

slide-19
SLIDE 19

Random Perturbation Training (RPT) and Conditional Entropy (VAT+EntMin)

VAT can be written as R(K)(θ, Dl, Dul) := 1 Nl + Nul

  • x∈Dl,Dul

Erk[D[p(y|x, ˆ θ)p(y|x + rK, θ)]] (16) R0 : RPT (Smooths the function isotropically) Conditional entropy: Rcent = H = − 1 Nl + Nul

  • x∈Dl,Dul
  • y

p(y|x, θ) log p(y|x, θ) (17)

Buse Gul Atli (Aalto University) Virtual Adversarial Training May 21, 2019 19 / 27

slide-20
SLIDE 20

VAT Performance on Semi-Supervised Learning

VAT and data augmentation can be used together

Without augmentation, (DGM=Deep Generative Models, FM=feature matching) With augmentation,(translation and horizontal flip)

Buse Gul Atli (Aalto University) Virtual Adversarial Training May 21, 2019 20 / 27

slide-21
SLIDE 21

Effects of Perturbation Size ǫ and Regularization Coefficient α

For small ǫ, the hyper-parameter α plays a similar role as ǫ Parameter search for ǫ over the search for α maxr{D(r, x, θ); ≃ r2 ≤ ǫ} ≃ maxr{1 2rTH(x, θ)r; r2 ≤ ǫ} 1 2ǫ2λ1(x, θ) (18)

Buse Gul Atli (Aalto University) Virtual Adversarial Training May 21, 2019 21 / 27

slide-22
SLIDE 22

Effects of Perturbation Size ǫ

Buse Gul Atli (Aalto University) Virtual Adversarial Training May 21, 2019 22 / 27

slide-23
SLIDE 23

Effect of the Number of the Power Iterations K

Power iteration method converges slowly if there is an eigenvalue close in magnitude to the dominant eigenvalue. Might depend on the spectrum of the Hessian matrix

Buse Gul Atli (Aalto University) Virtual Adversarial Training May 21, 2019 23 / 27

slide-24
SLIDE 24

Robustness of the VAT-trained Model Against Perturbed Images

VAT-trained model behaves more natural than without VAT model.

Buse Gul Atli (Aalto University) Virtual Adversarial Training May 21, 2019 24 / 27

slide-25
SLIDE 25

VAT: Contributions

Applicability to semi supervised learning tasks Applicability to any parametric models that we can calculate gradients w.r.t input and model parameters Small number of hyper-parameters Increase robustness against adversarial examples, acts more natural in different noise levels

Buse Gul Atli (Aalto University) Virtual Adversarial Training May 21, 2019 25 / 27

slide-26
SLIDE 26

More info

Performances on semi-supervised image classification benchmarks Adversarial training methods for semi-supervised text classifications

Buse Gul Atli (Aalto University) Virtual Adversarial Training May 21, 2019 26 / 27

slide-27
SLIDE 27

Implementation

Semi-supervised learning with VAT on SVHN 1000 labeled samples, 72,257 unlabeled samples, no data augmentation batch size for cross entropy loss:32, batch size for LDS: 128 ∼ 48,000 updates in training ADAM optimization, base learning rate = 0.001, linearly decayed the rate over the last 16,000 updates α = 1, ǫ = 2.5, ξ = 1e − 6

Buse Gul Atli (Aalto University) Virtual Adversarial Training May 21, 2019 27 / 27