Understanding and Robustifying Differentiable Architecture Search - - PowerPoint PPT Presentation

understanding and robustifying differentiable
SMART_READER_LITE
LIVE PREVIEW

Understanding and Robustifying Differentiable Architecture Search - - PowerPoint PPT Presentation

Understanding and Robustifying Differentiable Architecture Search Arber Zela 1 , Thomas Elsken 2 , 1 , Tonmoy Saikia 1 , Yassine Marrakchi 1 , Thomas Brox 1 & Frank Hutter 1 , 2 1 Department of Computer Science, University of Freiburg { zelaa,


slide-1
SLIDE 1

Understanding and Robustifying Differentiable Architecture Search

Arber Zela1, Thomas Elsken2,1, Tonmoy Saikia1, Yassine Marrakchi1, Thomas Brox1 & Frank Hutter1,2

1Department of Computer Science, University of Freiburg

{zelaa, saikiat, marrakch, brox, fh}@cs.uni-freiburg.de

2Bosch Center for Artificial Intelligence

Thomas.Elsken@de.bosch.com

February 19, 2020

Accepted as Oral at ICLR 2020

Arber Zela RobustDARTS February 19, 2020 1

slide-2
SLIDE 2

The Choice of Architecture Matters

Performance improvements on various tasks mostly due to novel architectural design choices Figure: Larger circles, more network parameters [Canziani et al. 2017]

Arber Zela RobustDARTS February 19, 2020 2

slide-3
SLIDE 3

The Choice of Architecture Matters

Performance improvements on various tasks mostly due to novel architectural design choices Figure: Inception-v4 modules [Szegedy et al. ‘17] Designing network architectures is hard, requiring lots of human efforts

  • Can we automate this design process?

Arber Zela RobustDARTS February 19, 2020 2

slide-4
SLIDE 4

Towards efficient Neural Architecture Search (NAS)

RL & Evolution for NAS by Google Brain [Quoc Le’s group, ‘16-’18]

New state-of-the-art results for CIFAR-10, ImageNet, Penn Treebank Large computational demands – 800 GPUs for 2 weeks; 12800 architectures evaluated Code not public Figure taken from FastAI

Arber Zela RobustDARTS February 19, 2020 3

slide-5
SLIDE 5

Towards efficient Neural Architecture Search (NAS)

RL & Evolution for NAS by Google Brain [Quoc Le’s group, ‘16-’18]

New state-of-the-art results for CIFAR-10, ImageNet, Penn Treebank Large computational demands – 800 GPUs for 2 weeks; 12800 architectures evaluated Code not public

Weight sharing/One-shot NAS [Pham et al,’18; Bender et al, ’18; Liu et al,

‘19; Xie et al. ’19; Cai et al. ’19, Zhang et al. ’19] All possible architectures are subgraphs of a large supergraph (the one-shot model) Weights are shared between different architectures with common edges/nodes in the supergraph Search costs reduced to < 1 GPU day.

Arber Zela RobustDARTS February 19, 2020 3

slide-6
SLIDE 6

Differentiable NAS (DARTS) [Liu et al. ‘19]

Neural Network as Directed Acyclic Graph

  • Nodes: fixed operators (element-wise addition, concatenation) on

feature maps

  • Edges: operations (sep conv 3×3, sep conv 5×5, dil conv 3×3,

dil conv 5×5, max pool 3×3, avg pool 3×3, identity and zero)

Between 2 nodes: Categorical choice for which operation to use

  • Relax this discrete space to a continuous representation using a convex

combination of these choices (MixedOps) − → one-shot model

  • Use SGD to search in the space of architectures.

Arber Zela RobustDARTS February 19, 2020 4

slide-7
SLIDE 7

Differentiable Architecture Search (DARTS) [Liu et al. ‘19]

x(j) =

i<j ˜

  • (i,j)(x(i)) =

i<j

  • ∈O

eα(i,j)

  • ′∈O e

α(i,j)

  • (x(i))

0.33 0.33

2

0.33

1

0.33 0.33 0.33 0.33 0.33 0.33

(a) Search start

0.84 0.24

2

0.17

1

0.71 0.05 0.03 0.13 0.38 0.45

(b) Search end

2 1

Arber Zela RobustDARTS February 19, 2020 5

slide-8
SLIDE 8

Differentiable Architecture Search (DARTS) [Liu et al. ‘19]

x(j) =

i<j ˜

  • (i,j)(x(i)) =

i<j

  • ∈O

eα(i,j)

  • ′∈O e

α(i,j)

  • (x(i))
  • (i,j) ∈ arg maxo∈O α(i,j)
  • 0.33

0.33

2

0.33

1

0.33 0.33 0.33 0.33 0.33 0.33

(d) Search start

0.84 0.24

2

0.17

1

0.71 0.05 0.03 0.13 0.38 0.45

(e) Search end

2 1

(f) Final cell

Arber Zela RobustDARTS February 19, 2020 5

slide-9
SLIDE 9

DARTS: Architecture Optimization

Optimizing both Ltrain and Lvalid corresponds to a bilevel optimization problem: min

α {f(α) Lvalid(w∗(α), α)}

s.t. w∗(α) = arg min

w

Ltrain(w, α), where

  • α −

→ architectural weights

  • w −

→ operation weights

Arber Zela RobustDARTS February 19, 2020 6

slide-10
SLIDE 10

DARTS: Architecture Optimization

Optimizing both Ltrain and Lvalid corresponds to a bilevel optimization problem: min

α {f(α) Lvalid(w∗(α), α)}

s.t. w∗(α) = arg min

w

Ltrain(w, α), where

  • α −

→ architectural weights

  • w −

→ operation weights Approximate w∗(α) ≈ w − ξ∇wLtrain(w, α) The optimization alternates between:

1

Update w by ∇wLtrain(w, α)

2

Update α by ∇αLvalid(w − ξ∇wLtrain(w, α), α)

Arber Zela RobustDARTS February 19, 2020 6

slide-11
SLIDE 11

Works quite well on many benchmarks

Original CNN space: 8 operations on each MixedOp 28 MixedOPs in total > 1023 possible architectures < 3% on CIFAR-10 in less than 1 GPU day of search

Arber Zela RobustDARTS February 19, 2020 7

slide-12
SLIDE 12

But not always...

S1: This search space uses a different set of two operators per edge, derived by iteratively running DARTs and pruning unimportant operations. S2: {3 × 3 SepConv, SkipConnect}. S3: {3 × 3 SepConv, SkipConnect, Zero}, S4: {3 × 3 SepConv, Noise}.

Arber Zela RobustDARTS February 19, 2020 8

slide-13
SLIDE 13

But not always...

S1: This search space uses a different set of two operators per edge, derived by iteratively running DARTs and pruning unimportant operations. S2: {3 × 3 SepConv, SkipConnect}. S3: {3 × 3 SepConv, SkipConnect, Zero}, S4: {3 × 3 SepConv, Noise}.

c_{k-2} skip_connect c_{k-1} skip_connect 1 skip_connect 2 skip_connect 3 skip_connect skip_connect skip_connect skip_connect c_{k} c_{k-2} skip_connect 2 skip_connect c_{k-1} skip_connect 1 skip_connect skip_connect skip_connect 3 skip_connect c_{k} sep_conv_3x3 c_{k-2} skip_connect 1 skip_connect 2 skip_connect 3 skip_connect c_{k-1} skip_connect skip_connect skip_connect skip_connect c_{k}

c_{k-2} sep_conv_3x3 1 sep_conv_3x3 2 noise 3 noise c_{k-1} sep_conv_3x3 noise noise c_{k} noise

Arber Zela RobustDARTS February 19, 2020 8

slide-14
SLIDE 14

Architecture overfitting

S5: Very small search space with known global optimum. 81 possible architectures trained 3 independent times using the default DARTS settings.

Arber Zela RobustDARTS February 19, 2020 9

slide-15
SLIDE 15

Architecture overfitting

S5: Very small search space with known global optimum. 81 possible architectures trained 3 independent times using the default DARTS settings. Architectural parameters start overfitting to the validation set.

10 20 30 40 50

Search epoch

1 2 3 4 5 6 7

Test regret (%) L2 factor: 0.0003

DARTS test regret DARTS one-shot val. error RS-ws test regret 10 20 30 40 50 60

Validation error (%)

Arber Zela RobustDARTS February 19, 2020 9

slide-16
SLIDE 16

Architecture overfitting

What would be a good feature that would detect overfitting without training and evaluating the architectures from scratch (too expensive!)?

Arber Zela RobustDARTS February 19, 2020 10

slide-17
SLIDE 17

Architecture overfitting

What would be a good feature that would detect overfitting without training and evaluating the architectures from scratch (too expensive!)? HINT: flatness/sharpness of minimas, e.g. in large vs. small batch size training of NN is a good indicator of generalization.

2

2Hessian-based Analysis of Large Batch Training and Robustness to Adversaries. Yao et al. NeurIPS ‘18 Arber Zela RobustDARTS February 19, 2020 10

slide-18
SLIDE 18

Generalization of architectures and sharpness of minimas

Compute the full Hessian ∇2

αLval on a randomly sampled mini-batch from the

validation set.

Arber Zela RobustDARTS February 19, 2020 11

slide-19
SLIDE 19

Generalization of architectures and sharpness of minimas

Compute the full Hessian ∇2

αLval on a randomly sampled mini-batch from the

validation set. The dominant EV starts increasing at the point where the architecture generalization error starts increasing.

10 20 30 40 50

Search epoch

10 15 20 25 30

One-shot validation error (%)

10 20 30 40 50

Search epoch

2 3 4 5 6 7 8

Test error (%)

10 20 30 40 50

Search epoch

0.2 0.4 0.6 0.8

Dominant Eigenvalue S1 S2 S3 S4

Arber Zela RobustDARTS February 19, 2020 11

slide-20
SLIDE 20

Generalization of architectures and sharpness of minimas

Compute the full Hessian ∇2

αLval on a randomly sampled mini-batch from the

validation set. The dominant EV starts increasing at the point where the architecture generalization error starts increasing. High correlation between generalization and the dominant eigenvalue (EV)

0.15 0.20 0.25 0.30 0.35 0.40

Average Dominant Eigenvalue

3.0 3.5 4.0 4.5 5.0 5.5

Test error (%) S1 C10 (Average over the EV trajectory) Pearson corr. coef.: 0.867, p-value: 0.00000 Arber Zela RobustDARTS February 19, 2020 11

slide-21
SLIDE 21

Early Stopping and Meta-regularization

Goal: Keep the dominant eigenvalue to a low value

Early stop whenever the EV increases rapidly Regularize the inner problem

10 20 30 40 50

Epoch

0.2 0.3 0.4 0.5 0.6 0.7 0.8

  • Max. Eigenvalue MA

S1 cifar10

dp=0.0000 dp=0.2000 dp=0.4000 dp=0.6000 10 20 30 40 50

Epoch

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

  • Max. Eigenvalue MA

S1 cifar10

L2=0.0003 L2=0.0009 L2=0.0027 L2=0.0081 L2=0.0243

Benchmark DARTS DARTS-ES C10 S1 4.66 ± 0.71 3.05 ± 0.07 S2 4.42 ± 0.40 3.41 ± 0.14 S3 4.12 ± 0.85 3.71 ± 1.14 S4 6.95 ± 0.18 4.17 ± 0.21 C100 S1 29.93 ± 0.41 28.90 ± 0.81 S2 28.75 ± 0.92 24.68 ± 1.43 S3 29.01 ± 0.24 26.99 ± 1.79 S4 24.77 ± 1.51 23.90 ± 2.01 SVHN S1 9.88 ± 5.50 2.80 ± 0.09 S2 3.69 ± 0.12 2.68 ± 0.18 S3 4.00 ± 1.01 2.78 ± 0.29 S4 2.90 ± 0.02 2.55 ± 0.15

Arber Zela RobustDARTS February 19, 2020 12

slide-22
SLIDE 22

How the curvature relates with generalization?

Sharp minimas much more sensitive to variations in the input space. DARTS discretizes (i.e. takes argmax over α) to get the final architecture.

Arber Zela RobustDARTS February 19, 2020 13

slide-23
SLIDE 23

How the curvature relates with generalization?

Sharp minimas much more sensitive to variations in the input space. DARTS discretizes (i.e. takes argmax over α) to get the final architecture.

0.0 0.5 1.0 1.5 2.0

Dominant Eigenvalue

10 20 30

Validation accuracy drop (%) Eigenvalues vs. Accuracy Drop Spearman corr. coef.: 0.736

Evaluate the found architectures with the search model weights. Report the accuracy drop relative to the search model performance.

Arber Zela RobustDARTS February 19, 2020 13

slide-24
SLIDE 24

How the curvature relates with generalization?

Sharp minimas much more sensitive to variations in the input space. DARTS discretizes (i.e. takes argmax over α) to get the final architecture.

Figure: Taken from SDARTS-RS [Chen & Hsieh, 2020]

Arber Zela RobustDARTS February 19, 2020 13

slide-25
SLIDE 25

Benchmark Results

Empirical evaluation of practical robustified versions of DARTS. Each entry is the test error after retraining the selected architecture as usual. The best method for each setting is boldface and underlined, the second best boldface.

Benchmark RS-ws DARTS R-DARTS(DP) R-DARTS(L2) DARTS-ES DARTS-ADA C10

S1 3.23 3.84 3.11 2.78 3.01 3.10 S2 3.66 4.85 3.48 3.31 3.26 3.35 S3 2.95 3.34 2.93 2.51 2.74 2.59 S4 8.07 7.20 3.58 3.56 3.71 4.84

C100

S1 23.30 29.46 25.93 24.25 28.37 24.03 S2 21.21 26.05 22.30 22.24 23.25 23.52 S3 23.75 28.90 22.36 23.99 23.73 23.37 S4 28.19 22.85 22.18 21.94 21.26 23.20

SVHN

S1 2.59 4.58 2.55 4.79 2.72 2.53 S2 2.72 3.53 2.52 2.51 2.60 2.54 S3 2.87 3.41 2.49 2.48 2.50 2.50 S4 3.46 3.05 2.61 2.50 2.51 2.46

Arber Zela RobustDARTS February 19, 2020 14

slide-26
SLIDE 26

More results

Effect of regularization for disparity estimation. Search was conducted on FlyingThings3D (FT) and then evaluated on both FT and Sintel.

Aug. One-shot valid FT test Sintel test Params Scale EPE EPE EPE (M) 0.0 4.49 3.83 5.69 9.65 0.1 3.53 3.75 5.97 9.65 0.5 3.28 3.37 5.22 9.43 1.0 4.61 3.12 5.47 12.46 1.5 5.23 2.60 4.15 12.57 2.0 7.45 2.33 3.76 12.25 L2 reg. One-shot valid FT test Sintel test Params factor EPE EPE EPE (M) 3 × 10−4 3.95 3.25 6.13 11.00 9 × 10−4 5.97 2.30 4.12 13.92 27 × 10−4 4.25 2.72 4.83 10.29 81 × 10−4 4.61 2.34 3.85 12.16

DARTS vs. RobustDARTS on the original DARTS search

  • spaces. We show mean ±

stddev for 5 repetitions.

Benchmark DARTS R-DARTS(L2) C10 2.91 ± 0.25 2.95 ± 0.21 C100 20.58 ± 0.44 18.01 ± 0.26 SVHN 2.46 ± 0.09 2.17 ± 0.09 PTB 58.64 57.59 Arber Zela RobustDARTS February 19, 2020 15

slide-27
SLIDE 27

Conclusions

1 We identify 12 NAS benchmarks in which standard DARTS yields

degenerate architectures with poor test performance.

2 We show that there is a strong correlation between the sharpness of

minimas and the architecture’s generalization error.

3 Based on these observations we propose regularizers in the

architectural level, such as:

  • EV-based early stopping
  • (Adaptive) regularization in the inner objective of DARTS

Arber Zela RobustDARTS February 19, 2020 16