Understanding and Robustifying Differentiable Architecture Search - PowerPoint PPT Presentation

Understanding and Robustifying Differentiable Architecture Search Arber Zela 1 , Thomas Elsken 2 , 1 , Tonmoy Saikia 1 , Yassine Marrakchi 1 , Thomas Brox 1 & Frank Hutter 1 , 2 1 Department of Computer Science, University of Freiburg { zelaa, saikiat, marrakch, brox, fh } @cs.uni-freiburg.de 2 Bosch Center for Artificial Intelligence Thomas.Elsken@de.bosch.com February 19, 2020 Accepted as Oral at ICLR 2020 Arber Zela RobustDARTS February 19, 2020 1

The Choice of Architecture Matters Performance improvements on various tasks mostly due to novel architectural design choices Figure: Larger circles, more network parameters [Canziani et al. 2017] Arber Zela RobustDARTS February 19, 2020 2

The Choice of Architecture Matters Performance improvements on various tasks mostly due to novel architectural design choices Figure: Inception-v4 modules [Szegedy et al. ‘17] Designing network architectures is hard, requiring lots of human efforts - Can we automate this design process? Arber Zela RobustDARTS February 19, 2020 2

Towards efficient Neural Architecture Search (NAS) RL & Evolution for NAS by Google Brain [Quoc Le’s group, ‘16-’18] New state-of-the-art results for CIFAR-10, ImageNet, Penn Treebank Large computational demands – 800 GPUs for 2 weeks; 12800 architectures evaluated Code not public Figure taken from FastAI Arber Zela RobustDARTS February 19, 2020 3

Towards efficient Neural Architecture Search (NAS) RL & Evolution for NAS by Google Brain [Quoc Le’s group, ‘16-’18] New state-of-the-art results for CIFAR-10, ImageNet, Penn Treebank Large computational demands – 800 GPUs for 2 weeks; 12800 architectures evaluated Code not public Weight sharing/One-shot NAS [Pham et al,’18; Bender et al, ’18; Liu et al, ‘19; Xie et al. ’19; Cai et al. ’19, Zhang et al. ’19] All possible architectures are subgraphs of a large supergraph (the one-shot model) Weights are shared between different architectures with common edges/nodes in the supergraph Search costs reduced to < 1 GPU day. Arber Zela RobustDARTS February 19, 2020 3

Differentiable NAS (DARTS) [Liu et al. ‘19] Neural Network as Directed Acyclic Graph - Nodes: fixed operators (element-wise addition, concatenation) on feature maps - Edges: operations ( sep conv 3 × 3 , sep conv 5 × 5 , dil conv 3 × 3 , dil conv 5 × 5 , max pool 3 × 3 , avg pool 3 × 3 , identity and zero ) Between 2 nodes: Categorical choice for which operation to use - Relax this discrete space to a continuous representation using a convex combination of these choices (MixedOps) − → one-shot model - Use SGD to search in the space of architectures. Arber Zela RobustDARTS February 19, 2020 4

0 1 2 Differentiable Architecture Search (DARTS) [Liu et al. ‘19] e α ( i,j ) x ( j ) = � o ( i,j ) ( x ( i ) ) = � o o ( x ( i ) ) i<j ˜ � o ∈O i<j α ( i,j ) o ′ � o ′∈O e 0 0 0.33 0.84 0.33 0.03 0.33 0.13 1 1 0.33 0.24 0.33 0.71 0.33 0.33 0.45 0.05 0.33 0.38 0.33 0.17 2 2 (a) Search start (b) Search end Arber Zela RobustDARTS February 19, 2020 5

Differentiable Architecture Search (DARTS) [Liu et al. ‘19] e α ( i,j ) x ( j ) = � o ( i,j ) ( x ( i ) ) = � o o ( x ( i ) ) i<j ˜ � o ∈O i<j α ( i,j ) o ′ � o ′∈O e o ( i,j ) ∈ arg max o ∈O α ( i,j ) o 0 0 0 0.33 0.84 0.33 0.03 0.33 0.13 1 1 1 0.33 0.24 0.33 0.71 0.33 0.33 0.45 0.05 0.33 0.38 0.33 0.17 2 2 2 (d) Search start (e) Search end (f) Final cell Arber Zela RobustDARTS February 19, 2020 5

DARTS: Architecture Optimization Optimizing both L train and L valid corresponds to a bilevel optimization problem: α { f ( α ) � L valid ( w ∗ ( α ) , α ) } min s.t. w ∗ ( α ) = arg min L train ( w, α ) , w where - α − → architectural weights - w − → operation weights Arber Zela RobustDARTS February 19, 2020 6

DARTS: Architecture Optimization Optimizing both L train and L valid corresponds to a bilevel optimization problem: α { f ( α ) � L valid ( w ∗ ( α ) , α ) } min s.t. w ∗ ( α ) = arg min L train ( w, α ) , w where - α − → architectural weights - w − → operation weights Approximate w ∗ ( α ) ≈ w − ξ ∇ w L train ( w, α ) The optimization alternates between: Update w by ∇ w L train ( w, α ) 1 Update α by ∇ α L valid ( w − ξ ∇ w L train ( w, α ) , α ) 2 Arber Zela RobustDARTS February 19, 2020 6

Works quite well on many benchmarks Original CNN space: 8 operations on each MixedOp 28 MixedOPs in total > 10 23 possible architectures < 3% on CIFAR-10 in less than 1 GPU day of search Arber Zela RobustDARTS February 19, 2020 7

But not always... S1: This search space uses a different set of two operators per edge, derived by iteratively running DARTs and pruning unimportant operations. S2: { 3 × 3 SepConv , SkipConnect } . S3: { 3 × 3 SepConv , SkipConnect , Zero } , S4: { 3 × 3 SepConv , Noise } . Arber Zela RobustDARTS February 19, 2020 8

But not always... S1: This search space uses a different set of two operators per edge, derived by iteratively running DARTs and pruning unimportant operations. S2: { 3 × 3 SepConv , SkipConnect } . S3: { 3 × 3 SepConv , SkipConnect , Zero } , S4: { 3 × 3 SepConv , Noise } . skip_connect skip_connect skip_connect 1 c_{k-1} skip_connect 2 skip_connect skip_connect 2 sep_conv_3x3 0 c_{k} c_{k-2} skip_connect skip_connect 3 skip_connect skip_connect skip_connect c_{k} 3 skip_connect 0 skip_connect skip_connect c_{k-1} c_{k-2} skip_connect 1 noise noise 3 noise c_{k-2} 2 sep_conv_3x3 skip_connect c_{k} skip_connect c_{k-1} 1 noise skip_connect 2 sep_conv_3x3 0 skip_connect 1 noise c_{k} skip_connect c_{k-1} 0 c_{k-2} skip_connect sep_conv_3x3 skip_connect 3 skip_connect Arber Zela RobustDARTS February 19, 2020 8

Architecture overfitting S5: Very small search space with known global optimum. 81 possible architectures trained 3 independent times using the default DARTS settings. Arber Zela RobustDARTS February 19, 2020 9

Architecture overfitting S5: Very small search space with known global optimum. 81 possible architectures trained 3 independent times using the default DARTS settings. Architectural parameters start overfitting to the validation set. L 2 factor: 0.0003 7 60 DARTS test regret DARTS one-shot val. error 6 RS-ws test regret 50 Validation error (%) 5 Test regret (%) 40 4 3 30 2 20 1 0 10 0 10 20 30 40 50 Search epoch Arber Zela RobustDARTS February 19, 2020 9

Architecture overfitting What would be a good feature that would detect overfitting without training and evaluating the architectures from scratch (too expensive!)? Arber Zela RobustDARTS February 19, 2020 10

Architecture overfitting What would be a good feature that would detect overfitting without training and evaluating the architectures from scratch (too expensive!)? HINT: flatness/sharpness of minimas, e.g. in large vs. small batch size training of NN is a good indicator of generalization. 2 2Hessian-based Analysis of Large Batch Training and Robustness to Adversaries. Yao et al. NeurIPS ‘18 Arber Zela RobustDARTS February 19, 2020 10

Generalization of architectures and sharpness of minimas Compute the full Hessian ∇ 2 α L val on a randomly sampled mini-batch from the validation set. Arber Zela RobustDARTS February 19, 2020 11

Generalization of architectures and sharpness of minimas Compute the full Hessian ∇ 2 α L val on a randomly sampled mini-batch from the validation set. The dominant EV starts increasing at the point where the architecture generalization error starts increasing. 30 8 One-shot validation error (%) S1 0.8 7 Dominant Eigenvalue S2 25 Test error (%) S3 6 0.6 S4 20 5 0.4 4 15 3 0.2 10 2 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 Search epoch Search epoch Search epoch Arber Zela RobustDARTS February 19, 2020 11

Generalization of architectures and sharpness of minimas Compute the full Hessian ∇ 2 α L val on a randomly sampled mini-batch from the validation set. The dominant EV starts increasing at the point where the architecture generalization error starts increasing. High correlation between generalization and the dominant eigenvalue (EV) S1 C10 (Average over the EV trajectory) Pearson corr. coef.: 0.867, p-value: 0.00000 5.5 5.0 Test error (%) 4.5 4.0 3.5 3.0 0.15 0.20 0.25 0.30 0.35 0.40 Average Dominant Eigenvalue Arber Zela RobustDARTS February 19, 2020 11

Understanding and Robustifying Differentiable Architecture Search - PowerPoint PPT Presentation

Understanding and Robustifying Differentiable Architecture Search Arber Zela 1 , Thomas Elsken 2 , 1 , Tonmoy Saikia 1 , Yassine Marrakchi 1 , Thomas Brox 1 & Frank Hutter 1 , 2 1 Department of Computer Science, University of Freiburg { zelaa,

An Enriched Perspective on Differentiable Stacks Benjamin MacAdam Joint work with Jonathan

Robustifying the Viterbi Algorithm Cedric De Boom, Jasper De Bock, Arthur Van Camp, Gert de

Automatically Robustifying Verified Hybrid Systems in KeYmaera X Nathan Fulton Carnegie Mellon

Differentiable Rendering for Mesh and Implicit Surface Weikai Chen Tencent America GAMES

The Differentiable Curry Martin Abadi, Dan Belov, Gordon Plotkin, Richard Wei, Dimitrios Vytiniotis

Learning with Differentiable Perturbed Optimizers Quentin Berthet Youth in High-dimensions -

Learning with Differentiable Perturbed Optimizers Quentin Berthet Optimization for ML - CIRM -

Learning to map between ferns with differentiable binary embedding networks Maximilian Blendowski

Relay : a high level differentiable IR Jared Roesch TVMConf December 12th, 2018 1 This

Reparameterization Gradient for Non-differentiable Models Wonyeol Lee Hangyeol Yu Hongseok

Differentiable Cloth Simulation for Inverse Problems Junbang Liang 1 Content Motivation

UNDERSTANDING (LMOU) LOCAL MEMORANDUM OF UNDERSTANDING (LMOU) LOCAL MEMORANDUM OF UNDERSTANDING

CS1063: Understanding CS1063: Understanding CS1063: Understanding CS1063: Understanding

END-TO-END DIFFERENTIABLE PHYSICS FOR LEARNING AND CONTROL Filipe de Avila Belbute-Peres 1 Kevin

Alexandru Suciu Northeastern University Session on the Geometry and Topology of Differentiable

A Unified Distributed Algorithm for Non- Games Non-cooperative, Non-convex, and Non-differentiable

SI425 : NLP Set 2 Probability Review Fall 2020 : Chambers help me make a new rumor

A Different Kind of Physics Interactive evolution of expressive dancers and choreography

McMambo V1: A new kind of Latin Dance Mambo Watson Ladd University of California, Berkeley

1 Peter Series Lesson #076 December 22, 2016 Dean Bible Ministries www.deanbibleministries.org Dr.

Announcements: Discussion this week in the classrooms as listed in Student Center (Hollister

Carlo (MC) methods Introduction to MC methods Why Scientists like to gamble Monte Carlo Methods

Neural Architecture Search Yu Cao What is Neural Architecture Search (NAS) Selecting the optimal

On the Stability and Robustness of Non-Synchronous Circuits with Timing Loops Matthias Fgger,