Recovery Guarantees for One-hidden-layer Neural Networks Kai Zhong - PowerPoint PPT Presentation

Recovery Guarantees for One-hidden-layer Neural Networks Kai Zhong ∗ Joint work with Zhao Song ∗ , Prateek Jain † , Peter L. Bartlett ‡ , Inderjit S. Dhillon ∗ ∗ UT-Austin, † MSR India, ‡ UC Berkeley Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 1 / 21

Learning Neural Networks is Hard The objective functions of neural networks are highly non-convex. Gradient-descent-based methods only achieve local optima. Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 2 / 21

Learning Neural Networks is Hard Good News When the size of the network is very large, no need to worry about bad local minima. Every local minimum is a global minimum or close to a global minimum. [Choromanska et al. ’15, Nguyen & Hein ’17, etc.] Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 3 / 21

Learning Neural Networks is Hard Good News When the size of the network is very large, no need to worry about bad local minima. Every local minimum is a global minimum or close to a global minimum. [Choromanska et al. ’15, Nguyen & Hein ’17, etc.] Bad News Typically over-parameterize May lead to overfitting!! Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 3 / 21

Learning Neural Networks is Hard Good News When the size of the network is very large, no need to worry about bad local minima. Every local minimum is a global minimum or close to a global minimum. [Choromanska et al. ’15, Nguyen & Hein ’17, etc.] Bad News Typically over-parameterize May lead to overfitting!! Can we learn a neural net without over-parameterization? Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 3 / 21

Recover A Neural Network Assume the data follows a specified neural network model. Try to recover this model. Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 4 / 21

Model: One-hidden-layer Neural Network Assume n samples S = { ( x j , y j ) } j =1 , 2 , ··· ,n ⊂ R d × R are sampled i.i.d. from distribution � k v ∗ i · φ ( w ∗⊤ D : x ∼ N (0 , I ) , y = x ) , i i =1 where φ ( z ) is the activation function, k is the number of hidden nodes, { w ∗ i , v ∗ i } i =1 , 2 , ··· ,k are underlying ground truth parameters. Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 5 / 21

General Issues and Our Contribution Can we recover the model? How many samples are required? (Sample Complexity) And how much time? (Computational Complexity) Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 6 / 21

General Issues and Our Contribution Can we recover the model? Yes, by gradient descent following tensor method initialization How many samples are required? (Sample Complexity) And how much time? (Computational Complexity) Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 6 / 21

General Issues and Our Contribution Can we recover the model? Yes, by gradient descent following tensor method initialization How many samples are required? (Sample Complexity) | S | > d · log(1 /ǫ ) · poly( k, λ ), where ǫ is the precision and λ is a condition number of W ∗ . And how much time? (Computational Complexity) Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 6 / 21

General Issues and Our Contribution Can we recover the model? Yes, by gradient descent following tensor method initialization How many samples are required? (Sample Complexity) | S | > d · log(1 /ǫ ) · poly( k, λ ), where ǫ is the precision and λ is a condition number of W ∗ . And how much time? (Computational Complexity) | S | · d · poly( k, λ ) Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 6 / 21

General Issues and Our Contribution Can we recover the model? Yes, by gradient descent following tensor method initialization How many samples are required? (Sample Complexity) | S | > d · log(1 /ǫ ) · poly( k, λ ), where ǫ is the precision and λ is a condition number of W ∗ . And how much time? (Computational Complexity) | S | · d · poly( k, λ ) The first recovery guarantee with both sample complexity and computational complexity linear in the input dimension and logarithmic in the precision. Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 6 / 21

Objective Function Given v ∗ i and a sample set S , consider L2 loss � k � 2 � � 1 � v ∗ i φ ( w ⊤ f S ( W ) = i x ) − y . 2 | S | i =1 ( x ,y ) ∈ S Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 7 / 21

Objective Function Given v ∗ i and a sample set S , consider L2 loss � k � 2 � � 1 � v ∗ i φ ( w ⊤ f S ( W ) = i x ) − y . 2 | S | i =1 ( x ,y ) ∈ S We show it is locally strongly convex near the ground truth! Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 7 / 21

Approach Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 8 / 21

Local Strong Convexity (LSC) ∇ 2 f ( W ) is positive definite (p.d.) for W ∈ A ⇒ f ( W ) is LSC in area A Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 9 / 21

Local Strong Convexity (LSC) ∇ 2 f ( W ) is positive definite (p.d.) for W ∈ A ⇒ f ( W ) is LSC in area A Consider the minimal eigenvalue of expected Hessian at ground truth,    2  � � � ∇ 2 f D ( W ∗ )  φ ′ ( w ∗⊤ j x ) x ⊤ a j   λ min = min j � a j � 2 =1 E � j where f D is the expected risk. Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 9 / 21

Local Strong Convexity (LSC) ∇ 2 f ( W ) is positive definite (p.d.) for W ∈ A ⇒ f ( W ) is LSC in area A Consider the minimal eigenvalue of expected Hessian at ground truth,    2  � � � ∇ 2 f D ( W ∗ )  φ ′ ( w ∗⊤ j x ) x ⊤ a j   λ min = min j � a j � 2 =1 E � j where f D is the expected risk. � � ∇ 2 f D ( W ∗ ) λ min ≥ 0 always holds. Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 9 / 21

Local Strong Convexity (LSC) ∇ 2 f ( W ) is positive definite (p.d.) for W ∈ A ⇒ f ( W ) is LSC in area A Consider the minimal eigenvalue of expected Hessian at ground truth,    2  � � � ∇ 2 f D ( W ∗ )  φ ′ ( w ∗⊤ j x ) x ⊤ a j   λ min = min j � a j � 2 =1 E � j where f D is the expected risk. � � ∇ 2 f D ( W ∗ ) λ min ≥ 0 always holds. � � ∇ 2 f D ( W ∗ ) Does λ min > 0 always hold? Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 9 / 21

Local Strong Convexity (LSC) ∇ 2 f ( W ) is positive definite (p.d.) for W ∈ A ⇒ f ( W ) is LSC in area A Consider the minimal eigenvalue of expected Hessian at ground truth,    2  � � � ∇ 2 f D ( W ∗ )  φ ′ ( w ∗⊤ j x ) x ⊤ a j   λ min = min j � a j � 2 =1 E � j where f D is the expected risk. � � ∇ 2 f D ( W ∗ ) λ min ≥ 0 always holds. � � ∇ 2 f D ( W ∗ ) Does λ min > 0 always hold? No Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 9 / 21

Two Examples when LSC doesn’t Hold i = 1 and W ∗ = I ( k = d ). Set v ∗ 1 When φ ( z ) = z ,    ( x ⊤ � � � ∇ 2 f D ( W ∗ ) a j ) 2  = 0 λ min = min j � a j � 2 =1 E � j The minimum is achieved when � j a j = 0 Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 10 / 21

Two Examples when LSC doesn’t Hold i = 1 and W ∗ = I ( k = d ). Set v ∗ 2 When φ ( z ) = z 2 , � � � ( � xx ⊤ , A � ) 2 � ∇ 2 f D ( W ∗ ) λ min = 4 min = 0 j � a j � 2 =1 E � where A = [ a 1 , a 2 , · · · , a d ] ∈ R d × d . The minimum is achieved when A = − A ⊤ . Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 11 / 21

When LSC Holds 1 φ ( z ) satisfies three properties. P1 Non-negative and homogeneously bounded derivative 0 ≤ φ ′ ( z ) ≤ L 1 | z | p for some constants L 1 > 0 and p ≥ 0. max( z, 0) tanh( z ) max( z, 0 . 1 z ) Figure: activations satisfying P1 z 2 e z max( − z, 0) Figure: activations not satisfying P1 Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 12 / 21

When LSC Holds 1 φ ( z ) satisfies three properties. P2 “Non-linearity” 1 For any σ > 0, we have ρ ( σ ) > 0 , where ρ ( σ ) := min { α 2 , 0 − α 2 1 , 0 − α 2 1 , 1 , α 2 , 2 − α 2 1 , 1 − α 2 1 , 2 , α 1 , 0 α 1 , 2 − α 2 1 , 1 } and α i,j := E z ∼N (0 , 1) [( φ ′ ( σz )) i z j ]. ReLU leaky squared erf tanh linear quad- ReLU ReLU ratic ρ (0 . 1) 1.9E-4 1.8E-4 ρ (1) 0.091 0.089 0.27 σ 5.2E-2 4.9E-2 0 0 ρ (10) 2.5E-5 5.1E-5 1 Best name we can find... still need more understanding for ρ ( σ ) Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 13 / 21

When LSC Holds 1 φ ( z ) satisfies three properties. P3 φ ′′ ( z ) satisfies one of the following two properties, (a) Smoothness | φ ′′ ( z ) | ≤ L 2 for all z for some constant L 2 , or (b) Piece-wise linearity φ ′′ ( z ) = 0 except for e ( e is a finite constant) points. max( z, 0) 2 max( z, 0) max( z, 0 . 1 z ) tanh( z ) φ ( z ) = 0 if z < 0; z 4 + 4 z o.w. e z Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 14 / 21

Recovery Guarantees for One-hidden-layer Neural Networks Kai Zhong - PowerPoint PPT Presentation

Recovery Guarantees for One-hidden-layer Neural Networks Kai Zhong Joint work with Zhao Song , Prateek Jain , Peter L. Bartlett , Inderjit S. Dhillon UT-Austin, MSR India, UC Berkeley Kai Zhong Recovery Guarantees

Implicit Guarantees and Risk Taking: Implicit Guarantees and Risk Taking: Implicit Guarantees and

Overview Multi-layer networks: Cognitive Modeling limits of single layer networks; Lecture

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

LSTMs Overview Subhashini Venugopalan Neural Networks z t Output B Hidden Hidden Input WHY

Network Layer October 2, 2019 guha.jayachandran@sjsu.edu Layer 2: Protocol atop Layer 1

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Lecture 6: Wireless Link Layer, Lecture 6: Wireless Link Layer, MAC protocols, CSMA MAC

1 Transport Layer Transport Layer Outline Message, Segment, Datagram Transport-layer

ELEC / COMP 177 Fall 2016 Some slides from Kurose and Ross, Computer Networking , 5 th Edition

5 Network Layer Network Layer Network Layer Network Layer Example: Choosing among multiple ASes

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Approximate Verification of Deep Neural Networks with Provable Guarantees Xiaowei Huang,

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

CS4501: Introduction to Computer Vision Neural Networks (NNs) Artificial Neural Networks (ANNs)

Finding Hidden Supernovae with Finding Hidden Supernovae with Finding Hidden Supernovae with

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

From crisis management to recovery the Cecilia Skingsley Riksbank's role in First Deputy

Let it Recover: Multiparty Protocol-Induced Recovery 1 Fail fast and recover quickly

PROTECT a P latform for Ro bust T hr e shold C ryp t ography Jason Resch, Christian Cachin, Hugo

Motivation Atomicity: Transactions may abort (Rollback). Logging and

Deadlocks: Avoidance Detection - Recovery Summer 2013 Cornell University 1 Today Can

APNA 29th Annual Conference Session 3013.1: October 30, 2015 T he I mpo rta nc e o f Disc lo

Performance and Concurrency Bug Detection Tools for Java Programs Shan Lu University of Chicago

Advanced PostgreSQL backup & recovery methods Anastasia Lubennikova pgconf.eu 2018 Agenda

Recovery Guarantees for One-hidden-layer Neural Networks Kai Zhong - PowerPoint PPT Presentation

Recovery Guarantees for One-hidden-layer Neural Networks Kai Zhong Joint work with Zhao Song , Prateek Jain , Peter L. Bartlett , Inderjit S. Dhillon UT-Austin, MSR India, UC Berkeley Kai Zhong Recovery Guarantees

Implicit Guarantees and Risk Taking: Implicit Guarantees and Risk Taking: Implicit Guarantees and

Overview Multi-layer networks: Cognitive Modeling limits of single layer networks; Lecture

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

LSTMs Overview Subhashini Venugopalan Neural Networks z t Output B Hidden Hidden Input WHY

Network Layer October 2, 2019 guha.jayachandran@sjsu.edu Layer 2: Protocol atop Layer 1

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Lecture 6: Wireless Link Layer, Lecture 6: Wireless Link Layer, MAC protocols, CSMA MAC

1 Transport Layer Transport Layer Outline Message, Segment, Datagram Transport-layer

ELEC / COMP 177 Fall 2016 Some slides from Kurose and Ross, Computer Networking , 5 th Edition

5 Network Layer Network Layer Network Layer Network Layer Example: Choosing among multiple ASes

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Approximate Verification of Deep Neural Networks with Provable Guarantees Xiaowei Huang,

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

CS4501: Introduction to Computer Vision Neural Networks (NNs) Artificial Neural Networks (ANNs)

Finding Hidden Supernovae with Finding Hidden Supernovae with Finding Hidden Supernovae with

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

From crisis management to recovery the Cecilia Skingsley Riksbank's role in First Deputy

Let it Recover: Multiparty Protocol-Induced Recovery 1 Fail fast and recover quickly

PROTECT a P latform for Ro bust T hr e shold C ryp t ography Jason Resch, Christian Cachin, Hugo

Motivation Atomicity: Transactions may abort (Rollback). Logging and

Deadlocks: Avoidance Detection - Recovery Summer 2013 Cornell University 1 Today Can

APNA 29th Annual Conference Session 3013.1: October 30, 2015 T he I mpo rta nc e o f Disc lo

Performance and Concurrency Bug Detection Tools for Java Programs Shan Lu University of Chicago

Advanced PostgreSQL backup &amp; recovery methods Anastasia Lubennikova pgconf.eu 2018 Agenda

Advanced PostgreSQL backup & recovery methods Anastasia Lubennikova pgconf.eu 2018 Agenda