Recovery Guarantees for One-hidden-layer Neural Networks Kai Zhong - - PowerPoint PPT Presentation

recovery guarantees for one hidden layer neural networks
SMART_READER_LITE
LIVE PREVIEW

Recovery Guarantees for One-hidden-layer Neural Networks Kai Zhong - - PowerPoint PPT Presentation

Recovery Guarantees for One-hidden-layer Neural Networks Kai Zhong Joint work with Zhao Song , Prateek Jain , Peter L. Bartlett , Inderjit S. Dhillon UT-Austin, MSR India, UC Berkeley Kai Zhong Recovery Guarantees


slide-1
SLIDE 1

Recovery Guarantees for One-hidden-layer Neural Networks

Kai Zhong∗

Joint work with Zhao Song∗, Prateek Jain†, Peter L. Bartlett‡, Inderjit S. Dhillon∗

∗UT-Austin, †MSR India, ‡UC Berkeley

Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 1 / 21

slide-2
SLIDE 2

Learning Neural Networks is Hard

The objective functions of neural networks are highly non-convex. Gradient-descent-based methods only achieve local optima.

Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 2 / 21

slide-3
SLIDE 3

Learning Neural Networks is Hard

Good News

When the size of the network is very large, no need to worry about bad local minima. Every local minimum is a global minimum or close to a global minimum. [Choromanska et al. ’15, Nguyen & Hein ’17, etc.]

Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 3 / 21

slide-4
SLIDE 4

Learning Neural Networks is Hard

Good News

When the size of the network is very large, no need to worry about bad local minima. Every local minimum is a global minimum or close to a global minimum. [Choromanska et al. ’15, Nguyen & Hein ’17, etc.]

Bad News

Typically over-parameterize May lead to overfitting!!

Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 3 / 21

slide-5
SLIDE 5

Learning Neural Networks is Hard

Good News

When the size of the network is very large, no need to worry about bad local minima. Every local minimum is a global minimum or close to a global minimum. [Choromanska et al. ’15, Nguyen & Hein ’17, etc.]

Bad News

Typically over-parameterize May lead to overfitting!!

Can we learn a neural net without over-parameterization?

Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 3 / 21

slide-6
SLIDE 6

Recover A Neural Network

Assume the data follows a specified neural network model. Try to recover this model.

Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 4 / 21

slide-7
SLIDE 7

Model: One-hidden-layer Neural Network

Assume n samples S = {(xj, yj)}j=1,2,··· ,n ⊂ Rd × R are sampled i.i.d. from distribution D : x ∼ N(0, I), y =

k

  • i=1

v∗

i · φ(w∗⊤ i

x), where φ(z) is the activation function, k is the number of hidden nodes, {w∗

i , v∗ i }i=1,2,··· ,k are underlying

ground truth parameters.

Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 5 / 21

slide-8
SLIDE 8

General Issues and Our Contribution

Can we recover the model? How many samples are required? (Sample Complexity) And how much time? (Computational Complexity)

Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 6 / 21

slide-9
SLIDE 9

General Issues and Our Contribution

Can we recover the model?

Yes, by gradient descent following tensor method initialization

How many samples are required? (Sample Complexity) And how much time? (Computational Complexity)

Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 6 / 21

slide-10
SLIDE 10

General Issues and Our Contribution

Can we recover the model?

Yes, by gradient descent following tensor method initialization

How many samples are required? (Sample Complexity)

|S| > d · log(1/ǫ) · poly(k, λ), where ǫ is the precision and λ is a condition number of W ∗.

And how much time? (Computational Complexity)

Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 6 / 21

slide-11
SLIDE 11

General Issues and Our Contribution

Can we recover the model?

Yes, by gradient descent following tensor method initialization

How many samples are required? (Sample Complexity)

|S| > d · log(1/ǫ) · poly(k, λ), where ǫ is the precision and λ is a condition number of W ∗.

And how much time? (Computational Complexity)

|S| · d · poly(k, λ)

Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 6 / 21

slide-12
SLIDE 12

General Issues and Our Contribution

Can we recover the model?

Yes, by gradient descent following tensor method initialization

How many samples are required? (Sample Complexity)

|S| > d · log(1/ǫ) · poly(k, λ), where ǫ is the precision and λ is a condition number of W ∗.

And how much time? (Computational Complexity)

|S| · d · poly(k, λ)

The first recovery guarantee with both sample complexity and computational complexity linear in the input dimension and logarithmic in the precision.

Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 6 / 21

slide-13
SLIDE 13

Objective Function

Given v∗

i and a sample set S, consider L2 loss

  • fS(W) =

1 2|S|

  • (x,y)∈S

k

  • i=1

v∗

i φ(w⊤ i x) − y

2 .

Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 7 / 21

slide-14
SLIDE 14

Objective Function

Given v∗

i and a sample set S, consider L2 loss

  • fS(W) =

1 2|S|

  • (x,y)∈S

k

  • i=1

v∗

i φ(w⊤ i x) − y

2 . We show it is locally strongly convex near the ground truth!

Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 7 / 21

slide-15
SLIDE 15

Approach

Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 8 / 21

slide-16
SLIDE 16

Local Strong Convexity (LSC)

∇2f(W) is positive definite (p.d.) for W ∈ A ⇒ f(W) is LSC in area A

Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 9 / 21

slide-17
SLIDE 17

Local Strong Convexity (LSC)

∇2f(W) is positive definite (p.d.) for W ∈ A ⇒ f(W) is LSC in area A Consider the minimal eigenvalue of expected Hessian at ground truth, λmin

  • ∇2fD(W ∗)
  • =

min

  • j aj2=1 E

   

j

φ′(w∗⊤

j x)x⊤aj

 

2

 where fD is the expected risk.

Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 9 / 21

slide-18
SLIDE 18

Local Strong Convexity (LSC)

∇2f(W) is positive definite (p.d.) for W ∈ A ⇒ f(W) is LSC in area A Consider the minimal eigenvalue of expected Hessian at ground truth, λmin

  • ∇2fD(W ∗)
  • =

min

  • j aj2=1 E

   

j

φ′(w∗⊤

j x)x⊤aj

 

2

 where fD is the expected risk. λmin

  • ∇2fD(W ∗)
  • ≥ 0 always holds.

Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 9 / 21

slide-19
SLIDE 19

Local Strong Convexity (LSC)

∇2f(W) is positive definite (p.d.) for W ∈ A ⇒ f(W) is LSC in area A Consider the minimal eigenvalue of expected Hessian at ground truth, λmin

  • ∇2fD(W ∗)
  • =

min

  • j aj2=1 E

   

j

φ′(w∗⊤

j x)x⊤aj

 

2

 where fD is the expected risk. λmin

  • ∇2fD(W ∗)
  • ≥ 0 always holds.

Does λmin

  • ∇2fD(W ∗)
  • > 0 always hold?

Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 9 / 21

slide-20
SLIDE 20

Local Strong Convexity (LSC)

∇2f(W) is positive definite (p.d.) for W ∈ A ⇒ f(W) is LSC in area A Consider the minimal eigenvalue of expected Hessian at ground truth, λmin

  • ∇2fD(W ∗)
  • =

min

  • j aj2=1 E

   

j

φ′(w∗⊤

j x)x⊤aj

 

2

 where fD is the expected risk. λmin

  • ∇2fD(W ∗)
  • ≥ 0 always holds.

Does λmin

  • ∇2fD(W ∗)
  • > 0 always hold? No

Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 9 / 21

slide-21
SLIDE 21

Two Examples when LSC doesn’t Hold

Set v∗

i = 1 and W ∗ = I(k = d). 1 When φ(z) = z,

λmin

  • ∇2fD(W ∗)
  • =

min

  • j aj2=1 E

 (x⊤

j

aj)2   = 0

The minimum is achieved when

j aj = 0

Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 10 / 21

slide-22
SLIDE 22

Two Examples when LSC doesn’t Hold

Set v∗

i = 1 and W ∗ = I(k = d). 2 When φ(z) = z2,

λmin

  • ∇2fD(W ∗)
  • = 4

min

  • j aj2=1 E
  • (xx⊤, A)2

= 0

where A = [a1, a2, · · · , ad] ∈ Rd×d. The minimum is achieved when A = −A⊤.

Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 11 / 21

slide-23
SLIDE 23

When LSC Holds

1 φ(z) satisfies three properties.

P1 Non-negative and homogeneously bounded derivative 0 ≤ φ′(z) ≤ L1|z|p for some constants L1 > 0 and p ≥ 0.

max(z, 0) tanh(z) max(z, 0.1z)

Figure: activations satisfying P1

max(−z, 0) z2 ez

Figure: activations not satisfying P1

Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 12 / 21

slide-24
SLIDE 24

When LSC Holds

1 φ(z) satisfies three properties.

P2 “Non-linearity”1 For any σ > 0, we have ρ(σ) > 0 , where ρ(σ) := min{α2,0 − α2

1,0 − α2 1,1, α2,2 − α2 1,1 − α2 1,2, α1,0α1,2 − α2 1,1}

and αi,j := Ez∼N (0,1)[(φ′(σz))izj].

ReLU leaky ReLU squared ReLU erf tanh linear quad- ratic ρ(0.1) 0.091 0.089 0.27σ 1.9E-4 1.8E-4 ρ(1) 5.2E-2 4.9E-2 ρ(10) 2.5E-5 5.1E-5

1Best name we can find... still need more understanding for ρ(σ) Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 13 / 21

slide-25
SLIDE 25

When LSC Holds

1 φ(z) satisfies three properties.

P3 φ′′(z) satisfies one of the following two properties,

(a) Smoothness |φ′′(z)| ≤ L2 for all z for some constant L2, or (b) Piece-wise linearity φ′′(z) = 0 except for e (e is a finite constant) points.

max(z, 0) max(z, 0.1z) tanh(z) max(z, 0)2 ez φ(z) = 0 if z < 0; z4 + 4z o.w.

Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 14 / 21

slide-26
SLIDE 26

Three Properties in Summary

P1 Non-negative and homogeneously bounded derivative P2 “Non-linearity” P3 (a) Smoothness, or (b)Piece-wise linearity name φ(z) P1 P2 P3.a P3.b P1,2,3 ReLU max{z, 0} ✓ ✓ ✗ ✓ ✓ leaky ReLU max{z, 0.01z} ✓ ✓ ✗ ✓ ✓ squared ReLU max{z, 0}2 ✓ ✓ ✓ ✗ ✓ sigmoid

1 1+e−z

✓ ✓ ✓ ✗ ✓ tanh

ez−e−z ez+e−z

✓ ✓ ✓ ✗ ✓ erf z

0 e−t2dt

✓ ✓ ✓ ✗ ✓ linear z ✓ ✗ ✓ ✓ ✗ quadratic z2 ✗ ✗ ✓ ✗ ✗

Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 15 / 21

slide-27
SLIDE 27

Local Strong Convexity

Definition

Let σi(i = 1, 2, · · · , k) denote the i-th singular value of W ∗ ∈ Rd×k. Define κ = σ1/σk and λ = (k

i=1 σi)/σk k.

Theorem

Let

1 φ(z) satisfies Property 1,2,3 with ρ(σk) 2 |S| ≥ d · poly(k, λ)/ρ2(σk), 3 W − W ∗ ≤ ρ2(σk)/ poly(λ, k).

Then there exist two positives m0 = Θ(ρ(σk)/(κ2λ)) and M0 = Θ(kσ2p

1 ) such that w.h.p.,

m0I ∇2 fS(W) M0I

Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 16 / 21

slide-28
SLIDE 28

Linear Convergence of Gradient Descent

For smooth activations, gradient descent has linear convergence.

Corollary

Let φ(z) satisfy Property 1,2,3(a) and |S|, W satisfy the conditions in the above theorem. Let W † = W − 1 M0 ∇ fS(W), then w.h.p. W † − W ∗2

F ≤ (1 − m0

M0 )W − W ∗2

F .

Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 17 / 21

slide-29
SLIDE 29

Initialization by Tensor Method

Definition

φ(z) is called q-homogeneous if φ(σ · z) = σqφ(z) for some constant q and any σ > 0.

Fact

If (x, y) is sampled from D : x ∼ N(0, I), y =

  • i

v∗

i · φ(w∗⊤ i

x), and φ(z) is q-homogeneous, then E[y · (x ⊗ x ⊗ x − x ⊗I)] =

  • i

c v∗

i w∗ i q−3w∗ i ⊗ w∗ i ⊗ w∗ i ,

where v ⊗I = d

j=1[v ⊗ ej ⊗ ej + ej ⊗ v ⊗ ej + ej ⊗ ej ⊗ v].

Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 18 / 21

slide-30
SLIDE 30

Estimate Parameters Using Tensor Decomposition

W.l.o.g. we can assume v∗

i ∈ {−1, 1} due to the

homogeneity. Setting M3 := E[y · (x ⊗ x ⊗ x − x ⊗I)], we can

1 Compute an empirical M3,

M3, from samples.

2 Do tensor decomposition on

M3.

3 v∗

i ∈ {−1, 1} can be exactly recovered and w∗ i can be

approximated.

Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 19 / 21

slide-31
SLIDE 31

Overall Theoretical Guarantees

Theorem

Let the activation function be homogeneous satisfying Property 1, 2, 3(a). Then for any ǫ > 0, if |S| ≥ O(d · log(1/ǫ) · poly(k, λ)), the tensor method followed by gradient descent takes O(|S| · d · poly(k, λ)) time and outputs W and v satisfying

  • W − W ∗F ≤ O(ǫ), and

vi = v∗

i .

The proof mainly follows The matrix Bernstein inequality Error bound for non-orthogonal tensor decomposition from [Kuleshov-Chaganty-Liang’15] Linear convergence of gradient descent

Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 20 / 21

slide-32
SLIDE 32

Take-home Message and Future Work

Take-home message

1 The squared loss of one-hidden-layer neural nets is locally

strongly convex near the ground truth w.r.t. the first-layer parameters.

2 Tensor method is able to initialize the parameters into the

local strong convexity region.

3 Sample and computational complexities are linear in dim

and logarithmic in precision.

Future work

1 One-hidden-layer nets have low capacity. –Multiple layers? 2 Tensor method highly depends on Gaussian assumption.

–Random Initialization?

Kai Zhong Recovery Guarantees for One-hidden-layer Neural Networks 21 / 21