[PPT] - Collapse of Deep and Narrow ReLU Neural Nets Lu Lu , Yeonjong Shin, PowerPoint Presentation

SLIDE 1

Collapse of Deep and Narrow ReLU Neural Nets

Lu Lu, Yeonjong Shin, Yanhui Su, George Karniadakis Division of Applied Mathematics, Brown University Scientific Machine Learning, ICERM January 28, 2019

Lu (Brown) ReLU NN Collapse Scientific ML 2019 1 / 20

SLIDE 2

Overview

1

Introduction

2

Examples

3

Theoretical analysis

4

Asymmetric initialization (Shin)

Lu (Brown) ReLU NN Collapse Scientific ML 2019 2 / 20

SLIDE 3

Introduction

Shallow NNs (single hidden layer)

◮ universal approximation theorem Lu (Brown) ReLU NN Collapse Scientific ML 2019 3 / 20

SLIDE 4

Introduction

Shallow NNs (single hidden layer)

◮ universal approximation theorem

Deep (& narrow) NNs

◮ Better than shallow NNs (of comparable size) ◮

sizedeep sizeshallow ≈ ǫd [Mhaskar & Poggio, 2016]

Lu (Brown) ReLU NN Collapse Scientific ML 2019 3 / 20

SLIDE 5

Introduction

Shallow NNs (single hidden layer)

◮ universal approximation theorem

Deep (& narrow) NNs

◮ Better than shallow NNs (of comparable size) ◮

sizedeep sizeshallow ≈ ǫd [Mhaskar & Poggio, 2016]

⇒ Deep & narrow

Lu (Brown) ReLU NN Collapse Scientific ML 2019 3 / 20

SLIDE 6

Introduction

Shallow NNs (single hidden layer)

◮ universal approximation theorem

Deep (& narrow) NNs

◮ Better than shallow NNs (of comparable size) ◮

sizedeep sizeshallow ≈ ǫd [Mhaskar & Poggio, 2016]

⇒ Deep & narrow ReLU := max(x, 0)

Lu (Brown) ReLU NN Collapse Scientific ML 2019 3 / 20

SLIDE 7

Introduction

Shallow NNs (single hidden layer)

◮ universal approximation theorem

Deep (& narrow) NNs

◮ Better than shallow NNs (of comparable size) ◮

sizedeep sizeshallow ≈ ǫd [Mhaskar & Poggio, 2016]

⇒ Deep & narrow ReLU := max(x, 0)

◮ Width limit?

For continuous functions [0, 1]din → Rdout [Hanin & Sellke, 2017]: din + 1 ≤ minimal width ≤ din + dout

Lu (Brown) ReLU NN Collapse Scientific ML 2019 3 / 20

SLIDE 8

Introduction

Shallow NNs (single hidden layer)

◮ universal approximation theorem

Deep (& narrow) NNs

◮ Better than shallow NNs (of comparable size) ◮

sizedeep sizeshallow ≈ ǫd [Mhaskar & Poggio, 2016]

⇒ Deep & narrow ReLU := max(x, 0)

◮ Width limit?

For continuous functions [0, 1]din → Rdout [Hanin & Sellke, 2017]: din + 1 ≤ minimal width ≤ din + dout

◮ Depth limit? Lu (Brown) ReLU NN Collapse Scientific ML 2019 3 / 20

SLIDE 9

Introduction

Training of NNs NP-hard [Sima, 2002] Local minima [Fukumizu & Amari, 2002] Bad saddle points [Kawaguchi, 2016]

Lu (Brown) ReLU NN Collapse Scientific ML 2019 4 / 20

SLIDE 10

Introduction

Training of NNs NP-hard [Sima, 2002] Local minima [Fukumizu & Amari, 2002] Bad saddle points [Kawaguchi, 2016] ReLU Dying ReLU neuron: stuck in the negative side

Lu (Brown) ReLU NN Collapse Scientific ML 2019 4 / 20

SLIDE 11

Introduction

Training of NNs NP-hard [Sima, 2002] Local minima [Fukumizu & Amari, 2002] Bad saddle points [Kawaguchi, 2016] ReLU Dying ReLU neuron: stuck in the negative side Deep ReLU nets?

Dying ReLU network

NN is a constant function after initialization

Lu (Brown) ReLU NN Collapse Scientific ML 2019 4 / 20

SLIDE 12

Introduction

Training of NNs NP-hard [Sima, 2002] Local minima [Fukumizu & Amari, 2002] Bad saddle points [Kawaguchi, 2016] ReLU Dying ReLU neuron: stuck in the negative side Deep ReLU nets?

Dying ReLU network

NN is a constant function after initialization

Collapse

NN converges to the “mean” state of the target function during training

Lu (Brown) ReLU NN Collapse Scientific ML 2019 4 / 20

SLIDE 13

Overview

1

Introduction

2

Examples

3

Theoretical analysis

4

Asymmetric initialization (Shin)

Lu (Brown) ReLU NN Collapse Scientific ML 2019 5 / 20

SLIDE 14

1D Examples

f(x) = |x| |x| = ReLU(x) + ReLU(−x) =

1

1

ReLU(
1

−1

x)

2-layer with width 2 Train a 10-layer ReLU NN with width 2 (MSE loss, whatever optimizer)

Lu (Brown) ReLU NN Collapse Scientific ML 2019 6 / 20

SLIDE 15

1D Examples

f(x) = |x| |x| = ReLU(x) + ReLU(−x) =

1

1

ReLU(
1

−1

x)

2-layer with width 2 Train a 10-layer ReLU NN with width 2 (MSE loss, whatever optimizer) Collapse to the mean value (A): ∼93% Collapse partially (B)

0.5 1 1.5 2

1.5
1
0.5

0.5 1 1.5

A

y x y = |x| NN 0.5 1 1.5 2

1.5
1
0.5

0.5 1 1.5

B

y x y = |x| NN Lu (Brown) ReLU NN Collapse Scientific ML 2019 6 / 20

SLIDE 16

1D Examples

f(x) = x sin(5x)

1

1 2

1

1

A

y x y = xsin(5x) NN

1

1

B

x y = xsin(5x) NN

1

1

C

x y = xsin(5x) NN

1

1

D

x y = xsin(5x) NN

f(x) = 1{x>0} + 0.2 sin(5x)

0.5

0.5 1 1.5

1

1

A

y x y NN

1

1

B

x y NN

1

1

C

x y NN

1

1

D

x y NN

Lu (Brown) ReLU NN Collapse Scientific ML 2019 7 / 20

SLIDE 17

2D Examples

f(x) =

|x1 + x2|

|x1 − x2|

=
1

1 1 1

ReLU(

    

1 1 −1 −1 1 −1 −1 1

     x)

x1 x2 1 2 3

A

y1 = |x1+x2| NN

1

1

1

1 x1 x2 1 2 3

B

y1 = |x1+x2| NN

1

1

1

1 Lu (Brown) ReLU NN Collapse Scientific ML 2019 8 / 20

SLIDE 18

Loss

Mean squared error (MSE) ⇒ mean Mean absolute error (MAE) ⇒ median

1 2

1.5
1
0.5

0.5 1 1.5

A

y x y = |x| MSE MAE

1

1 2

1.5
1
0.5

0.5 1 1.5

B

x y = xsin(5x) MSE MAE

1

1 2

1.5
1
0.5

0.5 1 1.5

C

x y = 1{x>0}+0.2sin(5x) MSE MAE

Lu (Brown) ReLU NN Collapse Scientific ML 2019 9 / 20

SLIDE 19

Overview

1

Introduction

2

Examples

3

Theoretical analysis

4

Asymmetric initialization (Shin)

Lu (Brown) ReLU NN Collapse Scientific ML 2019 10 / 20

SLIDE 20

Setup

Feed-forward ReLU neural network N L : Rdin → Rdout L layers In the layer ℓ

◮ Nℓ neurons (N0 = din, NL = dout) ◮ Weight Wℓ: Nℓ × Nℓ−1 matrix ◮ Bias bℓ ∈ RNℓ

Input: x ∈ Rdin Neural activity in the layer ℓ: N ℓ(x) ∈ RNℓ N ℓ(x) = W ℓφ(N ℓ−1(x)) + bℓ ∈ RNℓ, for 2 ≤ ℓ ≤ L N 1(x) = W 1x + b1

Lu (Brown) ReLU NN Collapse Scientific ML 2019 11 / 20

SLIDE 21

Setup

Training data T = {xi, f(xi)}1≤i≤M ⊂ D ≡ Br(0) = {x ∈ Rdin|x2 ≤ r} Loss function L(θ, T ) =

M

i=1

ℓ(N L(xi; θ), f(xi)), where θ = {W ℓ, bℓ}1≤ℓ≤L

Lu (Brown) ReLU NN Collapse Scientific ML 2019 12 / 20

SLIDE 22

N L will eventually Die in probability as L → ∞

Theorem 1

Let N L(x) be a ReLU NN with L layers, each having N1, · · · , NL

neurons. Suppose

1 Weights are independently initialized from a symmetric distribution

around 0,

2 Biases are either from a symmetric distribution or set to be zero.

Then P(N L(x) dies) ≤ 1 −

L−1

ℓ=1

(1 − (1/2)Nℓ). Furthermore, assuming Nℓ = N for all ℓ, lim

L→∞ P(N L(x) dies) = 1,

lim

N→∞ P(N L(x) dies) = 0.

Lu (Brown) ReLU NN Collapse Scientific ML 2019 13 / 20

SLIDE 23

Proof

Lemma 1

Let N L(x) be a ReLU NN of L-layers. Suppose weights are independently from distributions satisfying P(W ℓ

j z = 0) = 0 for any nonzero z ∈ RNℓ−1

and any j-th row of W ℓ. Then P(N ℓ(x) dies) = P(∃ ℓ ∈ {1, . . . , L − 1} s.t. φ(N ℓ(x)) = 0 ∀x ∈ D).

Lu (Brown) ReLU NN Collapse Scientific ML 2019 14 / 20

SLIDE 24

Proof

Lemma 1

Let N L(x) be a ReLU NN of L-layers. Suppose weights are independently from distributions satisfying P(W ℓ

j z = 0) = 0 for any nonzero z ∈ RNℓ−1

and any j-th row of W ℓ. Then P(N ℓ(x) dies) = P(∃ ℓ ∈ {1, . . . , L − 1} s.t. φ(N ℓ(x)) = 0 ∀x ∈ D). For a given x, P

W j

s φ(N j−1(x)) + bj s < 0| ˜

Ac

j−1,x

= 1

2, where ˜ Ac

ℓ,x = {∀1 ≤ j < ℓ, φ(N j(x)) = 0}

Lu (Brown) ReLU NN Collapse Scientific ML 2019 14 / 20

SLIDE 25

Dead Networks would Collapse

Theorem 2

Suppose the ReLU NN dies. Then for any loss L, the network is optimized to a constant function by any gradient based method.

Lu (Brown) ReLU NN Collapse Scientific ML 2019 15 / 20

SLIDE 26

Dead Networks would Collapse

Theorem 2

Suppose the ReLU NN dies. Then for any loss L, the network is optimized to a constant function by any gradient based method. Proof Lemma 1 ⇒ ∃ ℓ ∈ {1, . . . , L − 1} s.t. φ(N ℓ(x)) = 0 ∀x ∈ D Gradients of L wrt the weights/biases in the 1, . . . , l-th layers vanish

Lu (Brown) ReLU NN Collapse Scientific ML 2019 15 / 20

SLIDE 27

Dead Networks would Collapse

Theorem 2

Suppose the ReLU NN dies. Then for any loss L, the network is optimized to a constant function by any gradient based method. Proof Lemma 1 ⇒ ∃ ℓ ∈ {1, . . . , L − 1} s.t. φ(N ℓ(x)) = 0 ∀x ∈ D Gradients of L wrt the weights/biases in the 1, . . . , l-th layers vanish Assuming training data are iid from PD, the optimized network is N L(x; θ∗) = argmin

c∈RNL

Ex∼PD [ℓ(c, f(x)))] MSE/L2 ⇒ E[f(x)] MAE/L1 ⇒ median of f(x)

Lu (Brown) ReLU NN Collapse Scientific ML 2019 15 / 20

SLIDE 28

Probability of Dying when din = 1

Theorem 3

Let N L(x) be a bias-free ReLU NN with L ≥ 2 layers, each having N neurons at din = 1. Suppose weights are independently initialized from continuous symmetric distributions around 0. Then 1 −

L−1

ℓ=1

(1 − (1/2)N) ≥ P(N L(x) dies) ≥ 1 − (P22)L−2 − (1 − 2−N+1)(1 − 2−N) 1 + (N − 1)2−N ((P22)L−2 − (P33)L−2) where P22 = 1 −

1 2N and P33 = 1 − 1 2N−1 − N−1 4N .

Lu (Brown) ReLU NN Collapse Scientific ML 2019 16 / 20

SLIDE 29

Numerical Test

A ReLU NN with din = 1 Weights randomly initialized from symmetric distributions Biases are initialized to 0 More likely to die when it is deeper and narrower

0.2 0.4 0.6 0.8 1 5 10 15 20 Probability # Hidden layers Width 2 Width 3 Width 4 Width 5 Width 10

Lu (Brown) ReLU NN Collapse Scientific ML 2019 17 / 20

SLIDE 30

Safe Operating Region for a ReLU NN

Keep the dying probability < 10% or 1%

1 10 100 1000 2 4 6 8 10 12 14 16 Safe region Maximum # hidden layers Width

Approx. 1%
Approx. 10%

Simulation 1% Simulation 10%

Lu (Brown) ReLU NN Collapse Scientific ML 2019 18 / 20

SLIDE 31

Overview

1

Introduction

2

Examples

3

Theoretical analysis

4

Asymmetric initialization (Shin)

Lu (Brown) ReLU NN Collapse Scientific ML 2019 19 / 20

SLIDE 32

References

Fukumizu, K., & Amari, S. I. (2000). Local minima and plateaus in hierarchical structures of multilayer perceptrons. Neural networks, 13(3), 317-327. Sima, J. (2002). Training a single sigmoidal neuron is hard. Neural computation, 14(11), 2709-2728. Kawaguchi, K. (2016). Deep learning without poor local minima. NIPS (pp. 586-594). Mhaskar, H. N., & Poggio, T. (2016). Deep vs. shallow networks: An approximation theory perspective. Analysis and Applications, 14(06), 829-848. Hanin, B., & Sellke, M. (2017). Approximating Continuous Functions by ReLU Nets of Minimal Width. arXiv preprint arXiv:1710.11278. Lu, L., Su, Y., & Karniadakis, G. E. (2018). Collapse of deep and narrow neural nets. arXiv preprint arXiv:1808.04947.

Lu (Brown) ReLU NN Collapse Scientific ML 2019 20 / 20