Efficiently Training Sum-Product Neural Networks using Forward - - PowerPoint PPT Presentation

efficiently training sum product neural networks using
SMART_READER_LITE
LIVE PREVIEW

Efficiently Training Sum-Product Neural Networks using Forward - - PowerPoint PPT Presentation

Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Greedy Algorithms, Frank-Wolfe and Friends A modern perspective, Lake


slide-1
SLIDE 1

Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection

Shai Shalev-Shwartz

School of CS and Engineering, The Hebrew University of Jerusalem

Greedy Algorithms, Frank-Wolfe and Friends — A modern perspective, Lake Tahoe, December 2013 Based on joint work with Ohad Shamir

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 1 / 25

slide-2
SLIDE 2

Neural Networks

A single neuron with activation function σ : R → R x1 x2 x3 x4 x5 σ(v, x) v1 v2 v3 v4 v5 Usually, σ is taken to be a sigmoidal function

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 2 / 25

slide-3
SLIDE 3

Neural Networks

A multilayer neural network of depth 3 and size 6 x1 x2 x3 x4 x5 Hidden layer Hidden layer Input layer Output layer

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 3 / 25

slide-4
SLIDE 4

Why Deep Neural Networks are Great?

Because “A” used it to do “B”

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 4 / 25

slide-5
SLIDE 5

Why Deep Neural Networks are Great?

Because “A” used it to do “B” Classic explanation: Neural Networks are universal approximators — every Lipschitz function f : [−1, 1]d → [−1, 1] can be approximated by a neural network

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 4 / 25

slide-6
SLIDE 6

Why Deep Neural Networks are Great?

Because “A” used it to do “B” Classic explanation: Neural Networks are universal approximators — every Lipschitz function f : [−1, 1]d → [−1, 1] can be approximated by a neural network Not convincing because

It can be shown that the size of the network must be exponential in d, so why should we care about such large networks ? Many other universal approximators exist (nearest neighbor, boosting with decision stumps, SVM with RBF kernels), so why should we prefer neural networks?

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 4 / 25

slide-7
SLIDE 7

Why Deep Neural Networks are Great? A Statistical Learning Perspective

Goal: Learn a function h : X → Y based on training examples S = ((x1, y1), . . . , (xm, ym)) ∈ (X × Y)m No-Free-Lunch Theorem: For any algorithm A, and any sample size m, there exists a distribution D over X × Y and a function h∗ such that h∗ is perfect w.r.t. D but with high probability over S ∼ Dm, the output of A is very bad Prior knowledge: We must bias the learner toward “reasonable” functions — hypothesis class H ⊂ YX What should be H ?

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 5 / 25

slide-8
SLIDE 8

Why Deep Neural Networks are Great? A Statistical Learning Perspective

Consider all functions over {0, 1}d that can be executed in time at most T(d) Theorem: The class HNN of neural networks of depth O(T(d)) and size O(T(d)2) contains all functions that can be executed in time at most T(d) A great hypothesis class:

With sufficiently large network depth and size, we can express all functions we would ever want to learn Sample complexity behaves nicely and is well understood (see Anthony & Bartlett 1999)

End of story ?

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 6 / 25

slide-9
SLIDE 9

Why Deep Neural Networks are Great? A Statistical Learning Perspective

Consider all functions over {0, 1}d that can be executed in time at most T(d) Theorem: The class HNN of neural networks of depth O(T(d)) and size O(T(d)2) contains all functions that can be executed in time at most T(d) A great hypothesis class:

With sufficiently large network depth and size, we can express all functions we would ever want to learn Sample complexity behaves nicely and is well understood (see Anthony & Bartlett 1999)

End of story ? The computational barrier: But, how do we train neural networks ?

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 6 / 25

slide-10
SLIDE 10

Neural Networks — The computational barrier

It is NP hard to implement ERM for a depth 2 network with k ≥ 3 hidden neurons whose activation function is sigmoidal or sign (Blum and Rivest 1992, Bartlett and Ben-David 2002) Current approaches: Back propagation, possibly with unsupervised pre-training and other bells and whistles No theoretical guarantees, and often requires manual tweaking

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 7 / 25

slide-11
SLIDE 11

Outline

How to circumvent hardness?

1

Over-specification Extreme over-specification eliminate local (non-global) minima Hardness of improperly learning a two layers network with k = ω(1) hidden neurons

2

Change the activation function (sum-product networks) Efficiently learning sum-product networks of depth 2 using Forward Greedy Selection Hardness of learning deep sum-product networks

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 8 / 25

slide-12
SLIDE 12

Circumventing Hardness using Over-specification

Yann LeCun:

Fix a network architecture and generate data according to it Backpropagation fails to recover parameters However, if we enlarge the network size, backpropagation works just fine

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 9 / 25

slide-13
SLIDE 13

Circumventing Hardness using Over-specification

Yann LeCun:

Fix a network architecture and generate data according to it Backpropagation fails to recover parameters However, if we enlarge the network size, backpropagation works just fine Maybe we can efficiently learn neural network using over-specification?

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 9 / 25

slide-14
SLIDE 14

Extremely over-specified Networks have no local (non-global) minima

Let X ∈ Rd,m be a data matrix of m examples Consider a network with:

N internal neurons v be the weights of all but the last layer F(v; X) be evaluations of internal neurons over data matrix X w be weights connecting internal neurons to the output neuron The output of the network is w⊤F(v; X)

Theorem: If N ≥ m, and under mild conditions on F, the

  • ptimization problem minw,v w⊤F(v; X) − y2 has no local

(non-global) minima

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 10 / 25

slide-15
SLIDE 15

Extremely over-specified Networks have no local (non-global) minima

Let X ∈ Rd,m be a data matrix of m examples Consider a network with:

N internal neurons v be the weights of all but the last layer F(v; X) be evaluations of internal neurons over data matrix X w be weights connecting internal neurons to the output neuron The output of the network is w⊤F(v; X)

Theorem: If N ≥ m, and under mild conditions on F, the

  • ptimization problem minw,v w⊤F(v; X) − y2 has no local

(non-global) minima Proof idea: W.h.p. over perturbation of v, F(v; X) has full rank. For such v, if we’re not at global minimum, just by changing w we can decrease the objective

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 10 / 25

slide-16
SLIDE 16

Is over-specification enough ?

But, such large networks will lead to overfitting Maybe there’s a clever trick that circumvent overfitting (regularization, dropout, ...) ?

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 11 / 25

slide-17
SLIDE 17

Is over-specification enough ?

But, such large networks will lead to overfitting Maybe there’s a clever trick that circumvent overfitting (regularization, dropout, ...) ? Theorem (Daniely, Linial, S.) Even if the data is perfectly generated by a neural network of depth 2 and with only k = ω(1) neurons in the hidden layer, there is no algorithm that can achieve small test error Corollary: over-specification alone is not enough for efficient learnability

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 11 / 25

slide-18
SLIDE 18

Proof Idea: Hardness of Improper Learning

Improper learning: Learner tries to learn some hypothesis h∗ ∈ H but is not restricted to output a hypothesis from H

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 12 / 25

slide-19
SLIDE 19

Proof Idea: Hardness of Improper Learning

Improper learning: Learner tries to learn some hypothesis h∗ ∈ H but is not restricted to output a hypothesis from H How to show hardness?

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 12 / 25

slide-20
SLIDE 20

Proof Idea: Hardness of Improper Learning

Improper learning: Learner tries to learn some hypothesis h∗ ∈ H but is not restricted to output a hypothesis from H How to show hardness? Technical novelty: A new method for deriving lower bounds for improper learning, which relies on average-case complexity assumptions Technique yields new hardness results for improper learning of:

DNFs (open problem since Kearns&Valiant’1989) Intersection of ω(1) halfspaces (Klivans&Sherstov’2006 showed hardness for dc halfspaces) Constant approximation ratio for agnostically learning halfspaces (previously, only hardness of exact learning was known)

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 12 / 25

slide-21
SLIDE 21

Proof Idea: Hardness of Improper Learning

Improper learning: Learner tries to learn some hypothesis h∗ ∈ H but is not restricted to output a hypothesis from H How to show hardness? Technical novelty: A new method for deriving lower bounds for improper learning, which relies on average-case complexity assumptions Technique yields new hardness results for improper learning of:

DNFs (open problem since Kearns&Valiant’1989) Intersection of ω(1) halfspaces (Klivans&Sherstov’2006 showed hardness for dc halfspaces) Constant approximation ratio for agnostically learning halfspaces (previously, only hardness of exact learning was known)

Can also be used to establish Computational-Statistical Tradeoffs (Daniely, Linial, S., NIPS’13)

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 12 / 25

slide-22
SLIDE 22

Outline

How to circumvent hardness?

1

Over-specification Extreme over-specification eliminate local (non-global) minima Hardness of improperly learning a two layers network with k = ω(1) hidden neurons

2

Change the activation function (sum-product networks) Efficiently learning sum-product networks of depth 2 using Forward Greedy Selection Hardness of learning deep sum-product networks

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 13 / 25

slide-23
SLIDE 23

Circumventing hardness — sum-product networks

Simpler non-linearity — replace sigmoidal activation function by the square function σ(a) = a2 Network implements polynomials, where the depth corresponds to degree The size of the network (number of neurons) determines generalization properties and evaluation time Can we efficiently learn the class of polynomial networks of small size?

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 14 / 25

slide-24
SLIDE 24

Depth 2 polynomial network

x1 x2 x3 x4 x5 v1, x2 v2, x2 v3, x2

  • i wi vi, x2

Hidden layer Input layer Output layer

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 15 / 25

slide-25
SLIDE 25

Depth 2 polynomial networks

Corresponding hypothesis class: H =

  • x →

r

  • i=1

wi vi, x2 : w = O(1), ∀i, vi = 1

  • .

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 16 / 25

slide-26
SLIDE 26

Depth 2 polynomial networks

Corresponding hypothesis class: H =

  • x →

r

  • i=1

wi vi, x2 : w = O(1), ∀i, vi = 1

  • .

ERM is still NP hard But, here, a slight over-specification works !

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 16 / 25

slide-27
SLIDE 27

Depth 2 polynomial networks

Corresponding hypothesis class: H =

  • x →

r

  • i=1

wi vi, x2 : w = O(1), ∀i, vi = 1

  • .

ERM is still NP hard But, here, a slight over-specification works ! Using d2 hidden neurons suffices (trivial) Can we do better?

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 16 / 25

slide-28
SLIDE 28

Forward Greedy Selection

Goal: minimize R(w) s.t. w0 ≤ r Define: for I ⊆ [d], wI = argminw:supp(w)⊆I R(w)

Forward Greedy Selection (FGS)

Start with I = {·} For t = 1, 2, . . .

Pick j s.t. |∇jR(wI)| ≥ (1 − τ) maxi |∇iR(wI)| Let J = I ∪ {j} Replacement step: Let I be any set s.t. R(wI) ≤ R(wJ) and |I| ≤ |J|

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 17 / 25

slide-29
SLIDE 29

Analysis of Forward Greedy Selection

Theorem

Assume that R is β-smooth w.r.t. some norm for which ej = 1 for all j. Then, for every ǫ and ¯ w, if FGS is run for k ≥ 2β ¯ w2

1

(1 − τ)2ǫ iterations, it outputs w with R(w) ≤ R( ¯ w) + ǫ and w0 ≤ k.

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 18 / 25

slide-30
SLIDE 30

Analysis of Forward Greedy Selection

Theorem

Assume that R is β-smooth w.r.t. some norm for which ej = 1 for all j. Then, for every ǫ and ¯ w, if FGS is run for k ≥ 2β ¯ w2

1

(1 − τ)2ǫ iterations, it outputs w with R(w) ≤ R( ¯ w) + ǫ and w0 ≤ k.

Remarks

Dimension of w has no effect on the theorem ! ¯ w is any vector (not necessarily the “optimum”) If ¯ w2 = O(1) then ¯ w2

1 ≤ ¯

w0 ¯ w2

2 = O ( ¯

w0) . Bound depends nicely on τ.

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 18 / 25

slide-31
SLIDE 31

Forward Greedy Selection for Sum-Product Networks

Let S be the Euclidean sphere of Rd Think on w as a mapping from S to R Each hypothesis in H corresponds to some ¯ w with ¯ w0 = r and ¯ w2 = O(1) R( ¯ w) = training loss of the hypothesis in H corresponding to ¯ w Applying FGS for k = Ω(r/ǫ) iterations would yield a network with O(r/ǫ) hidden neurons and loss of at most R( ¯ w) + ǫ Main caveat: at each iteration we need to find v s.t. |∇vR(w)| ≥ (1 − τ) max

u∈S |∇uR(w)|

Luckily, this is an eigenvalue problem ∇vR(w) = v⊤   E

(x,y) ℓ′

 

  • u∈supp(w)

wuu, x2, y   xx⊤   v

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 19 / 25

slide-32
SLIDE 32

The resulting algorithm

Greedy Efficient Component Optimization (GECO): Initialize V = [ ], w = [] For t = 1, 2, . . . , T

Let M = E(x,y) ℓ′(

i wi(vi, x)2, y)xx⊤

V = [V v] where v is an approximate leading eigenvector of M Let B = argminB∈Rt,t E(x,y) ℓ((x⊤V )B(V ⊤x), y) Update w = eigenvalues(B) and V = V eigenvectors(B)

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 20 / 25

slide-33
SLIDE 33

The resulting algorithm

Greedy Efficient Component Optimization (GECO): Initialize V = [ ], w = [] For t = 1, 2, . . . , T

Let M = E(x,y) ℓ′(

i wi(vi, x)2, y)xx⊤

V = [V v] where v is an approximate leading eigenvector of M Let B = argminB∈Rt,t E(x,y) ℓ((x⊤V )B(V ⊤x), y) Update w = eigenvalues(B) and V = V eigenvectors(B)

Remarks: Finding an approximate leading eigenvector takes linear time Overall complexity depends linearly on the size of the data

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 20 / 25

slide-34
SLIDE 34

Deeper sum-product networks ?

Learning depth 2 sigmoidal networks is hard even if we allow

  • ver-specification

Learning depth 2 sum-product networks is tractable if we allow slight

  • ver-specification

What about higher degrees? Theorem (Livni, Shamir, S.): It is hard to learn polynomial networks of depth poly(d) even if their size is poly(d). Proof idea: It is possible to approximate the sigmoid function with a polynomial of degree poly(d)

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 21 / 25

slide-35
SLIDE 35

Deeper sum-product networks ?

Distributional assumptions: Is it easier to learn under certain distributional assumptions? Vanishing Component Analysis (Livni et al 2013) Efficiently finding the generating ideal of the data Can be used to efficiently construct deep features Theoretical guarantees still not satisfactory

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 22 / 25

slide-36
SLIDE 36

Summary

Sigmoidal deep networks are great statistically but cannot be trained efficiently Sum-product networks of depth 2 can be trained efficiently using forward greedy selection Very deep sum-product networks cannot be trained efficiently Open problems Is it possible to train sum-product networks of depth 3? What about depth log(d) ? Find a combination of network architecture and distributional assumptions that are useful in practice and lead to efficient algorithms

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 23 / 25

slide-37
SLIDE 37

Thanks! Collaborators

Seek of efficient algorithms for deep learning: Ohad Shamir GECO: Alon Gonen and Ohad Shamir Based on a previous paper with Tong Zhang and Nati Srebro Lower bounds: Amit Daniely and Nati Linial VCA: Livni, Lehavi, Nachlieli, Schein, Globerson

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 24 / 25

slide-38
SLIDE 38

Shameless Advertisement

Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 25 / 25