Efficiently Training Sum-Product Neural Networks using Forward - PowerPoint PPT Presentation

Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Greedy Algorithms, Frank-Wolfe and Friends — A modern perspective, Lake Tahoe, December 2013 Based on joint work with Ohad Shamir Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 1 / 25

Neural Networks A single neuron with activation function σ : R → R x 1 v 1 x 2 v 2 v 3 x 3 σ ( � v, x � ) v 4 x 4 v 5 x 5 Usually, σ is taken to be a sigmoidal function Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 2 / 25

Neural Networks A multilayer neural network of depth 3 and size 6 Input Hidden Hidden Output layer layer layer layer x 1 x 2 x 3 x 4 x 5 Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 3 / 25

Why Deep Neural Networks are Great? Because “A” used it to do “B” Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 4 / 25

Why Deep Neural Networks are Great? Because “A” used it to do “B” Classic explanation: Neural Networks are universal approximators — every Lipschitz function f : [ − 1 , 1] d → [ − 1 , 1] can be approximated by a neural network Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 4 / 25

Why Deep Neural Networks are Great? Because “A” used it to do “B” Classic explanation: Neural Networks are universal approximators — every Lipschitz function f : [ − 1 , 1] d → [ − 1 , 1] can be approximated by a neural network Not convincing because It can be shown that the size of the network must be exponential in d , so why should we care about such large networks ? Many other universal approximators exist (nearest neighbor, boosting with decision stumps, SVM with RBF kernels), so why should we prefer neural networks? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 4 / 25

Why Deep Neural Networks are Great? A Statistical Learning Perspective Goal: Learn a function h : X → Y based on training examples S = (( x 1 , y 1 ) , . . . , ( x m , y m )) ∈ ( X × Y ) m No-Free-Lunch Theorem: For any algorithm A , and any sample size m , there exists a distribution D over X × Y and a function h ∗ such that h ∗ is perfect w.r.t. D but with high probability over S ∼ D m , the output of A is very bad Prior knowledge: We must bias the learner toward “reasonable” functions — hypothesis class H ⊂ Y X What should be H ? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 5 / 25

Why Deep Neural Networks are Great? A Statistical Learning Perspective Consider all functions over { 0 , 1 } d that can be executed in time at most T ( d ) Theorem: The class H NN of neural networks of depth O ( T ( d )) and size O ( T ( d ) 2 ) contains all functions that can be executed in time at most T ( d ) A great hypothesis class: With sufficiently large network depth and size, we can express all functions we would ever want to learn Sample complexity behaves nicely and is well understood (see Anthony & Bartlett 1999) End of story ? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 6 / 25

Why Deep Neural Networks are Great? A Statistical Learning Perspective Consider all functions over { 0 , 1 } d that can be executed in time at most T ( d ) Theorem: The class H NN of neural networks of depth O ( T ( d )) and size O ( T ( d ) 2 ) contains all functions that can be executed in time at most T ( d ) A great hypothesis class: With sufficiently large network depth and size, we can express all functions we would ever want to learn Sample complexity behaves nicely and is well understood (see Anthony & Bartlett 1999) End of story ? The computational barrier: But, how do we train neural networks ? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 6 / 25

Neural Networks — The computational barrier It is NP hard to implement ERM for a depth 2 network with k ≥ 3 hidden neurons whose activation function is sigmoidal or sign (Blum and Rivest 1992, Bartlett and Ben-David 2002) Current approaches: Back propagation, possibly with unsupervised pre-training and other bells and whistles No theoretical guarantees, and often requires manual tweaking Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 7 / 25

Outline How to circumvent hardness? Over-specification 1 Extreme over-specification eliminate local (non-global) minima Hardness of improperly learning a two layers network with k = ω (1) hidden neurons Change the activation function (sum-product networks) 2 Efficiently learning sum-product networks of depth 2 using Forward Greedy Selection Hardness of learning deep sum-product networks Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 8 / 25

Circumventing Hardness using Over-specification Yann LeCun: Fix a network architecture and generate data according to it Backpropagation fails to recover parameters However, if we enlarge the network size, backpropagation works just fine Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 9 / 25

Circumventing Hardness using Over-specification Yann LeCun: Fix a network architecture and generate data according to it Backpropagation fails to recover parameters However, if we enlarge the network size, backpropagation works just fine Maybe we can efficiently learn neural network using over-specification? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 9 / 25

Extremely over-specified Networks have no local (non-global) minima Let X ∈ R d,m be a data matrix of m examples Consider a network with: N internal neurons v be the weights of all but the last layer F ( v ; X ) be evaluations of internal neurons over data matrix X w be weights connecting internal neurons to the output neuron The output of the network is w ⊤ F ( v ; X ) Theorem: If N ≥ m , and under mild conditions on F , the optimization problem min w,v � w ⊤ F ( v ; X ) − y � 2 has no local (non-global) minima Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 10 / 25

Extremely over-specified Networks have no local (non-global) minima Let X ∈ R d,m be a data matrix of m examples Consider a network with: N internal neurons v be the weights of all but the last layer F ( v ; X ) be evaluations of internal neurons over data matrix X w be weights connecting internal neurons to the output neuron The output of the network is w ⊤ F ( v ; X ) Theorem: If N ≥ m , and under mild conditions on F , the optimization problem min w,v � w ⊤ F ( v ; X ) − y � 2 has no local (non-global) minima Proof idea: W.h.p. over perturbation of v , F ( v ; X ) has full rank. For such v , if we’re not at global minimum, just by changing w we can decrease the objective Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 10 / 25

Is over-specification enough ? But, such large networks will lead to overfitting Maybe there’s a clever trick that circumvent overfitting (regularization, dropout, ...) ? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 11 / 25

Is over-specification enough ? But, such large networks will lead to overfitting Maybe there’s a clever trick that circumvent overfitting (regularization, dropout, ...) ? Theorem (Daniely, Linial, S.) Even if the data is perfectly generated by a neural network of depth 2 and with only k = ω (1) neurons in the hidden layer, there is no algorithm that can achieve small test error Corollary: over-specification alone is not enough for efficient learnability Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 11 / 25

Proof Idea: Hardness of Improper Learning Improper learning: Learner tries to learn some hypothesis h ∗ ∈ H but is not restricted to output a hypothesis from H Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 12 / 25

Proof Idea: Hardness of Improper Learning Improper learning: Learner tries to learn some hypothesis h ∗ ∈ H but is not restricted to output a hypothesis from H How to show hardness? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 12 / 25

Proof Idea: Hardness of Improper Learning Improper learning: Learner tries to learn some hypothesis h ∗ ∈ H but is not restricted to output a hypothesis from H How to show hardness? Technical novelty: A new method for deriving lower bounds for improper learning, which relies on average-case complexity assumptions Technique yields new hardness results for improper learning of: DNFs (open problem since Kearns&Valiant’1989) Intersection of ω (1) halfspaces (Klivans&Sherstov’2006 showed hardness for d c halfspaces) Constant approximation ratio for agnostically learning halfspaces (previously, only hardness of exact learning was known) Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec’13 12 / 25

Efficiently Training Sum-Product Neural Networks using Forward - PowerPoint PPT Presentation

Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Greedy Algorithms, Frank-Wolfe and Friends A modern perspective, Lake

Product Section Product Section New Product Introduction New Product Introduction Product

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

ex Addition: 1-bit half adder A + Sum B Carry out Carry A B Sum out 0 0 A 0 1 Sum

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Artificial Neural Networks Oliver Schulte - CMPT 726 Feed-forward Networks Network Training

Sum Product Networks What is a Sum Product Network? 1. It is a tractable probabilistic model

Horizontal Water Source Heat Pump Horizontal Water Source Heat Pump Product Training Product

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

Deep Neural Networks II Sen Wang UDRC Co-I WP3.1 and WP3.2 Assistant Professor in Robotics

AIRS Science Processing Software Version 4.0 Status Version 5.0 Strategy and Goals Steven

Deep Learning Primer Nishith Khandwala Neural Networks Overview Neural Network Basics

Mathematical Programming: Modelling and Applications September 2009 Sonia Cafieri LIX, cole

YHDP Round 3 New Project Application June 3, 2020 Lena McGinn, ICF Jen Best, ICF In

JPEG CODING STANDARD Laboratory session Fernando Pereira Instituto Superior Tcnico

A challenge of packing CSS-sprites J.Marsza lkowski, J.Mizgajski, D.Mokwa, M.Drozdowski

Recent Progress on CNNs for Object Detection & Image Compression Rahul Sukthankar Google

Efficiently Training Sum-Product Neural Networks using Forward - PowerPoint PPT Presentation

Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Greedy Algorithms, Frank-Wolfe and Friends A modern perspective, Lake

Product Section Product Section New Product Introduction New Product Introduction Product

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

ex Addition: 1-bit half adder A + Sum B Carry out Carry A B Sum out 0 0 A 0 1 Sum

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Artificial Neural Networks Oliver Schulte - CMPT 726 Feed-forward Networks Network Training

Sum Product Networks What is a Sum Product Network? 1. It is a tractable probabilistic model

Horizontal Water Source Heat Pump Horizontal Water Source Heat Pump Product Training Product

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

Deep Neural Networks II Sen Wang UDRC Co-I WP3.1 and WP3.2 Assistant Professor in Robotics

AIRS Science Processing Software Version 4.0 Status Version 5.0 Strategy and Goals Steven

Deep Learning Primer Nishith Khandwala Neural Networks Overview Neural Network Basics

Mathematical Programming: Modelling and Applications September 2009 Sonia Cafieri LIX, cole

YHDP Round 3 New Project Application June 3, 2020 Lena McGinn, ICF Jen Best, ICF In

JPEG CODING STANDARD Laboratory session Fernando Pereira Instituto Superior Tcnico

A challenge of packing CSS-sprites J.Marsza lkowski, J.Mizgajski, D.Mokwa, M.Drozdowski

Recent Progress on CNNs for Object Detection &amp; Image Compression Rahul Sukthankar Google

Recent Progress on CNNs for Object Detection & Image Compression Rahul Sukthankar Google