Principled Deep Neural Network Training through Linear Programming - PowerPoint PPT Presentation

Principled Deep Neural Network Training through Linear Programming Daniel Bienstock 1 , Gonzalo Muñoz 2 , Sebastian Pokutta 3 January 9, 2019 1 IEOR, Columbia University 2 IVADO, Polytechnique Montréal 3 ISyE, Georgia Tech 1

“...I’m starting to look at machine learning problems” Oktay Günlük’s research interests, Aussois 2019 2

Goal of this talk

• Unfortunately, only recent results regarding the complexity of training deep neural networks have been obtained. trained to near optimality using linear programs whose size is linear Goal of this talk • Deep Learning is receiving signifjcant attention due to its impressive performance. • Our goal: to show that large classes of Neural Networks can be on the data. 3

trained to near optimality using linear programs whose size is linear Goal of this talk • Deep Learning is receiving signifjcant attention due to its impressive performance. training deep neural networks have been obtained. • Our goal: to show that large classes of Neural Networks can be on the data. 3 • Unfortunately, only recent results regarding the complexity of

Goal of this talk • Deep Learning is receiving signifjcant attention due to its impressive performance. training deep neural networks have been obtained. • Our goal: to show that large classes of Neural Networks can be trained to near optimality using linear programs whose size is linear on the data. 3 • Unfortunately, only recent results regarding the complexity of

• D data points x i y i • A loss function m to solve Compute f D D i 1 Empirical Risk Minimization problem f x i 1 (+ optional regularizer f ) f F (some class) y i n f Given: (not necessarily convex) m m m y i n • x i D 1 i 4

m to solve Compute f Empirical Risk Minimization problem D (some class) F f f ) (+ optional regularizer y i f x i 1 i 1 D Given: f n 4 • D data points (ˆ x i , ˆ y i ) , i = 1 , . . . , D • ˆ x i ∈ R n , ˆ y i ∈ R m • A loss function ℓ : R m × R m → R (not necessarily convex)

Empirical Risk Minimization problem f (some class) D D Given: 1 4 • D data points (ˆ x i , ˆ y i ) , i = 1 , . . . , D • ˆ x i ∈ R n , ˆ y i ∈ R m • A loss function ℓ : R m × R m → R (not necessarily convex) Compute f : R n → R m to solve ∑ min ℓ ( f (ˆ x i ) , ˆ y i ) (+ optional regularizer Φ( f ) ) i = 1 f ∈ F

Empirical Risk Minimization problem (some class) • Neural Networks with k layers. Examples: 5 D D 1 f ∑ min ℓ ( f (ˆ x i ) , ˆ y i ) (+ optional regularizer Φ( f ) ) i = 1 f ∈ F • Linear Regression. f ( x ) = Ax + b with ℓ 2 -loss. • Binary Classifjcation. Varying f architectures and cross-entropy loss: ℓ ( p , y ) = − y log( p ) − ( 1 − y ) log( 1 − p ) f ( x ) = T k + 1 ◦ σ ◦ T k ◦ σ . . . ◦ σ ◦ T 1 ( x ) , each T j affjne.

Thus, THE problem becomes Function parameterization We assume family F (statisticians’ hypothesis) is parameterized: there exists f such that 1 D D i 1 f x i y i 6 F = { f ( x , θ ) : θ ∈ Θ ⊆ [ − 1 , 1 ] N } .

Function parameterization We assume family F (statisticians’ hypothesis) is parameterized: there exists f such that 1 D D 6 F = { f ( x , θ ) : θ ∈ Θ ⊆ [ − 1 , 1 ] N } . Thus, THE problem becomes ∑ min ℓ ( f (ˆ x i , θ ) , ˆ y i ) θ ∈ Θ i = 1

What we know for Neural Nets

• Each T i affjne T i y • A 1 is n 1 is w m , A i is w Neural Networks m w w n . . . w otherwise. w , A k b i A i y T 1 T k 1 T k • f 7 • D data points (ˆ x i , ˆ y i ) , 1 ≤ i ≤ D , ˆ x i ∈ R n , ˆ y i ∈ R m

• Each T i affjne T i y • A 1 is n 1 is w m , A i is w Neural Networks m w w n . . . w otherwise. w , A k b i A i y 7 • D data points (ˆ x i , ˆ y i ) , 1 ≤ i ≤ D , ˆ x i ∈ R n , ˆ y i ∈ R m • f = T k + 1 ◦ σ ◦ T k ◦ σ . . . ◦ σ ◦ T 1

• A 1 is n 1 is w m , A i is w m w w n . . . w otherwise. Neural Networks 7 w , A k • D data points (ˆ x i , ˆ y i ) , 1 ≤ i ≤ D , ˆ x i ∈ R n , ˆ y i ∈ R m • f = T k + 1 ◦ σ ◦ T k ◦ σ . . . ◦ σ ◦ T 1 • Each T i affjne T i ( y ) = A i y + b i

Neural Networks . m w w n . . 7 • D data points (ˆ x i , ˆ y i ) , 1 ≤ i ≤ D , ˆ x i ∈ R n , ˆ y i ∈ R m • f = T k + 1 ◦ σ ◦ T k ◦ σ . . . ◦ σ ◦ T 1 • Each T i affjne T i ( y ) = A i y + b i • A 1 is n × w , A k + 1 is w × m , A i is w × w otherwise.

Hardness Results Theorem (Blum and Rivest 1992) . . . Theorem (Boob, Dey and Lan 2018) activation. Then training is NP-hard in the same network. 8 x i ∈ R n , ˆ y i ∈ { 0 , 1 } , ℓ ∈ (absolute value, 2-norm squared) and σ a Let ˆ threshold function. Then training is NP-hard even in this simple network: x i ∈ R n , ˆ y i ∈ { 0 , 1 } , ℓ a norm and σ ( t ) = max { 0 , t } a ReLU Let ˆ

that is polynomial in the data size” algorithms for DNNs with two or more hidden layers and this seems Exact Training Complexity Theorem (Arora, Basu, Mianjy and Mukherjee 2018) training algorithm of complexity O 2 w D nw poly D n w Polynomial in the size of the data set, for fjxed n w. Also in that paper: “we are not aware of any complexity results which would rule out the possibility of an algorithm which trains to global optimality in time “Perhaps an even better breakthrough would be to get optimal training like a substantially harder nut to crack” 9 If k = 1 (one “hidden layer”), m = 1 and ℓ is convex, there is an exact

that is polynomial in the data size” algorithms for DNNs with two or more hidden layers and this seems Exact Training Complexity Theorem (Arora, Basu, Mianjy and Mukherjee 2018) training algorithm of complexity Also in that paper: “we are not aware of any complexity results which would rule out the possibility of an algorithm which trains to global optimality in time “Perhaps an even better breakthrough would be to get optimal training like a substantially harder nut to crack” 9 If k = 1 (one “hidden layer”), m = 1 and ℓ is convex, there is an exact O ( 2 w D nw poly ( D , n , w ) ) Polynomial in the size of the data set, for fjxed n , w.

Exact Training Complexity Theorem (Arora, Basu, Mianjy and Mukherjee 2018) training algorithm of complexity Also in that paper: “we are not aware of any complexity results which would rule out the possibility of an algorithm which trains to global optimality in time “Perhaps an even better breakthrough would be to get optimal training algorithms for DNNs with two or more hidden layers and this seems like a substantially harder nut to crack” 9 If k = 1 (one “hidden layer”), m = 1 and ℓ is convex, there is an exact O ( 2 w D nw poly ( D , n , w ) ) Polynomial in the size of the data set, for fjxed n , w. that is polynomial in the data size”

training problems coming from x i y i D What we’ll prove There exists a polytope: whose size depends linearly on D that encodes approximately all possible i 1 1 1 n m D . Spoiler: Theory-only results 10

What we’ll prove There exists a polytope: whose size depends linearly on D that encodes approximately all possible Spoiler: Theory-only results 10 i = 1 ⊆ [ − 1 , 1 ] ( n + m ) D . training problems coming from (ˆ x i , ˆ y i ) D

Our Hammer

Given a chordal graph G, we say its treewidth is • Trees have treewidth 1 • Cycles have treewidth 2 • K n has treewidth n Treewidth Treewidth is a parameter that measures how tree-like a graph is. Defjnition if its clique number is 1 . 1 11

• Trees have treewidth 1 • Cycles have treewidth 2 • K n has treewidth n Treewidth Treewidth is a parameter that measures how tree-like a graph is. Defjnition 1 11 Given a chordal graph G, we say its treewidth is ω if its clique number is ω + 1 .

Treewidth Treewidth is a parameter that measures how tree-like a graph is. Defjnition 11 Given a chordal graph G, we say its treewidth is ω if its clique number is ω + 1 . • Trees have treewidth 1 • Cycles have treewidth 2 • K n has treewidth n − 1

i over 0 1 n • Each f i is “well-behaved”: Lipschitz constant • Intersection graph: An edge whenever two variables appear in the x 4 x 5 Approximate optimization of well-behaved functions x 4 6 5 4 3 2 1 The intersection graph is: 2 x 6 1 1 x 3 Prototype problem: x 3 x 2 x 1 For example: same f i Toolset: 12 min c T x s.t. f i ( x ) ≤ 0 , i = 1 , . . . , m x ∈ [ 0 , 1 ] n

• Intersection graph: An edge whenever two variables appear in the x 4 x 5 Approximate optimization of well-behaved functions x 4 6 5 4 3 2 1 The intersection graph is: 2 x 6 1 1 x 3 Prototype problem: x 3 x 2 x 1 For example: same f i Toolset: 12 min c T x s.t. f i ( x ) ≤ 0 , i = 1 , . . . , m x ∈ [ 0 , 1 ] n • Each f i is “well-behaved”: Lipschitz constant L i over [ 0 , 1 ] n

Principled Deep Neural Network Training through Linear Programming - PowerPoint PPT Presentation

Principled Deep Neural Network Training through Linear Programming Daniel Bienstock 1 , Gonzalo Muoz 2 , Sebastian Pokutta 3 January 9, 2019 1 IEOR, Columbia University 2 IVADO, Polytechnique Montral 3 ISyE, Georgia Tech 1 ...Im

Data-Dependent Sample Complexities for Deep Neural Networks Tengyu Ma Colin Wei Stanford

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Deep Learning Primer Nishith Khandwala Neural Networks Overview Neural Network Basics

recap: Linear Models, RBFs, Neural Networks Linear Model with Nonlinear Transform Neural Network

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

The Fundamentals of Deep Learning Building Blocks Theory with Applications Neural Units Neural

Neural Networks Greg Mori - CMPT 419/726 Bishop PRML Ch. 5 Feed-forward Networks Network

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Deep Neural Networks CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Deep learning slides credit:

PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING NEURAL NETWORK VISION FOR ROBOT DRIVING ARJUN

FLEET: Flexible Efficient Ensemble Training for Heterogenous Deep Neural Networks Hui Guan,

Stochastic Algorithms in Machine Learning Aymeric DIEULEVEUT EPFL, Lausanne December 1st, 2017

On oracle inequalities related to high dimensional linear models Yuri Golubev CNRS, Universit

!"#"$%"$&#'(')#+$+,("-)./(

Decision Theory, and Loss Functions CMSC 691 UMBC Some slides adapted from Hamed Pirsiavash

The Shadow Cost of Bank Capital Requirements Roni Kisin Washington University in St. Louis Asaf

5. Bayesian decision theory Chlo-Agathe Azencot Centre for Computatjonal Biology, Mines

Understanding the safety-relevance of visual cue perception at a Surface Manager HMI Lothar

CS 6316 Machine Learning Boosting Yangfeng Ji Department of Computer Science University of

Sambuz

Useful Links

Newsletter

Mail Us

Principled Deep Neural Network Training through Linear Programming - PowerPoint PPT Presentation

Principled Deep Neural Network Training through Linear Programming Daniel Bienstock 1 , Gonzalo Muoz 2 , Sebastian Pokutta 3 January 9, 2019 1 IEOR, Columbia University 2 IVADO, Polytechnique Montral 3 ISyE, Georgia Tech 1 ...Im

Data-Dependent Sample Complexities for Deep Neural Networks Tengyu Ma Colin Wei Stanford

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Deep Learning Primer Nishith Khandwala Neural Networks Overview Neural Network Basics

recap: Linear Models, RBFs, Neural Networks Linear Model with Nonlinear Transform Neural Network

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

The Fundamentals of Deep Learning Building Blocks Theory with Applications Neural Units Neural

Neural Networks Greg Mori - CMPT 419/726 Bishop PRML Ch. 5 Feed-forward Networks Network

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Deep Neural Networks CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Deep learning slides credit:

PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING NEURAL NETWORK VISION FOR ROBOT DRIVING ARJUN

FLEET: Flexible Efficient Ensemble Training for Heterogenous Deep Neural Networks Hui Guan,

Stochastic Algorithms in Machine Learning Aymeric DIEULEVEUT EPFL, Lausanne December 1st, 2017

On oracle inequalities related to high dimensional linear models Yuri Golubev CNRS, Universit

!&quot;#&quot;$%&quot;$&amp;#'(')#*+$+,(&quot;-).*/(

Decision Theory, and Loss Functions CMSC 691 UMBC Some slides adapted from Hamed Pirsiavash

The Shadow Cost of Bank Capital Requirements Roni Kisin Washington University in St. Louis Asaf

5. Bayesian decision theory Chlo-Agathe Azencot Centre for Computatjonal Biology, Mines

Understanding the safety-relevance of visual cue perception at a Surface Manager HMI Lothar

CS 6316 Machine Learning Boosting Yangfeng Ji Department of Computer Science University of

Sambuz

Useful Links

Newsletter

Mail Us

!"#"$%"$&#'(')#+$+,("-)./(