When can Deep Networks avoid the curse of dimensionality and other - PowerPoint PPT Presentation

When can Deep Networks avoid the curse of dimensionality and other theoretical puzzles Tomaso Poggio, MIT, CBMM Astar

CBMM’s focus is the Science and the Engineering of Intelligence We aim to make progress in understanding intelligence, that is in understanding how the brain makes the mind, how the brain works and how to build intelligent machines. We believe that the science of intelligence will enable better engineering of intelligence. BCS VC meeting, 2017

Key role of Machine learning: history Third Annual NSF Site Visit, June 8 – 9, 2016

CBMM: one of the motivations Key recent advances in the engineering of intelligence have their roots in basic research on the brain

It is time for a theory of deep learning

RELU approximatinion by univariate polynomial preserves deep nets properties

Deep Networks:Three theory questions • Approximation Theory: When and why are deep networks better than shallow networks? • Optimization: What is the landscape of the empirical risk? • Learning Theory: How can deep learning not overfit?

Theory I:   When is deep better than shallow Why and when are deep networks better than shallow networks? f ( x 1 , x 2 ,..., x 8 ) = g 3 ( g 21 ( g 11 ( x 1 , x 2 ), g 12 ( x 3 , x 4 )) g 22 ( g 11 ( x 5 , x 6 ), g 12 ( x 7 , x 8 ))) r ∑ g ( x ) = < w i , x > + b i + c i i = 1 Theorem (informal statement) Suppose that a function of d variables is compositional . Both shallow and deep network can approximate f equally well. O ( ε − d ) The number of parameters of the shallow network depends exponentially on d as with the dimension whereas O ( ε − 2 ) for the deep network dance is dimension independent, i.e. Mhaskar, Poggio, Liao, 2016

Deep and shallow networks: universality r ∑ φ ( x ) = < w i , x > + b i + c i i = 1 Cybenko, Girosi, ….

Classical learning theory and Kernel Machines   When is deep better than shallow (Regularization in RKHS) ℓ 1 & # 2 min V ( f ( x ) y ) f ∑ − + λ $ ! i i ℓ K f H ∈ % " i 1 = implies l f ( x ) i K ( x , x ) ∑ = α i i Equation includes splines, Radial Basis Functions and Support Vector Machines (depending on choice of V). RKHS were explicitly introduced in learning theory by Girosi (1997), Vapnik (1998). Moody and Darken (1989), and Broomhead and Lowe (1988) introduced RBF to learning theory. Poggio and Girosi (1989) introduced Tikhonov regularization in learning theory and worked (implicitly) with RKHS. RKHS were used earlier in approximation theory (eg Parzen, 1952-1970, Wahba, 1990). Mhaskar, Poggio, Liao, 2016

Classical kernel machines are equivalent to shallow networks Kernel machines… X Y l f ( x ) c K ( x , x ) b = ∑ + i i i K K K can be “written” as shallow networks: the value of K corresponds to the “activity” of C1 C n CN the “unit” for the input and the correspond to “weights” + f

Curse of dimensionality When is deep better than shallow y = f ( x 1 , x 2 ,..., x 8 ) Curse of dimensionality Both shallow and deep network can approximate a function of d variables equally well. The number of parameters in both cases O ( ε − d ) depends exponentially on d as . Mhaskar, Poggio, Liao, 2016

When is deep better than shallow Generic functions f ( x 1 , x 2 ,..., x 8 ) Compositional functions f ( x 1 , x 2 ,..., x 8 ) = g 3 ( g 21 ( g 11 ( x 1 , x 2 ), g 12 ( x 3 , x 4 )) g 22 ( g 11 ( x 5 , x 6 ), g 12 ( x 7 , x 8 ))) Mhaskar, Poggio, Liao, 2016

Hierarchically local compositionality When is deep better than shallow f ( x 1 , x 2 ,..., x 8 ) = g 3 ( g 21 ( g 11 ( x 1 , x 2 ), g 12 ( x 3 , x 4 )) g 22 ( g 11 ( x 5 , x 6 ), g 12 ( x 7 , x 8 ))) Theorem (informal statement) Suppose that a function of d variables is hierarchically, locally, compositional . Both shallow and deep network can approximate f equally well. The number of parameters of O ( ε − d ) the shallow network depends exponentially on d as with the dimension O ( d ε − 2 ) whereas for the deep network dance is Mhaskar, Poggio, Liao, 2016

Microstructure of compositionality target function approximating function/network 19

Locality of constituent functions is key: CIFAR

Remarks

Old results on Boolean functions are closely related • A classical theorem [Sipser, 1986; Hastad, 1987] shows that deep circuits are more efficient in representing certain Boolean functions than shallow circuits. Hastad proved that highly-variable functions (in the sense of having high frequencies in their Fourier spectrum) in particular the parity function cannot even be decently approximated by small constant depth circuits 22

Lower Bounds • The main result of [Telgarsky, 2016, Colt] says that there are functions with many oscillations that cannot be represented by shallow networks with linear complexity but can be represented with low complexity by deep networks. • Older examples exist: consider a function which is a linear combination of n tensor product Chui–Wang spline wavelets, where each wavelet is a tensor product cubic spline. It was shown by Chui and Mhaskar that is impossible to implement such a function using a shallow neural network with a sigmoidal activation function using O(n) neurons, but a deep network with ( x + ) 2 the activation function do so. In this case, as we mentioned, there is a formal proof of a gap between deep and shallow networks. Similarly, Eldan and Shamir show other cases with separations that are exponential in the input dimension. 23

Open problem: why compositional functions are important for When is deep better than shallow perception? They seem to occur in computations on text, speech, images…why? Conjecture (with) Max Tegmark The locality of the hamiltonians of physics induce compositionality in natural signals such as images or The connectivity in our brain implies that our perception is limited to compositional functions

Why are compositional Locality of Computation functions important? Which one of these reasons: What is special about Physics? locality of computation? Neuroscience? <=== Locality in “space”? Evolution? Locality in “time”?

Theory II:   When is deep better than shallow What is the Landscape of the empirical risk? Observation Replacing the RELUs with univariate polynomial approximation, Bezout theorem implies that the system of polynomial equations corresponding to zero empirical error has a very large number of degenerate solutions. The global zero-minimizers correspond to flat minima in many dimensions (generically, unlike local minima). Thus SGD is biased towards finding global minima of the empirical risk. Liao, Poggio, 2017

  Bezout theorem p ( x i ) − y i = 0 for i = 1,..., n The set of polynomial equations above with k= degree of p(x) has a number of distinct zeros (counting points at infinity, using projective space, assigning an appropriate multiplicity to each intersection point, and excluding degenerate cases) equal to Z = k n the product of the degrees of each of the equations. As in the linear case, when the system of equations is underdetermined – as many equations as data points but more unknowns (the weights) – the theorem says that there are an infinite number of global minima, under the form of Z regions of zero empirical error.

  Global and local zeros f ( x i ) − y i = 0 for i = 1,..., n n equations in W unknowns with W >> n W equations in W unknowns

  Langevin equation df dt = − γ t ∇ V ( f ( t ), z ( t ) + γ ' t dB ( t ) with the Boltzmann equation as asymptotic “solution” − U ( x ) p ( f ) ~ 1 Z = e T

  When is deep better than shallow SGD

This is an analogy NOT a theorem

  GDL selects larger volume minima

  GDL and SGD

  Concentration because of high dimensionality

When is deep better than shallow SGDL and SGD observation: summary • SGDL finds with very high probability large volume, flat zero-minimizers; empirically SGD behaves in a similar way • Flat minimizers correspond to degenerate zero-minimizers and thus to global minimizers; Poggio, Rakhlin, Golovitc, Zhang, Liao, 2017

Problem of overfitting Regularization or similar to control overfitting

Deep Polynomial Networks show same puzzles From now on we study polynomial networks! Poggio et al., 2017

  Good generalization with less data than # weights Poggio et al., 2017

  Randomly labeled data Poggio et al., 2017 following Zhang et al., 2016, ICLR

No overfitting! Poggio et al., 2017 Explaining this figure is our main goal!

No overfitting with GD

When can Deep Networks avoid the curse of dimensionality and other - PowerPoint PPT Presentation

When can Deep Networks avoid the curse of dimensionality and other theoretical puzzles Tomaso Poggio, MIT, CBMM Astar CBMMs focus is the Science and the Engineering of Intelligence We aim to make progress in understanding intelligence,

How to Cope with the Curse of Dimensionality ? Henryk Wo zniakowski University of Warsaw and

. . . 1 / 5 The curse of dimensionality . many applications require high dimensional data .

Can Tim or Leste Avoid Can Tim or Leste Avoid the Resource Curse? the Resource Curse? By

Lifting the curse of dimensionality in nonlinear system identification with tensor networks. Kim

Dimensionality Reduction Alexandros Tantos Assistant Professor Aristotle University of

Investigating Dimensionality Dimensionality Dimensionality with with Investigating

STAT 209 Dimensionality Reduction November 26, 2019 Colin Reimer Dawson 1 / 24 Dimensionality

Overcoming the curse of dimensionality: from nonlinear Monte Carlo to deep artificial neural

High dimensional computing - the upside of the curse of dimensionality Peer Neubert Stefan

Curse of Dimensionality in Pivot-based Indexes Ilya Volnyansky, Vladimir Pestov Department of

Concepts for Breaking the Curse of Dimensionality for the Optimal Control HJB Equation Karl

The curse of dimensionality Julie Delon Laboratoire MAP5, UMR CNRS 8145 Universit Paris

Lecture 3: Kernel Regression Distance Metrics Curse of Dimensionality Linear

Dampening the Curse of Dimensionality Decomposition Methods for Stochastic Optimization Problems

Lecture 3: Kernel Regression Curse of Dimensionality Aykut Erdem February 2016 Hacettepe

On k -anonymity and the curse of dimensionality Introduction An important method for privacy

CMB anisotropies from acausal scaling seeds (arXiv:0901.1845v1) Ruth Durrer with Sandro Scodeller

Generalized Geometry and Double Field Theory: a toy Model Patrizia Vitale Dipartimento di Fisica

k Antennas Shower propagation Only time dependant No need to take the amplitudes into account t

MiniAgda: a toy language for integrating dependent and sized types Edoardo Putti, Lorena Yunes

An Analytical Study of GPU Computation for Solving QAPs by Parallel Evolutionary Computation with

Outline Classification of first-order theories Simple theories NIP theories NTP 2 Space of

DAQ introduction Purpose of this talk : (1) Introduction for those who have not been in every

Lecture 2.6: Propositions over a universe Matthew Macauley Department of Mathematical Sciences