Approximation power of deep networks Matus Telgarsky - PowerPoint PPT Presentation

Bumps via multiplication 1 . 0 cos( x ) cos( x ) 2 0 . 8 cos( x ) 4 cos( x ) 8 0 . 6 cos( x ) 16 cos( x ) 32 0 . 4 0 . 2 0 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 0 . 5 1 . 0 1 . 5 Univariate bump: cos( x ) p for large p . Multivariate bump: d d � � � � � � cos( x i ) p . � x � ∞ ≤ 1 = | x i | ≤ 1 and 1 1 i =1 i =1 To remove the product: � � cos( x ) cos( x ) = 1 cos(2 x ) + 1 , 2 2 cos( x 1 ) cos( x 2 ) = cos( x 1 + x 2 ) + cos( x 1 − x 2 ) .

Weierstrass approximation theorem Theorem (Weierstrass, 1885). Polynomials can uniformly approximate continuous functions over compact sets.

Weierstrass approximation theorem Theorem (Weierstrass, 1885). Polynomials can uniformly approximate continuous functions over compact sets. Remarks. ◮ Not a consequence of interpolation: must control behavior between interpolants. ◮ Proofs are interesting; e.g., Bernstein (Bernstein polynomials and tail bounds), Weierstrass (Gaussian smoothing gives analytic functions). . . . ◮ Stone-Weierstrass theorem: Polynomial-like function families (e.g., closed under multiplication) also approximate continuous function.

Theorem (Hornik-Stinchcombe-White ’89). Let σ : R → R be given with z →−∞ σ ( z ) = 0 , lim z → + ∞ σ ( z ) = 1 , lim � x �→ σ ( a T x − b ) : ( a, b ) ∈ R d +1 � and define H σ := . Then span( H σ ) uniformly approximates continuous functions on compact sets.

Theorem (Hornik-Stinchcombe-White ’89). Let σ : R → R be given with z →−∞ σ ( z ) = 0 , lim z → + ∞ σ ( z ) = 1 , lim � x �→ σ ( a T x − b ) : ( a, b ) ∈ R d +1 � and define H σ := . Then span( H σ ) uniformly approximates continuous functions on compact sets. Proof #1. H cos is closed under products since 2 cos( a ) cos( b ) = cos( a + b ) + cos( a − b ) . Now uniformly approximate fixed H cos with span( H σ ). (Univariate fitting.) Proof #2. H exp is closed under products since e a e b = e a + b . Now uniformly approximate fixed H exp with span( H σ ). (Univariate fitting.)

Theorem (Hornik-Stinchcombe-White ’89). Let σ : R → R be given with z →−∞ σ ( z ) = 0 , lim z → + ∞ σ ( z ) = 1 , lim � x �→ σ ( a T x − b ) : ( a, b ) ∈ R d +1 � and define H σ := . Then span( H σ ) uniformly approximates continuous functions on compact sets.

Theorem (Hornik-Stinchcombe-White ’89). Let σ : R → R be given with z →−∞ σ ( z ) = 0 , lim z → + ∞ σ ( z ) = 1 , lim � x �→ σ ( a T x − b ) : ( a, b ) ∈ R d +1 � and define H σ := . Then span( H σ ) uniformly approximates continuous functions on compact sets. Remarks. ◮ ReLU is fine: use σ ( z ) := σ r ( z ) − σ r ( z − 1). � � Ω( d ) . ◮ Size estimate: expanding terms, seem to get Lip / ǫ ◮ Best conditions on σ (Leshno-Lin-Pinkus-Schocken ’93): theorem holds iff σ not a polynomial. ◮ Inner hint about DN: no need for explicit multiplication?

Other proofs. ◮ (Cybenko ’89.) Assume contradictorily you miss some functions. � By duality, 0 = σ ( a T x − b ) d µ ( x ) for some signed measure µ , all ( a, b ). Using Fourier, can show this implies µ = 0. . . ◮ (Leshno-Lin-Pinkus-Schocken ’93.) If σ a polynomial, . . . ; else can (roughly) get derivatives of all orders, polynomials of all orders. ◮ (Barron ’93.) Consider activation x �→ exp( ia T x ), � exp( ia T x ) � infinite width form f ( a ) d a . Take real part and sample (Maurey) to get g ∈ span( H cos ); convert to span( H σ ) as before. ◮ (Funahashi ’89.) Also Fourier, measure-theoretic.

“Universal approximation” (Uniform approximation of cont. functions on compact sets). ◮ Elementary proof: RBF (Mhaskar-Michelli ’92; BJTX ’19). ◮ Slick proof: Stone-Weierstrass and H cos or H exp (Hornik-Stinchcombe-White, ’89). � � d , ◮ Proof with size estimates beating Lip / ǫ indeed norm of Fourier transform of gradient, related to “sampling measure”: (Barron ’93). Remarks. ◮ Exhibits nothing special about DN; indeed, same proofs work for boosting, RBF SVM, . . . ◮ Size estimates huge (soon we’ll see d Ω( d ) ). ◮ Proofs use nice representation “tricks”; (e.g., Leshno et al “iff not polynomial”).

Elementary universal approximation. Classical universal approximation. Benefits of depth. 1 Sobolev spaces. 0 1 Odds & ends.

Radial functions are easy with two ReLU layers Consider f ( � x � 2 ) with Lipschitz constant Lip. 2 = � ◮ Pick h ( x ) ≈ ǫ � x � 2 i x 2 i with d · Lip / ǫ ReLU.

Radial functions are easy with two ReLU layers Consider f ( � x � 2 ) with Lipschitz constant Lip. 2 = � ◮ Pick h ( x ) ≈ ǫ � x � 2 i x 2 i with d · Lip / ǫ ReLU. ◮ Pick g ≈ ǫ f with Lip / ǫ ReLU; then � � � � � � � � � � � f ( � x � 2 ) − g ( h ( x )) � f ( � x � 2 ) − f ( h ( x )) � f ( h ( x )) − g ( h ( x )) � � ≤ � + � � � � � � x � 2 − h ( x ) ≤ Lip � + ǫ ≤ 2 ǫ.

Radial functions are easy with two ReLU layers Consider f ( � x � 2 ) with Lipschitz constant Lip. 2 = � ◮ Pick h ( x ) ≈ ǫ � x � 2 i x 2 i with d · Lip / ǫ ReLU. ◮ Pick g ≈ ǫ f with Lip / ǫ ReLU; then � � � � � � � � � � � f ( � x � 2 ) − g ( h ( x )) � f ( � x � 2 ) − f ( h ( x )) � f ( h ( x )) − g ( h ( x )) � � ≤ � + � � � � � � x � 2 − h ( x ) ≤ Lip � + ǫ ≤ 2 ǫ. Remarks. ◮ Final size of g ◦ h is poly(Lip , d, 1 /ǫ ). ◮ Proof style is “typical”/lazy; (problematically) pays with Lipschitz constant. ◮ That was easy/intuitive; how about the 2 layer case?...

Radial functions are not easy with only one ReLU layer (I) Theorem (Eldan-Shamir, 2015). There exists a radial function f , expressible with two ReLU layers of width poly( d ), and a probability measure P so that every g with a single ReLU layer of width 2 O ( d ) satisfies � � � 2 d P ( x ) ≥ Ω(1) . f ( x ) − g ( x )

Radial functions are not easy with only one ReLU layer (I) Theorem (Eldan-Shamir, 2015). There exists a radial function f , expressible with two ReLU layers of width poly( d ), and a probability measure P so that every g with a single ReLU layer of width 2 O ( d ) satisfies � � � 2 d P ( x ) ≥ Ω(1) . f ( x ) − g ( x ) Proof hints. Apply Fourier isometry and consider the transforms. Transform of g is supported on a small set of tubes; transform of f has large mass they can’t reach.

Radial functions are not easy with only one ReLU layer (II) Theorem (Daniely, 2017). Let ( x, x ′ ) ∼ P be uniform on two sphere surfaces, define h ( x, x ′ ) = sin( πd 3 x T x ′ ). For any g with a single ReLU layer of width d O ( d ) and weight magnitude O (2 d ), � � � 2 d P ( x, x ′ ) ≥ Ω(1) , h ( x, x ′ ) − g ( x, x ′ ) and h can be approximated to accuracy ǫ by f with two ReLU layers of size poly( d, 1 /ǫ ).

Radial functions are not easy with only one ReLU layer (II) Theorem (Daniely, 2017). Let ( x, x ′ ) ∼ P be uniform on two sphere surfaces, define h ( x, x ′ ) = sin( πd 3 x T x ′ ). For any g with a single ReLU layer of width d O ( d ) and weight magnitude O (2 d ), � � � 2 d P ( x, x ′ ) ≥ Ω(1) , h ( x, x ′ ) − g ( x, x ′ ) and h can be approximated to accuracy ǫ by f with two ReLU layers of size poly( d, 1 /ǫ ). Proof hints. Spherical harmonics reduce this to a univariate problem; apply region counting.

Approximation of high-dimensional radial functions (A radial function contour plot.) If we can approximate each shell, we can approximate the overall function.

Approximation of high-dimensional radial shell Let’s approximate a single shell; consider � � x �→ 1 � x � ∈ [1 − 1 / d , 1] , which has a constant fraction of sphere volume.

Approximation of high-dimensional radial shell Let’s approximate a single shell; consider � � x �→ 1 � x � ∈ [1 − 1 / d , 1] , which has a constant fraction of sphere volume. Can’t cut too deeply; get bad error on inner zero part. . .

Approximation of high-dimensional radial shell Let’s approximate a single shell; consider � � x �→ 1 � x � ∈ [1 − 1 / d , 1] , which has a constant fraction of sphere volume. Can’t cut too deeply; get bad error on inner zero part. . . . . . but then we need to cover exponentially many caps.

Let’s go back to the drawing board; what do shallow representations do exceptionally badly?

Let’s go back to the drawing board; what do shallow representations do exceptionally badly? One weakness: their complexity scales with #bumps.

Consider the tent map � 2 x x ∈ [0 , 1 / 2) , ∆( x ) := σ r (2 x ) − σ r (4 x − 2) = 2(1 − x ) x ∈ [1 / 2 , 1] . 1 0 1 ∆.

Consider the tent map � 2 x x ∈ [0 , 1 / 2) , ∆( x ) := σ r (2 x ) − σ r (4 x − 2) = 2(1 − x ) x ∈ [1 / 2 , 1] . 1 0 1 ∆. What is the effect of composition? � x ∈ [0 , 1 / 2 ) = ⇒ f (2 x ) = f squeezed into [0 , 1 / 2] , f (∆( x )) = � � x ∈ [ 1 / 2 , 1] = ⇒ f 2(1 − x ) = f reversed, squeezed .

Consider the tent map � 2 x x ∈ [0 , 1 / 2) , ∆( x ) := σ r (2 x ) − σ r (4 x − 2) = 2(1 − x ) x ∈ [1 / 2 , 1] . 1 1 0 1 0 1 ∆ 2 = ∆ ◦ ∆. ∆. What is the effect of composition? � x ∈ [0 , 1 / 2 ) = ⇒ f (2 x ) = f squeezed into [0 , 1 / 2] , f (∆( x )) = � � x ∈ [ 1 / 2 , 1] = ⇒ f 2(1 − x ) = f reversed, squeezed .

Consider the tent map � 2 x x ∈ [0 , 1 / 2) , ∆( x ) := σ r (2 x ) − σ r (4 x − 2) = 2(1 − x ) x ∈ [1 / 2 , 1] . 1 1 1 0 1 0 1 0 1 ∆ 2 = ∆ ◦ ∆. ∆ k . ∆. What is the effect of composition? � x ∈ [0 , 1 / 2 ) = ⇒ f (2 x ) = f squeezed into [0 , 1 / 2] , f (∆( x )) = � � x ∈ [ 1 / 2 , 1] = ⇒ f 2(1 − x ) = f reversed, squeezed .

Consider the tent map � 2 x x ∈ [0 , 1 / 2) , ∆( x ) := σ r (2 x ) − σ r (4 x − 2) = 2(1 − x ) x ∈ [1 / 2 , 1] . 1 1 1 0 1 0 1 0 1 ∆ 2 = ∆ ◦ ∆. ∆ k . ∆. What is the effect of composition? � x ∈ [0 , 1 / 2 ) = ⇒ f (2 x ) = f squeezed into [0 , 1 / 2] , f (∆( x )) = � � x ∈ [ 1 / 2 , 1] = ⇒ f 2(1 − x ) = f reversed, squeezed . ∆ k uses O ( k ) layers & nodes, but has O (2 k ) bumps.

Theorem (T ’15). Let #layers k ≥ 1 be given.

Theorem (T ’15). Let #layers k ≥ 1 be given. Exists ReLU network f : [0 , 1] → [0 , 1] with 4 distinct parameters, 3 k 2 + 9 nodes, 2 k 2 + 6 layers,

Theorem (T ’15). Let #layers k ≥ 1 be given. Exists ReLU network f : [0 , 1] → [0 , 1] with 4 distinct parameters, 3 k 2 + 9 nodes, 2 k 2 + 6 layers, such that every ReLU network g : R d → R with ≤ k layers, ≤ 2 k nodes

Theorem (T ’15). Let #layers k ≥ 1 be given. Exists ReLU network f : [0 , 1] → [0 , 1] with 4 distinct parameters, 3 k 2 + 9 nodes, 2 k 2 + 6 layers, such that every ReLU network g : R d → R with ≤ k layers, ≤ 2 k nodes satisfies � | f ( x ) − g ( x ) | d x ≥ 1 32 . [0 , 1]

Theorem (T ’15). Let #layers k ≥ 1 be given. Exists ReLU network f : [0 , 1] → [0 , 1] with 4 distinct parameters, 3 k 2 + 9 nodes, 2 k 2 + 6 layers, such that every ReLU network g : R d → R with ≤ k layers, ≤ 2 k nodes satisfies � | f ( x ) − g ( x ) | d x ≥ 1 32 . [0 , 1] Proof. 1. g with few oscillations can’t apx oscillatory regular f . 2. There exists a regular, oscillatory f . ( f = ∆ k 2 +3 . ) ⇒ few ( O ( m L )) oscillations. 3. Width m depth L = Rediscovered many times; (T ’15) gives elementary univariate argument; multivariate arguments in (Warren ’68), (Arnold ?), (Montufar, Pascanu, Cho, Bengio, ’14), (BT ’18), . . .

� g with few oscillations; = ⇒ | g − f | large . f highly oscillatory [0 , 1] f g

� g with few oscillations; ? = ⇒ | g − f | large . f highly oscillatory [0 , 1] g f

� g with few oscillations; = ⇒ | g − f | large . f highly oscillatory, regular [0 , 1] g f Let’s use f = ∆ k 2 +3 .

Story from benefits of depth : ◮ Certain radial functions have polynomial width 2 ReLU layer representation, exponential width 1 ReLU layer representation. ◮ ∆ k 2 +3 can be written with O ( k 2 ) depth and O (1) width, requires width Ω(2 k ) if depth O ( k ).

Story from benefits of depth : ◮ Certain radial functions have polynomial width 2 ReLU layer representation, exponential width 1 ReLU layer representation. ◮ ∆ k 2 +3 can be written with O ( k 2 ) depth and O (1) width, requires width Ω(2 k ) if depth O ( k ). Remarks. ◮ ∆ k is 2 k -Lipschitz; possibly nonsensical, unrealistic. ◮ These results have stood a few years now; many “technical” questions, also “realistic” questions.

Elementary universal approximation. Classical universal approximation. Benefits of depth. 1 Sobolev spaces. 0 1 Odds & ends.

h k := piecewise-affine interpolation of x 2 at { 0 , 1 2 k , . . . , 2 k 2 k , 2 2 k } . h 1 .

h k := piecewise-affine interpolation of x 2 at { 0 , 1 2 k , . . . , 2 k 2 k , 2 2 k } . h 1 . h 2 .

h k := piecewise-affine interpolation of x 2 at { 0 , 1 2 k , . . . , 2 k 2 k , 2 2 k } . h 1 . h 2 . h 1 − h 2 .

h k := piecewise-affine interpolation of x 2 at { 0 , 1 2 k , . . . , 2 k 2 k , 2 2 k } . h 1 . h 2 . h 1 − h 2 . Thus h k ( x ) = x − � i ≤ k ∆ i ( x ) / 4 i .

h k := piecewise-affine interpolation of x 2 at { 0 , 1 2 k , . . . , 2 k 2 k , 2 2 k } . h 1 . h 2 . Thus h k ( x ) = x − � i ≤ k ∆ i ( x ) / 4 i .

h k := piecewise-affine interpolation of x 2 at { 0 , 1 2 k , . . . , 2 k 2 k , 2 2 k } . h 1 . h 2 . Thus h k ( x ) = x − � i ≤ k ∆ i ( x ) / 4 i . ◮ h k needs k = O (ln(1 /ǫ )) to ǫ -apx x �→ x 2 (Yarotsky, ’16), with matching lower bounds. ◮ Squaring implies multiplication via polarization: � � x + y � 2 − � x � 2 − � y � 2 � x T y = 1 . 2

h k := piecewise-affine interpolation of x 2 at { 0 , 1 2 k , . . . , 2 k 2 k , 2 2 k } . h 1 . h 2 . Thus h k ( x ) = x − � i ≤ k ∆ i ( x ) / 4 i . ◮ h k needs k = O (ln(1 /ǫ )) to ǫ -apx x �→ x 2 (Yarotsky, ’16), with matching lower bounds. ◮ Squaring implies multiplication via polarization: � � x + y � 2 − � x � 2 − � y � 2 � x T y = 1 . 2 ◮ This implies efficient approximation of polynomials; can we do more?

Approximation power of deep networks Matus Telgarsky - PowerPoint PPT Presentation

Approximation power of deep networks Matus Telgarsky <mjt@illinois.edu> (with help from many friends!) Goal: in some prediction problem, replace f : R d R with neural network g : R d R . Goal: in some prediction problem, replace f : R

Deep Approximation via Deep Learning Zuowei Shen Department of Mathematics National University

6. Approximation and fitting norm approximation least-norm problems regularized

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

(power x 0) == 1 (power x (+ n 1)) == (* (power x n) x) (power x 0) == 1 (power x (+ (* 2 m)

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Deep Networks Need for

Deep Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Deep Networks Need for

E9 205 Machine Learning for Signal Processing Understanding Deep Networks 08-11-2019 Instructor

Moderately exponential approximation Bridging the gap between exact computation and polynomial

6. Approximation and fitting Prof. Ying Cui Department of Electrical Engineering Shanghai Jiao

ECS 231 Lecture on Approximation and Error Analysis 1 / 9 Approximation and error analysis 1.

LOCAL LINEAR APPROXIMATION MATH 200 GOALS Be able to compute the local linear approximation

Lecture 18: PCP Theorem and Hardness of Approximation I Arijit Bishnu 26.04.2010 Introduction

Advanced Algorithms COMS31900 Approximation algorithms part three (Fully) Polynomial Time

On the Expressive Power of Deep Neural Networks Maithra Raghu, Ben Poole, Jon Kleinberg, Surya

On existence and behavior of radial minimizers for the Schrdinger-Poisson-Slater problem.

Experimental turbulence studies for gyro-kinetic code validation using advanced microwave

Tie (chemo)-dynamics of gas accretion onto star-forming galaxies Gabriele Pezzulli (ETH Zurich)

Readings Covered Further Readings Hermann Survey Graph Visualisation in Information

Relativistic effects and non-collinear DFT What is relativistic effects? Dirac equation

The probabilistic viewpoint and dynamics in arithmetic geometry Juan Rivera-Letelier Roots

Machine Learning 2: Nonlinear Regression Stefano Ermon April 13, 2016 Stefano Ermon April 13,

Two classes of Blaschke products and their applications to operator theory Pamela Gorkin