approximation power of deep networks
play

Approximation power of deep networks Matus Telgarsky - PowerPoint PPT Presentation

Approximation power of deep networks Matus Telgarsky <mjt@illinois.edu> (with help from many friends!) Goal: in some prediction problem, replace f : R d R with neural network g : R d R . Goal: in some prediction problem, replace f : R


  1. Bumps via multiplication 1 . 0 cos( x ) cos( x ) 2 0 . 8 cos( x ) 4 cos( x ) 8 0 . 6 cos( x ) 16 cos( x ) 32 0 . 4 0 . 2 0 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 0 . 5 1 . 0 1 . 5 Univariate bump: cos( x ) p for large p . Multivariate bump: d d � � � � � � cos( x i ) p . � x � ∞ ≤ 1 = | x i | ≤ 1 and 1 1 i =1 i =1 To remove the product: � � cos( x ) cos( x ) = 1 cos(2 x ) + 1 , 2 2 cos( x 1 ) cos( x 2 ) = cos( x 1 + x 2 ) + cos( x 1 − x 2 ) .

  2. Weierstrass approximation theorem Theorem (Weierstrass, 1885). Polynomials can uniformly approximate continuous functions over compact sets.

  3. Weierstrass approximation theorem Theorem (Weierstrass, 1885). Polynomials can uniformly approximate continuous functions over compact sets. Remarks. ◮ Not a consequence of interpolation: must control behavior between interpolants. ◮ Proofs are interesting; e.g., Bernstein (Bernstein polynomials and tail bounds), Weierstrass (Gaussian smoothing gives analytic functions). . . . ◮ Stone-Weierstrass theorem: Polynomial-like function families (e.g., closed under multiplication) also approximate continuous function.

  4. Theorem (Hornik-Stinchcombe-White ’89). Let σ : R → R be given with z →−∞ σ ( z ) = 0 , lim z → + ∞ σ ( z ) = 1 , lim � x �→ σ ( a T x − b ) : ( a, b ) ∈ R d +1 � and define H σ := . Then span( H σ ) uniformly approximates continuous functions on compact sets.

  5. Theorem (Hornik-Stinchcombe-White ’89). Let σ : R → R be given with z →−∞ σ ( z ) = 0 , lim z → + ∞ σ ( z ) = 1 , lim � x �→ σ ( a T x − b ) : ( a, b ) ∈ R d +1 � and define H σ := . Then span( H σ ) uniformly approximates continuous functions on compact sets. Proof #1. H cos is closed under products since 2 cos( a ) cos( b ) = cos( a + b ) + cos( a − b ) . Now uniformly approximate fixed H cos with span( H σ ). (Univariate fitting.) Proof #2. H exp is closed under products since e a e b = e a + b . Now uniformly approximate fixed H exp with span( H σ ). (Univariate fitting.)

  6. Theorem (Hornik-Stinchcombe-White ’89). Let σ : R → R be given with z →−∞ σ ( z ) = 0 , lim z → + ∞ σ ( z ) = 1 , lim � x �→ σ ( a T x − b ) : ( a, b ) ∈ R d +1 � and define H σ := . Then span( H σ ) uniformly approximates continuous functions on compact sets.

  7. Theorem (Hornik-Stinchcombe-White ’89). Let σ : R → R be given with z →−∞ σ ( z ) = 0 , lim z → + ∞ σ ( z ) = 1 , lim � x �→ σ ( a T x − b ) : ( a, b ) ∈ R d +1 � and define H σ := . Then span( H σ ) uniformly approximates continuous functions on compact sets. Remarks. ◮ ReLU is fine: use σ ( z ) := σ r ( z ) − σ r ( z − 1). � � Ω( d ) . ◮ Size estimate: expanding terms, seem to get Lip / ǫ ◮ Best conditions on σ (Leshno-Lin-Pinkus-Schocken ’93): theorem holds iff σ not a polynomial. ◮ Inner hint about DN: no need for explicit multiplication?

  8. Other proofs. ◮ (Cybenko ’89.) Assume contradictorily you miss some functions. � By duality, 0 = σ ( a T x − b ) d µ ( x ) for some signed measure µ , all ( a, b ). Using Fourier, can show this implies µ = 0. . . ◮ (Leshno-Lin-Pinkus-Schocken ’93.) If σ a polynomial, . . . ; else can (roughly) get derivatives of all orders, polynomials of all orders. ◮ (Barron ’93.) Consider activation x �→ exp( ia T x ), � exp( ia T x ) � infinite width form f ( a ) d a . Take real part and sample (Maurey) to get g ∈ span( H cos ); convert to span( H σ ) as before. ◮ (Funahashi ’89.) Also Fourier, measure-theoretic.

  9. “Universal approximation” (Uniform approximation of cont. functions on compact sets). ◮ Elementary proof: RBF (Mhaskar-Michelli ’92; BJTX ’19). ◮ Slick proof: Stone-Weierstrass and H cos or H exp (Hornik-Stinchcombe-White, ’89). � � d , ◮ Proof with size estimates beating Lip / ǫ indeed norm of Fourier transform of gradient, related to “sampling measure”: (Barron ’93). Remarks. ◮ Exhibits nothing special about DN; indeed, same proofs work for boosting, RBF SVM, . . . ◮ Size estimates huge (soon we’ll see d Ω( d ) ). ◮ Proofs use nice representation “tricks”; (e.g., Leshno et al “iff not polynomial”).

  10. Elementary universal approximation. Classical universal approximation. Benefits of depth. 1 Sobolev spaces. 0 1 Odds & ends.

  11. Radial functions are easy with two ReLU layers Consider f ( � x � 2 ) with Lipschitz constant Lip. 2 = � ◮ Pick h ( x ) ≈ ǫ � x � 2 i x 2 i with d · Lip / ǫ ReLU.

  12. Radial functions are easy with two ReLU layers Consider f ( � x � 2 ) with Lipschitz constant Lip. 2 = � ◮ Pick h ( x ) ≈ ǫ � x � 2 i x 2 i with d · Lip / ǫ ReLU. ◮ Pick g ≈ ǫ f with Lip / ǫ ReLU; then � � � � � � � � � � � f ( � x � 2 ) − g ( h ( x )) � f ( � x � 2 ) − f ( h ( x )) � f ( h ( x )) − g ( h ( x )) � � ≤ � + � � � � � � x � 2 − h ( x ) ≤ Lip � + ǫ ≤ 2 ǫ.

  13. Radial functions are easy with two ReLU layers Consider f ( � x � 2 ) with Lipschitz constant Lip. 2 = � ◮ Pick h ( x ) ≈ ǫ � x � 2 i x 2 i with d · Lip / ǫ ReLU. ◮ Pick g ≈ ǫ f with Lip / ǫ ReLU; then � � � � � � � � � � � f ( � x � 2 ) − g ( h ( x )) � f ( � x � 2 ) − f ( h ( x )) � f ( h ( x )) − g ( h ( x )) � � ≤ � + � � � � � � x � 2 − h ( x ) ≤ Lip � + ǫ ≤ 2 ǫ. Remarks. ◮ Final size of g ◦ h is poly(Lip , d, 1 /ǫ ). ◮ Proof style is “typical”/lazy; (problematically) pays with Lipschitz constant. ◮ That was easy/intuitive; how about the 2 layer case?...

  14. Radial functions are not easy with only one ReLU layer (I) Theorem (Eldan-Shamir, 2015). There exists a radial function f , expressible with two ReLU layers of width poly( d ), and a probability measure P so that every g with a single ReLU layer of width 2 O ( d ) satisfies � � � 2 d P ( x ) ≥ Ω(1) . f ( x ) − g ( x )

  15. Radial functions are not easy with only one ReLU layer (I) Theorem (Eldan-Shamir, 2015). There exists a radial function f , expressible with two ReLU layers of width poly( d ), and a probability measure P so that every g with a single ReLU layer of width 2 O ( d ) satisfies � � � 2 d P ( x ) ≥ Ω(1) . f ( x ) − g ( x ) Proof hints. Apply Fourier isometry and consider the transforms. Transform of g is supported on a small set of tubes; transform of f has large mass they can’t reach.

  16. Radial functions are not easy with only one ReLU layer (II) Theorem (Daniely, 2017). Let ( x, x ′ ) ∼ P be uniform on two sphere surfaces, define h ( x, x ′ ) = sin( πd 3 x T x ′ ). For any g with a single ReLU layer of width d O ( d ) and weight magnitude O (2 d ), � � � 2 d P ( x, x ′ ) ≥ Ω(1) , h ( x, x ′ ) − g ( x, x ′ ) and h can be approximated to accuracy ǫ by f with two ReLU layers of size poly( d, 1 /ǫ ).

  17. Radial functions are not easy with only one ReLU layer (II) Theorem (Daniely, 2017). Let ( x, x ′ ) ∼ P be uniform on two sphere surfaces, define h ( x, x ′ ) = sin( πd 3 x T x ′ ). For any g with a single ReLU layer of width d O ( d ) and weight magnitude O (2 d ), � � � 2 d P ( x, x ′ ) ≥ Ω(1) , h ( x, x ′ ) − g ( x, x ′ ) and h can be approximated to accuracy ǫ by f with two ReLU layers of size poly( d, 1 /ǫ ). Proof hints. Spherical harmonics reduce this to a univariate problem; apply region counting.

  18. Approximation of high-dimensional radial functions (A radial function contour plot.) If we can approximate each shell, we can approximate the overall function.

  19. Approximation of high-dimensional radial shell Let’s approximate a single shell; consider � � x �→ 1 � x � ∈ [1 − 1 / d , 1] , which has a constant fraction of sphere volume.

  20. Approximation of high-dimensional radial shell Let’s approximate a single shell; consider � � x �→ 1 � x � ∈ [1 − 1 / d , 1] , which has a constant fraction of sphere volume. Can’t cut too deeply; get bad error on inner zero part. . .

  21. Approximation of high-dimensional radial shell Let’s approximate a single shell; consider � � x �→ 1 � x � ∈ [1 − 1 / d , 1] , which has a constant fraction of sphere volume. Can’t cut too deeply; get bad error on inner zero part. . . . . . but then we need to cover exponentially many caps.

  22. Let’s go back to the drawing board; what do shallow representations do exceptionally badly?

  23. Let’s go back to the drawing board; what do shallow representations do exceptionally badly? One weakness: their complexity scales with #bumps.

  24. Consider the tent map � 2 x x ∈ [0 , 1 / 2) , ∆( x ) := σ r (2 x ) − σ r (4 x − 2) = 2(1 − x ) x ∈ [1 / 2 , 1] . 1 0 1 ∆.

  25. Consider the tent map � 2 x x ∈ [0 , 1 / 2) , ∆( x ) := σ r (2 x ) − σ r (4 x − 2) = 2(1 − x ) x ∈ [1 / 2 , 1] . 1 0 1 ∆. What is the effect of composition? � x ∈ [0 , 1 / 2 ) = ⇒ f (2 x ) = f squeezed into [0 , 1 / 2] , f (∆( x )) = � � x ∈ [ 1 / 2 , 1] = ⇒ f 2(1 − x ) = f reversed, squeezed .

  26. Consider the tent map � 2 x x ∈ [0 , 1 / 2) , ∆( x ) := σ r (2 x ) − σ r (4 x − 2) = 2(1 − x ) x ∈ [1 / 2 , 1] . 1 1 0 1 0 1 ∆ 2 = ∆ ◦ ∆. ∆. What is the effect of composition? � x ∈ [0 , 1 / 2 ) = ⇒ f (2 x ) = f squeezed into [0 , 1 / 2] , f (∆( x )) = � � x ∈ [ 1 / 2 , 1] = ⇒ f 2(1 − x ) = f reversed, squeezed .

  27. Consider the tent map � 2 x x ∈ [0 , 1 / 2) , ∆( x ) := σ r (2 x ) − σ r (4 x − 2) = 2(1 − x ) x ∈ [1 / 2 , 1] . 1 1 1 0 1 0 1 0 1 ∆ 2 = ∆ ◦ ∆. ∆ k . ∆. What is the effect of composition? � x ∈ [0 , 1 / 2 ) = ⇒ f (2 x ) = f squeezed into [0 , 1 / 2] , f (∆( x )) = � � x ∈ [ 1 / 2 , 1] = ⇒ f 2(1 − x ) = f reversed, squeezed .

  28. Consider the tent map � 2 x x ∈ [0 , 1 / 2) , ∆( x ) := σ r (2 x ) − σ r (4 x − 2) = 2(1 − x ) x ∈ [1 / 2 , 1] . 1 1 1 0 1 0 1 0 1 ∆ 2 = ∆ ◦ ∆. ∆ k . ∆. What is the effect of composition? � x ∈ [0 , 1 / 2 ) = ⇒ f (2 x ) = f squeezed into [0 , 1 / 2] , f (∆( x )) = � � x ∈ [ 1 / 2 , 1] = ⇒ f 2(1 − x ) = f reversed, squeezed . ∆ k uses O ( k ) layers & nodes, but has O (2 k ) bumps.

  29. Theorem (T ’15). Let #layers k ≥ 1 be given.

  30. Theorem (T ’15). Let #layers k ≥ 1 be given. Exists ReLU network f : [0 , 1] → [0 , 1] with 4 distinct parameters, 3 k 2 + 9 nodes, 2 k 2 + 6 layers,

  31. Theorem (T ’15). Let #layers k ≥ 1 be given. Exists ReLU network f : [0 , 1] → [0 , 1] with 4 distinct parameters, 3 k 2 + 9 nodes, 2 k 2 + 6 layers, such that every ReLU network g : R d → R with ≤ k layers, ≤ 2 k nodes

  32. Theorem (T ’15). Let #layers k ≥ 1 be given. Exists ReLU network f : [0 , 1] → [0 , 1] with 4 distinct parameters, 3 k 2 + 9 nodes, 2 k 2 + 6 layers, such that every ReLU network g : R d → R with ≤ k layers, ≤ 2 k nodes satisfies � | f ( x ) − g ( x ) | d x ≥ 1 32 . [0 , 1]

  33. Theorem (T ’15). Let #layers k ≥ 1 be given. Exists ReLU network f : [0 , 1] → [0 , 1] with 4 distinct parameters, 3 k 2 + 9 nodes, 2 k 2 + 6 layers, such that every ReLU network g : R d → R with ≤ k layers, ≤ 2 k nodes satisfies � | f ( x ) − g ( x ) | d x ≥ 1 32 . [0 , 1] Proof. 1. g with few oscillations can’t apx oscillatory regular f . 2. There exists a regular, oscillatory f . ( f = ∆ k 2 +3 . ) ⇒ few ( O ( m L )) oscillations. 3. Width m depth L = Rediscovered many times; (T ’15) gives elementary univariate argument; multivariate arguments in (Warren ’68), (Arnold ?), (Montufar, Pascanu, Cho, Bengio, ’14), (BT ’18), . . .

  34. � g with few oscillations; = ⇒ | g − f | large . f highly oscillatory [0 , 1] f g

  35. � g with few oscillations; = ⇒ | g − f | large . f highly oscillatory [0 , 1] f g

  36. � g with few oscillations; ? = ⇒ | g − f | large . f highly oscillatory [0 , 1] g f

  37. � g with few oscillations; = ⇒ | g − f | large . f highly oscillatory, regular [0 , 1] g f Let’s use f = ∆ k 2 +3 .

  38. � g with few oscillations; = ⇒ | g − f | large . f highly oscillatory, regular [0 , 1] g f Let’s use f = ∆ k 2 +3 .

  39. � g with few oscillations; = ⇒ | g − f | large . f highly oscillatory, regular [0 , 1] g f Let’s use f = ∆ k 2 +3 .

  40. � g with few oscillations; = ⇒ | g − f | large . f highly oscillatory, regular [0 , 1] g f Let’s use f = ∆ k 2 +3 .

  41. � g with few oscillations; = ⇒ | g − f | large . f highly oscillatory, regular [0 , 1] g f Let’s use f = ∆ k 2 +3 .

  42. Story from benefits of depth : ◮ Certain radial functions have polynomial width 2 ReLU layer representation, exponential width 1 ReLU layer representation. ◮ ∆ k 2 +3 can be written with O ( k 2 ) depth and O (1) width, requires width Ω(2 k ) if depth O ( k ).

  43. Story from benefits of depth : ◮ Certain radial functions have polynomial width 2 ReLU layer representation, exponential width 1 ReLU layer representation. ◮ ∆ k 2 +3 can be written with O ( k 2 ) depth and O (1) width, requires width Ω(2 k ) if depth O ( k ). Remarks. ◮ ∆ k is 2 k -Lipschitz; possibly nonsensical, unrealistic. ◮ These results have stood a few years now; many “technical” questions, also “realistic” questions.

  44. Elementary universal approximation. Classical universal approximation. Benefits of depth. 1 Sobolev spaces. 0 1 Odds & ends.

  45. h k := piecewise-affine interpolation of x 2 at { 0 , 1 2 k , . . . , 2 k 2 k , 2 2 k } . h 1 .

  46. h k := piecewise-affine interpolation of x 2 at { 0 , 1 2 k , . . . , 2 k 2 k , 2 2 k } . h 1 . h 2 .

  47. h k := piecewise-affine interpolation of x 2 at { 0 , 1 2 k , . . . , 2 k 2 k , 2 2 k } . h 1 . h 2 . h 1 − h 2 .

  48. h k := piecewise-affine interpolation of x 2 at { 0 , 1 2 k , . . . , 2 k 2 k , 2 2 k } . h 1 . h 2 . h 1 − h 2 .

  49. h k := piecewise-affine interpolation of x 2 at { 0 , 1 2 k , . . . , 2 k 2 k , 2 2 k } . h 1 . h 2 . h 1 − h 2 .

  50. h k := piecewise-affine interpolation of x 2 at { 0 , 1 2 k , . . . , 2 k 2 k , 2 2 k } . h 1 . h 2 . h 1 − h 2 . Thus h k ( x ) = x − � i ≤ k ∆ i ( x ) / 4 i .

  51. h k := piecewise-affine interpolation of x 2 at { 0 , 1 2 k , . . . , 2 k 2 k , 2 2 k } . h 1 . h 2 . Thus h k ( x ) = x − � i ≤ k ∆ i ( x ) / 4 i .

  52. h k := piecewise-affine interpolation of x 2 at { 0 , 1 2 k , . . . , 2 k 2 k , 2 2 k } . h 1 . h 2 . Thus h k ( x ) = x − � i ≤ k ∆ i ( x ) / 4 i . ◮ h k needs k = O (ln(1 /ǫ )) to ǫ -apx x �→ x 2 (Yarotsky, ’16), with matching lower bounds. ◮ Squaring implies multiplication via polarization: � � x + y � 2 − � x � 2 − � y � 2 � x T y = 1 . 2

  53. h k := piecewise-affine interpolation of x 2 at { 0 , 1 2 k , . . . , 2 k 2 k , 2 2 k } . h 1 . h 2 . Thus h k ( x ) = x − � i ≤ k ∆ i ( x ) / 4 i . ◮ h k needs k = O (ln(1 /ǫ )) to ǫ -apx x �→ x 2 (Yarotsky, ’16), with matching lower bounds. ◮ Squaring implies multiplication via polarization: � � x + y � 2 − � x � 2 − � y � 2 � x T y = 1 . 2 ◮ This implies efficient approximation of polynomials; can we do more?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend