deep approximation via deep learning
play

Deep Approximation via Deep Learning Zuowei Shen Department of - PowerPoint PPT Presentation

Deep Approximation via Deep Learning Zuowei Shen Department of Mathematics National University of Singapore Outline Introduction of approximation theory 1 Approximation of functions by compositions 2 Approximation rate in term of number of


  1. Deep Approximation via Deep Learning Zuowei Shen Department of Mathematics National University of Singapore

  2. Outline Introduction of approximation theory 1 Approximation of functions by compositions 2 Approximation rate in term of number of nurons 3

  3. Outline Introduction of approximation theory 1 Approximation of functions by compositions 2 Approximation rate in term of number of nurons 3

  4. A brief introduction For a given function f : R d → R and ǫ > 0 , approximation is to find a simple function g such that � f − g � < ǫ.

  5. A brief introduction For a given function f : R d → R and ǫ > 0 , approximation is to find a simple function g such that � f − g � < ǫ. Function g : R n → R can be as simple as g ( x ) = a · x . To make sense of this approximation, we need to find a map T : R d �→ R n , such that � f − g ◦ T � < ǫ.

  6. A brief introduction For a given function f : R d → R and ǫ > 0 , approximation is to find a simple function g such that � f − g � < ǫ. Function g : R n → R can be as simple as g ( x ) = a · x . To make sense of this approximation, we need to find a map T : R d �→ R n , such that � f − g ◦ T � < ǫ. In practice, we only have sample data { ( x i , f ( x i )) } m i =1 of f , one needs develop algorithms to find T .

  7. A brief introduction For a given function f : R d → R and ǫ > 0 , approximation is to find a simple function g such that � f − g � < ǫ. Function g : R n → R can be as simple as g ( x ) = a · x . To make sense of this approximation, we need to find a map T : R d �→ R n , such that � f − g ◦ T � < ǫ. In practice, we only have sample data { ( x i , f ( x i )) } m i =1 of f , one needs develop algorithms to find T . Classical approximation: T is independent of f or data, 1 while n depends on ǫ . Learning: T is learned from data and determined by a few 2 parameters. n depends on ǫ . Deep learning: T is fully learned from data with huge 3 number of parameters. T is a composition of many simple maps, and n can be independent of ǫ .

  8. Classical approximation Linear approximation: Given a finite fixed set of generators { φ 1 , . . . , φ n } , e.g. splines, wavelet frames, finite elements or generators in reproducing kernel Hilbert spaces. Define T = [ φ 1 , φ 2 , . . . , φ n ] ⊤ : R d �→ R n g ( x ) = a · x. and The linear approximation is to find a ∈ R n such that n � g ◦ T = a i φ i ∼ f i =1 It is linear because f 1 ∼ g 1 , f 2 ∼ g 2 ⇒ f 1 + f 2 ∼ g 1 + g 2 .

  9. Classical approximation Linear approximation: Given a finite fixed set of generators { φ 1 , . . . , φ n } , e.g. splines, wavelet frames, finite elements or generators in reproducing kernel Hilbert spaces. Define T = [ φ 1 , φ 2 , . . . , φ n ] ⊤ : R d �→ R n g ( x ) = a · x. and The linear approximation is to find a ∈ R n such that n � g ◦ T = a i φ i ∼ f i =1 It is linear because f 1 ∼ g 1 , f 2 ∼ g 2 ⇒ f 1 + f 2 ∼ g 1 + g 2 . The best n -term approximation: Given dictionary D that can have infinitely many generators , e.g. D = { φ i } ∞ i =1 and define T = [ φ 1 , φ 2 , . . . , ] ⊤ : R d �→∈ R ∞ and g ( x ) = a · x The best n -term approximation of f is to find a with n nonzero terms such that g ◦ T ∼ f .is the best approximation among all the n -term choices It is nonlinear because f 1 ∼ g 1 , f 2 ∼ g 2 � f 1 + f 2 ∼ g 1 + g 2 , as the support of the a 1 and a 2 depends on f 1 and f 2 .

  10. Examples Consider a function space L 2 ( R d ) , let { φ i } ∞ i =1 be an orthonormal basis of L 2 ( R d ) .

  11. Examples Consider a function space L 2 ( R d ) , let { φ i } ∞ i =1 be an orthonormal basis of L 2 ( R d ) . Linear approximation For a given n , T = [ φ 1 , . . . , φ n ] ⊤ and g = a · x where a j = � f, φ j � . Denote H = span { φ 1 , . . . , φ n } ⊆ L 2 ( R d ) . Then, n � g ◦ T = � f, φ i � φ i i =1 is the orthogonal projection onto the space H and is the best approximation of f from the space H .

  12. Examples Consider a function space L 2 ( R d ) , let { φ i } ∞ i =1 be an orthonormal basis of L 2 ( R d ) . Linear approximation For a given n , T = [ φ 1 , . . . , φ n ] ⊤ and g = a · x where a j = � f, φ j � . Denote H = span { φ 1 , . . . , φ n } ⊆ L 2 ( R d ) . Then, n � g ◦ T = � f, φ i � φ i i =1 is the orthogonal projection onto the space H and is the best approximation of f from the space H . g ◦ T provides a good approximation of f when the sequence {� f, φ j �} ∞ j =1 decays fast as j → + ∞ .

  13. Examples Consider a function space L 2 ( R d ) , let { φ i } ∞ i =1 be an orthonormal basis of L 2 ( R d ) . Linear approximation For a given n , T = [ φ 1 , . . . , φ n ] ⊤ and g = a · x where a j = � f, φ j � . Denote H = span { φ 1 , . . . , φ n } ⊆ L 2 ( R d ) . Then, n � g ◦ T = � f, φ i � φ i i =1 is the orthogonal projection onto the space H and is the best approximation of f from the space H . g ◦ T provides a good approximation of f when the sequence {� f, φ j �} ∞ j =1 decays fast as j → + ∞ . Therefore, 1 Linear approximation provides a good approximation for smooth functions. 2 Advantage: It is a good approximation scheme for d is small, domain is simple, function form is complicated but smooth. Disadvantage: It does not do well if d is big and/or domain of f is 3 complex.

  14. Examples The best n -term approximation j =1 : R d �→ R ∞ and g ( x ) = a · x and each a j is T = ( φ j ) ∞ � for the largest n terms in the sequence {|� f, φ j �|} ∞ � f, φ j � , j =1 a j = 0 , otherwise.

  15. Examples The best n -term approximation j =1 : R d �→ R ∞ and g ( x ) = a · x and each a j is T = ( φ j ) ∞ � for the largest n terms in the sequence {|� f, φ j �|} ∞ � f, φ j � , j =1 a j = 0 , otherwise. The approximation of f by g ◦ T depends less on the decay of the sequence {|� f, φ j �|} ∞ j =1 . Therefore, the best n -term approximation is better than the linear 1 approximation when f is nonsmooth. It is not a good scheme if d is big and/or domain of f is 2 complex.

  16. Approximation for deep learning Given data { ( x i , f ( x i )) } m i =1 . The key of deep learning is to construct a T by the given 1 data and chosen g .

  17. Approximation for deep learning Given data { ( x i , f ( x i )) } m i =1 . The key of deep learning is to construct a T by the given 1 data and chosen g . T can simplify the domain of f through the change of 2 variables while keeping the key features of the domain of f , so that

  18. Approximation for deep learning Given data { ( x i , f ( x i )) } m i =1 . The key of deep learning is to construct a T by the given 1 data and chosen g . T can simplify the domain of f through the change of 2 variables while keeping the key features of the domain of f , so that It is robust to approximate f by g ◦ T . 3

  19. Classical approximation vs deep learning For both linear and the best n -term approximations, T is fixed. Neither of them suits for approximating f , when f is defined on a complex domain, e.g manifold in a very high dimensional space.

  20. Classical approximation vs deep learning For both linear and the best n -term approximations, T is fixed. Neither of them suits for approximating f , when f is defined on a complex domain, e.g manifold in a very high dimensional space. For deep learning, T is constructed by and adapted to the given data. T changes variables and maps domain of f to mach with that of a simple function g . It is normally used to approximate f with complex domain.

  21. Classical approximation vs deep learning For both linear and the best n -term approximations, T is fixed. Neither of them suits for approximating f , when f is defined on a complex domain, e.g manifold in a very high dimensional space. For deep learning, T is constructed by and adapted to the given data. T changes variables and maps domain of f to mach with that of a simple function g . It is normally used to approximate f with complex domain. What is the mathematics behind this? Settings: construct a measurable map T : R d �→ R n and a simple function g (e.g. g = a · x ) from data such that the feature of the domain of f can be rearranged by T to match with those of g . This leads to g ◦ T provides a good approximation of f .

  22. Outline Introduction of approximation theory 1 Approximation of functions by compositions 2 Approximation rate in term of number of nurons 3

  23. Approximation by compositions (with Qianxiao Li and Cheng Tai) Question 1: For given f and g , is there a measurable T : R d �→ R n such that f = g ◦ T ?

  24. Approximation by compositions (with Qianxiao Li and Cheng Tai) Question 1: For given f and g , is there a measurable T : R d �→ R n such that f = g ◦ T ? Answer: Yes! We have proven Theorem Let f : R d → R and g : R n → R and assume Im( f ) ⊆ Im( g ) and g is continuous. Then, there exists a measurable map T : R d �→ R n such that f = g ◦ T, a.e. This is an existence proof. T cannot be written out analytically. This leads to the following relaxed question

  25. Approximation by compositions Question 2: For arbitrarily given ǫ > 0 , can one construct a measurable T : R d �→ R n such that � f − g ◦ T � ≤ ǫ ?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend