approximation and non parametric estimation of resnet
play

Approximation and Non-parametric Estimation of ResNet-type - PowerPoint PPT Presentation

Poster: 13 th June, Pacific Ballroom #77 Paper Link Approximation and Non-parametric Estimation of ResNet-type Convolutional Neural Networks Kenta Oono 1,2 Taiji Suzuki 1,3 {kenta_oono, taiji}@mist.i.u-tokyo.ac.jp 1. The University of Tokyo


  1. Poster: 13 th June, Pacific Ballroom #77 ↑Paper Link Approximation and Non-parametric Estimation of ResNet-type Convolutional Neural Networks Kenta Oono 1,2 Taiji Suzuki 1,3 {kenta_oono, taiji}@mist.i.u-tokyo.ac.jp 1. The University of Tokyo 2. Preferred Networks, Inc. 3. RIKEN AIP Thirty-sixth International Conference on Machine Learning (ICML 2019) June 13 th 2019, Long Beach, CA, U.S.A. Oono and Suzuki, Jun 13th #77 1

  2. Key Takeaway Q. Why ResNet-type CNNs work well? Oono and Suzuki, Jun 13th #77 2

  3. Key Takeaway Q. Why ResNet-type CNNs work well? A. Hidden sparse structure promotes good performance. Oono and Suzuki, Jun 13th #77 3

  4. Problem Setting We consider a non-parametric regression problem: ! = # ° (&) + ) # ° : True function (e.g., Hölder, Barron, Besov class), ) : Gaussian noise Oono and Suzuki, Jun 13th #77 4

  5. Problem Setting We consider a non-parametric regression problem: % = # ° ()) + , # ° : True function (e.g., Hölder, Barron, Besov class), , : Gaussian noise Given ! i.i.d. samples, we pick an estimator " # from the hypothesis class ℱ , which is a set of functions realized by CNNs with a specified architecture. Oono and Suzuki, Jun 13th #77 5

  6. Problem Setting We consider a non-parametric regression problem: ! = # ° (&) + ) # ° : True function (e.g., Hölder, Barron, Besov class), ) : Gaussian noise Given 1 i.i.d. samples, we pick an estimator + # from the hypothesis class ℱ , which is a set of functions realized by CNNs with a specified architecture. Goal: Evaluate the estimation error ℛ + # ∶= - & | + #(&) − # ° (&)| 0 Oono and Suzuki, Jun 13th #77 6

  7. Prior Work / + 1 ℛ " # ≾ inf (∈ℱ ∥ # − #° ∥ . 2(4 ℱ /6) Approximation Error Model Complexity 6 : Sample size ℱ: Set of functions realizable by CNNs with a specified architecture #° : True function (e.g., Hölder, Barron, Besov etc.) 1 2(9) : 2 -notation ignoring logarithmic terms. Oono and Suzuki, Jun 13th #77 7

  8. Prior Work 0 + 2 ℛ $ % ≾ inf *∈ℱ ∥ % − %° ∥ / 3(! ℱ /6) Approximation Error Model Complexity CNN type Parameter Size ! ℱ Minimax Optimality Discrete Optimization General # of all weights Sub-optimal L - 6 : Sample size ℱ: Set of functions realizable by CNNs with a specified architecture %° : True function (e.g., Hölder, Barron, Besov etc.) 2 3(9) : 3 -notation ignoring logarithmic terms. Oono and Suzuki, Jun 13th #77 8

  9. Prior Work 0 + 2 ℛ $ % ≾ inf *∈ℱ ∥ % − %° ∥ / 3(! ℱ /6) Approximation Error Model Complexity CNN type Parameter Size ! ℱ Minimax Optimality Discrete Optimization General # of all weights Sub-optimal L - Sparse* # of non-zero weights Optimal J Needed L * e.g., Hölder case: [Yarotsuky, 17; Schmidt-Hieber, 17; Petersen & Voigtlaender, 18] 6 : Sample size ℱ: Set of functions realizable by CNNs with a specified architecture %° : True function (e.g., Hölder, Barron, Besov etc.) 2 3(9) : 3 -notation ignoring logarithmic terms. Oono and Suzuki, Jun 13th #77 9

  10. Prior Work 0 + 2 ℛ $ % ≾ inf *∈ℱ ∥ % − %° ∥ / 3(! ℱ /6) Approximation Error Model Complexity CNN type Parameter Size ! ℱ Minimax Optimality Discrete Optimization General # of all weights Sub-optimal L - Sparse* # of non-zero weights Optimal J Needed L ResNet # of all weights Optimal J Not Needed J * e.g., Hölder case: [Yarotsuky, 17; Schmidt-Hieber, 17; Petersen & Voigtlaender, 18] 6 : Sample size ℱ: Set of functions realizable by CNNs with a specified architecture %° : True function (e.g., Hölder, Barron, Besov etc.) 2 3(9) : 3 -notation ignoring logarithmic terms. Oono and Suzuki, Jun 13th #77 10

  11. Contribution ResNet-type CNNs can achieve minimax-optimal rates without unrealistic constraints . CNN type Parameter Size ! ℱ Minimax Optimality Discrete Optimization General # of all weights Sub-optimal L - Sparse* # of non-zero weights Optimal J Needed L ResNet # of all weights Optimal J Not Needed J * e.g., Hölder case: [Yarotsuky, 17; Schmidt-Hieber, 17; Petersen & Voigtlaender, 18] Oono and Suzuki, Jun 13th #77 11

  12. Contribution ResNet-type CNNs can achieve minimax-optimal rates without unrealistic constraints . CNN type Parameter Size ! ℱ Minimax Optimality Discrete Optimization General # of all weights Sub-optimal L - Sparse* # of non-zero weights Optimal J Needed L ResNet # of all weights Optimal J Not Needed J * e.g., Hölder case: [Yarotsuky, 17; Schmidt-Hieber, 17; Petersen & Voigtlaender, 18] Key Observation Known optimal FNNs have block-sparse structures Oono and Suzuki, Jun 13th #77 12

  13. <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> Block-sparse FNN Forward w 1 FC σ W 1 , b 1 w M b FC σ W M , b M ) + FC & (.) − 1 FNN ∶= % * & &'( Oono and Suzuki, Jun 13th #77 13

  14. <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> Block-sparse FNN Forward w 1 FC σ W 1 , b 1 w M b FC σ W M , b M ) + FC & (.) − 1 FNN ∶= % * & &'( Known best approximating FNNs are block-sparse when the true function is --- Barron [Klusowski & Barron, 18] Hölder [Yarotsky, 17; Schmidt-Hieber, 17] Besov [Suzuki, 19]. Oono and Suzuki, Jun 13th #77 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend