Approximation and Non-parametric Estimation of ResNet-type - PowerPoint PPT Presentation

Poster: 13 th June, Pacific Ballroom #77 ↑Paper Link Approximation and Non-parametric Estimation of ResNet-type Convolutional Neural Networks Kenta Oono 1,2 Taiji Suzuki 1,3 {kenta_oono, taiji}@mist.i.u-tokyo.ac.jp 1. The University of Tokyo 2. Preferred Networks, Inc. 3. RIKEN AIP Thirty-sixth International Conference on Machine Learning (ICML 2019) June 13 th 2019, Long Beach, CA, U.S.A. Oono and Suzuki, Jun 13th #77 1

Key Takeaway Q. Why ResNet-type CNNs work well? Oono and Suzuki, Jun 13th #77 2

Key Takeaway Q. Why ResNet-type CNNs work well? A. Hidden sparse structure promotes good performance. Oono and Suzuki, Jun 13th #77 3

Problem Setting We consider a non-parametric regression problem: ! = # ° (&) + ) # ° : True function (e.g., Hölder, Barron, Besov class), ) : Gaussian noise Oono and Suzuki, Jun 13th #77 4

Problem Setting We consider a non-parametric regression problem: % = # ° ()) + , # ° : True function (e.g., Hölder, Barron, Besov class), , : Gaussian noise Given ! i.i.d. samples, we pick an estimator " # from the hypothesis class ℱ , which is a set of functions realized by CNNs with a specified architecture. Oono and Suzuki, Jun 13th #77 5

Problem Setting We consider a non-parametric regression problem: ! = # ° (&) + ) # ° : True function (e.g., Hölder, Barron, Besov class), ) : Gaussian noise Given 1 i.i.d. samples, we pick an estimator + # from the hypothesis class ℱ , which is a set of functions realized by CNNs with a specified architecture. Goal: Evaluate the estimation error ℛ + # ∶= - & | + #(&) − # ° (&)| 0 Oono and Suzuki, Jun 13th #77 6

Prior Work / + 1 ℛ " # ≾ inf (∈ℱ ∥ # − #° ∥ . 2(4 ℱ /6) Approximation Error Model Complexity 6 : Sample size ℱ: Set of functions realizable by CNNs with a specified architecture #° : True function (e.g., Hölder, Barron, Besov etc.) 1 2(9) : 2 -notation ignoring logarithmic terms. Oono and Suzuki, Jun 13th #77 7

Prior Work 0 + 2 ℛ $ % ≾ inf *∈ℱ ∥ % − %° ∥ / 3(! ℱ /6) Approximation Error Model Complexity CNN type Parameter Size ! ℱ Minimax Optimality Discrete Optimization General # of all weights Sub-optimal L - 6 : Sample size ℱ: Set of functions realizable by CNNs with a specified architecture %° : True function (e.g., Hölder, Barron, Besov etc.) 2 3(9) : 3 -notation ignoring logarithmic terms. Oono and Suzuki, Jun 13th #77 8

Prior Work 0 + 2 ℛ $ % ≾ inf *∈ℱ ∥ % − %° ∥ / 3(! ℱ /6) Approximation Error Model Complexity CNN type Parameter Size ! ℱ Minimax Optimality Discrete Optimization General # of all weights Sub-optimal L - Sparse* # of non-zero weights Optimal J Needed L * e.g., Hölder case: [Yarotsuky, 17; Schmidt-Hieber, 17; Petersen & Voigtlaender, 18] 6 : Sample size ℱ: Set of functions realizable by CNNs with a specified architecture %° : True function (e.g., Hölder, Barron, Besov etc.) 2 3(9) : 3 -notation ignoring logarithmic terms. Oono and Suzuki, Jun 13th #77 9

Prior Work 0 + 2 ℛ $ % ≾ inf *∈ℱ ∥ % − %° ∥ / 3(! ℱ /6) Approximation Error Model Complexity CNN type Parameter Size ! ℱ Minimax Optimality Discrete Optimization General # of all weights Sub-optimal L - Sparse* # of non-zero weights Optimal J Needed L ResNet # of all weights Optimal J Not Needed J * e.g., Hölder case: [Yarotsuky, 17; Schmidt-Hieber, 17; Petersen & Voigtlaender, 18] 6 : Sample size ℱ: Set of functions realizable by CNNs with a specified architecture %° : True function (e.g., Hölder, Barron, Besov etc.) 2 3(9) : 3 -notation ignoring logarithmic terms. Oono and Suzuki, Jun 13th #77 10

Contribution ResNet-type CNNs can achieve minimax-optimal rates without unrealistic constraints . CNN type Parameter Size ! ℱ Minimax Optimality Discrete Optimization General # of all weights Sub-optimal L - Sparse* # of non-zero weights Optimal J Needed L ResNet # of all weights Optimal J Not Needed J * e.g., Hölder case: [Yarotsuky, 17; Schmidt-Hieber, 17; Petersen & Voigtlaender, 18] Oono and Suzuki, Jun 13th #77 11

Contribution ResNet-type CNNs can achieve minimax-optimal rates without unrealistic constraints . CNN type Parameter Size ! ℱ Minimax Optimality Discrete Optimization General # of all weights Sub-optimal L - Sparse* # of non-zero weights Optimal J Needed L ResNet # of all weights Optimal J Not Needed J * e.g., Hölder case: [Yarotsuky, 17; Schmidt-Hieber, 17; Petersen & Voigtlaender, 18] Key Observation Known optimal FNNs have block-sparse structures Oono and Suzuki, Jun 13th #77 12

<latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> Block-sparse FNN Forward w 1 FC σ W 1 , b 1 w M b FC σ W M , b M ) + FC & (.) − 1 FNN ∶= % * & &'( Oono and Suzuki, Jun 13th #77 13

<latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> Block-sparse FNN Forward w 1 FC σ W 1 , b 1 w M b FC σ W M , b M ) + FC & (.) − 1 FNN ∶= % * & &'( Known best approximating FNNs are block-sparse when the true function is --- Barron [Klusowski & Barron, 18] Hölder [Yarotsky, 17; Schmidt-Hieber, 17] Besov [Suzuki, 19]. Oono and Suzuki, Jun 13th #77 14

Approximation and Non-parametric Estimation of ResNet-type - PowerPoint PPT Presentation

Poster: 13 th June, Pacific Ballroom #77 Paper Link Approximation and Non-parametric Estimation of ResNet-type Convolutional Neural Networks Kenta Oono 1,2 Taiji Suzuki 1,3 {kenta_oono, taiji}@mist.i.u-tokyo.ac.jp 1. The University of Tokyo

MLSE Channel Estimation MLSE Channel Estimation MLSE Channel Estimation Parametric or Non-

Residual Networks (ResNet) Residual Networks (ResNet) In [1]: import d2l from mxnet import gluon,

Semi-parametric and response setup non-parametric approaches to Parametric models

Non-parametric Bayesian Statistics Graham Neubig 2011-12-22 1 Graham Neubig Non-parametric

Part 3. Spectrum Estimation Part 3. Spectrum Estimation 3.2 Parametric Methods for Spectral

Non-parametric Methods Oliver Schulte - CMPT 726 Bishop PRML Ch. 2.5 Kernel Density Estimation

Introduction to non-parametric Bayes Introduction to non-parametric Bayes methods 1 Overview

ResNET Custom Digital Signage System July 2016 Webmasters Presentation Daniel Reeves

6. Approximation and fitting norm approximation least-norm problems regularized

Non parametric prediction and mapping of standing Non-parametric prediction and mapping of

Towards a non-parametric Towards a non-parametric stochastic framework: a consistent approach of

Non-parametric Density Estimation on a Transformation Group for Vision Erik G. Miller, UC

Estimation theory Parametric estimation Properties of estimators Minimum variance

Model-Selection for Non-Parametric Function Approximation: A Case Study in a Smart Energy System

CMSC427 Notes on piecewise parametric curves: Hermite, Catmull-Rom, and Bezier I. Parametric

Optical Parametric Generation and Amplification 1 Optical Parametric Generation Sum frequency

Disclosures Nothing to disclose ABSTRACT PROFFERED TALK: Large ( 2 cm) Breast Cancers in

Safety First: A Two-Stage Algorithm for LTL Games Saqib Sohail Fabio Somenzi Department of

Persistent currents in two dimension: New regimes induced by the interplay between electronic

Dynamic Partial-Order Reduction for Model Checking Software Cormac Flanagan Patrice Godefroid

Installation and Integration Meeting Bi-Weekly Update EOI information and Installation Reviews

1 Quick recap of EM negative total-data log-likelihood free energy likelihood p

OC7 Multi-Domain Optical Modelling Tool aka MOMoT Anna Manolova Fagertun, DTU Fotonik Nicola

4 Deep Generative Models BVM 2018 Tutorial: Advanced Deep Learning Methods Jens Petersen Dept.