Approximation and Non-parametric Estimation of ResNet-type - - PowerPoint PPT Presentation

approximation and non parametric estimation of resnet
SMART_READER_LITE
LIVE PREVIEW

Approximation and Non-parametric Estimation of ResNet-type - - PowerPoint PPT Presentation

Poster: 13 th June, Pacific Ballroom #77 Paper Link Approximation and Non-parametric Estimation of ResNet-type Convolutional Neural Networks Kenta Oono 1,2 Taiji Suzuki 1,3 {kenta_oono, taiji}@mist.i.u-tokyo.ac.jp 1. The University of Tokyo


slide-1
SLIDE 1

Approximation and Non-parametric Estimation of ResNet-type Convolutional Neural Networks

Kenta Oono1,2 Taiji Suzuki1,3

Oono and Suzuki, Jun 13th #77 1

Poster: 13th June, Pacific Ballroom #77

↑Paper Link

{kenta_oono, taiji}@mist.i.u-tokyo.ac.jp

  • 1. The University of Tokyo 2. Preferred Networks, Inc. 3. RIKEN AIP

Thirty-sixth International Conference on Machine Learning (ICML 2019) June 13th 2019, Long Beach, CA, U.S.A.

slide-2
SLIDE 2

Key Takeaway

  • Q. Why ResNet-type CNNs work well?

Oono and Suzuki, Jun 13th #77 2

slide-3
SLIDE 3

Key Takeaway

  • A. Hidden sparse structure promotes

good performance.

  • Q. Why ResNet-type CNNs work well?

Oono and Suzuki, Jun 13th #77 3

slide-4
SLIDE 4

Problem Setting

Oono and Suzuki, Jun 13th #77 4

We consider a non-parametric regression problem:

! = #°(&) + )

#°: True function (e.g., Hölder, Barron, Besov class), ): Gaussian noise

slide-5
SLIDE 5

Problem Setting

Oono and Suzuki, Jun 13th #77 5

Given ! i.i.d. samples, we pick an estimator " # from the hypothesis class ℱ, which is a set of functions realized by CNNs with a specified architecture.

% = #°()) + ,

#°: True function (e.g., Hölder, Barron, Besov class), ,: Gaussian noise

We consider a non-parametric regression problem:

slide-6
SLIDE 6

Problem Setting

Oono and Suzuki, Jun 13th #77 6

! = #°(&) + )

#°: True function (e.g., Hölder, Barron, Besov class), ): Gaussian noise

Goal: Evaluate the estimation error ℛ + # ∶= -& | + #(&) − #°(&)|0

We consider a non-parametric regression problem: Given 1 i.i.d. samples, we pick an estimator + # from the hypothesis class ℱ, which is a set of functions realized by CNNs with a specified architecture.

slide-7
SLIDE 7

Prior Work

ℛ " # ≾ inf(∈ℱ ∥ # − #° ∥.

/ + 1

2(4ℱ/6) Approximation Error Model Complexity

Oono and Suzuki, Jun 13th #77 7

6: Sample size ℱ: Set of functions realizable by CNNs with a specified architecture #°: True function (e.g., Hölder, Barron, Besov etc.) 1 2(9): 2-notation ignoring logarithmic terms.

slide-8
SLIDE 8

CNN type Parameter Size !ℱ Minimax Optimality Discrete Optimization General # of all weights Sub-optimal L

  • Prior Work

ℛ $ % ≾ inf*∈ℱ ∥ % − %° ∥/

0 + 2

3(!ℱ/6) Approximation Error Model Complexity

Oono and Suzuki, Jun 13th #77 8

6: Sample size ℱ: Set of functions realizable by CNNs with a specified architecture %°: True function (e.g., Hölder, Barron, Besov etc.) 2 3(9): 3-notation ignoring logarithmic terms.

slide-9
SLIDE 9

CNN type Parameter Size !ℱ Minimax Optimality Discrete Optimization General # of all weights Sub-optimal L

  • Sparse*

# of non-zero weights Optimal J Needed L

Prior Work

ℛ $ % ≾ inf*∈ℱ ∥ % − %° ∥/

0 + 2

3(!ℱ/6) Approximation Error Model Complexity

* e.g., Hölder case: [Yarotsuky, 17; Schmidt-Hieber, 17; Petersen & Voigtlaender, 18]

Oono and Suzuki, Jun 13th #77 9

6: Sample size ℱ: Set of functions realizable by CNNs with a specified architecture %°: True function (e.g., Hölder, Barron, Besov etc.) 2 3(9): 3-notation ignoring logarithmic terms.

slide-10
SLIDE 10

CNN type Parameter Size !ℱ Minimax Optimality Discrete Optimization General # of all weights Sub-optimal L

  • Sparse*

# of non-zero weights Optimal J Needed L ResNet # of all weights Optimal J Not Needed J

Prior Work

ℛ $ % ≾ inf*∈ℱ ∥ % − %° ∥/

0 + 2

3(!ℱ/6) Approximation Error Model Complexity

* e.g., Hölder case: [Yarotsuky, 17; Schmidt-Hieber, 17; Petersen & Voigtlaender, 18]

Oono and Suzuki, Jun 13th #77 10

6: Sample size ℱ: Set of functions realizable by CNNs with a specified architecture %°: True function (e.g., Hölder, Barron, Besov etc.) 2 3(9): 3-notation ignoring logarithmic terms.

slide-11
SLIDE 11

CNN type Parameter Size !ℱ Minimax Optimality Discrete Optimization General # of all weights Sub-optimal L

  • Sparse*

# of non-zero weights Optimal J Needed L ResNet # of all weights Optimal J Not Needed J

Contribution

ResNet-type CNNs can achieve minimax-optimal rates without unrealistic constraints.

* e.g., Hölder case: [Yarotsuky, 17; Schmidt-Hieber, 17; Petersen & Voigtlaender, 18]

Oono and Suzuki, Jun 13th #77 11

slide-12
SLIDE 12

CNN type Parameter Size !ℱ Minimax Optimality Discrete Optimization General # of all weights Sub-optimal L

  • Sparse*

# of non-zero weights Optimal J Needed L ResNet # of all weights Optimal J Not Needed J

Contribution

ResNet-type CNNs can achieve minimax-optimal rates without unrealistic constraints.

Known optimal FNNs have block-sparse structures Key Observation

* e.g., Hölder case: [Yarotsuky, 17; Schmidt-Hieber, 17; Petersen & Voigtlaender, 18]

Oono and Suzuki, Jun 13th #77 12

slide-13
SLIDE 13

Block-sparse FNN

Oono and Suzuki, Jun 13th #77 13

b

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

FCσ

W1,b1

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

FCσ

WM,bM

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

wM

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

Forward

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

FNN ∶= %

&'( )

*&

+ FC&(.) − 1

w1

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>
slide-14
SLIDE 14

Block-sparse FNN

Oono and Suzuki, Jun 13th #77 14

b

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

FCσ

W1,b1

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

FCσ

WM,bM

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

w1

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

wM

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

Forward

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

FNN ∶= %

&'( )

*&

+ FC&(.) − 1

Barron [Klusowski & Barron, 18] Hölder [Yarotsky, 17; Schmidt-Hieber, 17] Besov [Suzuki, 19]. Known best approximating FNNs are block-sparse when the true function is ---

slide-15
SLIDE 15

Block-sparse FNN to ResNet-type CNN

Oono and Suzuki, Jun 13th #77 15

b

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

FCσ

W1,b1

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

FCσ

WM,bM

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

wM

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

FNN ∶= %

&'( )

*&

+ FC&(.) − 1

+ +

id

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

id

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

FCid

W,b

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

Convσ

w1,b1

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

Convσ

wM,bM

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

CNN: = FC ∘ Conv) + id ∘ ⋯ ∘ (Conv( + id) ∘ ;

Barron [Klusowski & Barron, 18] Hölder [Yarotsky, 17; Schmidt-Hieber, 17] Besov [Suzuki, 19]. Transform w1

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

Known best approximating FNNs are block-sparse when the true function is ---

slide-16
SLIDE 16

Block-sparse FNN to ResNet-type CNN

Oono and Suzuki, Jun 13th #77 16

b

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

FCσ

W1,b1

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

FCσ

WM,bM

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

wM

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

+ +

id

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

id

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

FCid

W,b

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

Convσ

w1,b1

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

Convσ

wM,bM

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

Barron [Klusowski & Barron, 18] Hölder [Yarotsky, 17; Schmidt-Hieber, 17] Besov [Suzuki, 19].

↑ Minimax Optimal

CNN: = FC ∘ Conv* + id ∘ ⋯ ∘ (Conv0 + id) ∘ 2

w1

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

Transform Known best approximating FNNs are block-sparse when the true function is ---

slide-17
SLIDE 17

Block-sparse FNN to ResNet-type CNN

Oono and Suzuki, Jun 13th #77 17

b

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

FCσ

W1,b1

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

FCσ

WM,bM

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

wM

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

Barron [Klusowski & Barron, 18] Hölder [Yarotsky, 17; Schmidt-Hieber, 17] Besov [Suzuki, 19].

+ +

id

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

id

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

FCid

W,b

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

Convσ

w1,b1

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

Convσ

wM,bM

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

↑ Minimax Optimal ↑ Minimax Optimal, too !

w1

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

Transform Known best approximating FNNs are block-sparse when the true function is ---

slide-18
SLIDE 18

For any block-sparse FNN with ! blocks, there exists a ResNet- type CNN with ! residual blocks which has "(!) more parameters and which is identical (as a function) to the FNN.

Oono and Suzuki, Jun 13th #77 18

Block-sparse FNN ResNet-type CNN

Theorem

Block-sparse FNN to ResNet-type CNN

b

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

FCσ

W1,b1

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

FCσ

WM,bM

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

w1

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

wM

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

+ + id

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

id

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

FCid

W,b

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

Convσ

w1,b1

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

Convσ

wM,bM

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>
slide-19
SLIDE 19

Optimality of ResNet-type CNNs

Oono and Suzuki, Jun 13th #77 19

Suppose the true function !° is #-Hölder. There exists a set of ResNet-type CNNs ℱ such that:

Theorem (e.g., Hölder Case)

slide-20
SLIDE 20

Optimality of ResNet-type CNNs

Oono and Suzuki, Jun 13th #77 20

Suppose the true function !° is #-Hölder. There exists a set of ResNet-type CNNs ℱ such that:

  • ℱ does NOT have sparse constraints
  • the estimator %

! of ℱ achieves the minimax-optimal estimation error rate (up to log factors).

Theorem (e.g., Hölder Case)

slide-21
SLIDE 21

Optimality of ResNet-type CNNs

Oono and Suzuki, Jun 13th #77 21

Suppose the true function !° is #-Hölder. There exists a set of ResNet-type CNNs ℱ such that:

  • ℱ does NOT have sparse constraints
  • the estimator %

! of ℱ achieves the minimax-optimal estimation error rate (up to log factors). J Minimax optimal ! J No discrete optimization !

Theorem (e.g., Hölder Case)

slide-22
SLIDE 22

Optimality of ResNet-type CNNs

Oono and Suzuki, Jun 13th #77 22

J Minimax optimal ! J No discrete optimization !

  • Using the same strategy, we can prove that ResNet-type CNNs can achieve the same

rate as FNNs for the Barron class etc.

  • We remove unrealistic constraints on channels size, too (see the paper).

Note

Suppose the true function !° is #-Hölder. There exists a set of ResNet-type CNNs ℱ such that:

  • ℱ does NOT have sparse constraints
  • the estimator %

! of ℱ achieves the minimax-optimal estimation error rate (up to log factors).

Theorem (e.g., Hölder Case)

slide-23
SLIDE 23

CNN type Parameter Size !ℱ Minimax Optimality Discrete Optimization General # of all weights Sub-optimal L

  • Sparse*

# of non-zero weights Optimal J Needed L ResNet # of all weights Optimal J Not Needed J

Conclusion

ResNet-type CNNs can achieve minimax-optimal rates in several function classes without implausible constraints.

Oono and Suzuki, Jun 13th #77 23

Poster: 13th June, Pacific Ballroom #77

↑ Minimax Optimal ↑ Minimax Optimal, too ! ↑Paper Link

b

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

FCσ

W1,b1

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

FCσ

WM,bM

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

w1

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

wM

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

+ + id

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

id

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

FCid

W,b

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

Convσ

w1,b1

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>

Convσ

wM,bM

<latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit><latexit sha1_base64="(nul)">(nul)</latexit>