Optimization Problems for Neural Networks Chih-Jen Lin National - - PowerPoint PPT Presentation

optimization problems for neural networks
SMART_READER_LITE
LIVE PREVIEW

Optimization Problems for Neural Networks Chih-Jen Lin National - - PowerPoint PPT Presentation

Optimization Problems for Neural Networks Chih-Jen Lin National Taiwan University Last updated: May 25, 2020 Chih-Jen Lin (National Taiwan Univ.) 1 / 78 Outline Regularized linear classification 1 Optimization problem for fully-connected


slide-1
SLIDE 1

Optimization Problems for Neural Networks

Chih-Jen Lin

National Taiwan University Last updated: May 25, 2020

Chih-Jen Lin (National Taiwan Univ.) 1 / 78

slide-2
SLIDE 2

Outline

1

Regularized linear classification

2

Optimization problem for fully-connected networks

3

Optimization problem for convolutional neural networks (CNN)

4

Discussion

Chih-Jen Lin (National Taiwan Univ.) 2 / 78

slide-3
SLIDE 3

Regularized linear classification

Outline

1

Regularized linear classification

2

Optimization problem for fully-connected networks

3

Optimization problem for convolutional neural networks (CNN)

4

Discussion

Chih-Jen Lin (National Taiwan Univ.) 3 / 78

slide-4
SLIDE 4

Regularized linear classification

Minimizing Training Errors

Basically a classification method starts with minimizing the training errors min

model

(training errors) That is, all or most training data with labels should be correctly classified by our model A model can be a decision tree, a neural network, or

  • ther types

Chih-Jen Lin (National Taiwan Univ.) 4 / 78

slide-5
SLIDE 5

Regularized linear classification

Minimizing Training Errors (Cont’d)

For simplicity, let’s consider the model to be a vector ✇ That is, the decision function is sgn(✇ T①) For any data, ①, the predicted label is

  • 1

if ✇ T① ≥ 0 −1

  • therwise

Chih-Jen Lin (National Taiwan Univ.) 5 / 78

slide-6
SLIDE 6

Regularized linear classification

Minimizing Training Errors (Cont’d)

The two-dimensional situation

△ △ △ △ △ △

✇ T① = 0 This seems to be quite restricted, but practically ① is in a much higher dimensional space

Chih-Jen Lin (National Taiwan Univ.) 6 / 78

slide-7
SLIDE 7

Regularized linear classification

Minimizing Training Errors (Cont’d)

To characterize the training error, we need a loss function ξ(✇; y, ①) for each instance (y, ①), where y = ±1 is the label and ① is the feature vector Ideally we should use 0–1 training loss: ξ(✇; y, ①) =

  • 1

if y✇ T① < 0,

  • therwise

Chih-Jen Lin (National Taiwan Univ.) 7 / 78

slide-8
SLIDE 8

Regularized linear classification

Minimizing Training Errors (Cont’d)

However, this function is discontinuous. The

  • ptimization problem becomes difficult

−y✇ T① ξ(✇; y, ①) We need continuous approximations

Chih-Jen Lin (National Taiwan Univ.) 8 / 78

slide-9
SLIDE 9

Regularized linear classification

Common Loss Functions

Hinge loss (l1 loss) ξL1(✇; y, ①) ≡ max(0, 1 − y✇ T①) (1) Logistic loss ξLR(✇; y, ①) ≡ log(1 + e−y✇ T①) (2) Support vector machines (SVM): Eq. (1). Logistic regression (LR): (2) SVM and LR are two very fundamental classification methods

Chih-Jen Lin (National Taiwan Univ.) 9 / 78

slide-10
SLIDE 10

Regularized linear classification

Common Loss Functions (Cont’d)

−y✇ T① ξ(✇; y, ①) ξL1 ξLR Logistic regression is very related to SVM Their performance is usually similar

Chih-Jen Lin (National Taiwan Univ.) 10 / 78

slide-11
SLIDE 11

Regularized linear classification

Common Loss Functions (Cont’d)

However, minimizing training losses may not give a good model for future prediction Overfitting occurs

Chih-Jen Lin (National Taiwan Univ.) 11 / 78

slide-12
SLIDE 12

Regularized linear classification

Overfitting

See the illustration in the next slide For classification, You can easily achieve 100% training accuracy This is useless When training a data set, we should Avoid underfitting: small training error Avoid overfitting: small testing error

Chih-Jen Lin (National Taiwan Univ.) 12 / 78

slide-13
SLIDE 13

Regularized linear classification

  • and ▲: training; and △: testing

Chih-Jen Lin (National Taiwan Univ.) 13 / 78

slide-14
SLIDE 14

Regularized linear classification

Regularization

To minimize the training error we manipulate the ✇ vector so that it fits the data To avoid overfitting we need a way to make ✇’s values less extreme. One idea is to make ✇ values closer to zero We can add, for example, ✇ T✇ 2

  • r

✇1 to the function that is minimized

Chih-Jen Lin (National Taiwan Univ.) 14 / 78

slide-15
SLIDE 15

Regularized linear classification

General Form of Linear Classification

Training data {yi, ①i}, ①i ∈ Rn, i = 1, . . . , l, yi = ±1 l: # of data, n: # of features min

✇ f (✇),

f (✇) ≡ ✇ T✇ 2 + C

l

  • i=1

ξ(✇; yi, ①i) ✇ T✇/2: regularization term ξ(✇; y, ①): loss function C: regularization parameter (chosen by users)

Chih-Jen Lin (National Taiwan Univ.) 15 / 78

slide-16
SLIDE 16

Optimization problem for fully-connected networks

Outline

1

Regularized linear classification

2

Optimization problem for fully-connected networks

3

Optimization problem for convolutional neural networks (CNN)

4

Discussion

Chih-Jen Lin (National Taiwan Univ.) 16 / 78

slide-17
SLIDE 17

Optimization problem for fully-connected networks

Multi-class Classification I

Our training set includes (② i, ①i), i = 1, . . . , l. ①i ∈ Rn1 is the feature vector. ② i ∈ RK is the label vector. As label is now a vector, we change (label, instance) from (yi, ①i) to (② i, ①i) K: # of classes If ①i is in class k, then ② i = [0, . . . , 0

k−1

, 1, 0, . . . , 0]T ∈ RK

Chih-Jen Lin (National Taiwan Univ.) 17 / 78

slide-18
SLIDE 18

Optimization problem for fully-connected networks

Multi-class Classification II

A neural network maps each feature vector to one

  • f the class labels by the connection of nodes.

Chih-Jen Lin (National Taiwan Univ.) 18 / 78

slide-19
SLIDE 19

Optimization problem for fully-connected networks

Fully-connected Networks

Between two layers a weight matrix maps inputs (the previous layer) to outputs (the next layer).

A1 B1 C1 A2 B2 A3 B3 C3

Chih-Jen Lin (National Taiwan Univ.) 19 / 78

slide-20
SLIDE 20

Optimization problem for fully-connected networks

Operations Between Two Layers I

The weight matrix W m at the mth layer is W m =     w m

11

w m

12

· · · w m

1nm

w m

21

w m

22

· · · w m

2nm

. . . . . . . . . . . . w m

nm+11 w m nm+12 · · · w m nm+1nm

   

nm+1×nm

nm : # input features at layer m nm+1 : # output features at layer m, or # input features at layer m + 1 L: number of layers

Chih-Jen Lin (National Taiwan Univ.) 20 / 78

slide-21
SLIDE 21

Optimization problem for fully-connected networks

Operations Between Two Layers II

n1 = # of features, nL+1 = # of classes Let ③m be the input of the mth layer, ③1 = ① and ③L+1 be the output From mth layer to (m + 1)th layer sm = W m③m, zm+1

j

= σ(sm

j ), j = 1, . . . , nm+1,

σ(·) is the activation function.

Chih-Jen Lin (National Taiwan Univ.) 21 / 78

slide-22
SLIDE 22

Optimization problem for fully-connected networks

Operations Between Two Layers III

Usually people do a bias term     bm

1

bm

2

. . . bm

nm+1

   

nm+1×1

, so that sm = W m③m + ❜m

Chih-Jen Lin (National Taiwan Univ.) 22 / 78

slide-23
SLIDE 23

Optimization problem for fully-connected networks

Operations Between Two Layers IV

Activation function is usually an R → R

  • transformation. As we are interested in
  • ptimization, let’s not worry about why it’s needed

We collect all variables: θ =        vec(W 1) ❜1 . . . vec(W L) ❜L        ∈ Rn

Chih-Jen Lin (National Taiwan Univ.) 23 / 78

slide-24
SLIDE 24

Optimization problem for fully-connected networks

Operations Between Two Layers V

n : total # variables = (n1+1)n2+· · ·+(nL+1)nL+1 The vec(·) operator stacks columns of a matrix to a vector

Chih-Jen Lin (National Taiwan Univ.) 24 / 78

slide-25
SLIDE 25

Optimization problem for fully-connected networks

Optimization Problem I

We solve the following optimization problem, minθ f (θ), where f (θ) = 1 2θTθ + C l

i=1 ξ(③L+1,i(θ); ② i, ①i).

C: regularization parameter ③L+1(θ) ∈ RnL+1: last-layer output vector of ①. ξ(③L+1; ②, ①): loss function. Example: ξ(③L+1; ②, ①) = ||③L+1 − ②||2

Chih-Jen Lin (National Taiwan Univ.) 25 / 78

slide-26
SLIDE 26

Optimization problem for fully-connected networks

Optimization Problem II

The formulation is same as linear classification However, the loss function is more complicated Further, it’s non-convex Note that in the earlier discussion we consider a single instance In the training process we actually have for i = 1, . . . , l, sm,i = W m③m,i, zm+1,i

j

= σ(sm,i

j

), j = 1, . . . , nm+1, This makes the training more complicated

Chih-Jen Lin (National Taiwan Univ.) 26 / 78

slide-27
SLIDE 27

Optimization problem for convolutional neural networks (CNN)

Outline

1

Regularized linear classification

2

Optimization problem for fully-connected networks

3

Optimization problem for convolutional neural networks (CNN)

4

Discussion

Chih-Jen Lin (National Taiwan Univ.) 27 / 78

slide-28
SLIDE 28

Optimization problem for convolutional neural networks (CNN)

Why CNN? I

There are many types of neural networks They are suitable for different types of problems While deep learning is hot, it’s not always better than other learning methods For example, fully-connected networks were evalueated for general classification data (e.g., data from UCI machine learning repository) They are not consistently better than random forests

  • r SVM; see the comparisons (Meyer et al., 2003;

Fern´ andez-Delgado et al., 2014; Wang et al., 2018).

Chih-Jen Lin (National Taiwan Univ.) 28 / 78

slide-29
SLIDE 29

Optimization problem for convolutional neural networks (CNN)

Why CNN? II

We are interested in CNN because it’s shown to be significantly better than others on image data That’s one of the main reasons deep learning becomes popular To study optimization algorithms, of course we want to consider an “established” network That’s why CNN was chosen for our discussion However, the problem is that operations in CNN are more complicated than fully-connected networks Most books/papers only give explanation without detailed mathematical forms

Chih-Jen Lin (National Taiwan Univ.) 29 / 78

slide-30
SLIDE 30

Optimization problem for convolutional neural networks (CNN)

Why CNN? III

To study the optimization, we need some clean formulations So let’s give it a try here

Chih-Jen Lin (National Taiwan Univ.) 30 / 78

slide-31
SLIDE 31

Optimization problem for convolutional neural networks (CNN)

Convolutional Neural Networks I

Consider a K-class classification problem with training data (② i, Z 1,i), i = 1, . . . , l. ② i: label vector Z 1,i: input image If Z 1,i is in class k, then ② i = [0, . . . , 0

k−1

, 1, 0, . . . , 0]T ∈ RK. CNN maps each image Z 1,i to ② i

Chih-Jen Lin (National Taiwan Univ.) 31 / 78

slide-32
SLIDE 32

Optimization problem for convolutional neural networks (CNN)

Convolutional Neural Networks II

Typically, CNN consists of multiple convolutional layers followed by fully-connected layers. Input and output of a convolutional layer are assumed to be images.

Chih-Jen Lin (National Taiwan Univ.) 32 / 78

slide-33
SLIDE 33

Optimization problem for convolutional neural networks (CNN)

Convolutional Layers I

For the current layer, let the input be an image Z in : ain × bin × din. ain: height, bin: width, and din: #channels.

ain bin din

Chih-Jen Lin (National Taiwan Univ.) 33 / 78

slide-34
SLIDE 34

Optimization problem for convolutional neural networks (CNN)

Convolutional Layers II

The goal is to generate an output image Z out,i

  • f dout channels of aout × bout images.

Consider dout filters. Filter j ∈ {1, . . . , dout} has dimensions h × h × din.    w j

1,1,1

w j

1,h,1

... w j

h,1,1

w j

h,h,1

   . . .    w j

1,1,din

w j

1,h,din

... w j

h,1,din

w j

h,h,din

   .

Chih-Jen Lin (National Taiwan Univ.) 34 / 78

slide-35
SLIDE 35

Optimization problem for convolutional neural networks (CNN)

Convolutional Layers III

h: filter height/width (layer index omitted)

1,1,1 1,2,1 1,3,1 2,1,1 2,2,1 2,3,1 3,1,1 3,2,1 3,3,1

sout,i

1,1,j

sout,i

1,2,j

sout,i

2,1,j

sout,i

2,2,j

To compute the jth channel of output, we scan the input from top-left to bottom-right to obtain the sub-images of size h × h × din

Chih-Jen Lin (National Taiwan Univ.) 35 / 78

slide-36
SLIDE 36

Optimization problem for convolutional neural networks (CNN)

Convolutional Layers IV

We then calculate the inner product between each sub-image and the jth filter For example, if we start from the upper left corner of the input image, the first sub-image of channel d is   zi

1,1,d

. . . zi

1,h,d

... zi

h,1,d . . . zi h,h,d

  .

Chih-Jen Lin (National Taiwan Univ.) 36 / 78

slide-37
SLIDE 37

Optimization problem for convolutional neural networks (CNN)

Convolutional Layers V

We then calculate

din

  • d=1

  zi

1,1,d

. . . zi

1,h,d

... zi

h,1,d . . . zi h,h,d

  ,    w j

1,1,d

. . . w j

1,h,d

... w j

h,1,d . . . w j h,h,d

  

  • +bj,

(3) where ·, · means the sum of component-wise products between two matrices. This value becomes the (1, 1) position of the channel j of the output image.

Chih-Jen Lin (National Taiwan Univ.) 37 / 78

slide-38
SLIDE 38

Optimization problem for convolutional neural networks (CNN)

Convolutional Layers VI

Next, we use other sub-images to produce values in

  • ther positions of the output image.

Let the stride s be the number of pixels vertically or horizontally to get sub-images. For the (2, 1) position of the output image, we move down s pixels vertically to obtain the following sub-image:   zi

1+s,1,d

. . . zi

1+s,h,d

... zi

h+s,1,d . . . zi h+s,h,d

  .

Chih-Jen Lin (National Taiwan Univ.) 38 / 78

slide-39
SLIDE 39

Optimization problem for convolutional neural networks (CNN)

Convolutional Layers VII

The (2, 1) position of the channel j of the output image is

din

  • d=1

  zi

1+s,1,d

. . . zi

1+s,h,d

... zi

h+s,1,d . . . zi h+s,h,d

  ,    w j

1,1,d

. . . w j

1,h,d

... w j

h,1,d . . . w j h,h,d

  

  • + bj.

(4)

Chih-Jen Lin (National Taiwan Univ.) 39 / 78

slide-40
SLIDE 40

Optimization problem for convolutional neural networks (CNN)

Convolutional Layers VIII

The output image size aout and bout are respectively numbers that vertically and horizontally we can move the filter aout = ⌊ain − h s ⌋ + 1, bout = ⌊bin − h s ⌋ + 1 (5) Rationale of (5): vertically last row of each sub-image is h, h + s, . . . , h + ∆s ≤ ain

Chih-Jen Lin (National Taiwan Univ.) 40 / 78

slide-41
SLIDE 41

Optimization problem for convolutional neural networks (CNN)

Convolutional Layers IX

Thus ∆ = ⌊ain − h s ⌋

Chih-Jen Lin (National Taiwan Univ.) 41 / 78

slide-42
SLIDE 42

Optimization problem for convolutional neural networks (CNN)

Matrix Operations I

For efficient implementations, we should conduct convolutional operations by matrix-matrix and matrix-vector operations We will go back to this issue later

Chih-Jen Lin (National Taiwan Univ.) 42 / 78

slide-43
SLIDE 43

Optimization problem for convolutional neural networks (CNN)

Matrix Operations II

Let’s collect images of all channels as the input Z in,i =    zi

1,1,1

zi

2,1,1

. . . zi

ain,bin,1

. . . . . . ... . . . zi

1,1,din zi 2,1,din . . . zi ain,bin,din

   ∈Rdin×ainbin.

Chih-Jen Lin (National Taiwan Univ.) 43 / 78

slide-44
SLIDE 44

Optimization problem for convolutional neural networks (CNN)

Matrix Operations III

Let all filters W =    w 1

1,1,1 w 1 2,1,1 . . . w 1 h,h,din

. . . . . . ... . . . w dout

1,1,1 w dout 2,1,1 . . . w dout h,h,din

   ∈ Rdout×hhdin be variables (parameters) of the current layer

Chih-Jen Lin (National Taiwan Univ.) 44 / 78

slide-45
SLIDE 45

Optimization problem for convolutional neural networks (CNN)

Matrix Operations IV

Usually a bias term is considered ❜ =   b1 . . . bdout   ∈ Rdout×1 Operations at a layer Sout,i = W φ(Z in,i) + ❜1T

aoutbout

∈ Rdout×aoutbout, (6)

Chih-Jen Lin (National Taiwan Univ.) 45 / 78

slide-46
SLIDE 46

Optimization problem for convolutional neural networks (CNN)

Matrix Operations V

where 1aoutbout =   1 . . . 1   ∈ Raoutbout×1. φ(Z in,i) collects all sub-images in Z in,i into a matrix.

Chih-Jen Lin (National Taiwan Univ.) 46 / 78

slide-47
SLIDE 47

Optimization problem for convolutional neural networks (CNN)

Matrix Operations VI

Specifically, φ(Z in,i) =          zi

1,1,1

zi

1+s,1,1

zi

1+(aout−1)s,1+(bout−1)s,1

zi

2,1,1

zi

2+s,1,1

zi

2+(aout−1)s,1+(bout−1)s,1

. . . . . . . . . . . . zi

h,h,1

zi

h+s,h,1

zi

h+(aout−1)s,h+(bout−1)s,1

. . . . . . . . . zi

h,h,din zi h+s,h,din

zi

h+(aout−1)s,h+(bout−1)s,din

         ∈ Rhhdin×aoutbout

Chih-Jen Lin (National Taiwan Univ.) 47 / 78

slide-48
SLIDE 48

Optimization problem for convolutional neural networks (CNN)

Activation Function I

Next, an activation function scales each element of Sout,i to obtain the output matrix Z out,i. Z out,i = σ(Sout,i) ∈ Rdout×aoutbout. (7) For CNN, commonly the following RELU activation function σ(x) = max(x, 0) (8) is used Later we need that σ(x) is differentiable, but the RELU function is not.

Chih-Jen Lin (National Taiwan Univ.) 48 / 78

slide-49
SLIDE 49

Optimization problem for convolutional neural networks (CNN)

Activation Function II

Past works such as Krizhevsky et al. (2012) assume σ′(x) =

  • 1

if x > 0

  • therwise

Chih-Jen Lin (National Taiwan Univ.) 49 / 78

slide-50
SLIDE 50

Optimization problem for convolutional neural networks (CNN)

The Function φ(Z in,i) I

In the matrix-matrix product W φ(Z in,i), each element is the inner product between a filter and a sub-image We need to represent φ(Z in,i) in an explicit form. This is important for subsequent calculation Clearly φ is a linear mapping, so there exists a 0/1 matrix Pφ such that φ(Z in,i) ≡ mat

  • Pφvec(Z in,i)
  • hhdin×aoutbout , ∀i,

(9)

Chih-Jen Lin (National Taiwan Univ.) 50 / 78

slide-51
SLIDE 51

Optimization problem for convolutional neural networks (CNN)

The Function φ(Z in,i) II

vec(M): all M’s columns concatenated to a vector ✈ vec(M) =   M:,1 . . . M:,b   ∈ Rab×1, where M ∈ Ra×b mat(✈) is the inverse of vec(M) mat(✈)a×b =   v1 v(b−1)a+1 . . . · · · . . . va vba   ∈ Ra×b, (10)

Chih-Jen Lin (National Taiwan Univ.) 51 / 78

slide-52
SLIDE 52

Optimization problem for convolutional neural networks (CNN)

The Function φ(Z in,i) III

where ✈ ∈ Rab×1. Pφ is a huge matrix: Pφ ∈ Rhhdinaoutbout×dinainbin and φ : Rdin×ainbin → Rhhdin×aoutbout Later we will check implementation details Past works using the form (9) include, for example, Vedaldi and Lenc (2015)

Chih-Jen Lin (National Taiwan Univ.) 52 / 78

slide-53
SLIDE 53

Optimization problem for convolutional neural networks (CNN)

Optimization Problem I

We collect all weights to a vector variable θ. θ =        vec(W 1) ❜1 . . . vec(W L) ❜L        ∈ Rn, n : total # variables The output of the last layer L is a vector ③L+1,i(θ). Consider any loss function such as the squared loss ξi(θ) = ||③L+1,i(θ) − ② i||2.

Chih-Jen Lin (National Taiwan Univ.) 53 / 78

slide-54
SLIDE 54

Optimization problem for convolutional neural networks (CNN)

Optimization Problem II

The optimization problem is min

θ f (θ),

where f (θ) = 1 2C θTθ + 1 l l

i=1 ξ(③L+1,i(θ); ② i, Z 1,i)

C: regularization parameter. The formulation is almost the same as that for fully connected networks

Chih-Jen Lin (National Taiwan Univ.) 54 / 78

slide-55
SLIDE 55

Optimization problem for convolutional neural networks (CNN)

Optimization Problem III

Note that we divide the sum of training losses by the number of training data Thus the secnd term becomes the average training loss With the optimization problem, there is still a long way to do a real implementation Further, CNN involves additional operations in practice padding pooling We will explain them

Chih-Jen Lin (National Taiwan Univ.) 55 / 78

slide-56
SLIDE 56

Optimization problem for convolutional neural networks (CNN)

Zero Padding I

To better control the size of the output image, before the convolutional operation we may enlarge the input image to have zero values around the border. This technique is called zero-padding in CNN training. An illustration:

Chih-Jen Lin (National Taiwan Univ.) 56 / 78

slide-57
SLIDE 57

Optimization problem for convolutional neural networks (CNN)

Zero Padding II

An input image 0 · · · 0 . . . 0 · · · 0 . . . . . . · · · · · · 0 · · · 0 . . . 0 · · · 0 · · · 0 . . . · · · · · · 0 . . . · · ·

p p ain bin

Chih-Jen Lin (National Taiwan Univ.) 57 / 78

slide-58
SLIDE 58

Optimization problem for convolutional neural networks (CNN)

Zero Padding III

The size of the new image is changed from ain × bin to (ain + 2p) × (bin + 2p), where p is specified by users The operation can be treated as a layer of mapping an input Z in,i to an output Z out,i. Let dout = din.

Chih-Jen Lin (National Taiwan Univ.) 58 / 78

slide-59
SLIDE 59

Optimization problem for convolutional neural networks (CNN)

Zero Padding IV

There exists a 0/1 matrix Ppad ∈ Rdoutaoutbout×dinainbin so that the padding operation can be represented by Z out,i ≡ mat(Ppadvec(Z in,i))dout×aoutbout. (11) Implementation details will be discussed later

Chih-Jen Lin (National Taiwan Univ.) 59 / 78

slide-60
SLIDE 60

Optimization problem for convolutional neural networks (CNN)

Pooling I

To reduce the computational cost, a dimension reduction is often applied by a pooling step after convolutional operations. Usually we consider an operation that can (approximately) extract rotational or translational invariance features. Examples: average pooling, max pooling, and stochastic pooling, Let’s consider max pooling as an illustration

Chih-Jen Lin (National Taiwan Univ.) 60 / 78

slide-61
SLIDE 61

Optimization problem for convolutional neural networks (CNN)

Pooling II

An example: image A     2 3 6 8 5 4 9 7 1 2 6 0 4 3 2 1     →

  • 5 9

4 6

  • image B

    3 2 3 6 4 5 4 9 2 1 2 6 3 4 3 2     →

  • 5 9

4 6

  • Chih-Jen Lin (National Taiwan Univ.)

61 / 78

slide-62
SLIDE 62

Optimization problem for convolutional neural networks (CNN)

Pooling III

B is derived by shifting A by 1 pixel in the horizontal direction. We split two images into four 2 × 2 sub-images and choose the max value from every sub-image. In each sub-image because only some elements are changed, the maximal value is likely the same or similar. This is called translational invariance For our example the two output images from A and B are the same.

Chih-Jen Lin (National Taiwan Univ.) 62 / 78

slide-63
SLIDE 63

Optimization problem for convolutional neural networks (CNN)

Pooling IV

For mathematical representation, we consider the

  • peration as a layer of mapping an input Z in,i to an
  • utput Z out,i.

In practice pooling is considered as an operation at the end of the convolutional layer. We partition every channel of Z in,i into non-overlapping sub-regions by h × h filters with the stride s = h Because of the disjoint sub-regions, the stride s for sliding the filters is equal to h.

Chih-Jen Lin (National Taiwan Univ.) 63 / 78

slide-64
SLIDE 64

Optimization problem for convolutional neural networks (CNN)

Pooling V

This partition step is a special case of how we generate sub-images in convolutional operations. By the same definition as (9) we can generate the matrix φ(Z in,i) = mat(Pφvec(Z in,i))hh×doutaoutbout, (12) where aout = ⌊ain h ⌋, bout = ⌊bin h ⌋, dout = din. (13)

Chih-Jen Lin (National Taiwan Univ.) 64 / 78

slide-65
SLIDE 65

Optimization problem for convolutional neural networks (CNN)

Pooling VI

This is the same from the calculation in (5) as ⌊ain − h h ⌋ + 1 = ⌊ain h ⌋ Note that here we consider hh × doutaoutbout rather than hhdout × aoutbout because we can then do a max operation on each column

Chih-Jen Lin (National Taiwan Univ.) 65 / 78

slide-66
SLIDE 66

Optimization problem for convolutional neural networks (CNN)

Pooling VII

To select the largest element of each sub-region, there exists a 0/1 matrix Mi ∈ Rdoutaoutbout×hhdoutaoutbout so that each row of Mi selects a single element from vec(φ(Z in,i)). Therefore, Z out,i = mat

  • Mivec(φ(Z in,i))
  • dout×aoutbout .

(14)

Chih-Jen Lin (National Taiwan Univ.) 66 / 78

slide-67
SLIDE 67

Optimization problem for convolutional neural networks (CNN)

Pooling VIII

A comparison with (6) shows that Mi is in a similar role to the weight matrix W While Mi is 0/1, it is not a constant. It’s positions

  • f 1’s depend on the values of φ(Z in,i)

By combining (12) and (14), we have Z out,i = mat

  • Pi

poolvec(Z in,i)

  • dout×aoutbout ,

(15) where Pi

pool = MiPφ ∈ Rdoutaoutbout×dinainbin.

(16)

Chih-Jen Lin (National Taiwan Univ.) 67 / 78

slide-68
SLIDE 68

Optimization problem for convolutional neural networks (CNN)

Summary of a Convolutional Layer I

For implementation, padding and pooling are (optional) part of the convolutional layers. We discuss details of considering all operations together. The whole convolutional layer involves the following procedure: Z m,i → padding by (11) → convolutional operations by (6), (7) → pooling by (15) → Z m+1,i, (17)

Chih-Jen Lin (National Taiwan Univ.) 68 / 78

slide-69
SLIDE 69

Optimization problem for convolutional neural networks (CNN)

Summary of a Convolutional Layer II

where Z m,i and Z m+1,i are input and output of the mth layer, respectively. Let the following symbols denote image sizes at different stages of the convolutional layer. am, bm : size in the beginning am

pad, bm pad : size after padding

am

conv, bm conv : size after convolution.

The following table indicates how these values are ain, bin, din and aout, bout, dout at different stages.

Chih-Jen Lin (National Taiwan Univ.) 69 / 78

slide-70
SLIDE 70

Optimization problem for convolutional neural networks (CNN)

Summary of a Convolutional Layer III

Operation Input Output Padding: (11) Z m,i pad(Z m,i) Convolution: (6) pad(Z m,i) Sm,i Convolution: (7) Sm,i σ(Sm,i) Pooling: (15) σ(Sm,i) Z m+1,i Operation ain, bin, din aout, bout, dout Padding: (11) am, bm, dm am

pad, bm pad, dm

Convolution: (6) am

pad, bm pad, dm

am

conv, bm conv, dm+1

Convolution: (7) am

conv, bm conv, dm+1

am

conv, bm conv, dm+1

Pooling: (15) am

conv, bm conv, dm+1

am+1, bm+1, dm+1

Chih-Jen Lin (National Taiwan Univ.) 70 / 78

slide-71
SLIDE 71

Optimization problem for convolutional neural networks (CNN)

Summary of a Convolutional Layer IV

Let the filter size, mapping matrices and weight matrices at the mth layer be hm, Pm

pad, Pm φ , Pm,i pool, W m, ❜m.

From (11), (6), (7), (15), all operations can be summarized as Sm,i =W mmat(Pm

φ Pm padvec(Z m,i))hmhmdm×am

convbm conv+

❜m1T

aconvbconv

Z m+1,i = mat(Pm,i

poolvec(σ(Sm,i)))dm+1×am+1bm+1,

(18)

Chih-Jen Lin (National Taiwan Univ.) 71 / 78

slide-72
SLIDE 72

Optimization problem for convolutional neural networks (CNN)

Fully-Connected Layer I

Assume LC is the number of convolutional layers Input vector of the first fully-connected layer: ③m,i = vec(Z m,i), i = 1, . . . , l, m = Lc + 1. In each of the fully-connected layers (Lc < m ≤ L), we consider weight matrix and bias vector between layers m and m + 1.

Chih-Jen Lin (National Taiwan Univ.) 72 / 78

slide-73
SLIDE 73

Optimization problem for convolutional neural networks (CNN)

Fully-Connected Layer II

Weight matrix: W m =     w m

11

w m

12

· · · w m

1nm

w m

21

w m

22

· · · w m

2nm

. . . . . . . . . . . . w m

nm+11 w m nm+12 · · · w m nm+1nm

   

nm+1×nm

(19) Bias vector ❜m =     bm

1

bm

2

. . . bm

nm+1

   

nm+1×1

Chih-Jen Lin (National Taiwan Univ.) 73 / 78

slide-74
SLIDE 74

Optimization problem for convolutional neural networks (CNN)

Fully-Connected Layer III

Here nm and nm+1 are the numbers of nodes in layers m and m + 1, respectively. If ③m,i ∈ Rnm is the input vector, the following

  • perations are applied to generate the output vector

③m+1,i ∈ Rnm+1. sm,i = W m③m,i + ❜m, (20) zm+1,i

j

= σ(sm,i

j

), j = 1, . . . , nm+1. (21)

Chih-Jen Lin (National Taiwan Univ.) 74 / 78

slide-75
SLIDE 75

Discussion

Outline

1

Regularized linear classification

2

Optimization problem for fully-connected networks

3

Optimization problem for convolutional neural networks (CNN)

4

Discussion

Chih-Jen Lin (National Taiwan Univ.) 75 / 78

slide-76
SLIDE 76

Discussion

Challenges in NN Optimization

The objective function is non-convex. It may have many local minima It’s known that global optimization is much more difficult than local minimization The problem structure is very complicated In this course we will have first-hand experiences on handling these difficulties

Chih-Jen Lin (National Taiwan Univ.) 76 / 78

slide-77
SLIDE 77

Discussion

Formulation I

We have written all CNN operations in matrix/vector forms This is useful in deriving the gradient Are our representation symbols good enough? Can we do better? You can say that this is only a matter of notation, but given the wide use of CNN, a good formulation can be extremely useful

Chih-Jen Lin (National Taiwan Univ.) 77 / 78

slide-78
SLIDE 78

Discussion

References I

  • M. Fern´

andez-Delgado, E. Cernadas, S. Barro, and D. Amorim. Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research, 15:3133–3181, 2014.

  • A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional

neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. 2012.

  • D. Meyer, F. Leisch, and K. Hornik. The support vector machine under test. Neurocomputing,

55:169–186, 2003.

  • A. Vedaldi and K. Lenc. MatConvNet: Convolutional neural networks for matlab. In

Proceedings of the 23rd ACM International Conference on Multimedia, pages 689–692, 2015. C.-C. Wang, K.-L. Tan, C.-T. Chen, Y.-H. Lin, S. S. Keerthi, D. Mahajan, S. Sundararajan, and C.-J. Lin. Distributed Newton methods for deep learning. Neural Computation, 30(6): 1673–1724, 2018. URL http://www.csie.ntu.edu.tw/~cjlin/papers/dnn/dsh.pdf.

Chih-Jen Lin (National Taiwan Univ.) 78 / 78