Optimization Problems for Neural Networks Chih-Jen Lin National - PowerPoint PPT Presentation

Optimization Problems for Neural Networks Chih-Jen Lin National Taiwan University Last updated: May 25, 2020 Chih-Jen Lin (National Taiwan Univ.) 1 / 78

Outline Regularized linear classification 1 Optimization problem for fully-connected networks 2 Optimization problem for convolutional neural 3 networks (CNN) Discussion 4 Chih-Jen Lin (National Taiwan Univ.) 2 / 78

Regularized linear classification Outline Regularized linear classification 1 Optimization problem for fully-connected networks 2 Optimization problem for convolutional neural 3 networks (CNN) Discussion 4 Chih-Jen Lin (National Taiwan Univ.) 3 / 78

Regularized linear classification Minimizing Training Errors Basically a classification method starts with minimizing the training errors min (training errors) model That is, all or most training data with labels should be correctly classified by our model A model can be a decision tree, a neural network, or other types Chih-Jen Lin (National Taiwan Univ.) 4 / 78

Regularized linear classification Minimizing Training Errors (Cont’d) For simplicity, let’s consider the model to be a vector ✇ That is, the decision function is sgn( ✇ T ① ) For any data, ① , the predicted label is � if ✇ T ① ≥ 0 1 − 1 otherwise Chih-Jen Lin (National Taiwan Univ.) 5 / 78

Regularized linear classification Minimizing Training Errors (Cont’d) The two-dimensional situation ◦ ◦ ◦ ◦ ◦ ◦ ◦ △ ◦ △ △ △ △ △ △ ✇ T ① = 0 This seems to be quite restricted, but practically ① is in a much higher dimensional space Chih-Jen Lin (National Taiwan Univ.) 6 / 78

Regularized linear classification Minimizing Training Errors (Cont’d) To characterize the training error, we need a loss function ξ ( ✇ ; y , ① ) for each instance ( y , ① ), where y = ± 1 is the label and ① is the feature vector Ideally we should use 0–1 training loss: � if y ✇ T ① < 0 , 1 ξ ( ✇ ; y , ① ) = 0 otherwise Chih-Jen Lin (National Taiwan Univ.) 7 / 78

Regularized linear classification Minimizing Training Errors (Cont’d) However, this function is discontinuous. The optimization problem becomes difficult ξ ( ✇ ; y , ① ) − y ✇ T ① We need continuous approximations Chih-Jen Lin (National Taiwan Univ.) 8 / 78

Regularized linear classification Common Loss Functions Hinge loss (l1 loss) ξ L1 ( ✇ ; y , ① ) ≡ max(0 , 1 − y ✇ T ① ) (1) Logistic loss ξ LR ( ✇ ; y , ① ) ≡ log(1 + e − y ✇ T ① ) (2) Support vector machines (SVM): Eq. (1). Logistic regression (LR): (2) SVM and LR are two very fundamental classification methods Chih-Jen Lin (National Taiwan Univ.) 9 / 78

Regularized linear classification Common Loss Functions (Cont’d) ξ ( ✇ ; y , ① ) ξ L1 ξ LR − y ✇ T ① Logistic regression is very related to SVM Their performance is usually similar Chih-Jen Lin (National Taiwan Univ.) 10 / 78

Regularized linear classification Common Loss Functions (Cont’d) However, minimizing training losses may not give a good model for future prediction Overfitting occurs Chih-Jen Lin (National Taiwan Univ.) 11 / 78

Regularized linear classification Overfitting See the illustration in the next slide For classification, You can easily achieve 100% training accuracy This is useless When training a data set, we should Avoid underfitting: small training error Avoid overfitting: small testing error Chih-Jen Lin (National Taiwan Univ.) 12 / 78

Regularized linear classification ● and ▲ : training; � and △ : testing Chih-Jen Lin (National Taiwan Univ.) 13 / 78

Regularized linear classification Regularization To minimize the training error we manipulate the ✇ vector so that it fits the data To avoid overfitting we need a way to make ✇ ’s values less extreme. One idea is to make ✇ values closer to zero We can add, for example, ✇ T ✇ or � ✇ � 1 2 to the function that is minimized Chih-Jen Lin (National Taiwan Univ.) 14 / 78

Regularized linear classification General Form of Linear Classification Training data { y i , ① i } , ① i ∈ R n , i = 1 , . . . , l , y i = ± 1 l : # of data, n : # of features l f ( ✇ ) ≡ ✇ T ✇ � min ✇ f ( ✇ ) , + C ξ ( ✇ ; y i , ① i ) 2 i =1 ✇ T ✇ / 2: regularization term ξ ( ✇ ; y , ① ): loss function C : regularization parameter (chosen by users) Chih-Jen Lin (National Taiwan Univ.) 15 / 78

Optimization problem for fully-connected networks Outline Regularized linear classification 1 Optimization problem for fully-connected networks 2 Optimization problem for convolutional neural 3 networks (CNN) Discussion 4 Chih-Jen Lin (National Taiwan Univ.) 16 / 78

Optimization problem for fully-connected networks Multi-class Classification I Our training set includes ( ② i , ① i ), i = 1 , . . . , l . ① i ∈ R n 1 is the feature vector. ② i ∈ R K is the label vector. As label is now a vector, we change (label, instance) from ( y i , ① i ) to ( ② i , ① i ) K : # of classes If ① i is in class k , then ② i = [0 , . . . , 0 , 1 , 0 , . . . , 0] T ∈ R K � �� k − 1 Chih-Jen Lin (National Taiwan Univ.) 17 / 78

Optimization problem for fully-connected networks Multi-class Classification II A neural network maps each feature vector to one of the class labels by the connection of nodes. Chih-Jen Lin (National Taiwan Univ.) 18 / 78

Optimization problem for fully-connected networks Fully-connected Networks Between two layers a weight matrix maps inputs (the previous layer) to outputs (the next layer). A 1 A 3 A 2 B 1 B 3 B 2 C 1 C 3 Chih-Jen Lin (National Taiwan Univ.) 19 / 78

Optimization problem for fully-connected networks Operations Between Two Layers I The weight matrix W m at the m th layer is   w m w m w m · · · 11 12 1 n m w m w m w m · · ·   W m = 21 22 2 n m . . . .   . . . . . . . .   w m n m +1 1 w m n m +1 2 · · · w m n m +1 n m n m +1 × n m n m : # input features at layer m n m +1 : # output features at layer m , or # input features at layer m + 1 L : number of layers Chih-Jen Lin (National Taiwan Univ.) 20 / 78

Optimization problem for fully-connected networks Operations Between Two Layers II n 1 = # of features, n L +1 = # of classes Let ③ m be the input of the m th layer, ③ 1 = ① and ③ L +1 be the output From m th layer to ( m + 1)th layer s m = W m ③ m , z m +1 = σ ( s m j ) , j = 1 , . . . , n m +1 , j σ ( · ) is the activation function. Chih-Jen Lin (National Taiwan Univ.) 21 / 78

Optimization problem for fully-connected networks Operations Between Two Layers III Usually people do a bias term   b m 1 b m   2 , .   . .   b m n m +1 n m +1 × 1 so that s m = W m ③ m + ❜ m Chih-Jen Lin (National Taiwan Univ.) 22 / 78

Optimization problem for fully-connected networks Operations Between Two Layers IV Activation function is usually an R → R transformation. As we are interested in optimization, let’s not worry about why it’s needed We collect all variables:   vec( W 1 ) ❜ 1     . . ∈ R n   θ = .     vec( W L )   ❜ L Chih-Jen Lin (National Taiwan Univ.) 23 / 78

Optimization problem for fully-connected networks Operations Between Two Layers V n : total # variables = ( n 1 +1) n 2 + · · · +( n L +1) n L +1 The vec( · ) operator stacks columns of a matrix to a vector Chih-Jen Lin (National Taiwan Univ.) 24 / 78

Optimization problem for fully-connected networks Optimization Problem I We solve the following optimization problem, min θ f ( θ ) , where f ( θ ) = 1 � l 2 θ T θ + C i =1 ξ ( ③ L +1 , i ( θ ); ② i , ① i ) . C : regularization parameter ③ L +1 ( θ ) ∈ R n L +1 : last-layer output vector of ① . ξ ( ③ L +1 ; ② , ① ): loss function. Example: ξ ( ③ L +1 ; ② , ① ) = || ③ L +1 − ② || 2 Chih-Jen Lin (National Taiwan Univ.) 25 / 78

Optimization problem for fully-connected networks Optimization Problem II The formulation is same as linear classification However, the loss function is more complicated Further, it’s non-convex Note that in the earlier discussion we consider a single instance In the training process we actually have for i = 1 , . . . , l , s m , i = W m ③ m , i , z m +1 , i = σ ( s m , i ) , j = 1 , . . . , n m +1 , j j This makes the training more complicated Chih-Jen Lin (National Taiwan Univ.) 26 / 78

Optimization problem for convolutional neural networks (CNN) Outline Regularized linear classification 1 Optimization problem for fully-connected networks 2 Optimization problem for convolutional neural 3 networks (CNN) Discussion 4 Chih-Jen Lin (National Taiwan Univ.) 27 / 78

Optimization problem for convolutional neural networks (CNN) Why CNN? I There are many types of neural networks They are suitable for different types of problems While deep learning is hot, it’s not always better than other learning methods For example, fully-connected networks were evalueated for general classification data (e.g., data from UCI machine learning repository) They are not consistently better than random forests or SVM; see the comparisons (Meyer et al., 2003; Fern´ andez-Delgado et al., 2014; Wang et al., 2018). Chih-Jen Lin (National Taiwan Univ.) 28 / 78

Optimization Problems for Neural Networks Chih-Jen Lin National - PowerPoint PPT Presentation

Optimization Problems for Neural Networks Chih-Jen Lin National Taiwan University Last updated: May 25, 2020 Chih-Jen Lin (National Taiwan Univ.) 1 / 78 Outline Regularized linear classification 1 Optimization problem for fully-connected

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

CHAPTER IV IV CHAPTER Combinatorial Optimization Combinatorial Optimization by Neural Networks

Neural Networks 1. Introduction Spring 2020 1 Neural Networks are taking over! Neural

Neural Networks 1. Introduction Spring 2019 1 Neural Networks are taking over! Neural

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Relaxation and Hopfield Networks Neural Networks Neural Networks - Hopfield 1 Bibliography

AMMI Introduction to Deep Learning 9.1. Transposed convolutions Fran cois Fleuret

Convolutions CON VOLUTION AL N EURAL N ETW ORK S F OR IMAGE P ROCES S IN G Ariel Rokem Senior

Spiking row-by-row FPGA Multi-kernel and Multi-layer Convolution Processor. Ricardo Tapiador

Convolutional Neural Networks in Speech Lecture 20 CS 753 Instructor: Preethi Jyothi

CS480/680 Machine Learning Lecture 20: Convolutional Neural Network Zahra Sheikhbahaee March 29,

Review on ImageNet Classification with Deep Convolutional Neural Networks by Alex Krizhevsky et. al

Deep Epitomic Nets and Scale/Position Search for Image Classification TTIC_ECP team George

Hierarchical Convolutional Features for Visual Tracking Chao Ma Jia-Bin Huang Xiaokang Yang