CS 6316 Machine Learning Neural Networks Yangfeng Ji Department of - PowerPoint PPT Presentation

CS 6316 Machine Learning Neural Networks Yangfeng Ji Department of Computer Science University of Virginia

Overview 1. From Logistic Regression to Neural Networks 2. Expressive Power of Neural Networks 3. Learning Neural Networks 4. Computation Graph 1

From Logistic Regression to Neu- ral Networks

Logistic Regression ◮ An unified form for y ∈ {− 1 , + 1 } 1 p ( Y � + 1 | x ) � (1) 1 + exp (−� w , x �) 3

Logistic Regression ◮ An unified form for y ∈ {− 1 , + 1 } 1 p ( Y � + 1 | x ) � (1) 1 + exp (−� w , x �) ◮ The sigmoid function σ ( a ) with a ∈ R 1 σ ( a ) � (2) 1 + exp (− a ) 3

Graphical Representation ◮ A specific example of LR � 4 p ( Y � 1 | x ) � σ ( w j x · , j ) (3) j � 1 ◮ The graphical representation of this LR model is Input Output layer layer x 1 x 2 y x 3 x 4 4

Capacity of a LR Logistic regression gives a linear decision boundary x 2 x 1 5

From LR to Neural Networks Build upon logistic regression, a simple neural network can be constructed as � d w ( 1 ) σ ( k , j x · , j ) k ∈ [ K ] z k (4) � j � 1 � K w ( o ) P ( y � 1 | x ) σ ( k z k ) (5) � k � 1 ◮ x ∈ R d : d -dimensional input ◮ y ∈ {− 1 , + 1 } (binary classification problem) 6

From LR to Neural Networks Build upon logistic regression, a simple neural network can be constructed as � d w ( 1 ) σ ( k , j x · , j ) k ∈ [ K ] z k (4) � j � 1 � K w ( o ) P ( y � 1 | x ) σ ( k z k ) (5) � k � 1 ◮ x ∈ R d : d -dimensional input ◮ y ∈ {− 1 , + 1 } (binary classification problem) ◮ { w ( 1 ) k , i } and { w ( o ) k } are two sets of the parameters, and ◮ K is the number of hidden units, each of them has the same form as a LR. 6

Mathematical Formulation ◮ Element-wise formulation � d w ( 1 ) σ ( k , j x · , j ) k ∈ [ K ] z k (6) � j � 1 � K w ( o ) P ( y � + 1 | x ) σ ( k z k ) (7) � k � 1 7

Mathematical Formulation ◮ Element-wise formulation � d w ( 1 ) σ ( k , j x · , j ) k ∈ [ K ] z k (6) � j � 1 � K w ( o ) P ( y � + 1 | x ) σ ( k z k ) (7) � k � 1 ◮ Matrix-vector formulation σ ( W ( 1 ) x ) (8) � z σ (( w ( o ) ) T z ) P ( y � + 1 | x ) (9) � where W ( 1 ) ∈ R K × d and w ( o ) ∈ R K 7

Graphical Representation Input Hidden Output layer layer layer z 1 x · , 1 z 2 x · , 2 z 3 y x · , 3 z 4 x · , 4 z 5 ◮ Depth: 2 (two-layer neural network) ◮ Width: 5 (the maximal number of units in each layer) 8

Hypothesis Space The hypothesis space of neural networks is usually defined by the architecture of the network, which includes ◮ the nodes in the network, ◮ the connections in the network, and ◮ the activation function (e.g., σ ) Input Output Hidden layer layer layer z 1 x · , 1 z 2 x · , 2 z 3 y x · , 3 z 4 x · , 4 z 5 9

Other Activation Functions (a) Sign function 10

Other Activation Functions (a) Sign function (b) Tanh function 10

Other Activation Functions (a) Sign function (b) Tanh function (c) ReLU function 10 [Jarrett et al., 2009]

Another Network/Hypothesis Space Simply increasing the number of layers or increase the number of hidden units, we can create another hypothesis space Input Output Hidden Hidden layer layer layer layer x · , 1 x · , 2 y x · , 3 x · , 4 11

Expressive Power of Neural Net- works

Two-layer NNs with Sign Function Consider a neural network defined by the following functions � d w ( 1 ) z k sign ( k , j x · , j ) k ∈ [ K ] (10) � j � 1 � K w ( o ) h ( x ) sign ( k z k ) (11) � k � 1 where sign ( a ) is the sign function. 13

Two-layer NNs with Sign Function Consider a neural network defined by the following functions � d w ( 1 ) z k sign ( k , j x · , j ) k ∈ [ K ] (10) � j � 1 � K w ( o ) h ( x ) sign ( k z k ) (11) � k � 1 where sign ( a ) is the sign function. h ( x ) can be rewritten as � � h ( x ) � sign � k , i x · , j ) � K d � � w ( o ) w ( 1 ) · sign ( (12) k � � k � 1 j � 1 13

Decision Boundary h ( x ) is defined by a combination of K linear predictors x 2 x 1 Similar conclusion applies to other activation functions. [Shalev-Shwartz and Ben-David, 2014, Page 274] 14

Universal Approximation Theorem Restrict the inputs x · , j ∈ {− 1 , + 1 } ∀ j ∈ [ d ] as binary Universal Approximation Theorem For every d , there exists a two-layer neural network (Equations 10 – 11), such that this hypothesis space contains all functions from {− 1 , + 1 } d to {− 1 , + 1 } [Shalev-Shwartz and Ben-David, 2014, Section 20.3] 15

Universal Approximation Theorem Restrict the inputs x · , j ∈ {− 1 , + 1 } ∀ j ∈ [ d ] as binary Universal Approximation Theorem For every d , there exists a two-layer neural network (Equations 10 – 11), such that this hypothesis space contains all functions from {− 1 , + 1 } d to {− 1 , + 1 } ◮ The minimal size of network that satisfies the theorem is exponential in d ◮ Similar results hold for σ as the activation function [Shalev-Shwartz and Ben-David, 2014, Section 20.3] 15

Learning Neural Networks

Neural Network Predictions Consider a binary classification problem with Y � {− 1 , + 1 } , ◮ A two-layer neural network gives the following prediction as � � ( w ( o ) ) T σ ( W ( 1 ) x ) P ( Y � + 1 | x ) � σ (13) where { w ( o ) , W ( 1 ) } are the parameters 17

Neural Network Predictions Consider a binary classification problem with Y � {− 1 , + 1 } , ◮ A two-layer neural network gives the following prediction as � � ( w ( o ) ) T σ ( W ( 1 ) x ) P ( Y � + 1 | x ) � σ (13) where { w ( o ) , W ( 1 ) } are the parameters ◮ Assume the ground-truth label is y , let’s introduce an empirical distribution � 1 y ′ � y q ( Y � y ′ | x ) � δ ( y ′ , y ) � (14) y ′ � y 0 17

Cross Entropy Given one data point, The loss function of a neural network is usually defined as the cross entropy of the prediction distribution p and the empirical distribution p H ( q , p ) − q ( Y � + 1 | x ) log p ( Y � + 1 | x ) � − q ( Y � − 1 | x ) log p ( Y � − 1 | x ) (15) 18

Cross Entropy Given one data point, The loss function of a neural network is usually defined as the cross entropy of the prediction distribution p and the empirical distribution p H ( q , p ) − q ( Y � + 1 | x ) log p ( Y � + 1 | x ) � − q ( Y � − 1 | x ) log p ( Y � − 1 | x ) (15) Since q is defined with a Delta function, Depending on y , we have � − log p ( Y � + 1 | x ) Y � + 1 H ( q , p ) � (16) − log p ( Y � − 1 | x ) Y � − 1 18

Cross Entropy Given one data point, The loss function of a neural network is usually defined as the cross entropy of the prediction distribution p and the empirical distribution p H ( q , p ) − q ( Y � + 1 | x ) log p ( Y � + 1 | x ) � − q ( Y � − 1 | x ) log p ( Y � − 1 | x ) (15) Since q is defined with a Delta function, Depending on y , we have � − log p ( Y � + 1 | x ) Y � + 1 H ( q , p ) � (16) − log p ( Y � − 1 | x ) Y � − 1 It is equivalent to the negative log-likelihood (NLL) function used in learning LR. 18

ERM ◮ Given a set of training example S � {( x i , y i )} m i � 1 , the loss function is defined as � m L ( θ ) � − log p ( y i | x i ) (17) i � 1 where θ indicates all the parameters in a network. 19

ERM ◮ Given a set of training example S � {( x i , y i )} m i � 1 , the loss function is defined as � m L ( θ ) � − log p ( y i | x i ) (17) i � 1 where θ indicates all the parameters in a network. ◮ For example, θ � { w ( o ) , W ( 1 ) } , for the previously defined two-layer neural network 19

ERM ◮ Given a set of training example S � {( x i , y i )} m i � 1 , the loss function is defined as � m L ( θ ) � − log p ( y i | x i ) (17) i � 1 where θ indicates all the parameters in a network. ◮ For example, θ � { w ( o ) , W ( 1 ) } , for the previously defined two-layer neural network ◮ Just like learning a LR, we can use gradient-based learning algorithm 19

Gradient-based Learning A simple scratch of gradient-based learning 1 1. Compute the gradient of θ , ∂ L ( θ ) ∂ θ 1 More detail will be discussed in the next lecture 20

Gradient-based Learning A simple scratch of gradient-based learning 1 1. Compute the gradient of θ , ∂ L ( θ ) ∂ θ 2. Update the parameter with the gradient � � � θ ( new ) ← θ ( old ) − η · ∂ L ( θ ) (18) ∂ θ θ � θ ( old ) where η is the learning rate 1 More detail will be discussed in the next lecture 20

Gradient-based Learning A simple scratch of gradient-based learning 1 1. Compute the gradient of θ , ∂ L ( θ ) ∂ θ 2. Update the parameter with the gradient � � � θ ( new ) ← θ ( old ) − η · ∂ L ( θ ) (18) ∂ θ θ � θ ( old ) where η is the learning rate 3. Go back step 1 until it converges 1 More detail will be discussed in the next lecture 20

Gradient Computation Consider the two-layer neural network with one training example ( x , y ) , to further simplify the computation, we assume y � + 1 � � ( w ( o ) ) T σ ( W ( 1 ) x ) log p ( y | x ) � log σ (19) 21

CS 6316 Machine Learning Neural Networks Yangfeng Ji Department of - PowerPoint PPT Presentation

CS 6316 Machine Learning Neural Networks Yangfeng Ji Department of Computer Science University of Virginia Overview 1. From Logistic Regression to Neural Networks 2. Expressive Power of Neural Networks 3. Learning Neural Networks 4.

CS 6316 Machine Learning The Bias-Complexity Tradeoff Yangfeng Ji Department of Computer Science

CS 6316 Machine Learning Introduction to Learning Theory Yangfeng Ji Department of Computer

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

CS 6316 Machine Learning Boosting Yangfeng Ji Department of Computer Science University of

CS 6316 Machine Learning Clustering Yangfeng Ji Department of Computer Science University of

CS 6316 Machine Learning Support Vector Machines and Kernel Meth- ods Yangfeng Ji Department of

CS 6316 Machine Learning Dimensionality Reduction Yangfeng Ji Department of Computer Science

CS 6316 Machine Learning Review of Linear Algebra and Probability Yangfeng Ji Department of

CS 6316 Machine Learning Linear Predictors Yangfeng Ji Department of Computer Science

CS 6316 Machine Learning Model Selection and Validation Yangfeng Ji Department of Computer

CS 6316 Machine Learning Generative Models Yangfeng Ji Department of Computer Science

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Binary Activated Neural Networks via Continuous Binarization Charbel Sakr *# , Jungwook Choi + ,

CSC2547: Learning to Search Lecture 2: Background and gradient esitmators Sept 20, 2019

Backpropagation and Gradients Agenda Motivation Backprop Tips & Tricks

AMMI Introduction to Deep Learning 5.3. PyTorch optimizers Fran cois Fleuret

Selected Topics in Optimization Some slides borrowed from

Gradient Gibbs measures with disorder Codina Cotar University College London June 25, 2015, GGI

Neural Networks Part 3 Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University

Gradient Boosted Regression Trees scikit Peter Prettenhofer (@pprett) Gilles Louppe (@glouppe)

CS 6316 Machine Learning Neural Networks Yangfeng Ji Department of - PowerPoint PPT Presentation

CS 6316 Machine Learning Neural Networks Yangfeng Ji Department of Computer Science University of Virginia Overview 1. From Logistic Regression to Neural Networks 2. Expressive Power of Neural Networks 3. Learning Neural Networks 4.

CS 6316 Machine Learning The Bias-Complexity Tradeoff Yangfeng Ji Department of Computer Science

CS 6316 Machine Learning Introduction to Learning Theory Yangfeng Ji Department of Computer

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

CS 6316 Machine Learning Boosting Yangfeng Ji Department of Computer Science University of

CS 6316 Machine Learning Clustering Yangfeng Ji Department of Computer Science University of

CS 6316 Machine Learning Support Vector Machines and Kernel Meth- ods Yangfeng Ji Department of

CS 6316 Machine Learning Dimensionality Reduction Yangfeng Ji Department of Computer Science

CS 6316 Machine Learning Review of Linear Algebra and Probability Yangfeng Ji Department of

CS 6316 Machine Learning Linear Predictors Yangfeng Ji Department of Computer Science

CS 6316 Machine Learning Model Selection and Validation Yangfeng Ji Department of Computer

CS 6316 Machine Learning Generative Models Yangfeng Ji Department of Computer Science

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Binary Activated Neural Networks via Continuous Binarization Charbel Sakr *# , Jungwook Choi + ,

CSC2547: Learning to Search Lecture 2: Background and gradient esitmators Sept 20, 2019

Backpropagation and Gradients Agenda Motivation Backprop Tips &amp; Tricks

AMMI Introduction to Deep Learning 5.3. PyTorch optimizers Fran cois Fleuret

Selected Topics in Optimization Some slides borrowed from

Gradient Gibbs measures with disorder Codina Cotar University College London June 25, 2015, GGI

Neural Networks Part 3 Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University

Gradient Boosted Regression Trees scikit Peter Prettenhofer (@pprett) Gilles Louppe (@glouppe)

Backpropagation and Gradients Agenda Motivation Backprop Tips & Tricks