cs 6316 machine learning
play

CS 6316 Machine Learning Neural Networks Yangfeng Ji Department of - PowerPoint PPT Presentation

CS 6316 Machine Learning Neural Networks Yangfeng Ji Department of Computer Science University of Virginia Overview 1. From Logistic Regression to Neural Networks 2. Expressive Power of Neural Networks 3. Learning Neural Networks 4.


  1. CS 6316 Machine Learning Neural Networks Yangfeng Ji Department of Computer Science University of Virginia

  2. Overview 1. From Logistic Regression to Neural Networks 2. Expressive Power of Neural Networks 3. Learning Neural Networks 4. Computation Graph 1

  3. From Logistic Regression to Neu- ral Networks

  4. Logistic Regression ◮ An unified form for y ∈ {− 1 , + 1 } 1 p ( Y � + 1 | x ) � (1) 1 + exp (−� w , x �) 3

  5. Logistic Regression ◮ An unified form for y ∈ {− 1 , + 1 } 1 p ( Y � + 1 | x ) � (1) 1 + exp (−� w , x �) ◮ The sigmoid function σ ( a ) with a ∈ R 1 σ ( a ) � (2) 1 + exp (− a ) 3

  6. Graphical Representation ◮ A specific example of LR � 4 p ( Y � 1 | x ) � σ ( w j x · , j ) (3) j � 1 ◮ The graphical representation of this LR model is Input Output layer layer x 1 x 2 y x 3 x 4 4

  7. Capacity of a LR Logistic regression gives a linear decision boundary x 2 x 1 5

  8. From LR to Neural Networks Build upon logistic regression, a simple neural network can be constructed as � d w ( 1 ) σ ( k , j x · , j ) k ∈ [ K ] z k (4) � j � 1 � K w ( o ) P ( y � 1 | x ) σ ( k z k ) (5) � k � 1 ◮ x ∈ R d : d -dimensional input ◮ y ∈ {− 1 , + 1 } (binary classification problem) 6

  9. From LR to Neural Networks Build upon logistic regression, a simple neural network can be constructed as � d w ( 1 ) σ ( k , j x · , j ) k ∈ [ K ] z k (4) � j � 1 � K w ( o ) P ( y � 1 | x ) σ ( k z k ) (5) � k � 1 ◮ x ∈ R d : d -dimensional input ◮ y ∈ {− 1 , + 1 } (binary classification problem) ◮ { w ( 1 ) k , i } and { w ( o ) k } are two sets of the parameters, and ◮ K is the number of hidden units, each of them has the same form as a LR. 6

  10. Mathematical Formulation ◮ Element-wise formulation � d w ( 1 ) σ ( k , j x · , j ) k ∈ [ K ] z k (6) � j � 1 � K w ( o ) P ( y � + 1 | x ) σ ( k z k ) (7) � k � 1 7

  11. Mathematical Formulation ◮ Element-wise formulation � d w ( 1 ) σ ( k , j x · , j ) k ∈ [ K ] z k (6) � j � 1 � K w ( o ) P ( y � + 1 | x ) σ ( k z k ) (7) � k � 1 ◮ Matrix-vector formulation σ ( W ( 1 ) x ) (8) � z σ (( w ( o ) ) T z ) P ( y � + 1 | x ) (9) � where W ( 1 ) ∈ R K × d and w ( o ) ∈ R K 7

  12. Graphical Representation Input Hidden Output layer layer layer z 1 x · , 1 z 2 x · , 2 z 3 y x · , 3 z 4 x · , 4 z 5 ◮ Depth: 2 (two-layer neural network) ◮ Width: 5 (the maximal number of units in each layer) 8

  13. Hypothesis Space The hypothesis space of neural networks is usually defined by the architecture of the network, which includes ◮ the nodes in the network, ◮ the connections in the network, and ◮ the activation function (e.g., σ ) Input Output Hidden layer layer layer z 1 x · , 1 z 2 x · , 2 z 3 y x · , 3 z 4 x · , 4 z 5 9

  14. Other Activation Functions (a) Sign function 10

  15. Other Activation Functions (a) Sign function (b) Tanh function 10

  16. Other Activation Functions (a) Sign function (b) Tanh function (c) ReLU function 10 [Jarrett et al., 2009]

  17. Another Network/Hypothesis Space Simply increasing the number of layers or increase the number of hidden units, we can create another hypothesis space Input Output Hidden Hidden layer layer layer layer x · , 1 x · , 2 y x · , 3 x · , 4 11

  18. Expressive Power of Neural Net- works

  19. Two-layer NNs with Sign Function Consider a neural network defined by the following functions � d w ( 1 ) z k sign ( k , j x · , j ) k ∈ [ K ] (10) � j � 1 � K w ( o ) h ( x ) sign ( k z k ) (11) � k � 1 where sign ( a ) is the sign function. 13

  20. Two-layer NNs with Sign Function Consider a neural network defined by the following functions � d w ( 1 ) z k sign ( k , j x · , j ) k ∈ [ K ] (10) � j � 1 � K w ( o ) h ( x ) sign ( k z k ) (11) � k � 1 where sign ( a ) is the sign function. h ( x ) can be rewritten as � � h ( x ) � sign � k , i x · , j ) � K d � � w ( o ) w ( 1 ) · sign ( (12) k � � k � 1 j � 1 13

  21. Decision Boundary h ( x ) is defined by a combination of K linear predictors x 2 x 1 Similar conclusion applies to other activation functions. [Shalev-Shwartz and Ben-David, 2014, Page 274] 14

  22. Universal Approximation Theorem Restrict the inputs x · , j ∈ {− 1 , + 1 } ∀ j ∈ [ d ] as binary Universal Approximation Theorem For every d , there exists a two-layer neural network (Equations 10 – 11), such that this hypothesis space contains all functions from {− 1 , + 1 } d to {− 1 , + 1 } [Shalev-Shwartz and Ben-David, 2014, Section 20.3] 15

  23. Universal Approximation Theorem Restrict the inputs x · , j ∈ {− 1 , + 1 } ∀ j ∈ [ d ] as binary Universal Approximation Theorem For every d , there exists a two-layer neural network (Equations 10 – 11), such that this hypothesis space contains all functions from {− 1 , + 1 } d to {− 1 , + 1 } ◮ The minimal size of network that satisfies the theorem is exponential in d ◮ Similar results hold for σ as the activation function [Shalev-Shwartz and Ben-David, 2014, Section 20.3] 15

  24. Learning Neural Networks

  25. Neural Network Predictions Consider a binary classification problem with Y � {− 1 , + 1 } , ◮ A two-layer neural network gives the following prediction as � � ( w ( o ) ) T σ ( W ( 1 ) x ) P ( Y � + 1 | x ) � σ (13) where { w ( o ) , W ( 1 ) } are the parameters 17

  26. Neural Network Predictions Consider a binary classification problem with Y � {− 1 , + 1 } , ◮ A two-layer neural network gives the following prediction as � � ( w ( o ) ) T σ ( W ( 1 ) x ) P ( Y � + 1 | x ) � σ (13) where { w ( o ) , W ( 1 ) } are the parameters ◮ Assume the ground-truth label is y , let’s introduce an empirical distribution � 1 y ′ � y q ( Y � y ′ | x ) � δ ( y ′ , y ) � (14) y ′ � y 0 17

  27. Cross Entropy Given one data point, The loss function of a neural network is usually defined as the cross entropy of the prediction distribution p and the empirical distribution p H ( q , p ) − q ( Y � + 1 | x ) log p ( Y � + 1 | x ) � − q ( Y � − 1 | x ) log p ( Y � − 1 | x ) (15) 18

  28. Cross Entropy Given one data point, The loss function of a neural network is usually defined as the cross entropy of the prediction distribution p and the empirical distribution p H ( q , p ) − q ( Y � + 1 | x ) log p ( Y � + 1 | x ) � − q ( Y � − 1 | x ) log p ( Y � − 1 | x ) (15) Since q is defined with a Delta function, Depending on y , we have � − log p ( Y � + 1 | x ) Y � + 1 H ( q , p ) � (16) − log p ( Y � − 1 | x ) Y � − 1 18

  29. Cross Entropy Given one data point, The loss function of a neural network is usually defined as the cross entropy of the prediction distribution p and the empirical distribution p H ( q , p ) − q ( Y � + 1 | x ) log p ( Y � + 1 | x ) � − q ( Y � − 1 | x ) log p ( Y � − 1 | x ) (15) Since q is defined with a Delta function, Depending on y , we have � − log p ( Y � + 1 | x ) Y � + 1 H ( q , p ) � (16) − log p ( Y � − 1 | x ) Y � − 1 It is equivalent to the negative log-likelihood (NLL) function used in learning LR. 18

  30. ERM ◮ Given a set of training example S � {( x i , y i )} m i � 1 , the loss function is defined as � m L ( θ ) � − log p ( y i | x i ) (17) i � 1 where θ indicates all the parameters in a network. 19

  31. ERM ◮ Given a set of training example S � {( x i , y i )} m i � 1 , the loss function is defined as � m L ( θ ) � − log p ( y i | x i ) (17) i � 1 where θ indicates all the parameters in a network. ◮ For example, θ � { w ( o ) , W ( 1 ) } , for the previously defined two-layer neural network 19

  32. ERM ◮ Given a set of training example S � {( x i , y i )} m i � 1 , the loss function is defined as � m L ( θ ) � − log p ( y i | x i ) (17) i � 1 where θ indicates all the parameters in a network. ◮ For example, θ � { w ( o ) , W ( 1 ) } , for the previously defined two-layer neural network ◮ Just like learning a LR, we can use gradient-based learning algorithm 19

  33. Gradient-based Learning A simple scratch of gradient-based learning 1 1. Compute the gradient of θ , ∂ L ( θ ) ∂ θ 1 More detail will be discussed in the next lecture 20

  34. Gradient-based Learning A simple scratch of gradient-based learning 1 1. Compute the gradient of θ , ∂ L ( θ ) ∂ θ 2. Update the parameter with the gradient � � � θ ( new ) ← θ ( old ) − η · ∂ L ( θ ) (18) ∂ θ θ � θ ( old ) where η is the learning rate 1 More detail will be discussed in the next lecture 20

  35. Gradient-based Learning A simple scratch of gradient-based learning 1 1. Compute the gradient of θ , ∂ L ( θ ) ∂ θ 2. Update the parameter with the gradient � � � θ ( new ) ← θ ( old ) − η · ∂ L ( θ ) (18) ∂ θ θ � θ ( old ) where η is the learning rate 3. Go back step 1 until it converges 1 More detail will be discussed in the next lecture 20

  36. Gradient Computation Consider the two-layer neural network with one training example ( x , y ) , to further simplify the computation, we assume y � + 1 � � ( w ( o ) ) T σ ( W ( 1 ) x ) log p ( y | x ) � log σ (19) 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend