Neural Networks Part 2 Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [Based on slides from Jerry Zhu, Mohit Gupta]
Limited power of one single neuron β’ Perceptron: π = π(Ο π π₯ π π¦ π ) β’ Activation function π : linear, step, sigmoid 1 π₯ 0 π¦ 1 π₯ 1 π β¦ π(ΰ· π₯ π π¦ π ) π π₯ πΈ π¦ πΈ slide 2
Limited power of one single neuron β’ Perceptron: π = π(Ο π π₯ π π¦ π ) β’ Activation function π : linear, step, sigmoid 1 π₯ 0 π¦ 1 π₯ 1 π β¦ π(ΰ· π₯ π π¦ π ) π π₯ πΈ π¦ πΈ β’ Decision boundary linear even for nonlinear π β’ XOR problem slide 3
Limited power of one single neuron β’ XOR problem β’ Wait! If one can represent AND, OR, NOT, one can represent any logic circuit (including XOR), by connecting them Question: how to? slide 4
Multi-layer neural networks β’ Standard way to connect Perceptrons β’ Example: 1 hidden layer, 1 output layer Layer 2 Layer 3 Layer 1 (hidden) (output) (input) π¦ 1 π¦ 2 slide 5
Multi-layer neural networks β’ Standard way to connect Perceptrons β’ Example: 1 hidden layer, 1 output layer 2 = π (2) π 1 ΰ· π¦ π π₯ 1π (2) π₯ 11 π π¦ 1 (2) π₯ 12 π¦ 2 slide 6
Multi-layer neural networks β’ Standard way to connect Perceptrons β’ Example: 1 hidden layer, 1 output layer 2 = π (2) π 1 ΰ· π¦ π π₯ 1π (2) π₯ 11 π π¦ 1 (2) π₯ 21 2 = π (2) π 2 π¦ π π₯ 2π ΰ· (2) π₯ 12 π (2) π₯ 22 π¦ 2 slide 7
Multi-layer neural networks β’ Standard way to connect Perceptrons β’ Example: 1 hidden layer, 1 output layer 2 = π (2) π 1 ΰ· π¦ π π₯ 1π (2) π₯ 11 π π¦ 1 (2) π₯ 21 (2) π₯ 31 2 = π (2) π 2 π¦ π π₯ 2π ΰ· (2) π₯ 12 π (2) π₯ 22 π¦ 2 (2) π₯ 32 2 = π (2) π 3 ΰ· π¦ π π₯ 3π π slide 8
Multi-layer neural networks β’ Standard way to connect Perceptrons β’ Example: 1 hidden layer, 1 output layer 2 = π (2) π 1 ΰ· π¦ π π₯ 1π (2) π₯ 11 (3) π π₯ 1 π¦ 1 (2) π₯ 21 (3) (2) π₯ 2 π₯ 31 2 = π 2 π₯ π (2) (3) π 2 π¦ π π₯ 2π ΰ· π = π ΰ· π π (2) π₯ 12 π π (2) π₯ 22 π¦ 2 (3) π₯ 3 (2) π₯ 32 2 = π (2) π 3 ΰ· π¦ π π₯ 3π π slide 9
Neural net for πΏ -way classification β’ Use πΏ output units β’ Training: encode a label π§ by an indicator vector βͺ class1=(1,0,0,β¦,0), class2=(0,1,0,β¦,0) etc. β’ Test: choose the class corresponding to the largest output unit (3) π₯ 11 2 π₯ 1π (3) π 1 = π ΰ· π π π¦ 1 π β¦ (3) π₯ 12 π¦ 2 (3) π₯ 13 slide 10
Neural net for πΏ -way classification β’ Use πΏ output units β’ Training: encode a label π§ by an indicator vector βͺ class1=(1,0,0,β¦,0), class2=(0,1,0,β¦,0) etc. β’ Test: choose the class corresponding to the largest output unit (3) π₯ 11 2 π₯ 1π (3) (3) π 1 = π ΰ· π π π₯ πΏ1 π¦ 1 π β¦ (3) π₯ 12 (3) π₯ πΏ2 π¦ 2 (3) π₯ 13 2 π₯ πΏπ (3) π πΏ = π ΰ· π π (3) π π₯ πΏ3 slide 11
The (unlimited) power of neural network β’ In theory βͺ we donβt need too many layers: βͺ 1-hidden-layer net with enough hidden units can represent any continuous function of the inputs with arbitrary accuracy βͺ 2-hidden-layer net can even represent discontinuous functions slide 14
Learning in neural network β’ Again we will minimize the error ( πΏ outputs): πΏ πΉ = 1 π§ β π 2 = ΰ· π π β π§ π 2 2 ΰ· πΉ π¦ , πΉ π¦ = π¦βπΈ π=1 β’ π¦ : one training point in the training set πΈ β’ π π : the π -th output for the training point π¦ β’ π§ π : the π -th element of the label indicator vector for π¦ π 1 1 π¦ 1 0 β¦ = π§ β¦ π¦ 2 0 π πΏ 0 slide 15
Learning in neural network β’ Again we will minimize the error ( πΏ outputs): πΏ πΉ = 1 π§ β π 2 = ΰ· π π β π§ π 2 2 ΰ· πΉ π¦ , πΉ π¦ = π¦βπΈ π=1 β’ π¦ : one training point in the training set πΈ β’ π π : the π -th output for the training point π¦ β’ π§ π : the π -th element of the label indicator vector for π¦ β’ Our variables are all the weights π₯ on all the edges βͺ Apparent difficulty: we donβt know the βcorrectβ output of hidden units slide 16
Learning in neural network β’ Again we will minimize the error ( πΏ outputs): πΏ πΉ = 1 π§ β π 2 = ΰ· π π β π§ π 2 2 ΰ· πΉ π¦ , πΉ π¦ = π¦βπΈ π=1 β’ π¦ : one training point in the training set πΈ β’ π π : the π -th output for the training point π¦ β’ π§ π : the π -th element of the label indicator vector for π¦ β’ Our variables are all the weights π₯ on all the edges βͺ Apparent difficulty: we donβt know the βcorrectβ output of hidden units βͺ It turns out to be OK: we can still do gradient descent. The trick you need is the chain rule βͺ The algorithm is known as back-propagation slide 17
Gradient (on one data point) Layer (1) Layer (2) Layer (3) Layer (4) (4) π₯ 11 π¦ 1 πΉ π¦ π¦ 2 ππΉ π¦ want to compute 4 ππ₯ 11 slide 18
Gradient (on one data point) Layer (1) Layer (2) Layer (3) Layer (4) (4) π₯ 11 π¦ 1 π 1 π§ β π 2 = πΉ π¦ π 2 π¦ 2 π§ β π 2 π 1 πΉ π¦ slide 19
Gradient (on one data point) Layer (1) Layer (2) Layer (3) Layer (4) (4) = π₯ 11 (3) + π₯ 12 (4) π 1 (4) π 2 (3) π¨ 1 (4) π₯ 11 π¦ 1 π 1 π§ β π 2 = πΉ π¦ π 2 π¦ 2 (4) π π¨ 1 π§ β π 2 (4) π¨ 1 π 1 πΉ π¦ slide 20
Gradient (on one data point) Layer (1) Layer (2) Layer (3) Layer (4) (4) = π₯ 11 (3) + π₯ 12 (4) π 1 (4) π 2 (3) π¨ 1 (4) π₯ 11 π¦ 1 π 1 (4) π₯ 12 π§ β π 2 = πΉ π¦ π 2 π¦ 2 4 π 1 (3) π₯ 11 (4) π π¨ 1 π§ β π 2 (4) π¨ 1 π 1 πΉ π¦ 4 π 2 (3) π₯ 12 slide 21
Gradient (on one data point) Layer (1) Layer (2) Layer (3) Layer (4) (4) = π₯ 11 (3) + π₯ 12 (4) π 1 (4) π 2 (3) π¨ 1 (4) π₯ 11 π¦ 1 π 1 π§ β π 2 = πΉ π¦ π 2 π¦ 2 4 π 1 (3) π₯ 11 (4) π π¨ 1 π§ β π 2 (4) π¨ 1 π 1 πΉ π¦ 4 π 2 (3) ππ 1 π₯ 12 ππΉ π (4) (4) = πβ² π¨ 1 = 2(π 1 β π§ 1 ) ππ 1 ππ¨ 1 (4) ππΉ π (4) = ππΉ π ππ 1 ππ¨ 1 By Chain Rule: (4) (4) ππ 1 ππ₯ 11 ππ¨ 1 ππ₯ 11 slide 22
Gradient (on one data point) Layer (1) Layer (2) Layer (3) Layer (4) (4) = π₯ 11 (3) + π₯ 12 (4) π 1 (4) π 2 (3) π¨ 1 (4) π₯ 11 π¦ 1 π 1 π§ β π 2 = πΉ π¦ π 2 π¦ 2 4 π 1 (3) π₯ 11 (4) π π¨ 1 π§ β π 2 (4) π¨ 1 π 1 πΉ π¦ 4 π 2 (3) ππ 1 π₯ 12 ππΉ π (4) (4) = πβ² π¨ 1 = 2(π 1 β π§ 1 ) ππ 1 ππ¨ 1 (4) ππΉ π ππ¨ 1 (4) (4) = 2(π 1 β π§ 1 )πβ² π¨ 1 By Chain Rule: (4) ππ₯ 11 ππ₯ 11 slide 23
Gradient (on one data point) Layer (1) Layer (2) Layer (3) Layer (4) (4) = π₯ 11 (3) + π₯ 12 (4) π 1 (4) π 2 (3) π¨ 1 (4) π₯ 11 π¦ 1 π 1 π§ β π 2 = πΉ π¦ π 2 π¦ 2 4 π 1 (3) π₯ 11 (4) π π¨ 1 π§ β π 2 (4) π¨ 1 π 1 πΉ π¦ 4 π 2 (3) ππ 1 π₯ 12 ππΉ π (4) (4) = πβ² π¨ 1 = 2(π 1 β π§ 1 ) ππ 1 ππ¨ 1 ππΉ π (4) π 1 (3) (4) = 2(π 1 β π§ 1 )πβ² π¨ 1 By Chain Rule: ππ₯ 11 slide 24
Gradient (on one data point) Layer (1) Layer (2) Layer (3) Layer (4) (4) = π₯ 11 (3) + π₯ 12 (4) π 1 (4) π 2 (3) π¨ 1 (4) π₯ 11 π¦ 1 π 1 π§ β π 2 = πΉ π¦ π 2 π¦ 2 4 π 1 (3) π₯ 11 (4) π π¨ 1 π§ β π 2 (4) π¨ 1 π 1 πΉ π¦ 4 π 2 (3) ππ 1 π₯ 12 ππΉ π (4) (4) = πβ² π¨ 1 = 2(π 1 β π§ 1 ) ππ 1 ππ¨ 1 ππΉ π (4) (4) (3) (4) = 2(π 1 β π§ 1 )π π¨ 1 1 β π π¨ 1 π 1 By Chain Rule: ππ₯ 11 slide 25
Gradient (on one data point) Layer (1) Layer (2) Layer (3) Layer (4) (4) = π₯ 11 (3) + π₯ 12 (4) π 1 (4) π 2 (3) π¨ 1 (4) π₯ 11 π¦ 1 π 1 π§ β π 2 = πΉ π¦ π 2 π¦ 2 4 π 1 (3) π₯ 11 (4) π π¨ 1 π§ β π 2 (4) π¨ 1 π 1 πΉ π¦ 4 π 2 (3) ππ 1 π₯ 12 ππΉ π (4) (4) = πβ² π¨ 1 = 2(π 1 β π§ 1 ) ππ 1 ππ¨ 1 ππΉ π (3) (4) = 2 π 1 β π§ 1 π 1 1 β π 1 π 1 By Chain Rule: ππ₯ 11 slide 26
Gradient (on one data point) Layer (1) Layer (2) Layer (3) Layer (4) (4) = π₯ 11 (3) + π₯ 12 (4) π 1 (4) π 2 (3) π¨ 1 (4) π₯ 11 π¦ 1 π 1 π§ β π 2 = πΉ π¦ π 2 π¦ 2 4 π 1 (3) π₯ 11 (4) π π¨ 1 π§ β π 2 (4) π¨ 1 π 1 πΉ π¦ 4 π 2 (3) ππ 1 π₯ 12 ππΉ π (4) (4) = πβ² π¨ 1 = 2(π 1 β π§ 1 ) ππ 1 ππ¨ 1 ππΉ π (3) (4) = 2 π 1 β π§ 1 π 1 1 β π 1 π 1 By Chain Rule: ππ₯ 11 Can be computed by network activation slide 27
Recommend
More recommend