CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics Neural Networks. Petr Pošík Czech Technical University in Prague Faculty of Electrical Engineering Dept. of Cybernetics P. Pošík c � 2017 Artificial Intelligence – 1 / 32
Introduction and Rehearsal P. Pošík c � 2017 Artificial Intelligence – 2 / 32
Notation In supervised learning , we work with ■ an observation described by a vector x = ( x 1 , . . . , x D ) , Intro ■ the corresponding true value of the dependent variable y , and • Notation y = f w ( x ) , where the model parameters are in vector w . ■ the prediction of a model � • Multiple regression • Logistic regression • Gradient descent • Ex: Grad. for MR • Ex: Grad. for LR • Relations to NN Multilayer FFN Gradient Descent Regularization Other NNs Summary P. Pošík c � 2017 Artificial Intelligence – 3 / 32
Notation In supervised learning , we work with ■ an observation described by a vector x = ( x 1 , . . . , x D ) , Intro ■ the corresponding true value of the dependent variable y , and • Notation y = f w ( x ) , where the model parameters are in vector w . ■ the prediction of a model � • Multiple regression • Logistic regression ■ Very often, we use homogeneous coordinates and matrix notation, and represent the • Gradient descent whole training data set as T = ( X , y ) , where • Ex: Grad. for MR • Ex: Grad. for LR • Relations to NN x ( 1 ) y ( 1 ) 1 Multilayer FFN . . . X = y = . . , and . . . . . Gradient Descent x ( | T | ) y ( | T | ) 1 Regularization Other NNs Summary P. Pošík c � 2017 Artificial Intelligence – 3 / 32
Notation In supervised learning , we work with ■ an observation described by a vector x = ( x 1 , . . . , x D ) , Intro ■ the corresponding true value of the dependent variable y , and • Notation y = f w ( x ) , where the model parameters are in vector w . ■ the prediction of a model � • Multiple regression • Logistic regression ■ Very often, we use homogeneous coordinates and matrix notation, and represent the • Gradient descent whole training data set as T = ( X , y ) , where • Ex: Grad. for MR • Ex: Grad. for LR • Relations to NN x ( 1 ) y ( 1 ) 1 Multilayer FFN . . . X = y = . . , and . . . . . Gradient Descent x ( | T | ) y ( | T | ) 1 Regularization Other NNs Summary Learning then amounts to finding such model parameters w ∗ which minimize certain loss (or energy) function: w ∗ = arg min J ( w , T ) w P. Pošík c � 2017 Artificial Intelligence – 3 / 32
Multiple linear regression Multiple linear regression model: y = f w ( x ) = w 1 x 1 + w 2 x 2 + . . . + w D x D = xw T � Intro • Notation The minimum of • Multiple regression • Logistic regression � y ( i ) � 2 | T | • Gradient descent 1 y ( i ) − � ∑ J MSE ( w ) = • Ex: Grad. for MR , | T | • Ex: Grad. for LR i = 1 • Relations to NN is given by Multilayer FFN Gradient Descent w ∗ = ( X T X ) − 1 X T y , Regularization Other NNs or found by numerical optimization. Summary P. Pošík c � 2017 Artificial Intelligence – 4 / 32
Multiple linear regression Multiple linear regression model: y = f w ( x ) = w 1 x 1 + w 2 x 2 + . . . + w D x D = xw T � Intro • Notation The minimum of • Multiple regression • Logistic regression � y ( i ) � 2 | T | • Gradient descent 1 y ( i ) − � ∑ J MSE ( w ) = • Ex: Grad. for MR , | T | • Ex: Grad. for LR i = 1 • Relations to NN is given by Multilayer FFN Gradient Descent w ∗ = ( X T X ) − 1 X T y , Regularization Other NNs or found by numerical optimization. Summary Multiple regression as a linear neuron : x 1 w i 3 x 2 � y 3 x 3 3 P. Pošík c � 2017 Artificial Intelligence – 4 / 32
Logistic regression Logistic regression model: y = f ( w , x ) = g ( xw T ) , � Intro • Notation where • Multiple regression • Logistic regression 1 • Gradient descent g ( z ) = • Ex: Grad. for MR 1 + e − z • Ex: Grad. for LR • Relations to NN is the sigmoid (a.k.a logistic ) function. Multilayer FFN ■ No explicit equation for the optimal weights. Gradient Descent ■ The only option is to find the optimum numerically, usually by some form of gradient Regularization descent. Other NNs Summary P. Pošík c � 2017 Artificial Intelligence – 5 / 32
Logistic regression Logistic regression model: y = f ( w , x ) = g ( xw T ) , � Intro • Notation where • Multiple regression • Logistic regression 1 • Gradient descent g ( z ) = • Ex: Grad. for MR 1 + e − z • Ex: Grad. for LR • Relations to NN is the sigmoid (a.k.a logistic ) function. Multilayer FFN ■ No explicit equation for the optimal weights. Gradient Descent ■ The only option is to find the optimum numerically, usually by some form of gradient Regularization descent. Other NNs Summary Logistic regression as a non-linear neuron : x 1 w i 3 g ( xw T ) x 2 y ˆ 3 x 3 3 P. Pošík c � 2017 Artificial Intelligence – 5 / 32
Gradient descent algorithm ■ Given a function J ( w ) that should be minimized, ■ start with a guess of w , and change it so that J ( w ) decreases, i.e. Intro ■ update our current guess of w by taking a step in the direction opposite to the • Notation gradient: • Multiple regression • Logistic regression w ← w − η ∇ J ( w ) , i.e. • Gradient descent • Ex: Grad. for MR ∂ • Ex: Grad. for LR w d ← w d − η J ( w ) , ∂ w d • Relations to NN Multilayer FFN where all w d s are updated simultaneously and η is a learning rate (step size). Gradient Descent ■ For cost functions given as the sum across the training examples Regularization Other NNs | T | E ( w , x ( i ) , y ( i ) ) , ∑ Summary J ( w ) = i = 1 we can concentrate on a single training example because | T | ∂ ∂ E ( w , x ( i ) , y ( i ) ) , ∑ J ( w ) = ∂ w d ∂ w d i = 1 and we can drop the indices over training data set: E = E ( w , x , y ) . P. Pošík c � 2017 Artificial Intelligence – 6 / 32
Example: Gradient for multiple regression and squared loss x 1 w i 3 Intro x 2 • Notation y � 3 • Multiple regression • Logistic regression • Gradient descent x 3 • Ex: Grad. for MR 3 • Ex: Grad. for LR • Relations to NN Assuming the squared error loss Multilayer FFN E ( w , x , y ) = 1 y ) 2 = 1 Gradient Descent 2 ( y − xw T ) 2 , 2 ( y − � Regularization Other NNs we can compute the derivatives using the chain rule as Summary ∂ � ∂ E = ∂ E y , where ∂ w d ∂ � y ∂ w d ∂ E y = ∂ 1 y ) 2 = − ( y − � 2 ( y − � y ) , and ∂ � ∂ � y ∂ � y ∂ xw T = x d , = ∂ w d ∂ w d and thus ∂ � ∂ E = ∂ E y = − ( y − � y ) x d . ∂ w d ∂ � y ∂ w d P. Pošík c � 2017 Artificial Intelligence – 7 / 32
Example: Gradient for logistic regression and crossentropy loss Nonlinear activation function: x 1 w i 3 1 g ( a ) g ( a ) = a Intro 1 + e − a x 2 y � • Notation 3 • Multiple regression Note that • Logistic regression • Gradient descent g ′ ( a ) = g ( a ) ( 1 − g ( a )) . x 3 • Ex: Grad. for MR 3 • Ex: Grad. for LR • Relations to NN Multilayer FFN Gradient Descent Regularization Other NNs Summary P. Pošík c � 2017 Artificial Intelligence – 8 / 32
Example: Gradient for logistic regression and crossentropy loss Nonlinear activation function: x 1 w i 3 1 g ( a ) g ( a ) = a Intro 1 + e − a x 2 y � • Notation 3 • Multiple regression Note that • Logistic regression • Gradient descent g ′ ( a ) = g ( a ) ( 1 − g ( a )) . x 3 • Ex: Grad. for MR 3 • Ex: Grad. for LR • Relations to NN Assuming the crossentropy loss Multilayer FFN Gradient Descent y = g ( a ) = g ( xw T ) , E ( w , x , y ) = − y log � y − ( 1 − y ) log ( 1 − � y ) , where � Regularization we can compute the derivatives using the chain rule as Other NNs Summary ∂ E = ∂ E ∂ � y ∂ a , where ∂ w d ∂ � y ∂ a ∂ w d y + 1 − y y − � ∂ E y = − y y y = − y ) , 1 − � y ( 1 − � ∂ � � � ∂ � y y ) , and ∂ a ∂ xw T = x d , ∂ a = � y ( 1 − � = ∂ w d ∂ w d and thus ∂ � ∂ E = ∂ E y ∂ a = − ( y − � y ) x d . ∂ w d ∂ � y ∂ a ∂ w d P. Pošík c � 2017 Artificial Intelligence – 8 / 32
Relations to neural networks ■ Above, we derived training algorithms (based on gradient descent) for linear regression model and linear classification model. ■ Note the similarity with the perceptron algorithm (“just add certain part of a Intro • Notation misclassified training example to the weight vector”). • Multiple regression ■ Units like those above are used as building blocks for more complex/flexible • Logistic regression models! • Gradient descent • Ex: Grad. for MR • Ex: Grad. for LR • Relations to NN Multilayer FFN Gradient Descent Regularization Other NNs Summary P. Pošík c � 2017 Artificial Intelligence – 9 / 32
Recommend
More recommend