neural networks
play

Neural Networks. Petr Pok Czech Technical University in Prague - PowerPoint PPT Presentation

CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics Neural Networks. Petr Pok Czech Technical University in Prague Faculty of Electrical Engineering Dept. of Cybernetics P. Pok c 2017


  1. CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics Neural Networks. Petr Pošík Czech Technical University in Prague Faculty of Electrical Engineering Dept. of Cybernetics P. Pošík c � 2017 Artificial Intelligence – 1 / 32

  2. Introduction and Rehearsal P. Pošík c � 2017 Artificial Intelligence – 2 / 32

  3. Notation In supervised learning , we work with ■ an observation described by a vector x = ( x 1 , . . . , x D ) , Intro ■ the corresponding true value of the dependent variable y , and • Notation y = f w ( x ) , where the model parameters are in vector w . ■ the prediction of a model � • Multiple regression • Logistic regression • Gradient descent • Ex: Grad. for MR • Ex: Grad. for LR • Relations to NN Multilayer FFN Gradient Descent Regularization Other NNs Summary P. Pošík c � 2017 Artificial Intelligence – 3 / 32

  4. Notation In supervised learning , we work with ■ an observation described by a vector x = ( x 1 , . . . , x D ) , Intro ■ the corresponding true value of the dependent variable y , and • Notation y = f w ( x ) , where the model parameters are in vector w . ■ the prediction of a model � • Multiple regression • Logistic regression ■ Very often, we use homogeneous coordinates and matrix notation, and represent the • Gradient descent whole training data set as T = ( X , y ) , where • Ex: Grad. for MR • Ex: Grad. for LR     • Relations to NN x ( 1 ) y ( 1 ) 1     Multilayer FFN . . .     X = y = . .  , and .  .   . . . Gradient Descent x ( | T | ) y ( | T | ) 1 Regularization Other NNs Summary P. Pošík c � 2017 Artificial Intelligence – 3 / 32

  5. Notation In supervised learning , we work with ■ an observation described by a vector x = ( x 1 , . . . , x D ) , Intro ■ the corresponding true value of the dependent variable y , and • Notation y = f w ( x ) , where the model parameters are in vector w . ■ the prediction of a model � • Multiple regression • Logistic regression ■ Very often, we use homogeneous coordinates and matrix notation, and represent the • Gradient descent whole training data set as T = ( X , y ) , where • Ex: Grad. for MR • Ex: Grad. for LR     • Relations to NN x ( 1 ) y ( 1 ) 1     Multilayer FFN . . .     X = y = . .  , and .  .   . . . Gradient Descent x ( | T | ) y ( | T | ) 1 Regularization Other NNs Summary Learning then amounts to finding such model parameters w ∗ which minimize certain loss (or energy) function: w ∗ = arg min J ( w , T ) w P. Pošík c � 2017 Artificial Intelligence – 3 / 32

  6. Multiple linear regression Multiple linear regression model: y = f w ( x ) = w 1 x 1 + w 2 x 2 + . . . + w D x D = xw T � Intro • Notation The minimum of • Multiple regression • Logistic regression � y ( i ) � 2 | T | • Gradient descent 1 y ( i ) − � ∑ J MSE ( w ) = • Ex: Grad. for MR , | T | • Ex: Grad. for LR i = 1 • Relations to NN is given by Multilayer FFN Gradient Descent w ∗ = ( X T X ) − 1 X T y , Regularization Other NNs or found by numerical optimization. Summary P. Pošík c � 2017 Artificial Intelligence – 4 / 32

  7. Multiple linear regression Multiple linear regression model: y = f w ( x ) = w 1 x 1 + w 2 x 2 + . . . + w D x D = xw T � Intro • Notation The minimum of • Multiple regression • Logistic regression � y ( i ) � 2 | T | • Gradient descent 1 y ( i ) − � ∑ J MSE ( w ) = • Ex: Grad. for MR , | T | • Ex: Grad. for LR i = 1 • Relations to NN is given by Multilayer FFN Gradient Descent w ∗ = ( X T X ) − 1 X T y , Regularization Other NNs or found by numerical optimization. Summary Multiple regression as a linear neuron : x 1 w i 3 x 2 � y 3 x 3 3 P. Pošík c � 2017 Artificial Intelligence – 4 / 32

  8. Logistic regression Logistic regression model: y = f ( w , x ) = g ( xw T ) , � Intro • Notation where • Multiple regression • Logistic regression 1 • Gradient descent g ( z ) = • Ex: Grad. for MR 1 + e − z • Ex: Grad. for LR • Relations to NN is the sigmoid (a.k.a logistic ) function. Multilayer FFN ■ No explicit equation for the optimal weights. Gradient Descent ■ The only option is to find the optimum numerically, usually by some form of gradient Regularization descent. Other NNs Summary P. Pošík c � 2017 Artificial Intelligence – 5 / 32

  9. Logistic regression Logistic regression model: y = f ( w , x ) = g ( xw T ) , � Intro • Notation where • Multiple regression • Logistic regression 1 • Gradient descent g ( z ) = • Ex: Grad. for MR 1 + e − z • Ex: Grad. for LR • Relations to NN is the sigmoid (a.k.a logistic ) function. Multilayer FFN ■ No explicit equation for the optimal weights. Gradient Descent ■ The only option is to find the optimum numerically, usually by some form of gradient Regularization descent. Other NNs Summary Logistic regression as a non-linear neuron : x 1 w i 3 g ( xw T ) x 2 y ˆ 3 x 3 3 P. Pošík c � 2017 Artificial Intelligence – 5 / 32

  10. Gradient descent algorithm ■ Given a function J ( w ) that should be minimized, ■ start with a guess of w , and change it so that J ( w ) decreases, i.e. Intro ■ update our current guess of w by taking a step in the direction opposite to the • Notation gradient: • Multiple regression • Logistic regression w ← w − η ∇ J ( w ) , i.e. • Gradient descent • Ex: Grad. for MR ∂ • Ex: Grad. for LR w d ← w d − η J ( w ) , ∂ w d • Relations to NN Multilayer FFN where all w d s are updated simultaneously and η is a learning rate (step size). Gradient Descent ■ For cost functions given as the sum across the training examples Regularization Other NNs | T | E ( w , x ( i ) , y ( i ) ) , ∑ Summary J ( w ) = i = 1 we can concentrate on a single training example because | T | ∂ ∂ E ( w , x ( i ) , y ( i ) ) , ∑ J ( w ) = ∂ w d ∂ w d i = 1 and we can drop the indices over training data set: E = E ( w , x , y ) . P. Pošík c � 2017 Artificial Intelligence – 6 / 32

  11. Example: Gradient for multiple regression and squared loss x 1 w i 3 Intro x 2 • Notation y � 3 • Multiple regression • Logistic regression • Gradient descent x 3 • Ex: Grad. for MR 3 • Ex: Grad. for LR • Relations to NN Assuming the squared error loss Multilayer FFN E ( w , x , y ) = 1 y ) 2 = 1 Gradient Descent 2 ( y − xw T ) 2 , 2 ( y − � Regularization Other NNs we can compute the derivatives using the chain rule as Summary ∂ � ∂ E = ∂ E y , where ∂ w d ∂ � y ∂ w d ∂ E y = ∂ 1 y ) 2 = − ( y − � 2 ( y − � y ) , and ∂ � ∂ � y ∂ � y ∂ xw T = x d , = ∂ w d ∂ w d and thus ∂ � ∂ E = ∂ E y = − ( y − � y ) x d . ∂ w d ∂ � y ∂ w d P. Pošík c � 2017 Artificial Intelligence – 7 / 32

  12. Example: Gradient for logistic regression and crossentropy loss Nonlinear activation function: x 1 w i 3 1 g ( a ) g ( a ) = a Intro 1 + e − a x 2 y � • Notation 3 • Multiple regression Note that • Logistic regression • Gradient descent g ′ ( a ) = g ( a ) ( 1 − g ( a )) . x 3 • Ex: Grad. for MR 3 • Ex: Grad. for LR • Relations to NN Multilayer FFN Gradient Descent Regularization Other NNs Summary P. Pošík c � 2017 Artificial Intelligence – 8 / 32

  13. Example: Gradient for logistic regression and crossentropy loss Nonlinear activation function: x 1 w i 3 1 g ( a ) g ( a ) = a Intro 1 + e − a x 2 y � • Notation 3 • Multiple regression Note that • Logistic regression • Gradient descent g ′ ( a ) = g ( a ) ( 1 − g ( a )) . x 3 • Ex: Grad. for MR 3 • Ex: Grad. for LR • Relations to NN Assuming the crossentropy loss Multilayer FFN Gradient Descent y = g ( a ) = g ( xw T ) , E ( w , x , y ) = − y log � y − ( 1 − y ) log ( 1 − � y ) , where � Regularization we can compute the derivatives using the chain rule as Other NNs Summary ∂ E = ∂ E ∂ � y ∂ a , where ∂ w d ∂ � y ∂ a ∂ w d y + 1 − y y − � ∂ E y = − y y y = − y ) , 1 − � y ( 1 − � ∂ � � � ∂ � y y ) , and ∂ a ∂ xw T = x d , ∂ a = � y ( 1 − � = ∂ w d ∂ w d and thus ∂ � ∂ E = ∂ E y ∂ a = − ( y − � y ) x d . ∂ w d ∂ � y ∂ a ∂ w d P. Pošík c � 2017 Artificial Intelligence – 8 / 32

  14. Relations to neural networks ■ Above, we derived training algorithms (based on gradient descent) for linear regression model and linear classification model. ■ Note the similarity with the perceptron algorithm (“just add certain part of a Intro • Notation misclassified training example to the weight vector”). • Multiple regression ■ Units like those above are used as building blocks for more complex/flexible • Logistic regression models! • Gradient descent • Ex: Grad. for MR • Ex: Grad. for LR • Relations to NN Multilayer FFN Gradient Descent Regularization Other NNs Summary P. Pošík c � 2017 Artificial Intelligence – 9 / 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend