BBM406 Fundamentals of Machine Learning Lecture 12: Computational - PowerPoint PPT Presentation

Illustration: 3Blue1Brown BBM406 Fundamentals of   Machine Learning Lecture 12: Computational Graph Backpropagation Aykut Erdem // Hacettepe University // Fall 2019

Last time…   Multilayer Perceptron • Layer Representation y W 4 y i = W i x i x i +1 = σ ( y i ) x4 W 3 • (typically) iterate between   x3 linear mapping Wx and   W 2 nonlinear function x2 • Loss function   l ( y, y i ) W 1 to measure quality of   estimate so far slide by Alex Smola x1 2

Last time… Forward Pass • Output of the network can be written as: D X X h j ( x ) = f ( v j 0 + x i v ji ) i =1 J slide by Raquel Urtasun, Richard Zemel, Sanja Fidler X o k ( x ) = g ( w k 0 + h j ( x ) w kj ) j =1 (j indexing hidden units, k indexing the output units, D number of inputs) • Activation functions f , g : sigmoid/logistic, tanh, or rectified linear (ReLU) 1 + exp( − z ) , tanh ( z ) = exp( z ) − exp( − z ) 1 σ ( z ) = exp( z ) + exp( − z ) , ReLU ( z ) = max(0 , z ) 3

      Last time… Forward Pass in Python • Example code for a forward pass for a 3-layer network in Python:   slide by Raquel Urtasun, Richard Zemel, Sanja Fidler • Can be implemented e ffi ciently using matrix operations • Example above: W 1 is matrix of size 4 × 3, W 2 is 4 × 4. What about biases and W 3 ? 4 [http://cs231n.github.io/neural-networks-1/]

Backpropagation 5

Recap: Loss function/Optimization TODO: 1. Define a loss function that quantifies our unhappiness with the scores across the training 3.42 -3.45 -0.51 data.   -8.87 4.64 6.04 0.09 2.65 5.31 2. Come up with a way of 2.9 5.1 -4.22 efficiently finding the 4.48 2.64 parameters that minimize the -4.19 loss function. (optimization) 8.02 slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 5.55 3.58 3.78 -4.34 4.49 1.06 -1.5 -4.37 -0.36 -4.79 -2.09 -0.72 6.14 -2.93 We defined a (linear) score function: 6

Softmax Classifier (Multinomial 7 Logistic Regression) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Optimization 17

18 Gradient Descent slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Mini-batch Gradient Descent • only use a small portion of the training set to compute the gradient slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 19

Mini-batch Gradient Descent • only use a small portion of the training set to compute the gradient slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson there are also more fancy update formulas (momentum, Adagrad, RMSProp, Adam, …) 20

The e ff ects of di ff erent update form formulas slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 21 (image credits to Alec Radford)

22 slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

23 slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 L Computational Graph + R hinge loss s (scores) * x W slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Convolutional Network (AlexNet) input image weights slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson loss 25

26 Neural Turing Machine loss input tape slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

27 e.g. x = -2, y = 5, z = -4 slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

28 e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

37 Chain rule: e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

39 Chain rule: e.g. x = -2, y = 5, z = -4 Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

40 f activations slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

41 “local gradient” f activations slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

42 gradients “local gradient” f activations slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

46 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

55 (-1) * (-0.20) = 0.20 Another example: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Another example: [local gradient] x [its gradient] [1] x [0.2] = 0.2 [1] x [0.2] = 0.2 (both inputs!) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 57

Another example: [local gradient] x [its gradient] x0: [2] x [0.2] = 0.4 w0: [-1] x [0.2] = -0.2 0.40 slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 59

60 sigmoid function sigmoid gate 0.40 slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

sigmoid function slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 0.40 sigmoid gate (0.73) * (1 - 0.73) = 0.2 61

Patterns in backward flow • add gate: gradient distributor • max gate: gradient router • mul gate: gradient… “switcher”? slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 62

63 Gradients add at branches + slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Implementation: forward/backward API Graph (or Net) object. (Rough pseudo code) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 0.40 64

Implementation: forward/backward API x z * y (x,y,z are scalars) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 65

Implementation: forward/backward API x z * y (x,y,z are scalars) slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 66

BBM406 Fundamentals of Machine Learning Lecture 12: Computational - PowerPoint PPT Presentation

Illustration: 3Blue1Brown BBM406 Fundamentals of Machine Learning Lecture 12: Computational Graph Backpropagation Aykut Erdem // Hacettepe University // Fall 2019 Last time Multilayer Perceptron Layer Representation y W 4 y i =

BBM406 Fundamentals of Machine Learning Lecture 1: Course outline and logistics An overview

BBM406 Fundamentals of Machine Learning Lecture 23: Dimensionality Reduction Aykut Erdem //

BBM406 Fundamentals of Machine Learning Lecture 6: Learning theory Probability Review Aykut

BBM406 Fundamentals of Machine Learning Lecture 18: Decision Trees Aykut Erdem // Hacettepe

BBM406 Fundamentals of Machine Learning Lecture 9: Logistic Regression Discriminative vs.

BBM406 Fundamentals of Machine Learning Lecture 11: Multi-layer Perceptron Forward Pass

BBM406 Fundamentals of Machine Learning Lecture 13: Introduction to Deep Learning Aykut

BBM406 Fundamentals of Machine Learning Lecture 7: Probability Review (contd.) Maximum

BBM406 Fundamentals of Machine Learning Lecture 10: Linear Discriminant Functions Perceptron

BBM406 Fundamentals of Machine Learning Lecture 19: What is Ensemble Learning? Bagging

BBM406 Fundamentals of Machine Learning Lecture 8: Maximum a Posteriori (MAP) Nave Bayes

BBM406 Fundamentals of Machine Learning Lecture 2: Machine Learning by Examples, Nearest

BBM406 Fundamentals of Machine Learning Lecture 20: AdaBoost Aykut Erdem // Hacettepe

BBM406 Fundamentals of Machine Learning Lecture 15: Support Vector Machines Aykut Erdem //

BBM406 Fundamentals of Machine Learning Lecture 17: Kernel Trick for SVMs Risk and Loss

BBM406 Fundamentals of Machine Learning Lecture 14: Deep Convolutional Networks Aykut Erdem

14.1 Review From the last lecture, we have the following general formulation for learning

Large scale libc++ deployment Evgenii Stepanov, Google Ivan Krasin, Google Containers of

IoE Network Technology 13 th November 2016 ICNRG Taewan You (twyou@etri.re.kr)

The Interoperability Challenge in Telecom and NFV Environments Carsten Rossenhvel, EANTC Chris

Overview of statistical learning theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Statistical

Unsupervised Learning There is no direct ground truth for the quantity of interest

Combining Models Oliver Schulte - CMPT 726 Bishop PRML Ch. 14 Combining Models: Some Theory

Clustering and Dimensionality Reduction Stony Brook University CSE545, Fall 2016 Goal: