CSI5180. MachineLearningfor BioinformaticsApplications Fundamentals - - PowerPoint PPT Presentation

csi5180 machinelearningfor bioinformaticsapplications
SMART_READER_LITE
LIVE PREVIEW

CSI5180. MachineLearningfor BioinformaticsApplications Fundamentals - - PowerPoint PPT Presentation

CSI5180. MachineLearningfor BioinformaticsApplications Fundamentals of Machine Learning Gradient Descent by Marcel Turcotte Version November 6, 2019 Preamble Preamble 2/52 Preamble Fundamentals of Machine Learning Gradient Descent


slide-1
SLIDE 1
  • CSI5180. MachineLearningfor

BioinformaticsApplications

Fundamentals of Machine Learning — Gradient Descent

by

Marcel Turcotte

Version November 6, 2019

slide-2
SLIDE 2

Preamble 2/52

Preamble

slide-3
SLIDE 3

Preamble

Preamble 3/52

Fundamentals of Machine Learning — Gradient Descent In this lecture, we focus on an essential building block for most learning algorithms, the optimization algorithm. General objective :

Describe the fundamental concepts of machine learning

slide-4
SLIDE 4

Learning objectives

Preamble 4/52

In your own words, explain the role of the optimization algorithm for solving a linear regression problem. Describe the function of the (partial) derivative in the gradient descent algorithm. Clarify the role the learning rate, a hyper-parameter. Compare the batch, stochastic and mini-batch gradient descent algorithms.

Reading:

Largly based on Géron 2019, §4.

slide-5
SLIDE 5

Plan

Preamble 5/52

  • 1. Preamble
  • 2. Mathematics
  • 3. Problem
  • 4. Building blocks
  • 5. Prologue
slide-6
SLIDE 6

Gradient Descent - Andrew Ng (1/4)

Preamble 6/52

https://youtu.be/F6GSRDoB-Cg

slide-7
SLIDE 7

Gradient Descent - Andrew Ng (2/4)

Preamble 7/52

https://youtu.be/YovTqTY-PYY

slide-8
SLIDE 8

Gradient Descent - Andrew Ng (3/4)

Preamble 8/52

https://youtu.be/66rql7He62g

slide-9
SLIDE 9

Normal Equation - Andrew Ng (4/4)

Preamble 9/52

https://youtu.be/B-Ks01zR4HY

slide-10
SLIDE 10

Mathematics 10/52

Mathematics

slide-11
SLIDE 11

3Blue1Brown

Mathematics 11/52

Essence of linear algebra

A series of 15 videos (10 to 15 minutes per video) providing “[a] geometric understanding of matrices, determinants, eigen-stuffs and more.”

6,662,732 views as of September 30, 2019.

Essence of calculus

A series of 12 videos (15 to 20 minutes per video): “The goal here is to make calculus feel like something that you yourself could have discovered.”

2,309,726 views as of September 30, 2019.

slide-12
SLIDE 12

Problem 12/52

Problem

slide-13
SLIDE 13

Supervised learning - regression

Problem 13/52

The data set is a collection of labelled examples.

{(xi, yi)}N

i=1

Each xi is a feature vector with D dimensions. x(j)

i

is the value of the feature j of the example i, for j ∈ 1 . . . D and i ∈ 1 . . . N.

The label yi is a real number.

Problem: given the data set as input, create a “model” that can be used to predict the value of y for an unseen x.

slide-14
SLIDE 14

QSAR

Problem 14/52

QSAR stands for Quantitative Structure-Activity Relationship As a machine learning problem,

Each xi is a chemical compound yi is the biological activity of the compound xi

Examples of biological activity include toxicology and biodegradability

0.615

  • 0.125

1.140 . . . . . . 0.941

slide-15
SLIDE 15

HIV-1 reverse transcriptase inhibitors

Problem 15/52

Viira, B., García-Sosa, A. T. & Maran, U. Chemical structure and correlation analysis of HIV-1 NNRT and NRT inhibitors and database-curated, published inhibition constants with chemical structure in diverse datasets. J Mol Graph Model 76:205223 (2017). Each compound (example) in ChemDB has features such as the number

  • f atoms, area, solvation, coulombic, molecular weight, XLogP, etc.

A possible solution, a model, would look something like this: ˆ y = 44.418 − 35.133 × x (1) − 13.518 × x (2) + 0.766 × x (3)

slide-16
SLIDE 16

Building blocks 16/52

Buildingblocks

slide-17
SLIDE 17

Building blocks

Building blocks 17/52

In general, a learning algorithm has the following building blocks.

A model, often consisting of a set of weights whose values will be “learnt”.

slide-18
SLIDE 18

Building blocks

Building blocks 17/52

In general, a learning algorithm has the following building blocks.

A model, often consisting of a set of weights whose values will be “learnt”. An objective function.

slide-19
SLIDE 19

Building blocks

Building blocks 17/52

In general, a learning algorithm has the following building blocks.

A model, often consisting of a set of weights whose values will be “learnt”. An objective function.

In the case of a regression, this is often a loss function, a function that quantifies misclassification. The Root Mean Square Error is a common loss function for regression problems.

  • 1

N

N

  • 1

[h(xi) − yi]2

slide-20
SLIDE 20

Building blocks

Building blocks 17/52

In general, a learning algorithm has the following building blocks.

A model, often consisting of a set of weights whose values will be “learnt”. An objective function.

In the case of a regression, this is often a loss function, a function that quantifies misclassification. The Root Mean Square Error is a common loss function for regression problems.

  • 1

N

N

  • 1

[h(xi) − yi]2

Optimization algorithm

slide-21
SLIDE 21

Optimization

Building blocks 18/52

Until some termination criteria is met 1:

Evaluate the loss function, comparing h(xi) to yi. Make small changes to the weights, in a way that reduces that the value

  • f the loss function.

⇒ Let’s derive a concrete algorithm called gradient descent.

1E.g. the value of the loss function no longer decreases or maximum number of iterations.

slide-22
SLIDE 22

Derivative

Building blocks 19/52

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 20 40 60 80 100 120

The derivative of a real function describes how changes to the input value(s) will affect the output value.

slide-23
SLIDE 23

Derivative

Building blocks 19/52

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 20 40 60 80 100 120

The derivative of a real function describes how changes to the input value(s) will affect the output value. We focus on a single (input) variable function for now.

slide-24
SLIDE 24

Derivative

Building blocks 20/52

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 20 40 60 80 100 120

When evaluated at a single point, the derivative of a single variable function can be seen as a line tangent to the graph of the function.

slide-25
SLIDE 25

Derivative

Building blocks 21/52

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 20 40 60 80 100 120

When the slope of the tangent line is positive (when the derivative is positive), this means that increasing the value of the input variable will increase the value of the output. Furthermore, the magnitude of the derivative indicates how fast or slow the output will change.

slide-26
SLIDE 26

Derivative

Building blocks 22/52

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 20 40 60 80 100 120

When the slope of the tangent line is negative (when the derivative is negative), this means that increasing the value of the input variable will decrease the value of the output. Furthermore, the magnitude of the derivative indicates how fast or slow the output will change.

slide-27
SLIDE 27

Recall

Building blocks 23/52

A linear model assumes that the value of the label, ˆ yi, can be expressed as a linear combination of the feature values, x (j)

i :

ˆ yi = h(xi) = θ0 + θ1x (1)

i

+ θ2x (2)

i

+ . . . + θDx (D)

i

slide-28
SLIDE 28

Recall

Building blocks 23/52

A linear model assumes that the value of the label, ˆ yi, can be expressed as a linear combination of the feature values, x (j)

i :

ˆ yi = h(xi) = θ0 + θ1x (1)

i

+ θ2x (2)

i

+ . . . + θDx (D)

i

The Root Mean Square Error (RMSE) is a common loss function for regression problems.

  • 1

N

N

  • 1

[h(xi) − yi]2

slide-29
SLIDE 29

Recall

Building blocks 23/52

A linear model assumes that the value of the label, ˆ yi, can be expressed as a linear combination of the feature values, x (j)

i :

ˆ yi = h(xi) = θ0 + θ1x (1)

i

+ θ2x (2)

i

+ . . . + θDx (D)

i

The Root Mean Square Error (RMSE) is a common loss function for regression problems.

  • 1

N

N

  • 1

[h(xi) − yi]2 In practice, minimizing the Mean Squared Error (MSE) is easier and gives the same result. 1 N

N

  • 1

[h(xi) − yi]2

slide-30
SLIDE 30

Gradient descent - single value

Building blocks 24/52

Our model: h(xi) = θ0 + θ1x (1)

i

slide-31
SLIDE 31

Gradient descent - single value

Building blocks 24/52

Our model: h(xi) = θ0 + θ1x (1)

i

Our loss function: J(θ0, θ1) = 1 N

N

  • 1

[h(xi) − yi]2

slide-32
SLIDE 32

Gradient descent - single value

Building blocks 24/52

Our model: h(xi) = θ0 + θ1x (1)

i

Our loss function: J(θ0, θ1) = 1 N

N

  • 1

[h(xi) − yi]2 Problem: find the values of θ0 and θ1 minimize J.

slide-33
SLIDE 33

Gradient descent - single value

Building blocks 25/52

Gradient descent:

slide-34
SLIDE 34

Gradient descent - single value

Building blocks 25/52

Gradient descent:

Initialization: θ0 and θ1 - either with random values or zeros.

slide-35
SLIDE 35

Gradient descent - single value

Building blocks 25/52

Gradient descent:

Initialization: θ0 and θ1 - either with random values or zeros. Loop: repeat until convergence: { θj :=θj − α ∂ ∂θj J(θ0, θ1), for j = 0 and j = 1 }

slide-36
SLIDE 36

Gradient descent - single value

Building blocks 25/52

Gradient descent:

Initialization: θ0 and θ1 - either with random values or zeros. Loop: repeat until convergence: { θj :=θj − α ∂ ∂θj J(θ0, θ1), for j = 0 and j = 1 }

α is called the learning rate - this is the size of each step.

slide-37
SLIDE 37

Gradient descent - single value

Building blocks 25/52

Gradient descent:

Initialization: θ0 and θ1 - either with random values or zeros. Loop: repeat until convergence: { θj :=θj − α ∂ ∂θj J(θ0, θ1), for j = 0 and j = 1 }

α is called the learning rate - this is the size of each step.

∂ ∂θj J(θ0, θ1) is the partial derivative with respect to θj.

slide-38
SLIDE 38

Gradient descent - single value

Building blocks 25/52

Gradient descent:

Initialization: θ0 and θ1 - either with random values or zeros. Loop: repeat until convergence: { θj :=θj − α ∂ ∂θj J(θ0, θ1), for j = 0 and j = 1 }

α is called the learning rate - this is the size of each step.

∂ ∂θj J(θ0, θ1) is the partial derivative with respect to θj.

For the algorithm to be mathematically sound, all the θj must be updated simultaneously.

slide-39
SLIDE 39

Partial derivatives

Building blocks 26/52

Given J(θ0, θ1) = 1 N

N

  • 1

[h(xi) − yi]2 = 1 N

N

  • 1

[θ0 + θ1xi − yi]2 We have ∂ ∂θ0 J(θ0, θ1) = 2 N

N

  • i=1

(θ0 − θ1xi − yi) and ∂ ∂θ1 J(θ0, θ1) = 2 N

N

  • i=1

xi (θ0 + θ1xi − yi)

slide-40
SLIDE 40

Multivariate linear regression

Building blocks 27/52

h(xi) = θ0 + θ1x(1)

i

+ θ2x(2)

i

+ θ3x(3)

i

+ · · · + θDx(D)

i

x(j)

i

= value of the feature j in the ith example D = the number of features

slide-41
SLIDE 41

Gradient descent - multivariate

Building blocks 28/52

The new loss function is J(θ0, θ1, . . . , θD) = 1 N

N

  • i=1

(h(xi) − yi)2 Its partial derivative: ∂ ∂θj J(θ) = 2 N

N

  • i=1

x(j)

i

(θxi − yi) where θ, xi and yi are vectors, and θxi is a vector operation!

slide-42
SLIDE 42

Gradient vector

Building blocks 29/52

The vector containing the partial derivative of J (with respect to θj, for j ∈ {0, 1 . . . D}) is called the gradient vector. ∇θJ(θ) =

     

∂ ∂θ0 J(θ) ∂ ∂θ1 J(θ)

. . .

∂ ∂θD J(θ)

     

This vector gives the direction of the steepest ascent.

slide-43
SLIDE 43

Gradient vector

Building blocks 29/52

The vector containing the partial derivative of J (with respect to θj, for j ∈ {0, 1 . . . D}) is called the gradient vector. ∇θJ(θ) =

     

∂ ∂θ0 J(θ) ∂ ∂θ1 J(θ)

. . .

∂ ∂θD J(θ)

     

This vector gives the direction of the steepest ascent. It gives it name to the gradient descent algorithm: θ′ = θ − α∇θJ(θ)

slide-44
SLIDE 44

Gradient descent - multivariate

Building blocks 30/52

The gradient descent algorithm becomes: repeat until convergence: { θj :=θj − α ∂ ∂θj J(θ0, θ1, . . . , θD) for j ∈ [0, . . . , D] (update simultaneously) }

slide-45
SLIDE 45

Gradient descent - multivariate

Building blocks 31/52

The gradient descent algorithm becomes: repeat until convergence: { θ0 := θ0 − α 2 N

N

  • i=1

x0

i (h(xi) − yi)

θ1 := θ1 − α 2 N

N

  • i=1

x1

i (h(xi) − yi)

θ2 := θ2 − α 2 N

N

  • i=1

x2

i (h(xi) − yi)

· · · }

slide-46
SLIDE 46

Assumptions

Building blocks 32/52

What were our assumptions?

The (objective/loss) function is differentiable.

slide-47
SLIDE 47

Local vs. global

Building blocks 33/52

A function is convex if for any pair of points on the graph of the function, the line connecting these two points lies above or on the graph 2

2It would be convex downward or concave if those lines were below or on the graph of the function.

.

slide-48
SLIDE 48

Local vs. global

Building blocks 33/52

A function is convex if for any pair of points on the graph of the function, the line connecting these two points lies above or on the graph 2

A convex function has a single minimum.

2It would be convex downward or concave if those lines were below or on the graph of the function.

.

slide-49
SLIDE 49

Local vs. global

Building blocks 33/52

A function is convex if for any pair of points on the graph of the function, the line connecting these two points lies above or on the graph 2

A convex function has a single minimum.

The loss function for the linear regression (MSE) is convex.

2It would be convex downward or concave if those lines were below or on the graph of the function.

.

slide-50
SLIDE 50

Local vs. global

Building blocks 33/52

A function is convex if for any pair of points on the graph of the function, the line connecting these two points lies above or on the graph 2

A convex function has a single minimum.

The loss function for the linear regression (MSE) is convex.

For functions that are not convex, the gradient descent algorithm converges to a local minimum.

2It would be convex downward or concave if those lines were below or on the graph of the function.

.

slide-51
SLIDE 51

Local vs. global

Building blocks 33/52

A function is convex if for any pair of points on the graph of the function, the line connecting these two points lies above or on the graph 2

A convex function has a single minimum.

The loss function for the linear regression (MSE) is convex.

For functions that are not convex, the gradient descent algorithm converges to a local minimum. The loss function generally used with linear or logistic regressions, and Support Vector Machines (SVM) are convex, but not the ones for artificial neural networks.

2It would be convex downward or concave if those lines were below or on the graph of the function.

.

slide-52
SLIDE 52

Local vs. global

Building blocks 34/52

Source: https://commons.wikimedia.org/wiki/File:Extrema_example.svg

slide-53
SLIDE 53

About the learning rate

Building blocks 35/52

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 20 40 60 80 100 120

Small steps, low values for α, will make the algorithm converge slowly. Large steps, might cause the algorithm to diverge. Notice how the algorithm slows down naturally when approaching a minimum.

slide-54
SLIDE 54

Batch gradient descent

Building blocks 36/52

To be more precise, this algorithm is known as batch gradient descent since for each iteration, it processes the “whole batch” of training examples. Literature suggests that the algorithm might take more time to converge if the features are on different scales.

slide-55
SLIDE 55

Batch gradient descent - drawback

Building blocks 37/52

The batch gradient descent algorithm becomes very slow as the number

  • f training examples increases.
slide-56
SLIDE 56

Batch gradient descent - drawback

Building blocks 37/52

The batch gradient descent algorithm becomes very slow as the number

  • f training examples increases.

This is because all the training data is seen at each iteration. The algorithm is generally ran for a fixed number of iterations, say 1000.

slide-57
SLIDE 57

Stochastic Gradient Descent

Building blocks 38/52

The stochastic gradient descent algorithm randomly selects one training instance to calculate its gradient.

epochs = 10 for epoch in range ( epochs ) : for i in range (N) : s e l e c t i o n = np . random . r a n d i n t (N) # C a l c u l a t e the g r a d i e n t using s e l e c t i o n # Update the weights

slide-58
SLIDE 58

Stochastic Gradient Descent

Building blocks 38/52

The stochastic gradient descent algorithm randomly selects one training instance to calculate its gradient.

epochs = 10 for epoch in range ( epochs ) : for i in range (N) : s e l e c t i o n = np . random . r a n d i n t (N) # C a l c u l a t e the g r a d i e n t using s e l e c t i o n # Update the weights

This allows it to work with large training sets.

slide-59
SLIDE 59

Stochastic Gradient Descent

Building blocks 38/52

The stochastic gradient descent algorithm randomly selects one training instance to calculate its gradient.

epochs = 10 for epoch in range ( epochs ) : for i in range (N) : s e l e c t i o n = np . random . r a n d i n t (N) # C a l c u l a t e the g r a d i e n t using s e l e c t i o n # Update the weights

This allows it to work with large training sets. Its trajectory is not as regular as the batch algorithm.

slide-60
SLIDE 60

Stochastic Gradient Descent

Building blocks 38/52

The stochastic gradient descent algorithm randomly selects one training instance to calculate its gradient.

epochs = 10 for epoch in range ( epochs ) : for i in range (N) : s e l e c t i o n = np . random . r a n d i n t (N) # C a l c u l a t e the g r a d i e n t using s e l e c t i o n # Update the weights

This allows it to work with large training sets. Its trajectory is not as regular as the batch algorithm.

Because of its bumpy trajectory, it is often better at finding the global minima, when compared to batch.

slide-61
SLIDE 61

Stochastic Gradient Descent

Building blocks 38/52

The stochastic gradient descent algorithm randomly selects one training instance to calculate its gradient.

epochs = 10 for epoch in range ( epochs ) : for i in range (N) : s e l e c t i o n = np . random . r a n d i n t (N) # C a l c u l a t e the g r a d i e n t using s e l e c t i o n # Update the weights

This allows it to work with large training sets. Its trajectory is not as regular as the batch algorithm.

Because of its bumpy trajectory, it is often better at finding the global minima, when compared to batch. Its bumpy trajectory makes it bounce around the local minima.

slide-62
SLIDE 62

Stochastic Gradient Descent

Building blocks 38/52

The stochastic gradient descent algorithm randomly selects one training instance to calculate its gradient.

epochs = 10 for epoch in range ( epochs ) : for i in range (N) : s e l e c t i o n = np . random . r a n d i n t (N) # C a l c u l a t e the g r a d i e n t using s e l e c t i o n # Update the weights

This allows it to work with large training sets. Its trajectory is not as regular as the batch algorithm.

Because of its bumpy trajectory, it is often better at finding the global minima, when compared to batch. Its bumpy trajectory makes it bounce around the local minima. A way around this is to decrease the learning rate as the number of epoch increases - this is called a learning schedule.

slide-63
SLIDE 63

Stochastic Gradient Descent (SGD)

Building blocks 39/52

It important that the examples are either selected randomly or shuffled before running the algorithm to make sure that the algorithm converges towards the global minima.

slide-64
SLIDE 64

Mini-batch gradient descent

Building blocks 40/52

At each step, rather than selecting one training example, as SGD does, mini-batch gradient descent randomly selects a small number of training examples to compute the gradients.

slide-65
SLIDE 65

Mini-batch gradient descent

Building blocks 40/52

At each step, rather than selecting one training example, as SGD does, mini-batch gradient descent randomly selects a small number of training examples to compute the gradients. Its trajectory is more regular, compared to SGD.

slide-66
SLIDE 66

Mini-batch gradient descent

Building blocks 40/52

At each step, rather than selecting one training example, as SGD does, mini-batch gradient descent randomly selects a small number of training examples to compute the gradients. Its trajectory is more regular, compared to SGD.

As the size of the mini-batches increases, the algorithm is more and more similar to bach gradient descent, which uses all the examples at each step.

slide-67
SLIDE 67

Mini-batch gradient descent

Building blocks 40/52

At each step, rather than selecting one training example, as SGD does, mini-batch gradient descent randomly selects a small number of training examples to compute the gradients. Its trajectory is more regular, compared to SGD.

As the size of the mini-batches increases, the algorithm is more and more similar to bach gradient descent, which uses all the examples at each step.

It can take advantages of the hardware acceleration of matrix operations, in particular GPUs.

slide-68
SLIDE 68

Batch, stochastic, and mini-batch

Building blocks 41/52

Source: Géron 2019, Figure 4.11

slide-69
SLIDE 69

Normal Equation

Building blocks 42/52

  • Briefly. . .

For some loss functions, a closed-form solution exists, i.e. the problem can be solved analytically.

slide-70
SLIDE 70

Normal Equation

Building blocks 42/52

  • Briefly. . .

For some loss functions, a closed-form solution exists, i.e. the problem can be solved analytically. This is the case for a quadratic function, such as the mean squared error (MSE).

slide-71
SLIDE 71

Normal Equation

Building blocks 42/52

  • Briefly. . .

For some loss functions, a closed-form solution exists, i.e. the problem can be solved analytically. This is the case for a quadratic function, such as the mean squared error (MSE). However, this involves computing an inverse matrix, which in turns involves computing the singular value decomposition (SVD) of the matrix.

slide-72
SLIDE 72

Normal Equation

Building blocks 42/52

  • Briefly. . .

For some loss functions, a closed-form solution exists, i.e. the problem can be solved analytically. This is the case for a quadratic function, such as the mean squared error (MSE). However, this involves computing an inverse matrix, which in turns involves computing the singular value decomposition (SVD) of the matrix.

Such algorithms have a computational time complexity between O(D2.4) to O(D3), where D is the number of features.

slide-73
SLIDE 73

Normal Equation

Building blocks 42/52

  • Briefly. . .

For some loss functions, a closed-form solution exists, i.e. the problem can be solved analytically. This is the case for a quadratic function, such as the mean squared error (MSE). However, this involves computing an inverse matrix, which in turns involves computing the singular value decomposition (SVD) of the matrix.

Such algorithms have a computational time complexity between O(D2.4) to O(D3), where D is the number of features. However, these algorithms are linear with respect to the number of examples, N.

slide-74
SLIDE 74

Summary

Building blocks 43/52

Normal Equation is very slow when the number of features is large, say 100,000. However, the algorithm scales linearly with the number of examples. Batch gradient descent is slow, cannot be run on large data sets where

  • ut-of-core support is needed can work with a large number of features.

Stochastic gradient descent is fast, can handle a large number of examples. Mini-batch gradient descent is fast, can handle a large number of

  • examples. Takes advantage of hardware acceleration.

All three are implemented by SGDRegressor in Scikit-Learn.

slide-75
SLIDE 75

Optimization and deep nets

Building blocks 44/52

We will briefly revisit the subject when talking about deep artificial neural networks for which specialized optimization algorithms exist.

Momentum Optimization Nesterov Accelerated Gradient AdaGrad RMSProp Adam and Nadam

slide-76
SLIDE 76

Final word

Building blocks 45/52

Optimization is a vast subject. Other algorithms exist and are used in other contexts. Including

Particle swarm optimization (PSO), genetic algorithms (GAs), and artificial bee colony (ABC) algorithms.

slide-77
SLIDE 77

Linear regression - summary

Building blocks 46/52

A linear model assumes that the value of the label, ˆ yi, can be expressed as a linear combination of the feature values, x (j)

i :

ˆ yi = h(xi) = θ0 + θ1x (1)

i

+ θ2x (2)

i

+ . . . + θDx (D)

i

slide-78
SLIDE 78

Linear regression - summary

Building blocks 46/52

A linear model assumes that the value of the label, ˆ yi, can be expressed as a linear combination of the feature values, x (j)

i :

ˆ yi = h(xi) = θ0 + θ1x (1)

i

+ θ2x (2)

i

+ . . . + θDx (D)

i

The Mean Squared Error (MSE) is a 1 N

N

  • 1

[h(xi) − yi]2

slide-79
SLIDE 79

Linear regression - summary

Building blocks 46/52

A linear model assumes that the value of the label, ˆ yi, can be expressed as a linear combination of the feature values, x (j)

i :

ˆ yi = h(xi) = θ0 + θ1x (1)

i

+ θ2x (2)

i

+ . . . + θDx (D)

i

The Mean Squared Error (MSE) is a 1 N

N

  • 1

[h(xi) − yi]2 Batch, stochastic, or mini-batch gradient descent can be used to find “optimal” values for the weights, θj for j ∈ 0, 1 . . . D.

slide-80
SLIDE 80

Linear regression - summary

Building blocks 46/52

A linear model assumes that the value of the label, ˆ yi, can be expressed as a linear combination of the feature values, x (j)

i :

ˆ yi = h(xi) = θ0 + θ1x (1)

i

+ θ2x (2)

i

+ . . . + θDx (D)

i

The Mean Squared Error (MSE) is a 1 N

N

  • 1

[h(xi) − yi]2 Batch, stochastic, or mini-batch gradient descent can be used to find “optimal” values for the weights, θj for j ∈ 0, 1 . . . D. The result is a regressor. A function that can be used to predict the y value (the label) for some unseen example x.

slide-81
SLIDE 81

TO DO 2020

Building blocks 47/52

Consider saying a few works about autodiff - See Géron §D.

slide-82
SLIDE 82

Prologue 48/52

Prologue

slide-83
SLIDE 83

Summary

Prologue 49/52

An optimization algorithm is used to find “optimal” values for the parameters of the linear model so as to minimize the value of the losss function

slide-84
SLIDE 84

Summary

Prologue 49/52

An optimization algorithm is used to find “optimal” values for the parameters of the linear model so as to minimize the value of the losss function The gradient of the loss function plays a central role in the gradient descent algorithm. For each feature weight, it informs about the sign and magnitude of the required change.

slide-85
SLIDE 85

Summary

Prologue 49/52

An optimization algorithm is used to find “optimal” values for the parameters of the linear model so as to minimize the value of the losss function The gradient of the loss function plays a central role in the gradient descent algorithm. For each feature weight, it informs about the sign and magnitude of the required change. The learning rate controls how fast or slow the algorithm learns. The algorithm might diverge is the learning rate is too high.

slide-86
SLIDE 86

Summary

Prologue 49/52

An optimization algorithm is used to find “optimal” values for the parameters of the linear model so as to minimize the value of the losss function The gradient of the loss function plays a central role in the gradient descent algorithm. For each feature weight, it informs about the sign and magnitude of the required change. The learning rate controls how fast or slow the algorithm learns. The algorithm might diverge is the learning rate is too high. Batch gradient descent has a smooth trajectory, but becomes very slow when the number of examples is large.

slide-87
SLIDE 87

Summary

Prologue 49/52

An optimization algorithm is used to find “optimal” values for the parameters of the linear model so as to minimize the value of the losss function The gradient of the loss function plays a central role in the gradient descent algorithm. For each feature weight, it informs about the sign and magnitude of the required change. The learning rate controls how fast or slow the algorithm learns. The algorithm might diverge is the learning rate is too high. Batch gradient descent has a smooth trajectory, but becomes very slow when the number of examples is large. Stochastic and mini-batch gradient descent are good alternatives that can handle large amounts of training examples.

slide-88
SLIDE 88

Next module

Prologue 50/52

Feature engineering, data imputation, dimensionality reduction.

slide-89
SLIDE 89

References

Prologue 51/52

Aurélien Géron. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow. O’Reilly Media, 2nd edition, 2019. Andriy Burkov. The Hundred-Page Machine Learning Book. Andriy Burkov, 2019.

slide-90
SLIDE 90

Prologue 52/52

Marcel Turcotte

Marcel.Turcotte@uOttawa.ca School of Electrical Engineering and Computer Science (EECS) University of Ottawa