CSC 311: Introduction to Machine Learning
Lecture 2 - Linear Methods for Regression, Optimization Roger Grosse Chris Maddison Juhan Bae Silviu Pitis
University of Toronto, Fall 2020
Intro ML (UofT) CSC311-Lec2 1 / 53
CSC 311: Introduction to Machine Learning Lecture 2 - Linear Methods - - PowerPoint PPT Presentation
CSC 311: Introduction to Machine Learning Lecture 2 - Linear Methods for Regression, Optimization Roger Grosse Chris Maddison Juhan Bae Silviu Pitis University of Toronto, Fall 2020 Intro ML (UofT) CSC311-Lec2 1 / 53 Announcements
Intro ML (UofT) CSC311-Lec2 1 / 53
Intro ML (UofT) CSC311-Lec2 2 / 53
◮ Task: predict scalar-valued targets (e.g. stock prices) ◮ Architecture: linear function of the inputs
◮ choose a model describing the relationships between variables of
◮ define a loss function quantifying how bad the fit to the data is ◮ choose a regularizer saying how much we prefer different candidate
◮ fit a model that minimizes the loss function and satisfies the
Intro ML (UofT) CSC311-Lec2 3 / 53
Intro ML (UofT) CSC311-Lec2 4 / 53
◮ y is the prediction ◮ w is the weights ◮ b is the bias (or intercept)
Intro ML (UofT) CSC311-Lec2 5 / 53
2 1 1 2 x: features 1.0 0.5 0.0 0.5 1.0 1.5 2.0 y: response Fitted line Data
Intro ML (UofT) CSC311-Lec2 6 / 53
2 1 1 2 x: features 1.0 0.5 0.0 0.5 1.0 1.5 2.0 y: response Data 2 1 1 2 x: features 1.0 0.5 0.0 0.5 1.0 1.5 2.0 y: response Fitted line Data
Intro ML (UofT) CSC311-Lec2 7 / 53
N
N
Intro ML (UofT) CSC311-Lec2 8 / 53
N
j
Intro ML (UofT) CSC311-Lec2 9 / 53
◮ Cut down on Python interpreter overhead ◮ Use highly optimized linear algebra libraries (hardware support) ◮ Matrix multiplication very fast on GPU (Graphics Processing Unit)
Intro ML (UofT) CSC311-Lec2 10 / 53
Intro ML (UofT) CSC311-Lec2 11 / 53
Intro ML (UofT) CSC311-Lec2 12 / 53
◮ to show z∗ minimizes f(z), show that ∀z, f(z) ≥ f(z∗) ◮ to show that a = b, show that a ≥ b and b ≥ a
◮ multivariate generalization: set the partial derivatives to zero (or
Intro ML (UofT) CSC311-Lec2 13 / 53
◮ Consider any z = Xw ◮ By Pythagorean theorem and the
Intro ML (UofT) CSC311-Lec2 14 / 53
Intro ML (UofT) CSC311-Lec2 15 / 53
j′
j′
Intro ML (UofT) CSC311-Lec2 16 / 53
Intro ML (UofT) CSC311-Lec2 17 / 53
Intro ML (UofT) CSC311-Lec2 18 / 53
∂2 ∂wi∂wj f(w).
◮ Recall from multivariable calculus that for continuously
∂2 ∂wi∂wj f = ∂2 ∂wj∂wi f, so the Hessian is symmetric.
◮ Recall from linear algebra that the eigenvalues of a symmetric
◮ If all of the eigenvalues are positive, we say the Hessian is
◮ A critical point (∇f(w) = 0) of a continuously differentiable
Intro ML (UofT) CSC311-Lec2 19 / 53
1Image source: mkwiki.org Intro ML (UofT) CSC311-Lec2 20 / 53
Intro ML (UofT) CSC311-Lec2 21 / 53
Intro ML (UofT) CSC311-Lec2 22 / 53
Intro ML (UofT) CSC311-Lec2 23 / 53
Intro ML (UofT) CSC311-Lec2 24 / 53
Intro ML (UofT) CSC311-Lec2 25 / 53
Intro ML (UofT) CSC311-Lec2 26 / 53
Intro ML (UofT) CSC311-Lec2 27 / 53
x t M = 0 1 −1 1 x t M = 9 1 −1 1
x t M = 3 1 −1 1
Intro ML (UofT) CSC311-Lec2 28 / 53
x t M = 9 1 −1 1
Intro ML (UofT) CSC311-Lec2 29 / 53
◮ Regularizer: a function that quantifies how much we prefer one
Intro ML (UofT) CSC311-Lec2 30 / 53
◮ Note: To be precise, the L2 norm is Euclidean distance, so we’re
Intro ML (UofT) CSC311-Lec2 31 / 53
Intro ML (UofT) CSC311-Lec2 32 / 53
Intro ML (UofT) CSC311-Lec2 33 / 53
◮ direct solution (set derivatives to zero) ◮ gradient descent (next topic)
Intro ML (UofT) CSC311-Lec2 34 / 53
Intro ML (UofT) CSC311-Lec2 35 / 53
◮ if ∂J /∂wj > 0, then increasing wj increases J . ◮ if ∂J /∂wj < 0, then increasing wj decreases J .
◮ We’ll see later how to tune the learning rate, but values are
◮ If cost is the sum of N individual losses rather than their average,
Intro ML (UofT) CSC311-Lec2 36 / 53
◮ This is the direction of fastest increase in J .
Intro ML (UofT) CSC311-Lec2 37 / 53
◮ GD can be applied to a much broader set of models ◮ GD can be easier to implement than direct solutions ◮ For regression in high-dimensional space, GD is more efficient than
◮ Linear regression solution: (X⊤X)−1X⊤t ◮ Matrix inversion is an O(D3) algorithm ◮ Each GD update costs O(ND) ◮ Or less with stochastic GD (SGD, in a few slides) ◮ Huge difference if D ≫ 1 Intro ML (UofT) CSC311-Lec2 38 / 53
Intro ML (UofT) CSC311-Lec2 39 / 53
Intro ML (UofT) CSC311-Lec2 40 / 53
Intro ML (UofT) CSC311-Lec2 41 / 53
Intro ML (UofT) CSC311-Lec2 42 / 53
N
Intro ML (UofT) CSC311-Lec2 43 / 53
◮ Variance in the estimate may be high ◮ We can’t exploit efficient vectorized operations
◮ compute the gradients on a randomly chosen medium-sized set of
◮ Too large: requires more compute; e.g., it takes more memory to
◮ Too small: can’t exploit vectorization, has high variance ◮ A reasonable value might be |M| = 100. Intro ML (UofT) CSC311-Lec2 44 / 53
Intro ML (UofT) CSC311-Lec2 45 / 53
◮ Use a large learning rate early in training so you can get close to
◮ Gradually decay the learning rate to reduce the fluctuations Intro ML (UofT) CSC311-Lec2 46 / 53
global minimum critical point local minimum critical point local maximum critical point
Intro ML (UofT) CSC311-Lec2 47 / 53
Intro ML (UofT) CSC311-Lec2 48 / 53
Intro ML (UofT) CSC311-Lec2 49 / 53
Intro ML (UofT) CSC311-Lec2 50 / 53
— Bishop, Pattern Recognition and Machine Learning Intro ML (UofT) CSC311-Lec2 51 / 53
◮ The objective is convex. ◮ The true solution can be found using gradient descent.
◮ The objective is non-convex. ◮ Can only find approximate solution (e.g. the best in its
Intro ML (UofT) CSC311-Lec2 52 / 53
◮ choose a model describing the relationships between variables of
◮ define a loss function quantifying how bad the fit to the data is
◮ choose a regularizer to control the model complexity/overfitting
◮ fit/optimize the model (gradient descent, stochastic gradient
Intro ML (UofT) CSC311-Lec2 53 / 53