CSC 311: Introduction to Machine Learning Lecture 2 - Linear Methods - PowerPoint PPT Presentation

CSC 311: Introduction to Machine Learning Lecture 2 - Linear Methods for Regression, Optimization Roger Grosse Chris Maddison Juhan Bae Silviu Pitis University of Toronto, Fall 2020 Intro ML (UofT) CSC311-Lec2 1 / 53

Announcements Homework 1 is posted! Deadline Sept 30, 23:59. Instructor hours are announced on the course website. (TA OH TBA) No ProctorU! Intro ML (UofT) CSC311-Lec2 2 / 53

Overview Second learning algorithm of the course: linear regression. ◮ Task: predict scalar-valued targets (e.g. stock prices) ◮ Architecture: linear function of the inputs While KNN was a complete algorithm, linear regression exemplifies a modular approach that will be used throughout this course: ◮ choose a model describing the relationships between variables of interest ◮ define a loss function quantifying how bad the fit to the data is ◮ choose a regularizer saying how much we prefer different candidate models (or explanations of data) ◮ fit a model that minimizes the loss function and satisfies the constraint/penalty imposed by the regularizer, possibly using an optimization algorithm Mixing and matching these modular components give us a lot of new ML methods. Intro ML (UofT) CSC311-Lec2 3 / 53

Supervised Learning Setup In supervised learning: There is input x ∈ X , typically a vector of features (or covariates) There is target t ∈ T (also called response, outcome, output, class) Objective is to learn a function f : X → T such that t ≈ y = f ( x ) based on some data D = { ( x ( i ) , t ( i ) ) for i = 1 , 2 , ..., N } . Intro ML (UofT) CSC311-Lec2 4 / 53

Linear Regression - Model Model: In linear regression, we use a linear function of the features x = ( x 1 , . . . , x D ) ∈ R D to make predictions y of the target value t ∈ R : � y = f ( x ) = w j x j + b j ◮ y is the prediction ◮ w is the weights ◮ b is the bias (or intercept) w and b together are the parameters We hope that our prediction is close to the target: y ≈ t . Intro ML (UofT) CSC311-Lec2 5 / 53

What is Linear? 1 feature vs D features Fitted line 2.0 Data 1.5 If we have only 1 feature: 1.0 y: response 0.5 y = wx + b where w, x, b ∈ R . 0.0 0.5 y is linear in x . 1.0 2 1 0 1 2 x: features If we have D features: y = w ⊤ x + b where w , x ∈ R D , b ∈ R y is linear in x . Relation between the prediction y and inputs x is linear in both cases. Intro ML (UofT) CSC311-Lec2 6 / 53

Linear Regression We have a dataset D = { ( x ( i ) , t ( i ) ) for i = 1 , 2 , ..., N } where, x ( i ) = ( x ( i ) 1 , x ( i ) 2 , ..., x ( i ) D ) ⊤ ∈ R D are the inputs (e.g. age, height) t ( i ) ∈ R is the target or response (e.g. income) predict t ( i ) with a linear function of x ( i ) : Fitted line Data 2.0 2.0 Data t ( i ) ≈ y ( i ) = w ⊤ x ( i ) + b 1.5 1.5 1.0 1.0 y: response y: response Different ( w , b ) define different lines. 0.5 0.5 We want the “best” line ( w , b ). 0.0 0.0 0.5 0.5 How to quantify “best”? 1.0 1.0 2 2 1 1 0 0 1 1 2 2 x: features x: features Intro ML (UofT) CSC311-Lec2 7 / 53

Linear Regression - Loss Function A loss function L ( y, t ) defines how bad it is if, for some example x , the algorithm predicts y , but the target is actually t . Squared error loss function: L ( y, t ) = 1 2 ( y − t ) 2 y − t is the residual, and we want to make this small in magnitude The 1 2 factor is just to make the calculations convenient. Cost function: loss function averaged over all training examples N 1 y ( i ) − t ( i ) � 2 � � J ( w , b ) = 2 N i =1 N 1 w ⊤ x ( i ) + b − t ( i ) � 2 � � = 2 N i =1 Terminology varies. Some call “cost” empirical or average loss . Intro ML (UofT) CSC311-Lec2 8 / 53

Vectorization y ( i ) − t ( i ) � 2 gets messy if we expand y ( i ) : � N 1 � Notation-wise, i =1 2 N � D � 2 N 1 � � � � w j x ( i ) − t ( i ) + b j 2 N i =1 j =1 The code equivalent is to compute the prediction using a for loop: Excessive super/sub scripts are hard to work with, and Python loops are slow, so we vectorize algorithms by expressing them in terms of vectors and matrices. w = ( w 1 , . . . , w D ) ⊤ x = ( x 1 , . . . , x D ) ⊤ y = w ⊤ x + b This is simpler and executes much faster: Intro ML (UofT) CSC311-Lec2 9 / 53

Vectorization Why vectorize? The equations, and the code, will be simpler and more readable. Gets rid of dummy variables/indices! Vectorized code is much faster ◮ Cut down on Python interpreter overhead ◮ Use highly optimized linear algebra libraries (hardware support) ◮ Matrix multiplication very fast on GPU (Graphics Processing Unit) Switching in and out of vectorized form is a skill you gain with practice Some derivations are easier to do element-wise Some algorithms are easier to write/understand using for-loops and vectorize later for performance Intro ML (UofT) CSC311-Lec2 10 / 53

Vectorization We can organize all the training examples into a design matrix X with one row per training example, and all the targets into the target vector t . Computing the predictions for the whole dataset: w T x (1) + b y (1)     . . . . Xw + b 1 =  =  = y     . .   w T x ( N ) + b y ( N ) Intro ML (UofT) CSC311-Lec2 11 / 53

Vectorization Computing the squared error cost across the whole dataset: y = Xw + b 1 1 2 N � y − t � 2 J = Sometimes we may use J = 1 2 � y − t � 2 , without a normalizer. This would correspond to the sum of losses, and not the averaged loss. The minimizer does not depend on N (but optimization might!). We can also add a column of 1’s to design matrix, combine the bias and the weights, and conveniently write   b [ x (1) ] ⊤   1 w 1   [ x (2) ] ⊤  ∈ R N × ( D +1) and w = 1  ∈ R D +1 X =     w 2  .   .  . 1 . . . Then, our predictions reduce to y = Xw . Intro ML (UofT) CSC311-Lec2 12 / 53

Solving the Minimization Problem We defined a cost function. This is what we’d like to minimize. Two commonly applied mathematical approaches: Algebraic, e.g., using inequalities: ◮ to show z ∗ minimizes f ( z ), show that ∀ z, f ( z ) ≥ f ( z ∗ ) ◮ to show that a = b , show that a ≥ b and b ≥ a Calculus: minimum of a smooth function (if it exists) occurs at a critical point, i.e. point where the derivative is zero. ◮ multivariate generalization: set the partial derivatives to zero (or equivalently the gradient). Solutions may be direct or iterative Sometimes we can directly find provably optimal parameters (e.g. set the gradient to zero and solve in closed form). We call this a direct solution. We may also use optimization techniques that iteratively get us closer to the solution. We will get back to this soon. Intro ML (UofT) CSC311-Lec2 13 / 53

Direct Solution I: Linear Algebra We seek w to minimize � Xw − t � 2 , or equivalently � Xw − t � range( X ) = { Xw | w ∈ R D } is a D -dimensional subspace of R N . Recall that the closest point y ∗ = Xw ∗ in subspace range( X ) of R N to arbitrary point t ∈ R N is found by orthogonal projection. We have ( y ∗ − t ) ⊥ Xw , ∀ w ∈ R D Why is y ∗ the closest point to t ? ◮ Consider any z = Xw ◮ By Pythagorean theorem and the trivial inequality ( x 2 ≥ 0): � z − t � 2 = � y ∗ − t � 2 + � y ∗ − z � 2 ≥ � y ∗ − t � 2 Intro ML (UofT) CSC311-Lec2 14 / 53

Direct Solution I: Linear Algebra From the previous slide, we have ( y ∗ − t ) ⊥ Xw , ∀ w ∈ R D Equivalently, the columns of the design matrix X are all orthogonal to ( y ∗ − t ), and we have that: X ⊤ ( y ∗ − t ) = 0 X ⊤ Xw ∗ − X ⊤ t = 0 X ⊤ Xw ∗ = X ⊤ t w ∗ = ( X ⊤ X ) − 1 X ⊤ t While this solution is clean and the derivation easy to remember, like many algebraic solutions, it is somewhat ad hoc. On the hand, the tools of calculus are broadly applicable to differentiable loss functions... Intro ML (UofT) CSC311-Lec2 15 / 53

Direct Solution II: Calculus Partial derivative: derivative of a multivariate function with respect to one of its arguments. ∂ f ( x 1 + h, x 2 ) − f ( x 1 , x 2 ) f ( x 1 , x 2 ) = lim ∂x 1 h h → 0 To compute, take the single variable derivative, pretending the other arguments are constant. Example: partial derivatives of the prediction y     ∂y ∂ ∂b = ∂ ∂y � � ∂w j = w j ′ x j ′ + b w j ′ x j ′ + b   ∂w j ∂b j ′ j ′ = x j = 1 Intro ML (UofT) CSC311-Lec2 16 / 53

Direct Solution II: Calculus For loss derivatives, apply the chain rule: ∂w j = d L ∂ L ∂y d y ∂w j ∂b = d L ∂ L ∂y = d � 1 � d y ∂b 2( y − t ) 2 · x j d y = y − t = ( y − t ) x j For cost derivatives, use linearity and average over data points: i =1 ( y ( i ) − t ( i ) ) x ( i ) i =1 y ( i ) − t ( i ) � N � N ∂w j = 1 ∂ J ∂ J ∂b = 1 j N N Minimum must occur at a point where partial derivatives are zero. ∂ J ∂ J ∂w j = 0 ( ∀ j ) , ∂b = 0 . (if ∂ J /∂w j � = 0, you could reduce the cost by changing w j ) Intro ML (UofT) CSC311-Lec2 17 / 53

CSC 311: Introduction to Machine Learning Lecture 2 - Linear Methods - PowerPoint PPT Presentation

CSC 311: Introduction to Machine Learning Lecture 2 - Linear Methods for Regression, Optimization Roger Grosse Chris Maddison Juhan Bae Silviu Pitis University of Toronto, Fall 2020 Intro ML (UofT) CSC311-Lec2 1 / 53 Announcements

311: It was here then it was gone and now its back 311 call center closed doors at the

Unification of CSC and SE ABET Effor ts Similarity of CSC and SE Programs Similarity of CSC and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

CSC 311: Introduction to Machine Learning Lecture 1 - Introduction and Nearest Neighbors Roger

CSC 311: Introduction to Machine Learning Lecture 4 - Neural Networks Roger Grosse Chris

CSC 311: Introduction to Machine Learning Lecture 5 - Decision Trees & Bias-Variance

CSC 311: Introduction to Machine Learning Lecture 3 - Linear Classifiers, Logistic Regression,

CSC 311: Introduction to Machine Learning Lecture 7 - Probabilistic Models Roger Grosse Chris

CSC 311: Introduction to Machine Learning Lecture 6 - Bagging, Boosting Roger Grosse Chris

CSC 311: Introduction to Machine Learning Lecture 8 - Probabilistic Models Pt. II, PCA Roger

CSC Effectiveness Review CSC Effectiveness Review Team October 2018 ICANN63 Need for Review of

CMPSC 311- Introduction to Systems Programming Module: Systems Programming Professor Patrick

CMPSC 311- Introduction to Systems Programming Module: Studying Professor Patrick McDaniel Fall

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

alexander.wolff @ informatik . uni-wuerzburg . de Drawing metro maps [with Martin2, Herman, Max,

Performance of Local Algorithms in Random Structures. Power and limitations David Gamarnik MIT

Double Chooz Experiment Status Double Chooz Experiment Status Jelena Maricic, Drexel University

SK-Gd

FreeBSD Enterprise Storage Sawomir Wojciech Wojtczak vermaden@interia.pl

Overcoming the curse of dimensionality: from nonlinear Monte Carlo to deep artificial neural

Energy Efficient Routing for Statistical Inference of Markov Random Fields A. Anandkumar 1 L. Tong

A MATLAB interface for PRIMME for solving Eigenvalue and Singular Value problems Andreas

CSC 311: Introduction to Machine Learning Lecture 2 - Linear Methods - PowerPoint PPT Presentation

CSC 311: Introduction to Machine Learning Lecture 2 - Linear Methods for Regression, Optimization Roger Grosse Chris Maddison Juhan Bae Silviu Pitis University of Toronto, Fall 2020 Intro ML (UofT) CSC311-Lec2 1 / 53 Announcements

311: It was here then it was gone and now its back 311 call center closed doors at the

Unification of CSC and SE ABET Effor ts Similarity of CSC and SE Programs Similarity of CSC and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

CSC 311: Introduction to Machine Learning Lecture 1 - Introduction and Nearest Neighbors Roger

CSC 311: Introduction to Machine Learning Lecture 4 - Neural Networks Roger Grosse Chris

CSC 311: Introduction to Machine Learning Lecture 5 - Decision Trees &amp; Bias-Variance

CSC 311: Introduction to Machine Learning Lecture 3 - Linear Classifiers, Logistic Regression,

CSC 311: Introduction to Machine Learning Lecture 7 - Probabilistic Models Roger Grosse Chris

CSC 311: Introduction to Machine Learning Lecture 6 - Bagging, Boosting Roger Grosse Chris

CSC 311: Introduction to Machine Learning Lecture 8 - Probabilistic Models Pt. II, PCA Roger

CSC Effectiveness Review CSC Effectiveness Review Team October 2018 ICANN63 Need for Review of

CMPSC 311- Introduction to Systems Programming Module: Systems Programming Professor Patrick

CMPSC 311- Introduction to Systems Programming Module: Studying Professor Patrick McDaniel Fall

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

alexander.wolff @ informatik . uni-wuerzburg . de Drawing metro maps [with Martin2, Herman, Max,

Performance of Local Algorithms in Random Structures. Power and limitations David Gamarnik MIT

Double Chooz Experiment Status Double Chooz Experiment Status Jelena Maricic, Drexel University

SK-Gd

FreeBSD Enterprise Storage Sawomir Wojciech Wojtczak vermaden@interia.pl

Overcoming the curse of dimensionality: from nonlinear Monte Carlo to deep artificial neural

Energy Efficient Routing for Statistical Inference of Markov Random Fields A. Anandkumar 1 L. Tong

A MATLAB interface for PRIMME for solving Eigenvalue and Singular Value problems Andreas

CSC 311: Introduction to Machine Learning Lecture 5 - Decision Trees & Bias-Variance