5. Summary of linear regression so far Main points - PowerPoint PPT Presentation

5. Summary of linear regression so far

Main points ◮ Model/function/predictor class of linear regressors x �→ w T x . ◮ ERM principle: we chose a loss (least squares) and find a good predictor by minimizing empirical risk. ◮ ERM solution for least squares: pick w satisfying A T Aw = A T b , which is not unique; one unique choice is the ordinary least squares solution A + b . 18 / 94

Part 2 of linear regression lecture. . .

Recap on SVD. (A messy slide, I’m sorry.) Suppose 0 � = M ∈ R n × d , thus r := rank( M ) > 0 . ◮ “Decomposition form” thin SVD: M = � r i =1 s i u i v T i , and s 1 ≥ · · · ≥ s r > 0 , and M + = � r 1 s i v i u T i . and in general i =1 M + M = � r i =1 v i v T i � = I . ◮ “Factorization form” thin SVD: M = USV T , U ∈ R n × r and V ∈ R d × r orthonormal but U T U and V T V are not identity matrices in general, and S = diag( s 1 , . . . , s r ) ∈ R r × r with s 1 ≥ · · · ≥ s r > 0 ; pseudoinverse M + = V S − 1 U T and in general M + M � = MM + � = I . f , U f ∈ R n × n and V ∈ R d × d orthonormal and ◮ Full SVD: M = U f S f V T f V f are identity matrices and S f ∈ R n × d is zero full rank so U T f U f and V T everywhere except the first r diagonal entries which are s 1 ≥ · · · ≥ s r > 0 ; pseudoinverse M + = V f S + f U T f where S + is obtained f by transposing S f and then flipping nonzero entries, and in general M + M � = MM + � = I . Additional property: agreement with eigendecompositions of MM T and M T M . The “full SVD” adds columns to U and V which hit zeros of S and therefore don’t matter (as a sanity check, verify for yourself that all these SVDs are equal). 19 / 94

Recap on SVD, zero matrix case Suppose 0 = M ∈ R n × d , thus r := rank( M ) = 0 . ◮ In all types of SVD, M + is M T (another zero matrix). ◮ Technically speaking, s is a singular value of M iff exist nonzero vectors ( u , v ) with Mv = s u and M T u = s v , and zero matrix therefore has no singular values (or left/right singular vectors). ◮ “Factorization form thin SVD” becomes a little messy. 20 / 94

6. More on the normal equations

Recall our matrix notation Let labeled examples (( x i , y i )) n i =1 be given. Define n × d matrix A and n × 1 column vector b by     x T ← → y 1 1     1 1 . .     A := . b := . √ n  , √ n  . . .   x T ← → y n n 21 / 94

Recall our matrix notation Let labeled examples (( x i , y i )) n i =1 be given. Define n × d matrix A and n × 1 column vector b by     x T ← → y 1 1     1 1 . .     A := . b := . √ n  , √ n  . . .   x T ← → y n n Can write empirical risk as � � 2 n � R ( w ) = 1 � T = � Aw − b � 2 y i − x i w 2 . n i =1 21 / 94

Recall our matrix notation Let labeled examples (( x i , y i )) n i =1 be given. Define n × d matrix A and n × 1 column vector b by     x T ← → y 1 1     1 1 . .     A := . b := . √ n  , √ n  . . .   x T ← → y n n Can write empirical risk as � � 2 n � R ( w ) = 1 � T = � Aw − b � 2 y i − x i w 2 . n i =1 Necessary condition for w to be a minimizer of � R : ∇ � i.e., w is a critical point of � R ( w ) = 0 , R . 21 / 94

Recall our matrix notation Let labeled examples (( x i , y i )) n i =1 be given. Define n × d matrix A and n × 1 column vector b by     x T ← → y 1 1     1 1 . .     A := . b := . √ n  , √ n  . . .   x T ← → y n n Can write empirical risk as � � 2 n � R ( w ) = 1 � T = � Aw − b � 2 y i − x i w 2 . n i =1 Necessary condition for w to be a minimizer of � R : ∇ � i.e., w is a critical point of � R ( w ) = 0 , R . This translates to T A ) w = A T b , ( A a system of linear equations called the normal equations . 21 / 94

Recall our matrix notation Let labeled examples (( x i , y i )) n i =1 be given. Define n × d matrix A and n × 1 column vector b by     x T ← → y 1 1     1 1 . .     A := . b := . √ n  , √ n  . . .   x T ← → y n n Can write empirical risk as � � 2 n � R ( w ) = 1 � T = � Aw − b � 2 y i − x i w 2 . n i =1 Necessary condition for w to be a minimizer of � R : ∇ � i.e., w is a critical point of � R ( w ) = 0 , R . This translates to T A ) w = A T b , ( A a system of linear equations called the normal equations . We’ll now finally show that normal equations imply optimality. 21 / 94

Normal equations imply optimality Consider w with A T Aw = A T y , and any w ′ ; then � Aw ′ − y � 2 = � Aw ′ − Aw + Aw − y � 2 = � Aw ′ − Aw � 2 + 2( Aw ′ − Aw ) T ( Aw − y ) + � Aw − y � 2 . Since ( Aw ′ − Aw ) T ( Aw − y ) = ( w ′ − w ) T ( A T Aw − A T y ) = 0 , then � Aw ′ − y � 2 = � Aw ′ − Aw � 2 + � Aw − y � 2 . This means w ′ is optimal. 22 / 94

Normal equations imply optimality Consider w with A T Aw = A T y , and any w ′ ; then � Aw ′ − y � 2 = � Aw ′ − Aw + Aw − y � 2 = � Aw ′ − Aw � 2 + 2( Aw ′ − Aw ) T ( Aw − y ) + � Aw − y � 2 . Since ( Aw ′ − Aw ) T ( Aw − y ) = ( w ′ − w ) T ( A T Aw − A T y ) = 0 , then � Aw ′ − y � 2 = � Aw ′ − Aw � 2 + � Aw − y � 2 . This means w ′ is optimal. Morever, writing A = � r i =1 s i u i v T i ,   � r � Aw ′ − Aw � 2 = ( w ′ − w ) ⊤ ( A ⊤ A )( w ′ − w ) = ( w ′ − w ) ⊤  ( w ′ − w ) ,  s 2 T i v i v i i =1 so w ′ optimal iff w ′ − w is in the right nullspace of A . 22 / 94

Normal equations imply optimality Consider w with A T Aw = A T y , and any w ′ ; then � Aw ′ − y � 2 = � Aw ′ − Aw + Aw − y � 2 = � Aw ′ − Aw � 2 + 2( Aw ′ − Aw ) T ( Aw − y ) + � Aw − y � 2 . Since ( Aw ′ − Aw ) T ( Aw − y ) = ( w ′ − w ) T ( A T Aw − A T y ) = 0 , then � Aw ′ − y � 2 = � Aw ′ − Aw � 2 + � Aw − y � 2 . This means w ′ is optimal. Morever, writing A = � r i =1 s i u i v T i ,   � r � Aw ′ − Aw � 2 = ( w ′ − w ) ⊤ ( A ⊤ A )( w ′ − w ) = ( w ′ − w ) ⊤  ( w ′ − w ) ,  s 2 T i v i v i i =1 so w ′ optimal iff w ′ − w is in the right nullspace of A . (We’ll revisit all this with convexity later.) 22 / 94

Geometric interpretation of least squares ERM Let a j ∈ R n be the j -th column of matrix A ∈ R n × d , so   ↑ ↑   A = a 1 · · · a d  .  ↓ ↓ 23 / 94

Geometric interpretation of least squares ERM Let a j ∈ R n be the j -th column of matrix A ∈ R n × d , so   ↑ ↑   A = a 1 · · · a d  .  ↓ ↓ 2 is the same as finding vector ˆ Minimizing � Aw − b � 2 b ∈ range( A ) closest to b . 23 / 94

Geometric interpretation of least squares ERM Let a j ∈ R n be the j -th column of matrix A ∈ R n × d , so   ↑ ↑   A = a 1 · · · a d  .  ↓ ↓ 2 is the same as finding vector ˆ Minimizing � Aw − b � 2 b ∈ range( A ) closest to b . Solution ˆ b is orthogonal projection of b onto range( A ) = { Aw : w ∈ R d } . b a 1 ˆ b a 2 23 / 94

Geometric interpretation of least squares ERM Let a j ∈ R n be the j -th column of matrix A ∈ R n × d , so   ↑ ↑   A = a 1 · · · a d  .  ↓ ↓ 2 is the same as finding vector ˆ Minimizing � Aw − b � 2 b ∈ range( A ) closest to b . Solution ˆ b is orthogonal projection of b onto range( A ) = { Aw : w ∈ R d } . ◮ ˆ b is uniquely determined; indeed, b = AA + b = � r b ˆ i =1 u i u T i b . a 1 ˆ b a 2 23 / 94

Geometric interpretation of least squares ERM Let a j ∈ R n be the j -th column of matrix A ∈ R n × d , so   ↑ ↑   A = a 1 · · · a d  .  ↓ ↓ 2 is the same as finding vector ˆ Minimizing � Aw − b � 2 b ∈ range( A ) closest to b . Solution ˆ b is orthogonal projection of b onto range( A ) = { Aw : w ∈ R d } . ◮ ˆ b is uniquely determined; indeed, b = AA + b = � r b ˆ i =1 u i u T i b . ◮ If r = rank( A ) < d , then > 1 way to write ˆ b as linear combination of a 1 , . . . , a d . a 1 ˆ b a 2 23 / 94

Geometric interpretation of least squares ERM Let a j ∈ R n be the j -th column of matrix A ∈ R n × d , so   ↑ ↑   A = a 1 · · · a d  .  ↓ ↓ 2 is the same as finding vector ˆ Minimizing � Aw − b � 2 b ∈ range( A ) closest to b . Solution ˆ b is orthogonal projection of b onto range( A ) = { Aw : w ∈ R d } . ◮ ˆ b is uniquely determined; indeed, b = AA + b = � r b ˆ i =1 u i u T i b . ◮ If r = rank( A ) < d , then > 1 way to write ˆ b as linear combination of a 1 , . . . , a d . If rank( A ) < d , then ERM solution is not unique . a 1 ˆ b a 2 23 / 94

5. Summary of linear regression so far Main points - PowerPoint PPT Presentation

5. Summary of linear regression so far Main points Model/function/predictor class of linear regressors x w T x . ERM principle: we chose a loss (least squares) and find a good predictor by minimizing empirical risk. ERM solution

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

8.4.3 Linear Regression Prof. Tesler Math 283 Fall 2019 Prof. Tesler 8.4.3: Linear Regression

Technical conditions for linear regression Jo Hardin Professor, Pomona College DataCamp

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Lecture 3: Kernel Regression Curse of Dimensionality Aykut Erdem February 2016 Hacettepe

CS480/680 Machine Learning Lecture 3: May 13, 2019 Linear Regression [RN] Sec. 18.6.1, [HTF]

Optimization MS Maths Big Data Alexandre Gramfort alexandre.gramfort@telecom-paristech.fr

Lecture 08: Ridge Regression, Equivalent Formulations and KKT Conditions Instructor: Prof. Ganesh

CSI5180. MachineLearningfor BioinformaticsApplications Regularized Linear Models by Marcel

Regression with Many Predictors 21.12.2016 Goals of Todays Lecture Get a (limited) overview

Math 211 Math 211 Lecture #2 2 Autonomous Equations Autonomous Equations General equation:

Math 211 Math 211 Lecture #2 Separable Equations 2 Interval of Existence Interval of