5 summary of linear regression so far main points
play

5. Summary of linear regression so far Main points - PowerPoint PPT Presentation

5. Summary of linear regression so far Main points Model/function/predictor class of linear regressors x w T x . ERM principle: we chose a loss (least squares) and find a good predictor by minimizing empirical risk. ERM solution


  1. 5. Summary of linear regression so far

  2. Main points ◮ Model/function/predictor class of linear regressors x �→ w T x . ◮ ERM principle: we chose a loss (least squares) and find a good predictor by minimizing empirical risk. ◮ ERM solution for least squares: pick w satisfying A T Aw = A T b , which is not unique; one unique choice is the ordinary least squares solution A + b . 18 / 94

  3. Part 2 of linear regression lecture. . .

  4. Recap on SVD. (A messy slide, I’m sorry.) Suppose 0 � = M ∈ R n × d , thus r := rank( M ) > 0 . ◮ “Decomposition form” thin SVD: M = � r i =1 s i u i v T i , and s 1 ≥ · · · ≥ s r > 0 , and M + = � r 1 s i v i u T i . and in general i =1 M + M = � r i =1 v i v T i � = I . ◮ “Factorization form” thin SVD: M = USV T , U ∈ R n × r and V ∈ R d × r orthonormal but U T U and V T V are not identity matrices in general, and S = diag( s 1 , . . . , s r ) ∈ R r × r with s 1 ≥ · · · ≥ s r > 0 ; pseudoinverse M + = V S − 1 U T and in general M + M � = MM + � = I . f , U f ∈ R n × n and V ∈ R d × d orthonormal and ◮ Full SVD: M = U f S f V T f V f are identity matrices and S f ∈ R n × d is zero full rank so U T f U f and V T everywhere except the first r diagonal entries which are s 1 ≥ · · · ≥ s r > 0 ; pseudoinverse M + = V f S + f U T f where S + is obtained f by transposing S f and then flipping nonzero entries, and in general M + M � = MM + � = I . Additional property: agreement with eigendecompositions of MM T and M T M . The “full SVD” adds columns to U and V which hit zeros of S and therefore don’t matter (as a sanity check, verify for yourself that all these SVDs are equal). 19 / 94

  5. Recap on SVD, zero matrix case Suppose 0 = M ∈ R n × d , thus r := rank( M ) = 0 . ◮ In all types of SVD, M + is M T (another zero matrix). ◮ Technically speaking, s is a singular value of M iff exist nonzero vectors ( u , v ) with Mv = s u and M T u = s v , and zero matrix therefore has no singular values (or left/right singular vectors). ◮ “Factorization form thin SVD” becomes a little messy. 20 / 94

  6. 6. More on the normal equations

  7. Recall our matrix notation Let labeled examples (( x i , y i )) n i =1 be given. Define n × d matrix A and n × 1 column vector b by     x T ← → y 1 1     1 1 . .     A := . b := . √ n  , √ n  . . .   x T ← → y n n 21 / 94

  8. Recall our matrix notation Let labeled examples (( x i , y i )) n i =1 be given. Define n × d matrix A and n × 1 column vector b by     x T ← → y 1 1     1 1 . .     A := . b := . √ n  , √ n  . . .   x T ← → y n n Can write empirical risk as � � 2 n � R ( w ) = 1 � T = � Aw − b � 2 y i − x i w 2 . n i =1 21 / 94

  9. Recall our matrix notation Let labeled examples (( x i , y i )) n i =1 be given. Define n × d matrix A and n × 1 column vector b by     x T ← → y 1 1     1 1 . .     A := . b := . √ n  , √ n  . . .   x T ← → y n n Can write empirical risk as � � 2 n � R ( w ) = 1 � T = � Aw − b � 2 y i − x i w 2 . n i =1 Necessary condition for w to be a minimizer of � R : ∇ � i.e., w is a critical point of � R ( w ) = 0 , R . 21 / 94

  10. Recall our matrix notation Let labeled examples (( x i , y i )) n i =1 be given. Define n × d matrix A and n × 1 column vector b by     x T ← → y 1 1     1 1 . .     A := . b := . √ n  , √ n  . . .   x T ← → y n n Can write empirical risk as � � 2 n � R ( w ) = 1 � T = � Aw − b � 2 y i − x i w 2 . n i =1 Necessary condition for w to be a minimizer of � R : ∇ � i.e., w is a critical point of � R ( w ) = 0 , R . This translates to T A ) w = A T b , ( A a system of linear equations called the normal equations . 21 / 94

  11. Recall our matrix notation Let labeled examples (( x i , y i )) n i =1 be given. Define n × d matrix A and n × 1 column vector b by     x T ← → y 1 1     1 1 . .     A := . b := . √ n  , √ n  . . .   x T ← → y n n Can write empirical risk as � � 2 n � R ( w ) = 1 � T = � Aw − b � 2 y i − x i w 2 . n i =1 Necessary condition for w to be a minimizer of � R : ∇ � i.e., w is a critical point of � R ( w ) = 0 , R . This translates to T A ) w = A T b , ( A a system of linear equations called the normal equations . We’ll now finally show that normal equations imply optimality. 21 / 94

  12. Normal equations imply optimality Consider w with A T Aw = A T y , and any w ′ ; then � Aw ′ − y � 2 = � Aw ′ − Aw + Aw − y � 2 = � Aw ′ − Aw � 2 + 2( Aw ′ − Aw ) T ( Aw − y ) + � Aw − y � 2 . Since ( Aw ′ − Aw ) T ( Aw − y ) = ( w ′ − w ) T ( A T Aw − A T y ) = 0 , then � Aw ′ − y � 2 = � Aw ′ − Aw � 2 + � Aw − y � 2 . This means w ′ is optimal. 22 / 94

  13. Normal equations imply optimality Consider w with A T Aw = A T y , and any w ′ ; then � Aw ′ − y � 2 = � Aw ′ − Aw + Aw − y � 2 = � Aw ′ − Aw � 2 + 2( Aw ′ − Aw ) T ( Aw − y ) + � Aw − y � 2 . Since ( Aw ′ − Aw ) T ( Aw − y ) = ( w ′ − w ) T ( A T Aw − A T y ) = 0 , then � Aw ′ − y � 2 = � Aw ′ − Aw � 2 + � Aw − y � 2 . This means w ′ is optimal. Morever, writing A = � r i =1 s i u i v T i ,   � r � Aw ′ − Aw � 2 = ( w ′ − w ) ⊤ ( A ⊤ A )( w ′ − w ) = ( w ′ − w ) ⊤  ( w ′ − w ) ,  s 2 T i v i v i i =1 so w ′ optimal iff w ′ − w is in the right nullspace of A . 22 / 94

  14. Normal equations imply optimality Consider w with A T Aw = A T y , and any w ′ ; then � Aw ′ − y � 2 = � Aw ′ − Aw + Aw − y � 2 = � Aw ′ − Aw � 2 + 2( Aw ′ − Aw ) T ( Aw − y ) + � Aw − y � 2 . Since ( Aw ′ − Aw ) T ( Aw − y ) = ( w ′ − w ) T ( A T Aw − A T y ) = 0 , then � Aw ′ − y � 2 = � Aw ′ − Aw � 2 + � Aw − y � 2 . This means w ′ is optimal. Morever, writing A = � r i =1 s i u i v T i ,   � r � Aw ′ − Aw � 2 = ( w ′ − w ) ⊤ ( A ⊤ A )( w ′ − w ) = ( w ′ − w ) ⊤  ( w ′ − w ) ,  s 2 T i v i v i i =1 so w ′ optimal iff w ′ − w is in the right nullspace of A . (We’ll revisit all this with convexity later.) 22 / 94

  15. Geometric interpretation of least squares ERM Let a j ∈ R n be the j -th column of matrix A ∈ R n × d , so   ↑ ↑   A = a 1 · · · a d  .  ↓ ↓ 23 / 94

  16. Geometric interpretation of least squares ERM Let a j ∈ R n be the j -th column of matrix A ∈ R n × d , so   ↑ ↑   A = a 1 · · · a d  .  ↓ ↓ 2 is the same as finding vector ˆ Minimizing � Aw − b � 2 b ∈ range( A ) closest to b . 23 / 94

  17. Geometric interpretation of least squares ERM Let a j ∈ R n be the j -th column of matrix A ∈ R n × d , so   ↑ ↑   A = a 1 · · · a d  .  ↓ ↓ 2 is the same as finding vector ˆ Minimizing � Aw − b � 2 b ∈ range( A ) closest to b . Solution ˆ b is orthogonal projection of b onto range( A ) = { Aw : w ∈ R d } . b a 1 ˆ b a 2 23 / 94

  18. Geometric interpretation of least squares ERM Let a j ∈ R n be the j -th column of matrix A ∈ R n × d , so   ↑ ↑   A = a 1 · · · a d  .  ↓ ↓ 2 is the same as finding vector ˆ Minimizing � Aw − b � 2 b ∈ range( A ) closest to b . Solution ˆ b is orthogonal projection of b onto range( A ) = { Aw : w ∈ R d } . ◮ ˆ b is uniquely determined; indeed, b = AA + b = � r b ˆ i =1 u i u T i b . a 1 ˆ b a 2 23 / 94

  19. Geometric interpretation of least squares ERM Let a j ∈ R n be the j -th column of matrix A ∈ R n × d , so   ↑ ↑   A = a 1 · · · a d  .  ↓ ↓ 2 is the same as finding vector ˆ Minimizing � Aw − b � 2 b ∈ range( A ) closest to b . Solution ˆ b is orthogonal projection of b onto range( A ) = { Aw : w ∈ R d } . ◮ ˆ b is uniquely determined; indeed, b = AA + b = � r b ˆ i =1 u i u T i b . ◮ If r = rank( A ) < d , then > 1 way to write ˆ b as linear combination of a 1 , . . . , a d . a 1 ˆ b a 2 23 / 94

  20. Geometric interpretation of least squares ERM Let a j ∈ R n be the j -th column of matrix A ∈ R n × d , so   ↑ ↑   A = a 1 · · · a d  .  ↓ ↓ 2 is the same as finding vector ˆ Minimizing � Aw − b � 2 b ∈ range( A ) closest to b . Solution ˆ b is orthogonal projection of b onto range( A ) = { Aw : w ∈ R d } . ◮ ˆ b is uniquely determined; indeed, b = AA + b = � r b ˆ i =1 u i u T i b . ◮ If r = rank( A ) < d , then > 1 way to write ˆ b as linear combination of a 1 , . . . , a d . If rank( A ) < d , then ERM solution is not unique . a 1 ˆ b a 2 23 / 94

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend