Linear Models DS-GA 1013 / MATH-GA 2824 Optimization-based Data - PowerPoint PPT Presentation

Linear model � 2 π t � � 2 π t � y t ≈ � β 0 + � + � + � � β 1 cos β 2 sin β 3 t 12 12 1 ≤ t ≤ n is the time in months ( n = 12 · 150)

Model fitted by least squares 30 25 20 Temperature (Celsius) 15 10 5 0 Data Model 1860 1880 1900 1920 1940 1960 1980 2000

Model fitted by least squares 25 20 Temperature (Celsius) 15 10 5 Data Model 0 1900 1901 1902 1903 1904 1905

Model fitted by least squares 25 20 Temperature (Celsius) 15 10 5 0 Data Model 5 1960 1961 1962 1963 1964 1965

Trend: Increase of 0.75 ◦ C / 100 years (1.35 ◦ F) 30 25 20 Temperature (Celsius) 15 10 5 0 Data Trend 1860 1880 1900 1920 1940 1960 1980 2000

Model for minimum temperatures 20 15 Temperature (Celsius) 10 5 0 5 Data Model 10 1860 1880 1900 1920 1940 1960 1980 2000

Model for minimum temperatures 14 12 10 Temperature (Celsius) 8 6 4 2 0 Data Model 2 1900 1901 1902 1903 1904 1905

Model for minimum temperatures 15 10 Temperature (Celsius) 5 0 5 Data Model 10 1960 1961 1962 1963 1964 1965

Trend: Increase of 0.88 ◦ C / 100 years (1.58 ◦ F) 20 15 Temperature (Celsius) 10 5 0 5 Data Trend 10 1860 1880 1900 1920 1940 1960 1980 2000

Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

Geometric interpretation ◮ Any vector X � β is in the span of the columns of X ◮ The least-squares estimate is the closest vector to � y that can be represented in this way ◮ This is the projection of � y onto the column space of X X � β LS = USV T VS − 1 U T � y = UU T � y

Geometric interpretation

Face denoising We denoise by projecting onto: ◮ S 1 : the span of the 9 images from the same subject ◮ S 2 : the span of the 360 images in the training set Test error: || � x − P S 1 � y || 2 = 0 . 114 || � x || 2 || � x − P S 2 � y || 2 = 0 . 078 || � x || 2

S 1 � � S 1 := span

Denoising via projection onto S 1 Projection Projection onto S ⊥ onto S 1 1 Signal = 0.993 + 0.114 � x + Noise = 0.007 + 0.150 � z = Data = + � y Estimate

S 2 � S 2 := span · · · �

Denoising via projection onto S 2 Projection Projection onto S ⊥ onto S 2 2 Signal = 0.998 + 0.063 � x + Noise = 0.043 + 0.144 � z = Data = + � y Estimate

P S 1 � y and P S 2 � y � P S 1 � P S 2 � x y y

Lessons of Face Denoising What does our intuition learned from Face Denoising tell us about linear regression?

Lessons of Face Denoising What does our intuition learned from Face Denoising tell us about linear regression? ◮ More features = larger column space ◮ Larger column space = captures more of the true image ◮ Larger column space = captures more of the noise ◮ Balance between underfitting and overfitting

Motivation Model data y 1 , . . . , y n as realizations of a set of random variables y 1 , . . . , y n The joint pdf depends on a vector of parameters � β β ( y 1 , . . . , y n ) := f y 1 ,..., y n ( y 1 , . . . , y n ) f � is the probability density of y 1 , . . . , y n at the observed data Idea: Choose � β such that the density is as high as possible

Likelihood The likelihood is equal to the joint pdf � � � L y 1 ,..., y n β := f � β ( y 1 , . . . , y n ) interpreted as a function of the parameters � � � The log-likelihood function is the log of the likelihood log L y 1 ,..., y n β

Maximum-likelihood estimator The likelihood quantifies how likely the data are according to the model Maximum-likelihood (ML) estimator : � � � � β ML ( y 1 , . . . , y n ) := arg max L y 1 ,..., y n β � β � � � = arg max log L y 1 ,..., y n β � β Maximizing the log-likelihood is equivalent, and often more convenient

Probabilistic interpretation We model the noise as an iid Gaussian random vector � z Entries have zero mean and variance σ 2 The data are a realization of the random vector y := X � � β + � z y is Gaussian with mean X � β and covariance matrix σ 2 I �

Likelihood The joint pdf of � y is n � � 2 � 1 − 1 � � � � X � y ( � a ) := √ exp � a [ i ] − β [ i ] f � 2 σ 2 2 πσ i = 1 1 � − 1 � 2 � � � � a − X � = ( 2 π ) n σ n exp � � β � � � � 2 σ 2 � � � � 2 The likelihood is � � 1 − 1 2 � � � � � � � y − X � L � β = ( 2 π ) n exp � � β � � � � y � 2 � � � 2

Maximum-likelihood estimate The maximum-likelihood estimate is � � � � β ML = arg max L � β y � β � � � = arg max log L � β y � β 2 � � � � y − X � = arg min � � β � � � � � � � � 2 β = � β LS

Estimation error If the data are generated according to the linear model β ∗ + � y := X � � z then β LS − � � β ∗

Estimation error If the data are generated according to the linear model β ∗ + � y := X � � z then � − 1 β ∗ = � X T � β ∗ + � � β LS − � � X � − � β ∗ X T X z

Estimation error If the data are generated according to the linear model β ∗ + � y := X � � z then � − 1 β ∗ = � X T � β ∗ + � � β LS − � � X � − � β ∗ X T X z � − 1 � X T X X T � = z as long as X is full rank

LS estimator is unbiased Assume noise z is random and has zero mean, then � β ∗ � β LS − � � E

LS estimator is unbiased Assume noise z is random and has zero mean, then � − 1 � β ∗ � � β LS − � � X T X X T E ( � = z ) E

LS estimator is unbiased Assume noise z is random and has zero mean, then � − 1 � β ∗ � � β LS − � � X T X X T E ( � = z ) E = 0 The estimate is unbiased: its mean equals � β ∗

Least-squares error If the data are generated according to the linear model β ∗ + � y := X � � z then || � z || 2 2 ≤ || � z || 2 � � β ∗ � � � � β LS − � ≤ � � � � σ 1 σ p � � � σ 1 and σ p are the largest and smallest singular values of X

Least-squares error: Proof The error is given by β ∗ = ( X T X ) − 1 X T � � β LS − � z . How can we bound � ( X T X ) − 1 X T � z � 2 ?

Singular values The singular values of a matrix A ∈ R n × p of rank p satisfy σ 1 = max || A � x || 2 { || � x ∈ R n } x || 2 = 1 | � || A � σ p = min x || 2 { || � x ∈ R n } x || 2 = 1 | �

Least-squares error β ∗ = VS − 1 U T � β LS − � � z The smallest and largest singular values of VS − 1 U are 1 /σ 1 and 1 /σ p , so || � z || 2 2 ≤ || � z || 2 � � � � � VS − 1 U T � ≤ z � � � � σ 1 σ p � � �

Experiment z train and β ∗ are sampled iid from a standard Gaussian X train , X test , � Data has 50 features β ∗ + � y train = X train � � z train y test = X test � β ∗ � (No Test Noise) y train and X train to compute � We use � β LS � � � � � X train � β LS − � y train � � � � � � � 2 error train = || � y train || 2 � � � � � X test � β LS − � y test � � � � � � � 2 error test = || � y test || 2

Experiment 0.5 Error (training) Error (test) Noise level (training) 0.4 Relative error (l2 norm) 0.3 0.2 0.1 0.0 50 100 200 300 400 500 n

Experiment Questions 1. Can we approximate the relative noise level � � z � 2 / � � y � 2 ? 2. Why does the training error start at 0? 3. Why does the relative training error converge to the noise level? 4. Why does the relative test error converge to zero?

Experiment Questions 1. Can we approximate the relative noise level � � z � 2 / � � y � 2 ? √ √ z train � 2 ≈ √ n , � � 50, � X train � β ∗ � 2 ≈ β ∗ � 2 ≈ 1 50 n , � � 51 ≈ 0 . 140 √ 2. Why does the training error start at 0? 3. Why does the relative training error converge to the noise level? 4. Why does the relative test error converge to zero?

Experiment Questions 1. Can we approximate the relative noise level � � z � 2 / � � y � 2 ? √ √ z train � 2 ≈ √ n , � � 50, � X train � β ∗ � 2 ≈ β ∗ � 2 ≈ 1 50 n , � � 51 ≈ 0 . 140 √ 2. Why does the training error start at 0? X is square and invertible 3. Why does the relative training error converge to the noise level? 4. Why does the relative test error converge to zero?

Experiment Questions 1. Can we approximate the relative noise level � � z � 2 / � � y � 2 ? √ √ z train � 2 ≈ √ n , � � 50, � X train � β ∗ � 2 ≈ β ∗ � 2 ≈ 1 50 n , � � 51 ≈ 0 . 140 √ 2. Why does the training error start at 0? X is square and invertible 3. Why does the relative training error converge to the noise level? � X train � y train � 2 = � X train ( � β LS − � z train � 2 and � β LS → � β ∗ ) − � β LS − � β 4. Why does the relative test error converge to zero?

Experiment Questions 1. Can we approximate the relative noise level � � z � 2 / � � y � 2 ? √ √ z train � 2 ≈ √ n , � � 50, � X train � β ∗ � 2 ≈ β ∗ � 2 ≈ 1 50 n , � � 51 ≈ 0 . 140 √ 2. Why does the training error start at 0? X is square and invertible 3. Why does the relative training error converge to the noise level? � X train � y train � 2 = � X train ( � β LS − � z train � 2 and � β LS → � β ∗ ) − � β LS − � β 4. Why does the relative test error converge to zero? We assumed no test noise, and � β LS → � β ∗

Non-asymptotic bound Let β ∗ + � y := X � � z , where the entries of X and � z are iid standard Gaussians The least-squares estimate satisfies � � � p � p ( 1 − ǫ ) ( 1 + ǫ ) � � β ∗ � � � � β LS − � n ≤ 2 ≤ � � � � ( 1 + ǫ ) ( 1 − ǫ ) n � � � � − p ǫ 2 / 8 � with probability at least 1 − 1 / p − 2 exp as long as n ≥ 64 p log( 12 /ǫ ) /ǫ 2

Proof �� U T � � U T � z z � � � � � � � VS − 1 U T � 2 2 ≤ 2 ≤ z � � � � σ 1 σ p � � �

Projection onto a fixed subspace Let S be a k -dimensional subspace of R n and � z ∈ R n a vector of iid standard Gaussian noise For any ǫ > 0 � − k ǫ 2 � � � z || 2 k ( 1 − ǫ ) < ||P S � 2 < k ( 1 + ǫ ) ≥ 1 − 2 exp P 8

Projection onto a fixed subspace Let S be a k -dimensional subspace of R n and � z ∈ R n a vector of iid standard Gaussian noise For any ǫ > 0 � − k ǫ 2 � � � z || 2 k ( 1 − ǫ ) < ||P S � 2 < k ( 1 + ǫ ) ≥ 1 − 2 exp P 8 � − p ǫ 2 / 8 � Consequence: With probability 1 − 2 exp 2 � � � � � U T � ( 1 − ǫ ) p ≤ z 2 ≤ ( 1 + ǫ ) p � � � � � � �

Singular values of a Gaussian matrix Let A be a n × k matrix with iid standard Gaussian entries such that n > k For any fixed ǫ > 0, the singular values of A satisfy � � n ( 1 − ǫ ) ≤ σ k ≤ σ 1 ≤ n ( 1 + ǫ ) with probability at least 1 − 1 / k as long as n > 64 k ǫ 2 log 12 ǫ

Proof With probability 1 − 1 / p � � n ( 1 − ǫ ) ≤ σ p ≤ σ 1 ≤ n ( 1 + ǫ ) as long as n ≥ 64 p log( 12 /ǫ ) /ǫ 2

� � � � � � Experiment: β 2 ≈ p � � � � � � � Plot of � � β ∗ − � β LS � 2 � � β ∗ � 2 0.10 p=50 p=100 p=200 Relative coefficient error (l2 norm) 0.08 p n 1 / 0.06 0.04 0.02 0.00 50 5000 10000 15000 20000 n

Condition number The condition number of A ∈ R n × p , n ≥ p , is the ratio σ 1 /σ p of its largest and smallest singular values A matrix is ill conditioned if its condition is large (almost rank defficient)

Noise amplification Let β ∗ + � y := X � � z , where � z is iid standard Gaussian � − ǫ 2 / 8 � With probability at least 1 − 2 exp √ 1 − ǫ � � β ∗ � � � � β LS − � 2 ≥ � � � � σ p � � � where σ p is the smallest singular value of X

Proof 2 � � β ∗ � � � � β LS − � � � � � � � � 2

Proof 2 2 � � β ∗ � � � � � � � � β LS − � � VS − 1 U T � 2 = z � � � � � � � � � � � � � � 2

Proof 2 2 � � β ∗ � � � � � � � � β LS − � � VS − 1 U T � 2 = z � � � � � � � � � � � � � � 2 2 � � � � � S − 1 U T � = z V is orthogonal � � � � � � � 2

Proof 2 2 � � β ∗ � � � � � � � � β LS − � � VS − 1 U T � 2 = z � � � � � � � � � � � � � � 2 2 � � � � � S − 1 U T � = z V is orthogonal � � � � � � � 2 p � 2 � � u T i � z � = σ 2 i i

Proof 2 2 � � β ∗ � � � � � � � � β LS − � � VS − 1 U T � 2 = z � � � � � � � � � � � � � � 2 2 � � � � � S − 1 U T � = z V is orthogonal � � � � � � � 2 p � 2 � � u T i � z � = σ 2 i i � 2 � � u T p � z ≥ σ 2 p

Projection onto a fixed subspace Let S be a k -dimensional subspace of R n and � z ∈ R n a vector of iid standard Gaussian noise For any ǫ > 0 � − k ǫ 2 � � � z || 2 k ( 1 − ǫ ) < ||P S � 2 < k ( 1 + ǫ ) ≥ 1 − 2 exp P 8

Projection onto a fixed subspace Let S be a k -dimensional subspace of R n and � z ∈ R n a vector of iid standard Gaussian noise For any ǫ > 0 � − k ǫ 2 � � � z || 2 k ( 1 − ǫ ) < ||P S � 2 < k ( 1 + ǫ ) ≥ 1 − 2 exp P 8 � − ǫ 2 / 8 � Consequence: With probability 1 − 2 exp � 2 � u T � p � z ≥ ( 1 − ǫ )

Example Let β ∗ + � y := X � � z where  0 . 212 − 0 . 099   0 . 066  0 . 605 − 0 . 298 − 0 . 077   � 0 . 471     �   − 0 . 213 0 . 113 − 0 . 010 β ∗ := �     X := , , � z :=     0 . 589 − 0 . 285 − 1 . 191 − 0 . 033         0 . 016 0 . 006 0 . 010     0 . 059 0 . 032 0 . 028 || z || 2 = 0 . 11

Example Condition number = 100  − 0 . 234 0 . 427  − 0 . 674 − 0 . 202     � 1 . 00 � � − 0 . 898 � 0 . 241 0 . 744 0 0 . 440 X = USV T =     − 0 . 654 0 . 350 0 0 . 01 0 . 440 0 . 898     0 . 017 − 0 . 189   0 . 067 0 . 257

Example � β LS − � β ∗

Example β ∗ = VS − 1 U T � � β LS − � z

Example β ∗ = VS − 1 U T � � β LS − � z � 1 . 00 � 0 U T � = V z 0 100 . 00

Example β ∗ = VS − 1 U T � β LS − � � z � 1 . 00 � 0 U T � = V z 0 100 . 00 � 0 . 058 � = V 3 . 004

Example β ∗ = VS − 1 U T � β LS − � � z � 1 . 00 � 0 U T � = V z 0 100 . 00 � 0 . 058 � = V 3 . 004 � 1 . 270 � = 2 . 723

Example β ∗ = VS − 1 U T � � β LS − � z � 1 . 00 � 0 U T � = V z 0 100 . 00 � 0 . 058 � = V 3 . 004 � 1 . 270 � = 2 . 723 so that � � β ∗ � � � � β LS − � � � � � � � � 2 = 27 . 00 || � z || 2

Multicollinearity Feature matrix is ill conditioned if any subset of columns is close to being linearly dependent (there is a vector almost in the null space) This occurs if features are highly correlated For any X ∈ R n × p , with normalized columns, if X i and X j , i � = j , satisfy � X i , X j � 2 ≥ 1 − ǫ 2 then the smallest singular value σ p ≤ ǫ

Multicollinearity Feature matrix is ill conditioned if any subset of columns is close to being linearly dependent (there is a vector almost in the null space) This occurs if features are highly correlated For any X ∈ R n × p , with normalized columns, if X i and X j , i � = j , satisfy � X i , X j � 2 ≥ 1 − ǫ 2 then the smallest singular value σ p ≤ ǫ Proof Idea: Consider � X ( � e i − � e j ) � 2 .

Motivation Avoid noise amplification due to multicollinearity Problem: Noise amplification blows up coefficients Solution: Penalize large-norm solutions when fitting the model Adding a penalty term promoting a particular structure is called regularization

Ridge regression For a fixed regularization parameter λ > 0 2 2 � � � � � � � � � y − X � � � � � β ridge := arg min β 2 + λ β � � � � � � � � � � � � � � � 2 β

Ridge regression For a fixed regularization parameter λ > 0 2 2 � � � � � � � � � y − X � � � � � β ridge := arg min β 2 + λ β � � � � � � � � � � � � � � � 2 β � − 1 � X T X + λ I X T � = y

Linear Models DS-GA 1013 / MATH-GA 2824 Optimization-based Data - PowerPoint PPT Presentation

Linear Models DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/OBDA_fall17/index.html Carlos Fernandez-Granda Linear regression Least-squares estimation Geometric interpretation Probabilistic

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Introduction to Data Science: Logistic 0 1 1 according to a data fit criterion. account

Limitations of linear models Richard Erickson Instructor DataCamp Generalized Linear Models in

Workshop 3 Building from Linear Models to Generalised Linear Models Part 2: GLMs 2 2 What are

Functional Linear Models 1 66 / 181 Functional Linear Models Statistical Models So far we have

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

Workshop 2 Building from Linear Models to Generalised Linear Models Part 1: understanding LMs 2

Overview of logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models

Linear Classifiers: Expressiveness Machine Learning 1 Lecture outline Linear models:

Introduction to the R Statistical Computing Environment Linear and Generalized Linear Models in R

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

ECON 950 Winter 2020 Prof. James MacKinnon 9. Going Beyond Linear Models Linear regression,

Outline Statistical inference for linear mixed models general form of linear mixed models

Categorical Semantics for Linear Logic Categorical semantics for linear logic Interaction

Linear Programming Linear Programming In a linear programming problem, there is a set of

Local or Global Smoothing? A Bandwidth Selector for Dependent Data Francesco Giordano Maria

of Australian hospital data Liam HEINIGER a , Norm GOOD b and Sankalp KHANNA b a University of

Lecture 9: Residual Analysis Instructor: Prof. Shuai Huang Industrial and Systems Engineering

VALSE webinar 2015 5 27 Feature Selection in Image and Video Recognition

On the Communication Complexity of Multilateral Trading Nicolas Maudet Ulle Endriss Universit

MAR briefing call for European investors: Market practices for pre-sounding bond issuance Ruari

NRC International Activities Janice Dunn Lee, Director International Programs May 11, 2004

Global Disorder Paolo Guerrieri University of Rome Sapienza Sciences Po, PSIA, Paris

Linear Models DS-GA 1013 / MATH-GA 2824 Optimization-based Data - PowerPoint PPT Presentation

Linear Models DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/OBDA_fall17/index.html Carlos Fernandez-Granda Linear regression Least-squares estimation Geometric interpretation Probabilistic

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Introduction to Data Science: Logistic 0 1 1 according to a data fit criterion. account

Limitations of linear models Richard Erickson Instructor DataCamp Generalized Linear Models in

Workshop 3 Building from Linear Models to Generalised Linear Models Part 2: GLMs 2 2 What are

Functional Linear Models 1 66 / 181 Functional Linear Models Statistical Models So far we have

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

Workshop 2 Building from Linear Models to Generalised Linear Models Part 1: understanding LMs 2

Overview of logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models

Linear Classifiers: Expressiveness Machine Learning 1 Lecture outline Linear models:

Introduction to the R Statistical Computing Environment Linear and Generalized Linear Models in R

Graphics 2014 Linear Algebra II Linear Maps &amp; Matrices Linear Maps &amp; Matrices CORE

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

ECON 950 Winter 2020 Prof. James MacKinnon 9. Going Beyond Linear Models Linear regression,

Outline Statistical inference for linear mixed models general form of linear mixed models

Categorical Semantics for Linear Logic Categorical semantics for linear logic Interaction

Linear Programming Linear Programming In a linear programming problem, there is a set of

Local or Global Smoothing? A Bandwidth Selector for Dependent Data Francesco Giordano Maria

of Australian hospital data Liam HEINIGER a , Norm GOOD b and Sankalp KHANNA b a University of

Lecture 9: Residual Analysis Instructor: Prof. Shuai Huang Industrial and Systems Engineering

VALSE webinar 2015 5 27 Feature Selection in Image and Video Recognition

On the Communication Complexity of Multilateral Trading Nicolas Maudet Ulle Endriss Universit

MAR briefing call for European investors: Market practices for pre-sounding bond issuance Ruari

NRC International Activities Janice Dunn Lee, Director International Programs May 11, 2004

Global Disorder Paolo Guerrieri University of Rome Sapienza Sciences Po, PSIA, Paris

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE