COMS 4721: Machine Learning for Data Science Lecture 2, 1/19/2017 - PowerPoint PPT Presentation

COMS 4721: Machine Learning for Data Science Lecture 2, 1/19/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University

L INEAR R EGRESSION

E XAMPLE : O LD F AITHFUL

E XAMPLE : O LD F AITHFUL ● ● ● ● ● ● 90 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 80 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Waiting Time (min) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 70 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 60 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Current Eruption Time (min) Can we meaningfully predict the time between eruptions only using the duration of the last eruption?

E XAMPLE : O LD F AITHFUL ● ● ● ● One model for this ● 90 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 80 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● (wait time) ≈ w 0 + (last duration) × w 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Waiting Time (min) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 70 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ◮ w 0 and w 1 are to be learned. 60 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ◮ This is an example of linear regression. ● ● ● ● ● 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Current Eruption Time (min) Refresher w 1 is the slope, w 0 is called the intercept, bias, shift, offset.

H IGHER DIMENSIONS Two inputs (output) ≈ w 0 + (input 1) × w 1 + (input 2) × w 2 y With two inputs the intuition is the same − → y = w0 + x1w1 + x2w2 x2 x1

R EGRESSION : P ROBLEM D EFINITION Data Input : x ∈ R d (i.e., measurements, covariates, features, indepen. variables) Output : y ∈ R (i.e., response, dependent variable) Goal Find a function f : R d → R such that y ≈ f ( x ; w ) for the data pair ( x , y ) . f ( x ; w ) is called a regression function . Its free parameters are w . Definition of linear regression A regression method is called linear if the prediction f is a linear function of the unknown parameters w .

L EAST SQUARES LINEAR REGRESSION MODEL Model The linear regression model we focus on now has the form d � y i ≈ f ( x i ; w ) = w 0 + x ij w j . j = 1 Model learning We have the set of training data ( x 1 , y 1 ) . . . ( x n , y n ) . We want to use this data to learn a w such that y i ≈ f ( x i ; w ) . But we first need an objective function to tell us what a “good” value of w is. Least squares The least squares objective tells us to pick the w that minimizes the sum of squared errors n ( y i − f ( x i ; w )) 2 ≡ arg min � w LS = arg min L . w w i = 1

L EAST SQUARES IN PICTURES Observations: Vertical length is error. The objective function L is the sum of all the squared lengths. Find weights ( w 1 , w 2 ) plus an offset w 0 to minimize L . ( w 0 , w 1 , w 2 ) defines this plane.

E XAMPLE : E DUCATION , S ENIORITY AND I NCOME 2-dimensional problem Input : (education, seniority) ∈ R 2 . Output : (income) ∈ R Model : (income) ≈ w 0 + (education) w 1 + (seniority) w 2 Question : Both w 1 , w 2 > 0. What does this tell us? Answer : As education and/or seniority goes up, income tends to go up. (Caveat: This is a statement about correlation, not causation.)

L EAST SQUARES LINEAR REGRESSION MODEL Thus far We have data pairs ( x i , y i ) of measurements x i ∈ R d and a response y i ∈ R . We believe there is a linear relationship between x i and y i , d � y i = w 0 + x ij w j + ǫ i j = 1 and we want to minimize the objective function n n � � ( y i − w 0 − � d ǫ 2 L = i = j = 1 x ij w j ) 2 i = 1 i = 1 with respect to ( w 0 , w 1 , . . . , w d ) . Can math notation make this easier to look at/work with?

N OTATION : V ECTORS AND M ATRICES We think of data with d dimensions as a column vector:     x i 1 age x i 2 height     x i = (e.g.) ⇒  .   .  . .     . .     x id income A set of n vectors can be stacked into a matrix:    − x T  x 11 . . . x 1 d 1 − − x T x 21 . . . x 2 d 2 −     X =  =  . .   .  . . .     . . .    − x T n − x n 1 . . . x nd Assumptions for now: ◮ All features are treated as continuous-valued ( x ∈ R d ) ◮ We have more observations than dimensions ( d < n )

N OTATION : R EGRESSION ( AND CLASSIFICATION ) Usually, for linear regression (and classification) we include an intercept term w 0 that doesn’t interact with any element in the vector x ∈ R d . It will be convenient to attach a 1 to the first dimension of each vector x i (which we indicate by x i ∈ R d + 1 ) and in the first column of the matrix X :   1    1 − x T  1 x 11 . . . x 1 d 1 − x i 1   1 − x T 2 − 1 x 21 . . . x 2 d       x i 2 x i = X =  =   ,     .  . . . . . .       . . . . .     .   1 − x T n − 1 x n 1 . . . x nd x id We also now view w = [ w 0 , w 1 , . . . , w d ] T as w ∈ R d + 1 .

L EAST SQUARES IN VECTOR FORM Original least squares objective function: L = � n i = 1 ( y i − w 0 − � d j = 1 x ij w j ) 2 Using vectors, this can now be written: L = � n i = 1 ( y i − x T i w ) 2 Least squares solution (vector version) We can find w by setting, n � ∇ w ( y 2 i − 2 w T x i y i + w T x i x T ∇ w L = 0 ⇒ i w ) = 0 . i = 1 Solving gives, n n n n � − 1 � � � � � � � � � 2 x i x T x i x T − 2 y i x i + w = 0 ⇒ w LS = y i x i . i i i = 1 i = 1 i = 1 i = 1

COMS 4721: Machine Learning for Data Science Lecture 2, 1/19/2017 - PowerPoint PPT Presentation

COMS 4721: Machine Learning for Data Science Lecture 2, 1/19/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University L INEAR R EGRESSION E XAMPLE : O LD F AITHFUL E XAMPLE : O LD F AITHFUL

Introduction to machine learning COMS 4721 Learning from data Machine learning : the study

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 14, 3/21/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 3, 1/24/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 15, 3/23/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 4, 1/26/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 5, 1/31/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 12, 2/28/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 11, 2/23/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 23, 4/20/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 8, 2/14/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 17, 3/30/2017 Prof. John Paisley Department

Ordinary Least Squares for Histogram Data based on Wasserstein Distance Rosanna Verde Antonio

Least Squares (outline) Standard regression: Fit data with weighted sum of regressors.

Linear least squares Non-consistent systems Ax = b , b / R ( A ) This means b or a part of it

1 Least Squares Regression Suppose someone hands you a stack of N vectors, { x N } , each of

JUST THE MATHS SLIDES NUMBER 18.4 STATISTICS 4 (The principle of least squares) by

Simple Linear Regression and Correlation Model for designed experiment: Y i = 0 + 1 x i +

Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares Sham M Kakade c 2018

Optimizing pred(25) Is Problem NP-Hard Main Result Acknowledgments Martine Ceberio, Olga