coms 4721 machine learning for data science lecture 2 1
play

COMS 4721: Machine Learning for Data Science Lecture 2, 1/19/2017 - PowerPoint PPT Presentation

COMS 4721: Machine Learning for Data Science Lecture 2, 1/19/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University L INEAR R EGRESSION E XAMPLE : O LD F AITHFUL E XAMPLE : O LD F AITHFUL


  1. COMS 4721: Machine Learning for Data Science Lecture 2, 1/19/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University

  2. L INEAR R EGRESSION

  3. E XAMPLE : O LD F AITHFUL

  4. E XAMPLE : O LD F AITHFUL ● ● ● ● ● ● 90 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 80 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Waiting Time (min) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 70 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 60 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Current Eruption Time (min) Can we meaningfully predict the time between eruptions only using the duration of the last eruption?

  5. E XAMPLE : O LD F AITHFUL ● ● ● ● ● ● 90 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 80 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Waiting Time (min) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 70 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 60 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Current Eruption Time (min) Can we meaningfully predict the time between eruptions only using the duration of the last eruption?

  6. E XAMPLE : O LD F AITHFUL ● ● ● ● One model for this ● 90 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 80 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● (wait time) ≈ w 0 + (last duration) × w 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Waiting Time (min) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 70 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ◮ w 0 and w 1 are to be learned. 60 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ◮ This is an example of linear regression. ● ● ● ● ● 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Current Eruption Time (min) Refresher w 1 is the slope, w 0 is called the intercept, bias, shift, offset.

  7. H IGHER DIMENSIONS Two inputs (output) ≈ w 0 + (input 1) × w 1 + (input 2) × w 2 y With two inputs the intuition is the same − → y = w0 + x1w1 + x2w2 x2 x1

  8. R EGRESSION : P ROBLEM D EFINITION Data Input : x ∈ R d (i.e., measurements, covariates, features, indepen. variables) Output : y ∈ R (i.e., response, dependent variable) Goal Find a function f : R d → R such that y ≈ f ( x ; w ) for the data pair ( x , y ) . f ( x ; w ) is called a regression function . Its free parameters are w . Definition of linear regression A regression method is called linear if the prediction f is a linear function of the unknown parameters w .

  9. L EAST SQUARES LINEAR REGRESSION MODEL Model The linear regression model we focus on now has the form d � y i ≈ f ( x i ; w ) = w 0 + x ij w j . j = 1 Model learning We have the set of training data ( x 1 , y 1 ) . . . ( x n , y n ) . We want to use this data to learn a w such that y i ≈ f ( x i ; w ) . But we first need an objective function to tell us what a “good” value of w is. Least squares The least squares objective tells us to pick the w that minimizes the sum of squared errors n ( y i − f ( x i ; w )) 2 ≡ arg min � w LS = arg min L . w w i = 1

  10. L EAST SQUARES IN PICTURES Observations: Vertical length is error. The objective function L is the sum of all the squared lengths. Find weights ( w 1 , w 2 ) plus an offset w 0 to minimize L . ( w 0 , w 1 , w 2 ) defines this plane.

  11. E XAMPLE : E DUCATION , S ENIORITY AND I NCOME 2-dimensional problem Input : (education, seniority) ∈ R 2 . Output : (income) ∈ R Model : (income) ≈ w 0 + (education) w 1 + (seniority) w 2 Question : Both w 1 , w 2 > 0. What does this tell us? Answer : As education and/or seniority goes up, income tends to go up. (Caveat: This is a statement about correlation, not causation.)

  12. L EAST SQUARES LINEAR REGRESSION MODEL Thus far We have data pairs ( x i , y i ) of measurements x i ∈ R d and a response y i ∈ R . We believe there is a linear relationship between x i and y i , d � y i = w 0 + x ij w j + ǫ i j = 1 and we want to minimize the objective function n n � � ( y i − w 0 − � d ǫ 2 L = i = j = 1 x ij w j ) 2 i = 1 i = 1 with respect to ( w 0 , w 1 , . . . , w d ) . Can math notation make this easier to look at/work with?

  13. N OTATION : V ECTORS AND M ATRICES We think of data with d dimensions as a column vector:     x i 1 age x i 2 height     x i = (e.g.) ⇒  .   .  . .     . .     x id income A set of n vectors can be stacked into a matrix:    − x T  x 11 . . . x 1 d 1 − − x T x 21 . . . x 2 d 2 −     X =  =  . .   .  . . .     . . .    − x T n − x n 1 . . . x nd Assumptions for now: ◮ All features are treated as continuous-valued ( x ∈ R d ) ◮ We have more observations than dimensions ( d < n )

  14. N OTATION : R EGRESSION ( AND CLASSIFICATION ) Usually, for linear regression (and classification) we include an intercept term w 0 that doesn’t interact with any element in the vector x ∈ R d . It will be convenient to attach a 1 to the first dimension of each vector x i (which we indicate by x i ∈ R d + 1 ) and in the first column of the matrix X :   1    1 − x T  1 x 11 . . . x 1 d 1 − x i 1   1 − x T 2 − 1 x 21 . . . x 2 d       x i 2 x i = X =  =   ,     .  . . . . . .       . . . . .     .   1 − x T n − 1 x n 1 . . . x nd x id We also now view w = [ w 0 , w 1 , . . . , w d ] T as w ∈ R d + 1 .

  15. L EAST SQUARES IN VECTOR FORM Original least squares objective function: L = � n i = 1 ( y i − w 0 − � d j = 1 x ij w j ) 2 Using vectors, this can now be written: L = � n i = 1 ( y i − x T i w ) 2 Least squares solution (vector version) We can find w by setting, n � ∇ w ( y 2 i − 2 w T x i y i + w T x i x T ∇ w L = 0 ⇒ i w ) = 0 . i = 1 Solving gives, n n n n � − 1 � � � � � � � � � 2 x i x T x i x T − 2 y i x i + w = 0 ⇒ w LS = y i x i . i i i = 1 i = 1 i = 1 i = 1

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend