Applied Machine Learning Applied Machine Learning Linear Regression - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Linear Regression Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1

Learning objectives Learning objectives linear model evaluation criteria how to find the best fit geometric interpretation 2

Motivation Motivation History: method of least squares was invented by Legendre and Gauss (1800's) Gauss at age 24 used it to predict the future location of Ceres (largest astroid in the astroid belt) 3 . 1 source: http://chrisauld.com/2012/10/07/what-do-we-know-about-the-effect-of-income-inequality-on-health/

Motivation Motivation History: method of least squares was invented by Legendre and Gauss (1800's) Gauss at age 24 used it to predict the future location of Ceres (largest astroid in the astroid belt) effect of income inequality on health and social problems 3 . 1 source: http://chrisauld.com/2012/10/07/what-do-we-know-about-the-effect-of-income-inequality-on-health/

Motivation (?) Motivation (?) 3 . 2 Winter 2020 | Applied Machine Learning (COMP551)

Representing data Representing data ( n ) R D ∈ each instance: x ( n ) R ∈ y 4 . 1

Representing data Representing data one instance ( n ) R D ∈ each instance: x ( n ) R ∈ y 4 . 1

Representing data Representing data one instance ( n ) R D ∈ each instance: x ( n ) R ∈ y ⎡ x ⎤ 1 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ x 2 ⊤ ⎢ ⎥ vectors are assume to be column vectors x = = , , … , [ x D ] x x 1 2 ⋮ ⎣ D ⎦ x 4 . 1

Representing data Representing data one instance ( n ) R D ∈ each instance: x ( n ) R ∈ y ⎡ x ⎤ a feature 1 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ x 2 ⊤ ⎢ ⎥ vectors are assume to be column vectors x = = , , … , [ x D ] x x 1 2 ⋮ ⎣ D ⎦ x 4 . 1

Representing data Representing data one instance ( n ) R D ∈ each instance: x ( n ) R ∈ y ⎡ x ⎤ a feature 1 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ x 2 ⊤ ⎢ ⎥ vectors are assume to be column vectors x = = , , … , [ x D ] x x 1 2 ⋮ ⎣ D ⎦ x we assume N instances in the dataset D = {( x ( n ) ( n , y )} N n =1 each instance has D features indexed by d for example, is the feature d of instance n ( n ) R ∈ x d 4 . 1

Representing data Representing data design matrix: concatenate all instances each row is a datapoint, each column is a feature ⎡ x (1)⊤ ⎤ ⎢ ⎢ ⎥ ⎥ x (2)⊤ ⎢ ⎥ ⎢ ⎥ X = ⎢ ⎥ ⋮ ⎣ x ( N )⊤ ⎦ 4 . 2

Representing data Representing data design matrix: concatenate all instances each row is a datapoint, each column is a feature ⎡ x ⎤ ⎡ x (1)⊤ ⎤ (1) (1) (1) , , ⋯ , x x ⎢ ⎢ ⎥ ⎥ ⎢ ⎥ 1 2 D x (2)⊤ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ = X = ⎢ ⎥ ⋮ ⋮ ⋱ ⋮ ⎣ ( N ) ⎦ ⋮ ⎣ x ( N )⊤ ⎦ ( N ) ( N ) , , ⋯ , x x x 1 2 D 4 . 2

Representing data Representing data design matrix: concatenate all instances each row is a datapoint, each column is a feature ⎡ x ⎤ ⎡ x (1)⊤ ⎤ (1) (1) (1) , , ⋯ , one instance x x ⎢ ⎢ ⎥ ⎥ ⎢ ⎥ 1 2 D x (2)⊤ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ = X = ⎢ ⎥ ⋮ ⋮ ⋱ ⋮ ⎣ ( N ) ⎦ ⋮ ⎣ x ( N )⊤ ⎦ ( N ) ( N ) , , ⋯ , x x x 1 2 D 4 . 2

Representing data Representing data design matrix: concatenate all instances each row is a datapoint, each column is a feature ⎡ x ⎤ ⎡ x (1)⊤ ⎤ (1) (1) (1) , , ⋯ , one instance x x ⎢ ⎢ ⎥ ⎥ ⎢ ⎥ 1 2 D x (2)⊤ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ = X = ⎢ ⎥ ⋮ ⋮ ⋱ ⋮ ⎣ ( N ) ⎦ ⋮ ⎣ x ( N )⊤ ⎦ ( N ) ( N ) , , ⋯ , x x x 1 2 D one feature 4 . 2

Representing data Representing data design matrix: concatenate all instances each row is a datapoint, each column is a feature ⎡ x ⎤ ⎡ x (1)⊤ ⎤ (1) (1) (1) , , ⋯ , one instance x x ⎢ ⎢ ⎥ ⎥ ⎢ ⎥ 1 2 D x (2)⊤ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ = ∈ R N × D X = ⎢ ⎥ ⋮ ⋮ ⋱ ⋮ ⎣ ( N ) ⎦ ⋮ ⎣ x ( N )⊤ ⎦ ( N ) ( N ) , , ⋯ , x x x 1 2 D one feature 4 . 2

Representing data Representing data Example: Micro array data (X), contains gene expression levels labels (y) can be {cancer/no cancer} label for each patient patient (n) ∈ R N × D gene (d) 4 . 3 Winter 2020 | Applied Machine Learning (COMP551)

Linear model Linear model R R : → D assuming a scalar output f w will generalize to a vector later ( x ) = + + … + w f w w x x 0 1 1 w D D 5

Linear model Linear model R R : → D assuming a scalar output f w will generalize to a vector later ( x ) = + + … + w f w w x x 0 1 1 w D D model parameters or weights 5

Linear model Linear model R R : → D assuming a scalar output f w will generalize to a vector later ( x ) = + + … + w f w w x x 0 1 1 w D D model parameters or weights bias or intercept 5

Linear model Linear model R R : → D assuming a scalar output f w will generalize to a vector later ( x ) = + + … + w f w w x x 0 1 1 w D D model parameters or weights bias or intercept simplification D ⊤ concatenate a 1 to x x = [1, x , … , x ] 1 ⊤ ( x ) = f w x w 5

Linear model Linear model R R : → D assuming a scalar output f w will generalize to a vector later ( x ) = + + … + w f w w x x 0 1 1 w D D model parameters or weights bias or intercept yh_n = np.dot(w,x) simplification D ⊤ concatenate a 1 to x x = [1, x , … , x ] 1 ⊤ ( x ) = f w x w 5

Loss function Loss function ( n ) ( n ) , y ∀ n objective: find parameters to fit the data x 6 . 1

Loss function Loss function ( n ) ( n ) , y ∀ n objective: find parameters to fit the data x ( n ) y ( n ) ( x ) ≈ f w 6 . 1

Loss function Loss function ( n ) ( n ) , y ∀ n objective: find parameters to fit the data x ( n ) y ( n ) ( x ) ≈ f w and minimize a measure of difference between ^ ( n ) ( n ) y ( n ) = ( x ) y f w 6 . 1

Loss function Loss function ( n ) ( n ) , y ∀ n objective: find parameters to fit the data x ( n ) y ( n ) ( x ) ≈ f w and minimize a measure of difference between ^ ( n ) ( n ) y ( n ) = ( x ) y f w ) ≜ 1 ^ 2 L ( y , ^ ( y − ) square error loss (a.k.a. L2 loss) y y 2 for a single instance (a function of labels) 6 . 1

Loss function Loss function ( n ) ( n ) , y ∀ n objective: find parameters to fit the data x ( n ) y ( n ) ( x ) ≈ f w and minimize a measure of difference between ^ ( n ) ( n ) y ( n ) = ( x ) y f w ) ≜ 1 ^ 2 L ( y , ^ ( y − ) square error loss (a.k.a. L2 loss) y y 2 for a single instance (a function of labels) for future convenience 6 . 1

Loss function Loss function ( n ) ( n ) , y ∀ n objective: find parameters to fit the data x ( n ) y ( n ) ( x ) ≈ f w and minimize a measure of difference between ^ ( n ) ( n ) y ( n ) = ( x ) y f w ) ≜ 1 ^ 2 L ( y , ^ ( y − ) square error loss (a.k.a. L2 loss) y y 2 for a single instance (a function of labels) for future convenience sum of squared errors cost function 2 ( y ) 1 ∑ n =1 N ( n ) ⊤ ( n ) J ( w ) = − w x 2 6 . 1

Loss function Loss function ( n ) ( n ) , y ∀ n objective: find parameters to fit the data x ( n ) y ( n ) ( x ) ≈ f w and minimize a measure of difference between ^ ( n ) ( n ) y ( n ) = ( x ) y f w ) ≜ 1 ^ 2 L ( y , ^ ( y − ) square error loss (a.k.a. L2 loss) y y 2 for a single instance (a function of labels) for future convenience versus for the whole dataset sum of squared errors cost function 2 ( y ) 1 ∑ n =1 N ( n ) ⊤ ( n ) J ( w ) = − w x 2 6 . 1

Example Example (D = 1) (D = 1) +bias (D=2)! 6 . 2

Example Example (D = 1) (D = 1) +bias (D=2)! y x = [ x ] 1 6 . 2

Example Example (D = 1) (D = 1) +bias (D=2)! y (1) (1) ( x , y ) (2) (2) ( x , y ) (3) (3) ( x , y ) (4) (4) ( x , y ) x = [ x ] 1 6 . 2

Example Example (D = 1) (D = 1) +bias (D=2)! y (1) (1) ( x , y ) ∗ ∗ ( x ) = + f w w x w ∗ 0 1 (2) (2) ( x , y ) (3) (3) ( x , y ) (4) (4) ( x , y ) ∗ w 0 x = [ x ] 1 6 . 2

Example Example (D = 1) (D = 1) +bias (D=2)! y (1) (1) ( x , y ) ∗ ∗ ( x ) = + f w w x w ∗ 0 1 (2) (2) ( x , y ) (3) (3) − f ( x ) y (3) (3) ( x , y ) (4) (4) ( x , y ) ∗ w 0 x = [ x ] 1 6 . 2

Example Example (D = 1) (D = 1) +bias (D=2)! y (1) (1) ( x , y ) ∗ ∗ ( x ) = + f w w x w ∗ 0 1 (2) (2) ( x , y ) (3) (3) − f ( x ) y (3) (3) ( x , y ) (4) (4) ( x , y ) ∗ w 0 x = [ x ] 1 2 ( y ) Linear Least Squares ( n ) ( n ) min − w ∑ n T w x 6 . 2

Example Example (D=2) (D=2) +bias (D=3)! y ∗ ∗ ∗ ( x ) = + + f w w x w x w ∗ 1 2 0 1 2 ∗ x w 2 0 x 1 2 ( y ) Linear Least Squares ∗ ( n ) ( n ) w = arg min − w ∑ n T w x 6 . 3 Winter 2020 | Applied Machine Learning (COMP551)

Matrix form Matrix form ^ ( n ) ⊤ ( n ) = instead of y w x ∈ R 1 × D D × 1 7

Matrix form Matrix form ^ ( n ) ⊤ ( n ) = instead of y w x ∈ R 1 × D D × 1 ^ = y Xw use design matrix to write N × 1 D × 1 N × D 7

Applied Machine Learning Applied Machine Learning Linear Regression - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Linear Regression Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives linear model evaluation criteria how to find

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Contact

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Slides and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Multilayer Perceptron Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Perceptron and Support Vector Machines Siamak

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Bootstrap, Bagging and Boosting Siamak

Applied Machine Learning Applied Machine Learning Logistic Regression Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Regularization Siamak Ravanbakhsh Siamak

Existence and Regularity of A Capacitary functions in a Minkoswki Inspired Geometric Problem John

Null Space Gradient Flows for Constrained Optimization with Applications to Shape Optimization

Stochastic Online Optimization Jian Li Institute of Interdisciplinary Information Sciences

Null Cones to Infinity, Curvature Flux, and Bondi Mass Arick Shao (joint work with Spyros

A story... of one convex relaxation... and of the related revelation... Leonid Gurvits Los

Stability and Lebesgue constants in Good interpolation points RBF interpolation Results

Optimal transportation, curvature flows, and related a priori estimates Alexander Kolesnikov

An Operational Meaning for Entanglement Entropy in Continuum QFT Ronak M Soni Stanford Warning