Linear Classifiers and Regressors Borrowed with permission from - PowerPoint PPT Presentation

Linear Classifiers and Regressors “Borrowed” with permission from Andrew Moore (CMU)

Single-Parameter Linear Regression Linear: Slide 2

Regression vs Classification Input Prediction of Classifier Attributes categorical output Input Prediction of Regressor Attributes real-valued output Input Density Probability Attributes Estimator Linear: Slide 3

Linear Regression DATASET inputs outputs x 1 = 1 y 1 = 1 x 2 = 3 y 2 = 2.2 ↑ x 3 = 2 y 3 = 2 w x 4 = 1.5 y 4 = 1.9 ↓ ← 1 → x 5 = 4 y 5 = 3.1 Linear regression assumes expected value of output y given input x , E[y|x] , is linear. Simplest case: Out( x ) = w × x for some unknown w . Challenge: Given dataset, how to estimate w . Linear: Slide 4

1-parameter linear regression Assume data is formed by y i = w × x i + noise i where… • noise signals are independent • noise has normal distribution with mean 0 and unknown variance σ 2 P(y|w,x) has a normal distribution with • mean w × x • variance σ 2 Linear: Slide 5

Bayesian Linear Regression P( y | w , x ) = Normal(mean w × x ; var σ 2 ) Datapoints ( x 1 , y 1 ) ( x 2 , y 2 ) … ( x n , y n ) are EVIDENCE about w . Want to infer w from data: P( w | x 1 , x 2 ,…, x n , y 1 , y 2 …, y n ) •?? use BAYES rule to work out a posterior distribution for w given the data ?? •Or Maximum Likelihood Estimation ? Linear: Slide 6

Maximum likelihood estimation of w Question: “For what value of w is this data most likely to have happened?” ⇔ What value of w maximizes n = ∏ P y y y x x x w P y w x ( , ,..., | , ,..., , ) ( , ) ? n n i i 1 2 1 2 = i 1 Linear: Slide 7

⎧ ⎫ n ⎪ ⎪ ∏ = * w P y w x ⎨ ⎬ arg max ( , ) i i ⎪ ⎪ ⎩ ⎭ = i 1 ⎧ ⎫ n ⎪ ⎪ ∏ 1 − y wx i i − 2 ⎨ ⎬ = arg max exp( ( ) ) σ ⎪ 2 ⎪ ⎩ ⎭ = i 1 ⎧ ⎫ n 2 − ⎪ ⎛ ⎞ ⎪ y wx ∑ 1 i i = − ⎨ ⎬ arg max ⎜ ⎟ σ ⎝ ⎠ 2 ⎪ ⎪ ⎩ ⎭ = i 1 ⎧ ⎫ 2 n ⎪ ⎪ ∑ ( ) = − y wx ⎨ ⎬ arg min i i ⎪ ⎪ ⎩ = ⎭ i 1 Linear: Slide 8

Linear Regression Maximum likelihood w minimizes E(w) = sum-of-squares of residuals E(w) w ( ) ( ) ∑ ( ) ∑ ∑ ∑ 2 Ε = − = − + w y wx y 2 x y w x 2 w 2 ( ) 2 i i i i i i i i ⇒ Need to minimize a quadratic function of w . Linear: Slide 9

Linear Regression Sum-of-squares minimized when ∑ p(w) x y = i i w w ∑ 2 x i Note: Bayesian stats would The maximum likelihood provide a prob dist of w model is … and predictions would give a prob dist of expected output Out(x) = w × x Often useful to know your confidence. Can use for prediction Max likelihood also provides kind of confidence! Linear: Slide 10

Multi-variate Linear Regression Linear: Slide 11

Multivariate Regression What if inputs are vectors ? 3 . . 4 6 . I nput is 2-d; Output value is “height” . 5 x 2 . 8 . 10 x 1 y 1 x 1 Dataset has form y 2 x 2 y 3 x 3 .: : . y R x R Linear: Slide 12

Multivariate Regression R datapoints; each input has m components … as Matrices: ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ x x x y ..... ..... ... x m 11 12 1 1 1 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ x x x y ..... ..... ... x ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ = = m = 2 21 22 2 2 x y ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ M M M ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ x x x y ⎣ ..... ..... ⎦ ⎣ ... ⎦ ⎣ ⎦ x R R R Rm R 1 2 I MPORTANT EXERCI SE: PROVE I T !!!!! Linear regression model assumes ∃ vector w s.t. Out( x ) = w T x = w 1 x[1] + w 2 x[2] + … + w m x[m] Max. likelihood w = ( X T X) -1 ( X T Y) Linear: Slide 13

Multivariate Regression (con’t) The max. likelihood w is w = ( X T X) -1 ( X T Y) R ∑ x ki x X T X is m × m matrix: i,j’th elt = kj = k 1 R ∑ x ki y X T Y is m -element vector: i ’th elt = k = k 1 Linear: Slide 15

Constant Term in Linear Regression Linear: Slide 16

What about a constant term? What if linear data does not go through origin (0,0,…0) ? Statisticians and Neural Net Folks all agree on a simple obvious hack. Can you guess?? Linear: Slide 17

The constant term • Trick: create fake input “ X 0 ” that always takes value 1 X 1 X 2 Y X 0 X 1 X 2 Y 2 4 16 1 2 4 16 3 4 17 1 3 4 17 5 5 20 1 5 5 20 After: Before: Y= w 0 X 0 +w 1 X 1 + w 2 X 2 Y=w 1 X 1 + w 2 X 2 = w 0 +w 1 X 1 + w 2 X 2 …is a poor model Here, you should be able to see MLE …is good model! w 0 , w 1 , w 2 by inspection Linear: Slide 18

Heteroscedasticity... Linear Regression with varying noise Linear: Slide 19

Regression with varying noise • Suppose you know variance of noise that was added to each datapoint. y=3 x i y i σ i2 σ =2 ½ ½ 4 y=2 σ =1/2 1 1 1 σ =1 2 1 1/4 y=1 σ =1/2 σ =2 2 3 4 y=0 3 2 1/4 x=0 x=1 x=2 x=3 What’s the MLE estimate of w? σ y N wx 2 ~ ( , ) Assume i i i Linear: Slide 20

MLE estimation with varying noise σ σ σ argmax p y y y x x x 2 2 2 w log ( , ,..., | , ,..., , , ,..., , ) R R R 1 2 1 2 1 2 w Assuming i.i.d. and − R y wx 2 then plugging in ( ) ∑ = argmin i i equation for Gaussian σ 2 and simplifying. = i 1 i w ⎛ ⎞ − R x y wx ( ) Setting dLL/dw ∑ = = w i i i ⎜ ⎟ such that 0 equal to zero σ 2 ⎝ ⎠ = i 1 i ⎛ ⎞ R x y Trivial algebra ∑ i i ⎜ ⎟ σ 2 ⎝ ⎠ = i = i 1 ⎛ ⎞ R x 2 ∑ i ⎜ ⎟ σ 2 ⎝ ⎠ = i 1 i Linear: Slide 21

This is Weighted Regression • How to minimize weighted sum of squares ? y=3 σ =2 − R y wx 2 ( ) ∑ argmin i i σ y=2 2 = i σ =1/2 i 1 w σ =1 y=1 σ =1/2 σ =2 y=0 x=0 x=1 x=2 x=3 1 where weight for i’th datapoint is σ 2 i Linear: Slide 22

Weighted Multivariate Regression The max. likelihood w is w = (W X T WX) -1 (W X T WY) x ki x R ∑ kj (W X T WX) is an m x m matrix: i,j’th elt is σ 2 = k 1 i R x ki y (W X T WY) is an m -element vector: i ’th elt ∑ k σ 2 = k 1 i Linear: Slide 23

Non-linear Regression (Digression…) Linear: Slide 24

Non-linear Regression Suppose y is related to function of x in that predicted values have a non-linear dependence on w: y=3 x i y i ½ ½ y=2 1 2.5 2 3 y=1 3 2 y=0 3 3 x=0 x=1 x=2 x=3 What’s the MLE + σ estimate of w? y N w x 2 ~ ( , ) Assume i i Linear: Slide 25

Non-linear MLE estimation σ = argmax p y y y x x x w log ( , ,..., | , ,..., , , ) R R 1 2 1 2 w Assuming i.i.d. and ( ) R Common (but not only) approach: ∑ 2 = − + then plugging in argmin y w x Numerical Solutions: i i equation for Gaussian = i 1 w and simplifying. • Line Search • Simulated Annealing ⎛ ⎞ − + y w x R ∑ = i i = ⎜ ⎟ w • Gradient Descent such that 0 Setting dLL/dw ⎜ ⎟ + w x ⎝ ⎠ equal to zero • Conjugate Gradient = i 1 i • Levenberg Marquart We’re down the • Newton’s Method algebraic toilet Also, special purpose statistical- So guess what optimization-specific tricks such as E.M. (See Gaussian Mixtures lecture we do? for introduction) Linear: Slide 26

GRADIENT DESCENT Goal: Find a local minimum of f: ℜ→ℜ Approach: 1. Start with some value for w η ∂ ( ) ← − w w w 2. GRADIENT DESCENT: f ∂ w 3. Iterate … until bored … η = LEARNING RATE = small positive number, e.g. η = 0.05 Good default value for anything ! QUESTION: Justify the Gradient Descent Rule Linear: Slide 28

Gradient Descent in “m” Dimensions ℜ m → ℜ Given f( w ) : ∂ ⎛ ⎞ ( ) ⎜ ⎟ f w ∂ w ⎜ ⎟ 1 ( ) points in direction of steepest ascent. ∇ = ⎜ ⎟ M f w ∂ ⎜ ⎟ ( ) ⎟ f w ⎜ ∂ w ⎝ ⎠ m ( ) ∇ f w is the gradient in that direction ( ) ← η ∇ GRADIENT DESCENT RULE: w w - f w Equivalently ∂ ( ) is j th ← ….where w j weight w w η - f w j j ∂ w j “just like a linear feedback system” Linear: Slide 29

Linear Perceptron Linear: Slide 31

Linear Perceptrons Multivariate linear models: Out( x ) = w T x “Training” ≡ minimizing sum-of-squared residuals… ( ( ) ) ∑ 2 Ε = Out x − y k k k ( ) ∑ 2 Τ = x − w y k k k by gradient descent… → perceptron training rule Linear: Slide 32

Linear Perceptron Training Rule R ∂ ∂ E R ∑ ∑ = − T = − E y 2 T y 2 ( ) w x ( ) w x k k k k ∂ ∂ w w = k j 1 j = k 1 ∂ R Gradient descent: ∑ = − − T T y y 2 ( ) ( ) w x w x k k k k ∂ w to minimize E, = k 1 j update w … ∂ R ∑ = − T δ 2 w x ∂ E k k ∂ w ← = w w η k 1 j - j j ∂ …where w j = − T δ y w x k k k ∂ R m ∂ E ∑ ∑ = − δ w x So what’s 2 ? k i ki ∂ w ∂ w = = k i 1 j 1 j R ∑ = − δ k x 2 kj = k 1 Linear: Slide 33

Linear Classifiers and Regressors Borrowed with permission from - PowerPoint PPT Presentation

Linear Classifiers and Regressors Borrowed with permission from Andrew Moore (CMU) Single-Parameter Linear Regression Linear: Slide 2 Regression vs Classification Input Prediction of Classifier Attributes categorical output Input

Nonlinear Classifiers II 2 Nonlinear Classifiers: Introduction Classifiers Supervised

Week 2 Video 4 Metrics for Regressors Metrics for Regressors Linear Correlation MAE/RMSE

Linear Classifiers: Expressiveness Machine Learning 1 Lecture outline Linear models:

CS440/ECE448 Lecture 22: Including Slides by Svetlana Lazebnik, 10/2016 Linear Classifiers

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Linear Classifiers and the Perceptron William Cohen February 4, 2008 1 Linear classifiers

Linear, Binary SVM Classifiers COMPSCI 371D Machine Learning COMPSCI 371D Machine

Univariate 1-Way ANOVA as a Linear Model with Fixed Regressors Group 1 Group 2 Group 3 x x x

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Fusion of Continuous Output Classifiers Classifiers Jacob Hays Amit Pillay James DeFelice

Machine Learning Nave Bayes classifiers Types of classifiers We can divide the large

Occasion-level Classifiers or Event-level Classifiers? -Evidence from Child Language Acquisition

kinkyreg: Instrument-free inference for linear regression models with endogenous regressors

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

First look at structures CS 6355: Structured Prediction 1 So far Binary classifiers

Sketching Linear Classifiers Over Data Streams Kai Sheng Tai Vatsal Sharan, Peter Bailis,

To Re-rank or to Re-query: Can Visual Analytics Solve This Dilemma? E. Di Buccio 1 , M. Dussin 1 ,

An admissibility and asymptotic-preserving scheme for systems of conservation laws with source

Session 5 Software and Operating Systems Security Sbastien Combfis Fall 2019 This work is

Taming Reluctant Random Walks In The Positive Quadrant 2 , Marni Mishna 2 , and Yann

Data Preparation Data cleaning Data integration and transformation (Data

Vragen? Noem een aantal niet functionele requirements Software Design Software Design

14 The Plane Stress Problem IFEM Ch 14 Slide 1 Department of Engineering Mechanics PhD.

The Probabilistic Method The Probabilistic Method Topics on Randomized Computation Topics on