linear classifiers and regressors
play

Linear Classifiers and Regressors Borrowed with permission from - PowerPoint PPT Presentation

Linear Classifiers and Regressors Borrowed with permission from Andrew Moore (CMU) Single-Parameter Linear Regression Linear: Slide 2 Regression vs Classification Input Prediction of Classifier Attributes categorical output Input


  1. Linear Classifiers and Regressors “Borrowed” with permission from Andrew Moore (CMU)

  2. Single-Parameter Linear Regression Linear: Slide 2

  3. Regression vs Classification Input Prediction of Classifier Attributes categorical output Input Prediction of Regressor Attributes real-valued output Input Density Probability Attributes Estimator Linear: Slide 3

  4. Linear Regression DATASET inputs outputs x 1 = 1 y 1 = 1 x 2 = 3 y 2 = 2.2 ↑ x 3 = 2 y 3 = 2 w x 4 = 1.5 y 4 = 1.9 ↓ ← 1 → x 5 = 4 y 5 = 3.1 Linear regression assumes expected value of output y given input x , E[y|x] , is linear. Simplest case: Out( x ) = w × x for some unknown w . Challenge: Given dataset, how to estimate w . Linear: Slide 4

  5. 1-parameter linear regression Assume data is formed by y i = w × x i + noise i where… • noise signals are independent • noise has normal distribution with mean 0 and unknown variance σ 2 P(y|w,x) has a normal distribution with • mean w × x • variance σ 2 Linear: Slide 5

  6. Bayesian Linear Regression P( y | w , x ) = Normal(mean w × x ; var σ 2 ) Datapoints ( x 1 , y 1 ) ( x 2 , y 2 ) … ( x n , y n ) are EVIDENCE about w . Want to infer w from data: P( w | x 1 , x 2 ,…, x n , y 1 , y 2 …, y n ) •?? use BAYES rule to work out a posterior distribution for w given the data ?? •Or Maximum Likelihood Estimation ? Linear: Slide 6

  7. Maximum likelihood estimation of w Question: “For what value of w is this data most likely to have happened?” ⇔ What value of w maximizes n = ∏ P y y y x x x w P y w x ( , ,..., | , ,..., , ) ( , ) ? n n i i 1 2 1 2 = i 1 Linear: Slide 7

  8. ⎧ ⎫ n ⎪ ⎪ ∏ = * w P y w x ⎨ ⎬ arg max ( , ) i i ⎪ ⎪ ⎩ ⎭ = i 1 ⎧ ⎫ n ⎪ ⎪ ∏ 1 − y wx i i − 2 ⎨ ⎬ = arg max exp( ( ) ) σ ⎪ 2 ⎪ ⎩ ⎭ = i 1 ⎧ ⎫ n 2 − ⎪ ⎛ ⎞ ⎪ y wx ∑ 1 i i = − ⎨ ⎬ arg max ⎜ ⎟ σ ⎝ ⎠ 2 ⎪ ⎪ ⎩ ⎭ = i 1 ⎧ ⎫ 2 n ⎪ ⎪ ∑ ( ) = − y wx ⎨ ⎬ arg min i i ⎪ ⎪ ⎩ = ⎭ i 1 Linear: Slide 8

  9. Linear Regression Maximum likelihood w minimizes E(w) = sum-of-squares of residuals E(w) w ( ) ( ) ∑ ( ) ∑ ∑ ∑ 2 Ε = − = − + w y wx y 2 x y w x 2 w 2 ( ) 2 i i i i i i i i ⇒ Need to minimize a quadratic function of w . Linear: Slide 9

  10. Linear Regression Sum-of-squares minimized when ∑ p(w) x y = i i w w ∑ 2 x i Note: Bayesian stats would The maximum likelihood provide a prob dist of w model is … and predictions would give a prob dist of expected output Out(x) = w × x Often useful to know your confidence. Can use for prediction Max likelihood also provides kind of confidence! Linear: Slide 10

  11. Multi-variate Linear Regression Linear: Slide 11

  12. Multivariate Regression What if inputs are vectors ? 3 . . 4 6 . I nput is 2-d; Output value is “height” . 5 x 2 . 8 . 10 x 1 y 1 x 1 Dataset has form y 2 x 2 y 3 x 3 .: : . y R x R Linear: Slide 12

  13. Multivariate Regression R datapoints; each input has m components … as Matrices: ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ x x x y ..... ..... ... x m 11 12 1 1 1 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ x x x y ..... ..... ... x ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ = = m = 2 21 22 2 2 x y ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ M M M ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ x x x y ⎣ ..... ..... ⎦ ⎣ ... ⎦ ⎣ ⎦ x R R R Rm R 1 2 I MPORTANT EXERCI SE: PROVE I T !!!!! Linear regression model assumes ∃ vector w s.t. Out( x ) = w T x = w 1 x[1] + w 2 x[2] + … + w m x[m] Max. likelihood w = ( X T X) -1 ( X T Y) Linear: Slide 13

  14. Multivariate Regression (con’t) The max. likelihood w is w = ( X T X) -1 ( X T Y) R ∑ x ki x X T X is m × m matrix: i,j’th elt = kj = k 1 R ∑ x ki y X T Y is m -element vector: i ’th elt = k = k 1 Linear: Slide 15

  15. Constant Term in Linear Regression Linear: Slide 16

  16. What about a constant term? What if linear data does not go through origin (0,0,…0) ? Statisticians and Neural Net Folks all agree on a simple obvious hack. Can you guess?? Linear: Slide 17

  17. The constant term • Trick: create fake input “ X 0 ” that always takes value 1 X 1 X 2 Y X 0 X 1 X 2 Y 2 4 16 1 2 4 16 3 4 17 1 3 4 17 5 5 20 1 5 5 20 After: Before: Y= w 0 X 0 +w 1 X 1 + w 2 X 2 Y=w 1 X 1 + w 2 X 2 = w 0 +w 1 X 1 + w 2 X 2 …is a poor model Here, you should be able to see MLE …is good model! w 0 , w 1 , w 2 by inspection Linear: Slide 18

  18. Heteroscedasticity... Linear Regression with varying noise Linear: Slide 19

  19. Regression with varying noise • Suppose you know variance of noise that was added to each datapoint. y=3 x i y i σ i2 σ =2 ½ ½ 4 y=2 σ =1/2 1 1 1 σ =1 2 1 1/4 y=1 σ =1/2 σ =2 2 3 4 y=0 3 2 1/4 x=0 x=1 x=2 x=3 What’s the MLE estimate of w? σ y N wx 2 ~ ( , ) Assume i i i Linear: Slide 20

  20. MLE estimation with varying noise σ σ σ argmax p y y y x x x 2 2 2 w log ( , ,..., | , ,..., , , ,..., , ) R R R 1 2 1 2 1 2 w Assuming i.i.d. and − R y wx 2 then plugging in ( ) ∑ = argmin i i equation for Gaussian σ 2 and simplifying. = i 1 i w ⎛ ⎞ − R x y wx ( ) Setting dLL/dw ∑ = = w i i i ⎜ ⎟ such that 0 equal to zero σ 2 ⎝ ⎠ = i 1 i ⎛ ⎞ R x y Trivial algebra ∑ i i ⎜ ⎟ σ 2 ⎝ ⎠ = i = i 1 ⎛ ⎞ R x 2 ∑ i ⎜ ⎟ σ 2 ⎝ ⎠ = i 1 i Linear: Slide 21

  21. This is Weighted Regression • How to minimize weighted sum of squares ? y=3 σ =2 − R y wx 2 ( ) ∑ argmin i i σ y=2 2 = i σ =1/2 i 1 w σ =1 y=1 σ =1/2 σ =2 y=0 x=0 x=1 x=2 x=3 1 where weight for i’th datapoint is σ 2 i Linear: Slide 22

  22. Weighted Multivariate Regression The max. likelihood w is w = (W X T WX) -1 (W X T WY) x ki x R ∑ kj (W X T WX) is an m x m matrix: i,j’th elt is σ 2 = k 1 i R x ki y (W X T WY) is an m -element vector: i ’th elt ∑ k σ 2 = k 1 i Linear: Slide 23

  23. Non-linear Regression (Digression…) Linear: Slide 24

  24. Non-linear Regression Suppose y is related to function of x in that predicted values have a non-linear dependence on w: y=3 x i y i ½ ½ y=2 1 2.5 2 3 y=1 3 2 y=0 3 3 x=0 x=1 x=2 x=3 What’s the MLE + σ estimate of w? y N w x 2 ~ ( , ) Assume i i Linear: Slide 25

  25. Non-linear MLE estimation σ = argmax p y y y x x x w log ( , ,..., | , ,..., , , ) R R 1 2 1 2 w Assuming i.i.d. and ( ) R Common (but not only) approach: ∑ 2 = − + then plugging in argmin y w x Numerical Solutions: i i equation for Gaussian = i 1 w and simplifying. • Line Search • Simulated Annealing ⎛ ⎞ − + y w x R ∑ = i i = ⎜ ⎟ w • Gradient Descent such that 0 Setting dLL/dw ⎜ ⎟ + w x ⎝ ⎠ equal to zero • Conjugate Gradient = i 1 i • Levenberg Marquart We’re down the • Newton’s Method algebraic toilet Also, special purpose statistical- So guess what optimization-specific tricks such as E.M. (See Gaussian Mixtures lecture we do? for introduction) Linear: Slide 26

  26. GRADIENT DESCENT Goal: Find a local minimum of f: ℜ→ℜ Approach: 1. Start with some value for w η ∂ ( ) ← − w w w 2. GRADIENT DESCENT: f ∂ w 3. Iterate … until bored … η = LEARNING RATE = small positive number, e.g. η = 0.05 Good default value for anything ! QUESTION: Justify the Gradient Descent Rule Linear: Slide 28

  27. Gradient Descent in “m” Dimensions ℜ m → ℜ Given f( w ) : ∂ ⎛ ⎞ ( ) ⎜ ⎟ f w ∂ w ⎜ ⎟ 1 ( ) points in direction of steepest ascent. ∇ = ⎜ ⎟ M f w ∂ ⎜ ⎟ ( ) ⎟ f w ⎜ ∂ w ⎝ ⎠ m ( ) ∇ f w is the gradient in that direction ( ) ← η ∇ GRADIENT DESCENT RULE: w w - f w Equivalently ∂ ( ) is j th ← ….where w j weight w w η - f w j j ∂ w j “just like a linear feedback system” Linear: Slide 29

  28. Linear Perceptron Linear: Slide 31

  29. Linear Perceptrons Multivariate linear models: Out( x ) = w T x “Training” ≡ minimizing sum-of-squared residuals… ( ( ) ) ∑ 2 Ε = Out x − y k k k ( ) ∑ 2 Τ = x − w y k k k by gradient descent… → perceptron training rule Linear: Slide 32

  30. Linear Perceptron Training Rule R ∂ ∂ E R ∑ ∑ = − T = − E y 2 T y 2 ( ) w x ( ) w x k k k k ∂ ∂ w w = k j 1 j = k 1 ∂ R Gradient descent: ∑ = − − T T y y 2 ( ) ( ) w x w x k k k k ∂ w to minimize E, = k 1 j update w … ∂ R ∑ = − T δ 2 w x ∂ E k k ∂ w ← = w w η k 1 j - j j ∂ …where w j = − T δ y w x k k k ∂ R m ∂ E ∑ ∑ = − δ w x So what’s 2 ? k i ki ∂ w ∂ w = = k i 1 j 1 j R ∑ = − δ k x 2 kj = k 1 Linear: Slide 33

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend