mle regression
play

MLE & Regression Ken Kreutz-Delgado (Nuno Vasconcelos) UCSD - PowerPoint PPT Presentation

MLE & Regression Ken Kreutz-Delgado (Nuno Vasconcelos) UCSD ECE 175A Winter 2012 Statistical Learning Goal: Given a relationship between a feature vector x and a vector y , and iid sample data ( x i ,y i ), find an approximating


  1. MLE & Regression Ken Kreutz-Delgado (Nuno Vasconcelos) UCSD – ECE 175A – Winter 2012

  2. Statistical Learning Goal: Given a relationship between a feature vector x and a vector y , and iid sample data ( x i ,y i ), find an ˆ approximating function f ( x )  y   ˆ ( ) x y f x y ( ) · f This is called training or learning . Two major types of learning: • Unsupervised Classification (aka Clustering) or Regression (“blind” curve fitting): only X is known. • Supervised Classification or Regression: both X and target value Y are known during training, only X is known at test time. 2

  3. Supervised Learning & Regression • X can be anything, but the type of Y dictates the type of supervised learning problem – Y in {0,1} referred to as detection – Y in {0, ..., M-1} referred to as (M-ary) classification – Y continuous is referred to as Regression • We have been dealing mostly with classification, now we will emphasize regression • The regression problem provides a relatively easy setting to explain non-trivial MLE problems 3

  4. The Standard Regression Model • The regression problem is usually modeled as follow: – The are two random vectors. The independent (regressor) variable X and the dependent (regressed) variable Y . – An iid dataset of training examples D = {( x 1 ,y 1 ) , … , ( x n ,y n )} – An additive noise parametric model of the form    ( ; ) Y f X E where    R p is a deterministic parameter vector, and E is an iid additive random vector that accounts for noise and model error. • Two fundamental types of regression problems – Linear regression , where f(.) is linear in  – Nonlinear regression , otherwise – What matters is linearity in the parameter  , not in the data X ! 4

  5. Example Regression Models • Linear Regression: • Nonlinear Regression: – Line Fitting – Neural Networks      ( ; ) f x x 1 1 0   ( ; ) f x e      – Polynomial Fitting x 1 1 0   k – Sinusoidal Decompositions   i ( ; ) f x x i   k  i 0   ( ; ) cos( ) f x x – Truncated Fourier Series i  0 i   k   ( ; ) cos( ) f x ix – Etc . i  0 i • We often assume that E is additive white Gaussian noise ( AWGN ) • We always assume that E and X are independent 5

  6. Probabilistic Model of Y Conditioned on X • A realization is X = x , E = e, Y = y : y    e ( ; ) y f x – x is (almost) always known , the goal is to x predict y given x – Thus, for each x , f(x,  ) is treated like a constant – The realization Ε = e is added to f(x,  ) to form Y = y – Hence, Y is conditionally distributed as Ε but with a constant added – This only changes the mean of the distribution of E , P Ε ( ε ;  ), yielding        | ( | ; ) ( ; ); P y x P y f x E Y X – The conditional probability model for Y | X is determined from the distribution of the noise, P Ε ( ε ;  ) ! Also note that the noise pdf, P Ε ( ε ;  ) , might also depend on the unknown parameter vector  6

  7. The (Conditional) Likelihood Function • Consider a collection of iid training points D = {( x 1 ,y 1 ) , ... , ( x n ,y n )}. If we define X = { x 1 , ... , x n } = X  Y . Y = { y 1 ,... , y n }, we have D = X • Conditioned on X , the likelihood of  given D is n         ( | ; ) ( | ; ) | ; P P P y | | | D Y X i  i 1 n n             | ; ( ; ) ; P y x P y f x E | Y X i i i i   1 1 i i X -conditional likelihood of  given Y • This is also the X • Note: we have used the facts that y i is conditionally iid and depends only on x i (both facts being a consequence of our modeling assumptions). 7

  8. Maximum Likelihood Estimation • This suggests that – Given a collection of iid training points D = {( x 1 ,y 1 ) ,..., ( x n ,y n )}, the natural procedure to estimate the parameter  is ML estimation:    ˆ    argmax | ; P y x M L Y X | i i    i        argmax ( ; ); P y f x E i i    i Equivalently ,    ˆ    argmax log | ; P y x M L Y X | i i    i        argmax log ( ; ); P y f x E i i    i – Note that the noise pdf, P Ε ( ε ;  ), can also possibly depend on  8

  9. AWGN MLE • One frequently used model is the scalar AWGN case where the noise is zero-mean with variance s 2 e 2  1 e  s 2 ( ) 2 P e E s 2 2 • In this case the conditional pdf for Y | X is a Gaussian of mean f ( x ;  ) and variance s 2     2     ( ; ) y f x 1      ( | ; ) exp P y x s | Y X 2  s 2 2   2   • If the variance s 2 is unknown, it is included in  9

  10. AWGN MLE • Assume the variance s 2 is known. Then the MLE is:    ˆ     argmax log ( ; ) P y f x E ML i i    i     2 ( ; ) y f x 1    s i i 2 argmin log(2 ) s 2    2 2 i       2 argmin ( ; ) y f x i i    i – Since this minimizes the squared Euclidean distance of the estimation error ( or prediction error ), it is also known as least squares curve fitting 10

  11. MLE & Optimal Regression • The above development can be framed in our initial formulation of optimizing the loss of the learning system  ˆ x ( ) ˆ y f x ( , ) L y y ( ) · f • For a regression problem this still applies – the interpretation of f (.) as a predictor even becomes more intuitive y • Solving by ML is equivalent to picking a loss identical to the negative of the log of the noise probability density x 11

  12. Loss for Scalar Noise with Known PDF • Additive Error PDF: • Loss, ε = ( y – f (x;  )): – Gaussian (AWGN case) – L 2 Distance e 2      1 2 L( ( ; ) , ) ( ( ; ) ) f x y y f x e  s 2 ( ) 2 P e E s 2 2 – Laplacian – L 1 Distance e | | 1      L( ( ; ) , ) ( ; ) f x y y f x e  s ( ) P e E s 2 – Rayleigh Distance – Rayleigh     2 e 2 L( ( ; ) , ) ( ( ; ) ) f x y y f x e  e  s 2 ( ) 2 P e    E s log( ( ; )) y f x 2 12

  13. Maximum Likelihood Estimation • How do we find the optimal parameters? • Recall that to obtain the MLE we need to solve      * max argmax ; P D    • The unique local solutions are the parameter values such that  ˆ   ( ; ) 0 P   D ˆ        T ( , ) 0, 0 • Note that you always have to check the second-order Hessian condition! 13

  14. Maximum likelihood Estimation Recall some important results • FACT: each of the following is a necessary and sufficient condition for a real symmetric matrix A to be (strictly) positive definite : i) x T Ax > 0,  x  0 ii) All eigenvalues of A are real and satisfy l i >0 iii ) All upper-left submatrices A k have strictly positive determinant. ( strictly positive leading principal minors ). iv ) There exists a matrix R with independent rows such that A = RR T . Equivalently, there exists a matrix Q with independent columns such that A = Q T Q • Definition of upper left submatrices:   a a a   1,1 1,2 1,3   a a    1,1 1,2   A a A A a a a   1 1,1 2 3 2,1 2,2 2,3 a a     2,1 2,2 a a a   3,1 3,2 3,3 14

  15. Vector Derivatives • To compute the gradient and Hessian it is useful to rely on vector derivatives (defined as row operators) • Some important identities that we will use       T A        T T A A ( ) A A      2       ) T 2( b A b A A   • To find equivalent Cartesian gradient identities, merely transpose the above. • There are various lists of the most popular formulas. Just Google “vector derivatives” or “matrix derivatives”. 15

  16. MLE for Known Additive Noise PDF • For regression this becomes    ˆ    argmax log | ; P y x M L | Y X i i    i     argmin L , ; y x i i    i ˆ  • The unique locally optimal solutions, , are given by      ˆ   L( , ; )  0 y x   i i   i    2  ˆ  y x   >    T L( , ; ) 0, 0     2 i i   i 16

  17. MLE and Regression • Noting that the vector derivative and Hessian are linear operators (because derivatives are linear operators), these conditions can be written as    ˆ   y x  L( , ; ) 0     i i i and    2  ˆ  y x   >    T  2 L( , ; )  0, 0   i i   i 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend