MLE & Regression Ken Kreutz-Delgado (Nuno Vasconcelos) UCSD - - PowerPoint PPT Presentation

mle regression
SMART_READER_LITE
LIVE PREVIEW

MLE & Regression Ken Kreutz-Delgado (Nuno Vasconcelos) UCSD - - PowerPoint PPT Presentation

MLE & Regression Ken Kreutz-Delgado (Nuno Vasconcelos) UCSD ECE 175A Winter 2012 Statistical Learning Goal: Given a relationship between a feature vector x and a vector y , and iid sample data ( x i ,y i ), find an approximating


slide-1
SLIDE 1

MLE & Regression

Ken Kreutz-Delgado (Nuno Vasconcelos)

UCSD – ECE 175A – Winter 2012

slide-2
SLIDE 2

2

Statistical Learning

Goal: Given a relationship between a feature vector x and a vector y, and iid sample data (xi,yi), find an approximating function f (x)  y This is called training or learning. Two major types of learning:

  • Unsupervised Classification (aka Clustering) or Regression

(“blind” curve fitting): only X is known.

  • Supervised Classification or Regression: both X and target

value Y are known during training, only X is known at test time.

( ) ˆ y y f x  

x ( ) · f

ˆ

slide-3
SLIDE 3

3

Supervised Learning & Regression

  • X can be anything, but the type
  • f Y dictates the type of

supervised learning problem

– Y in {0,1} referred to as detection – Y in {0, ..., M-1} referred to as (M-ary) classification – Y continuous is referred to as Regression

  • We have been dealing mostly

with classification, now we will emphasize regression

  • The regression problem provides

a relatively easy setting to explain non-trivial MLE problems

slide-4
SLIDE 4

4

The Standard Regression Model

  • The regression problem is usually modeled as follow:

– The are two random vectors. The independent (regressor) variable X and the dependent (regressed) variable Y. – An iid dataset of training examples D = {(x1,y1) , … , (xn,yn)} – An additive noise parametric model of the form where    Rp is a deterministic parameter vector, and E is an iid additive random vector that accounts for noise and model error.

  • Two fundamental types of regression problems

– Linear regression, where f(.) is linear in  – Nonlinear regression, otherwise – What matters is linearity in the parameter , not in the data X!

( ; ) Y f X E   

slide-5
SLIDE 5

5

Example Regression Models

  • Linear Regression:

– Line Fitting – Polynomial Fitting – Truncated Fourier Series

1

( ; ) f x x      ( ; )

k i i i

f x x  

 

  • Nonlinear Regression:

– Neural Networks – Sinusoidal Decompositions – Etc.

( ; ) cos( )

k i i

f x ix  

 

1

1 ( ; ) 1

x

f x e 

 

  ( ; ) cos( )

k i i

f x x  

 

  • We often assume that E is additive white Gaussian noise (AWGN)
  • We always assume that E and X are independent
slide-6
SLIDE 6

6

Probabilistic Model of Y Conditioned on X

  • A realization is X = x, E = e, Y = y :

– x is (almost) always known, the goal is to predict y given x – Thus, for each x, f(x,) is treated like a constant – The realization Ε = e is added to f(x,) to form Y = y – Hence, Y is conditionally distributed as Ε but with a constant added – This only changes the mean of the distribution of E, PΕ (ε; ), yielding – The conditional probability model for Y|X is determined from the distribution of the noise, PΕ (ε; )! Also note that the noise pdf, PΕ (ε; ), might also depend on the unknown parameter vector 

( ; ) y f x e   

x y

 

| ( | ; )

( ; );

Y X

P y x P y f x   

E

 

slide-7
SLIDE 7

The (Conditional) Likelihood Function

  • Consider a collection of iid training points

D = {(x1,y1), ... , (xn,yn)}. If we define X = {x1 , ... , xn} Y = {y1 ,... , yn}, we have D = X = X Y .

  • Conditioned on X, the likelihood of  given D

is

  • This is also the X

X -conditional likelihood of  given Y

  • Note: we have used the facts that yi is conditionally iid

and depends only on xi (both facts being a consequence

  • f our modeling assumptions).

7

     

| | 1 | 1 | 1

( | ; ) | ; ) ( | ; | ; ( ; ) ;

Y X i Y X i i i n D i n n i i i

P y P y x P y P P f x      

 E  

    

  

slide-8
SLIDE 8

8

Maximum Likelihood Estimation

  • This suggests that

– Given a collection of iid training points D = {(x1,y1),..., (xn,yn)}, the natural procedure to estimate the parameter  is ML estimation: Equivalently, – Note that the noise pdf, PΕ (ε; ), can also possibly depend on 

   

L | M

argmax | ; argmax ˆ ( ; );

Y X i i i i i i

P y x P y f x

 

   

 E   

  

 

   

L | M

argmax log | ; argmax log ( ; ); ˆ

Y X i i i i i i

P y x P y f x

 

   

   E 

  

 

slide-9
SLIDE 9

9

AWGN MLE

  • One frequently used model is the scalar AWGN case

where the noise is zero-mean with variance s2

  • In this case the conditional pdf for Y|X is a Gaussian of

mean f(x; ) and variance s2

  • If the variance s2 is unknown, it is included in 

2 2

2 2

1 ( ) 2 P e

e s

e s

 E

 

2 | 2 2

( ; ) 1 ( | ; ) exp 2 2

Y X

y f x P y x s s                

slide-10
SLIDE 10

10

AWGN MLE

  • Assume the variance s

2 is known. Then the MLE is:

– Since this minimizes the squared Euclidean distance of the estimation error (or prediction error), it is also known as least squares curve fitting

     

2 2 ML 2 2

argmax log ( ; ) ( ; ) 1 argmin log(2 ) 2 2 argmin ( ; ) ˆ

i i i i i i i i i

P y f x y f x y f x

  

s  s   

E      

      

  

slide-11
SLIDE 11

11

MLE & Optimal Regression

  • The above development can be framed in our initial

formulation of optimizing the loss of the learning system

  • For a regression problem this still applies

– the interpretation of f (.) as a predictor even becomes more intuitive

  • Solving by ML is equivalent to

picking a loss identical to the negative of the log of the noise probability density

ˆ ( ) y f x  x ( ) · f

ˆ ( , ) L y y

x y

slide-12
SLIDE 12

12

Loss for Scalar Noise with Known PDF

  • Additive Error PDF:

– Gaussian (AWGN case) – Laplacian – Rayleigh

  • Loss, ε = (y – f(x; )):

– L2 Distance – L1 Distance – Rayleigh Distance

2 2

2 2

1 ( ) 2 P e

e s

e s

 E

| |

1 ( ) 2 P e

e s

e s

 E

2

L( ( ; , ) ) ) ( ( ; ) f x y y f x     L( ( ; , ) ( ) ; ) f x y y f x    

2 2

2 2

( ) P e

e s

e e s

 E

2

L( ( ; , ) ( ( ; ) log( ( ; ) )) ) f x y y f x y f x       

slide-13
SLIDE 13

13

max

Maximum Likelihood Estimation

  • How do we find the optimal parameters?
  • Recall that to obtain the MLE we need to solve
  • The unique local solutions are

the parameter values such that

  • Note that you always have to check the second-order

Hessian condition!

 

*

argmax ;

D

P

 

 

( ) ˆ ;

D

P     

ˆ ( , 0, )

T

      

slide-14
SLIDE 14

14

Maximum likelihood Estimation

Recall some important results

  • FACT: each of the following is a necessary and

sufficient condition for a real symmetric matrix A to be (strictly) positive definite: i) xTAx > 0,  x  0

ii) All eigenvalues of A are real and satisfy li >0 iii) All upper-left submatrices Ak have strictly positive determinant. (strictly positive leading principal minors). iv) There exists a matrix R with independent rows such that A = RRT. Equivalently, there exists a matrix Q with independent columns such that A = QTQ

  • Definition of upper left submatrices:

1,1 1,2 1,3 1,1 1,2 1 1,1 2 3 2,1 2,2 2,3 2,1 2,2 3,1 3,2 3,3

a a a a a A a A A a a a a a a a a                   

slide-15
SLIDE 15

15

Vector Derivatives

  • To compute the gradient and Hessian it is useful to rely
  • n vector derivatives (defined as row operators)
  • Some important identities that we will use
  • To find equivalent Cartesian gradient identities, merely

transpose the above.

  • There are various lists of the most popular formulas. Just

Google “vector derivatives” or “matrix derivatives”.

 

A A     

 

( )

T T T A

A A        

2

2( )T b A b A A         

slide-16
SLIDE 16

16

MLE for Known Additive Noise PDF

  • For regression this becomes
  • The unique locally optimal solutions, , are given by

   

L | M

argmax log | ; argmin L , ˆ ;

Y X i i i i i i

P y x y x

 

  

   

 

, ˆ L( ; )

i i i

x y           

2 2

L( , ; ) 0, ˆ

T i i i

y x         >       

ˆ 

slide-17
SLIDE 17

17

MLE and Regression

  • Noting that the vector derivative and Hessian are linear
  • perators (because derivatives are linear operators),

these conditions can be written as and

ˆ L( , ; )

i i i

y x         

2 2 L(

, ; ) 0, ˆ

T i i i

y x         >       

slide-18
SLIDE 18

18

Example

  • Consider the problem of 2-D line fitting

– The model is where is ε is scalar AWGN of known variance – The (effective) loss function is – We are looking for the line that makes the square of these vertical distances as small as possible in an averaged sense. – Our first step is to compute the zeros of the gradient

  • this amounts to solving a system of linear equations

1

( ; ) y f x x e   e      

 

2 1

L

i i i

y x     

slide-19
SLIDE 19

19

Example

  • This can be written in matrix form as

   

1 1 1

2 2

i i i i i i i

L y x L y x x                          

 

1 2 1 i i i i i i i i i i i

y x n y x x x             

    

slide-20
SLIDE 20

20

Example

  • Defining the sample averaged quantities:

we get

                              

    

1 2

1 1 1 1 1 1  

i i i i i i i i i i i

x n x n x n x y n y n

1 1 1 , ,

k k i i i i i i i

y y y x x yx y x n n n    

  

                    

1 2

1   x x x xy y

slide-21
SLIDE 21

21

Example

  • The solution is

1 2 1 2 2 2 2 M 2 L 2

ˆ 1 ˆ 1 1 1 x y xy x x x x y xy x x x x y x xy x x xy x y  

                                                    

slide-22
SLIDE 22

22

Example

  • or, in a form that may be more familiar
  • we also need to check that we have a minimum

2 2 1

ˆ

i i j i i i i i i j i

n x y x y n x x          

    

1

1 1 ˆ ˆ

i i i i

y x n n    

 

1

ˆ ˆ y x    

1

cov( , ) ˆ var( ) x y x  

   

1 2 1 2 1 2

2 1 2 2

i i i i i i i

L y x x L n L x x y x x                               

 

slide-23
SLIDE 23

23

Example

  • The Hessian

has to be positive definite

  • Recall that one of the criteria

is for the leading principal minors to be strictly positive

  • Check:

– 1 > 0 –

1

ˆ y x    

1

cov( , ) ˆ var ) , ( x y x  

2 2 2

1 L ˆ ˆ ( 2 ( ) ) x n x x               

2 2

var( ) = sample variance of x x x x   >

slide-24
SLIDE 24

24

Least Squares in General

  • What if I have other models?
  • Can we solve this more generally?

– Note that we can write the model – as

  • This can be generalized to any model if we exploit the

assumption of linearity in the (k+1)-vector to form

( ; ) ( )T f x x    

1

( ; ) f x x     

1

1 ( ) x x                   ( ) ( ) , ( )

k k

x x x                            

slide-25
SLIDE 25

25

Examples

  • Elements of (x) can be arbitrary non-linear functions of x

– Line Fitting – Polynomial Fitting – Truncated Fourier Series

1

( ; ) f x x      ( ; )

k i i i

f x x  

  ( ; ) cos( )

k i i

f x ix  

 

( ) [1 ]

T

x x   ( ) 1

T k

x x      

 

( ) 1 cos( )

T

x k x  

slide-26
SLIDE 26

26

Least Squares Parameter Estimation

  • For Scalar iid AWGN of Known Variance, we have the

(unweighted) least squares loss function which we can write as

  • or
  • where

1 1

( ) ( ) ( )

T T n n k

x x x y y y                                       

 

2 1

L

i i i

y x     

 

2

( ) L

i i T i

x y     

2

L ( ) y x    

slide-27
SLIDE 27

27

Examples

  • The most important component is the matrix (x)

– Line Fitting

  • Polynomial Fitting

– Truncated Fourier Series

1

1 ( ) 1

k k n

x x x            

1

1 ( ) 1

n

x x x            

1

1 cos( ) ( ) 1 cos( )

n

k x x k x            

slide-28
SLIDE 28

28

(Unweighted) Least Squares

  • Loss Function is:
  • One way to minimize this is

to find a stationary value  such that

  • r

from which, assuming that (x)T(x) is invertible,

2

L ( ) y x    

 

) ( ) L( 2 ( )

T

y x x           

( ) ( ) ( )

T T

x x x y     

S 1 L

ˆ ( ) ) ( ) ( ) (

T T

x x x x y y 

 

         

slide-29
SLIDE 29

29

(Unweighted) Least Squares

  • We next check the Hessian
  • This is positive definite iff the columns of (x) are

independent, which turns out to be

– the condition for (x)T(x) to be invertible, – which is the necessary condition for the solution to be stationary

  • Note that we usually design (x) so we can always make

this true

  • Usually we only have to make sure all the xi are different

 

 

2 2 L

L 2 ( ) ( ) 2 ( ) ( )

T T T

x y x x x                            

slide-30
SLIDE 30

30

(Unweighted) Least Squares

  • In summary

– A problem of the type has a least squares solution iff (x) has full column rank. – The matrix is called the (Moore-Penrose) pseudo-inverse of (x)

2

min ) L( ( ) y x

    

S 1 L

ˆ ( ) ) ( ) ( ) (

T T

x x x x y y 

 

         

1

( ) ( ) ( ) ( )

T T

x x x x

 

        

slide-31
SLIDE 31

31

(Unweighted) Least Squares

  • Here is a way of thinking about this

– we have an inconsistent system of equations This can’t be solved because although (x) has full (column) rank, it is “tall” (has more rows than columns) and thus is not invertible E.g. consider the line – To make a consistent system, we multiply both sides by (x)T

( ) y x   

1 1 1

1 1

n n

y x y x                             

( ) ( ) ( )

T T

x y x x     

slide-32
SLIDE 32

32

(Unweighted) Least Squares

  • This is now a solvable system. E.g.,

whose solution is given by the pseudo-inverse We have just seen that this is the best approximate solution to the original problem in the (unweighted) least squares sense

S 1 L

( ) ( ) ( ( ˆ ) )

T T

x x x y x y 

 

         

1 1 1 1 1

1 1 1 1 1 1

n n n n

y x x x x x y x                                         

S 2 L

argmin ( ) ˆ y x

 

 

  

slide-33
SLIDE 33

33

(Unweighted) Least squares

  • In principle, assuming that the matrix (x) has full

column rank, the least squares solution is straightforward to compute

  • For example, let’s redo the line example

1 2 1

1 1 1 1 1

T n n

x x Γ(x) Γ(x) n x x x x x                          

1 1

1 1

T n n

y y Γ(x) y n x x xy y                        

slide-34
SLIDE 34

34

(Unweighted) Least squares

  • So that
  • leads to

which is the solution that we had obtained before, but now with less work. Of course, we know from ECE174 that there is a deep geometric formalism at play here.

LS 1 2 1 2 2 2

ˆ 1 ˆ 1 x y xy x x x y x xy x x xy x y  

                                  

S 1 L

ˆ ( ) ( ) ( ) ( )

T T

x x x x y 

 

         

slide-35
SLIDE 35

35

Relationship to Probabilistic Model

  • The (unweighted) least square solution

– Estimates the function f(x;) of maximum likelihood for the scalar model – where e is a scalar iid zero-mean Gaussian error (AWGN) of known variance, – The method is general – Other models f(x,) will lead to other least squares problems – If variance is unknown, we don’t have a pure LS problem – If we have a vector model, in general we have weighted LS – If the error is not Gaussian, problem is not least squares.

( ; ) y f x e   

x y

2 2

2 2

1 ( ) 2 P e

e s

e s

 E

slide-36
SLIDE 36

36

END