PATTERN RECOGNITION AND MACHINE LEARNING Slide Set 2: Estimation - - PowerPoint PPT Presentation

pattern recognition and machine learning
SMART_READER_LITE
LIVE PREVIEW

PATTERN RECOGNITION AND MACHINE LEARNING Slide Set 2: Estimation - - PowerPoint PPT Presentation

PATTERN RECOGNITION AND MACHINE LEARNING Slide Set 2: Estimation Theory October 2019 Heikki Huttunen heikki.huttunen@tuni.fi Signal Processing Tampere University default Classical Estimation and Detection Theory Before the machine


slide-1
SLIDE 1

PATTERN RECOGNITION AND MACHINE LEARNING

Slide Set 2: Estimation Theory October 2019 Heikki Huttunen heikki.huttunen@tuni.fi

Signal Processing Tampere University

slide-2
SLIDE 2

default

Classical Estimation and Detection Theory

  • Before the machine learning part, we will take a look at classical estimation theory.
  • Estimation theory has many connections to the foundations of modern machine

learning.

  • Outline of the next few hours:

1 Estimation theory:

  • Fundamentals
  • Maximum likelihood
  • Examples

2 Detection theory:

  • Fundamentals
  • Error metrics
  • Examples

2 / 37

slide-3
SLIDE 3

default

Introduction - estimation

  • Our goal is to estimate the values of a group of parameters from data.
  • Examples: radar, sonar, speech, image analysis, biomedicine, communications,

control, seismology, etc.

  • Parameter estimation: Given an N-point data set x = {x[0], x[1], . . . , x[N − 1]} which

depends on the unknown parameter θ ∈ R, we wish to design an estimator g(·) for θ ˆ θ = g(x[0], x[1], . . . , x[N − 1]).

  • The fundamental questions are:

1 What is the model for our data? 2 How to determine its parameters? 3 / 37

slide-4
SLIDE 4

default

Introductory Example – Straight line

  • Suppose we have the illustrated time series and

would like to approximate the relationship of the two coordinates.

  • The relationship looks linear, so we could assume

the following model: y[n] = ax[n] + b + w[n], with a ∈ R and b ∈ R unknown and w[n] ∼ N(0, σ2)

  • N(0, σ2) is the normal distribution with mean 0

and variance σ2.

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 x 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 y

4 / 37

slide-5
SLIDE 5

default

Introductory Example – Straight line

  • Each pair of a and b represent one line.
  • Which line of the three would best describe the

data set? Or some other line?

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 x 0.25 0.00 0.25 0.50 0.75 1.00 1.25 y Model candidate 1 (a = 0.07, b = 0.49) Model candidate 2 (a = 0.06, b = 0.33) Model candidate 3 (a = 0.08, b = 0.51)

5 / 37

slide-6
SLIDE 6

default

Introductory Example – Straight line

  • It can be shown that the best solution (in the maximum likelihood sense; to be defined

later) is given by ˆ a = − 6 N(N + 1)

N−1

  • n=0

y(n) + 12 N(N2 − 1)

N−1

  • n=0

x(n)y(n) ˆ b = 2(2N − 1) N(N + 1)

N−1

  • n=0

y(n) − 6 N(N + 1)

N−1

  • n=0

x(n)y(n).

  • Or, as we will later learn, in an easy matrix form:

ˆ θ = ˆ a ˆ b

  • = (XTX)−1XTy

6 / 37

slide-7
SLIDE 7

default

Introductory Example – Straight line

  • In this case, ˆ

a = 0.07401 and ˆ b = 0.49319, which produces the line shown on the right.

  • The line also minimizes the squared distances

(green dashed lines) between the model (blue line) and the data (red circles).

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 x 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 y Best Fit y = 0.0740x+0.4932 Sum of Squares = 3.62

7 / 37

slide-8
SLIDE 8

default

Introductory Example 2 – Sinusoid

  • Consider transmitting the sinusoid below.

20 40 60 80 100 120 140 160 4 3 2 1 1 2 3 4

8 / 37

slide-9
SLIDE 9

default

Introductory Example 2 – Sinusoid

  • When the data is received, it is corrupted by noise and the received samples look like

below.

20 40 60 80 100 120 140 160 4 3 2 1 1 2 3 4

  • Can we recover the parameters of the sinusoid?

9 / 37

slide-10
SLIDE 10

default

Introductory Example 2 – Sinusoid

  • In this case, the problem is to find good values for A, f0 and φ in the following model:

x[n] = A cos(2πf0n + φ) + w[n], with w[n] ∼ N(0, σ2).

10 / 37

slide-11
SLIDE 11

default

Introductory Example 2 – Sinusoid

  • It can be shown that the maximum likelihood estimator; MLE for parameters A, f0 and φ

are given by ˆ f0 = value of f that maximizes

  • N−1
  • n=0

x(n)e−2πifn

  • ,

ˆ A = 2 N

  • N−1
  • n=0

x(n)e−2πiˆ

f0n

  • ˆ

φ = arctan − N−1

n=0 x(n) sin(2πˆ

f0n) N−1

n=0 x(n) cos(2πˆ

f0n) .

11 / 37

slide-12
SLIDE 12

default

Introductory Example 2 – Sinusoid

  • It turns out that the sinusoidal parameter estimation is very successful:

20 40 60 80 100 120 140 160 4 3 2 1 1 2 3 4

ˆ f0 = 0.068 (0.068); ˆ A = 0.692 (0.667); ˆ φ = 0.238 (0.609)

  • The blue curve is the original sinusoid, and the red curve is the one estimated from

the green circles.

  • The estimates are shown in the figure (true values in parentheses).

12 / 37

slide-13
SLIDE 13

default

Introductory Example 2 – Sinusoid

  • However, the results are different for each realization of the noise w[n].

20 40 60 80 100 120 140 160 4 3 2 1 1 2 3 4

ˆ f0 = 0.068; ˆ A = 0.652; ˆ φ = -0.023

20 40 60 80 100 120 140 160 4 3 2 1 1 2 3 4

ˆ f0 = 0.066; ˆ A = 0.660; ˆ φ = 0.851

20 40 60 80 100 120 140 160 4 3 2 1 1 2 3 4

ˆ f0 = 0.067; ˆ A = 0.786; ˆ φ = 0.814

20 40 60 80 100 120 140 160 4 3 2 1 1 2 3 4

ˆ f0 = 0.459; ˆ A = 0.618; ˆ φ = 0.299 13 / 37

slide-14
SLIDE 14

default

Introductory Example 2 – Sinusoid

  • Thus, we’re not very interested in an individual case, but rather on the distributions of

estimates

  • What are the expectations: E[ˆ

f0], E[ˆ φ] and E[ˆ A]?

  • What are their respective variances?
  • Could there be a better formula that would yield smaller variance?
  • If yes, how to discover the better estimators?

14 / 37

slide-15
SLIDE 15

default

LEAST SQUARES

15 / 37

slide-16
SLIDE 16

default

Least Squares

  • The general solution for linear case is easy to remember.
  • Consider the line fitting case: y[n] = ax[n] + b + w[n].
  • This can be written in matrix form as follows:

     y[0] y[1] . . . y[N − 1]     

  • y

=        x[0] 1 x[1] 1 x[2] 1 . . . . . . x[N − 1] 1       

  • X
  • a

b

  • θ

+w.

16 / 37

slide-17
SLIDE 17

default

Least Squares

  • Now the model is written compactly as

y = Xθ + w

  • Solution: The value of θ minimizing the error wTw is given by:

ˆ θLS =

  • XTX

−1 XTy.

  • See sample Python code here: https:

//github.com/mahehu/SGN-41007/blob/master/code/Least_Squares.ipynb

17 / 37

slide-18
SLIDE 18

default

LS Example with two variables

  • Consider an example, where microscope images

have an uneven illumination.

  • The illumination pattern appears to have the

highest brightness at the center.

  • Let’s try to fit a paraboloid to remove the effect:

z(x, y) = c1x2 + c2y2 + c3xy + c4x + c5y + c6, with z(x, y) the brightness at pixel (x, y).

18 / 37

slide-19
SLIDE 19

default

LS Example with two variables

  • In matrix form, the model looks like the following:

     z1 z2 . . . zN     

z

=      x2

1

y2

1

x1y1 x1 y1 1 x2

2

y2

2

x2y2 x2 y2 1 . . . . . . . . . . . . . . . . . . x2

N

y2

N

xNyN xN yN 1     

  • H

·         c1 c2 c3 c4 c5 c6        

c

+ǫ, (1) with zk the grayscale value at k’th pixel (xk, yk).

19 / 37

slide-20
SLIDE 20

default

LS Example with two variables

  • As a result we get the LS fit by ˆ

c =

  • HTH

−1 HTz, ˆ c = (−0.000080, −0.000288, 0.000123, 0.022064, 0.284020, 106.538687)

  • Or, in other words

z(x, y) = − 0.000080x2 − 0.000288y2 + 0.000123xy + 0.022064x + 0.284020y + 106.538687.

20 / 37

slide-21
SLIDE 21

default

LS Example with two variables

  • Finally, we remove the illumination component by subtraction.

21 / 37

slide-22
SLIDE 22

default

MAXIMUM LIKELIHOOD

22 / 37

slide-23
SLIDE 23

default

Maximum Likelihood Estimation

  • Maximum likelihood (MLE) is the most popular estimation approach due to its

applicability in complicated estimation problems.

  • Maximization of likelihood also appears often as the optimality criterion in machine

learning.

  • The method was proposed by Fisher in 1922, though he published the basic principle

already in 1912 as a third year undergraduate.

  • The basic principle is simple: find the parameter θ that is the most probable to have

generated the data x.

  • The MLE estimator may or may not be optimal in the minimum variance sense. It is

not necessarily unbiased, either.

23 / 37

slide-24
SLIDE 24

default

The Likelihood Function

  • Consider again the problem of estimating the mean level A of noisy data.
  • Assume that the data originates from the following model:

x[n] = A + w[n], where w[n] ∼ N(0, σ2): Constant plus Gaussian random noise with zero mean and variance σ2.

50 100 150 200 4 3 2 1 1 2 3

24 / 37

slide-25
SLIDE 25

default

The Likelihood Function

  • The key to maximum likelihood is the likelihood function.
  • If the probability density function (PDF) of the data is viewed as a function of the

unknown parameter (with fixed data), it is called likelihood function.

  • Often the likelihood function has an exponential form. Then it’s usual to take the

natural logarithm to get rid of the exponential. Note that the maximum of the new log-likelihood function does not change.

4 2 2 4 6 8 10 A 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

PDF of A assuming x[0] = 3

20 15 10 5

Likelihood Log-likelihood

25 / 37

slide-26
SLIDE 26

default

MLE Example

  • In our case of estimating the mean of a signal:

x[n] = A + w[n], n = 0, 1, . . . , N − 1, the likelihood function is easy to construct.

  • The noise samples w[n] ∼ N(0, σ2) are assumed independent, so the PDF of the

whole batch of samples x = (x[0], . . . , x[N − 1]) is obtained by the product rule: p(x; A) =

N−1

  • n=0

p(x[n]; A) = 1 (2πσ2)

N 2 exp

  • − 1

2σ2

N−1

  • n=0

(x[n] − A)2

  • When we have observed the data x, we can turn the problem around and consider

what is the most likely parameter A to have generated the data.

26 / 37

slide-27
SLIDE 27

default

MLE Example

  • Some authors emphasize this by turning the order around: p(A; x) or give the function

a different name such as L(A; x) or ℓ(A; x).

  • So, consider p(x; A) as a function of A and try to maximize it.

27 / 37

slide-28
SLIDE 28

default

MLE Example

  • The picture below shows the likelihood function and the log-likelihood function for
  • ne possible realization of data.
  • The data consists of 50 points, with true A = 5.
  • The likelihood function gives the probability of observing these particular points with

different values of A.

2 3 4 5 6 7 8 A 1 2 3 4 5 6 1e 31

Likelihood of A (max at 5.17)

350 300 250 200 150 100 50

Samples Likelihood Log-likelihood

28 / 37

slide-29
SLIDE 29

default

MLE Example

  • Instead of finding the maximum from the plot, we wish to have a closed form solution.
  • Closed form is faster, more elegant, accurate and numerically more stable.
  • Just for the sake of an example, below is the code for the stupid version.

# The samples are in array called x0 x = np.linspace(2, 8, 200) likelihood = [] log_likelihood = [] for A in x: likelihood.append(gaussian(x0, A, 1).prod()) log_likelihood.append(gaussian_log(x0, A, 1).sum()) print ("Max likelihood is at %.2f" % (x[np.argmax(log_likelihood)]))

29 / 37

slide-30
SLIDE 30

default

MLE Example

  • Maximization of p(x; A) directly is nontrivial. Therefore, we take the logarithm, and maximize it

instead: p(x; A) = 1 (2πσ2)

N 2

exp

1 2σ2

N−1

  • n=0

(x[n] − A)2 ln p(x; A) = −N 2 ln(2πσ2) − 1 2σ2

N−1

  • n=0

(x[n] − A)2

  • The maximum is found via differentiation:

∂ ln p(x; A) ∂A = 1 σ2

N−1

  • n=0

(x[n] − A)

30 / 37

slide-31
SLIDE 31

default

MLE Example

  • Setting this equal to zero gives

1 σ2

N−1

  • n=0

(x[n] − A) =

N−1

  • n=0

(x[n] − A) =

N−1

  • n=0

x[n] −

N−1

  • n=0

A =

N−1

  • n=0

x[n] − NA =

N−1

  • n=0

x[n] = NA A = 1 N

N−1

  • n=0

x[n] 31 / 37

slide-32
SLIDE 32

default

Conclusion

  • What did we actually do?
  • We proved that the sample mean is the maximum likelihood estimator for the

distribution mean.

  • But I could have guessed this result from the beginning. What’s the point?
  • We can do the same thing for cases where you can not guess.

32 / 37

slide-33
SLIDE 33

default

Example: Sinusoidal Parameter Estimation

  • Consider the model

x[n] = A cos(2πf0n + φ) + w[n] with w[n] ∼ N(0, σ2). It is possible to find the MLE for all three parameters: θ = [A, f0, φ]T.

  • The PDF is given as

p(x; θ) = 1 (2πσ2)

N 2 exp

  − 1 2σ2

N−1

  • n=0

(x[n] − A cos(2πf0n + φ)

  • w[n]

)2   

33 / 37

slide-34
SLIDE 34

default

Example: Sinusoidal Parameter Estimation

  • Instead of proceeding directly through the log-likelihood function, we note that the

above function is maximized when J(A, f0, φ) =

N−1

  • n=0

(x[n] − A cos(2πf0n + φ))2 is minimized.

  • The minimum of this function can be found although it is a nontrivial task (about 10

slides).

  • We skip the derivation, but for details, see Kay et al. "Statistical Signal Processing:

Estimation Theory," 1993.

34 / 37

slide-35
SLIDE 35

default

Sinusoidal Parameter Estimation

  • The MLE of frequency f0 is obtained by maximizing the periodogram over f0:

ˆ f0 = arg max

f

  • N−1
  • n=0

x[n] exp(−j2πfn)

  • Once ˆ

f0 is available, proceed by calculating the other parameters: ˆ A = 2 N

  • N−1
  • n=0

x[n] exp(−j2πˆ f0n)

  • ˆ

φ = arctan     −

N−1

  • n=0

x[n] sin 2πˆ f0n

N−1

  • n=0

x[n] cos 2πˆ f0n     

35 / 37

slide-36
SLIDE 36

default

Sinusoidal Parameter Estimation—Experiments

  • Four example runs of the estimation algorithm are

illustrated in the figures.

  • The algorithm was also tested for 10000 realizations
  • f a sinusoid with fixed θ and N = 160, σ2 = 1.2.
  • Note that the estimator is not unbiased.

0.0 0.1 0.2 0.3 0.4 0.5 1000 2000 3000 4000 5000 6000 7000 8000

10000 estimates of parameter f0 (true = 0.07)

0.4 0.6 0.8 1.0 1.2 100 200 300 400 500 600

10000 estimates of parameter A (true = 0.67)

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 100 200 300 400 500 600 700 800 900

10000 estimates of parameter φ (true = 0.61)

36 / 37

slide-37
SLIDE 37

default

Estimation Theory—Summary

  • We have seen a brief overview of estimation theory with particular focus on

Maximum Likelihood.

  • If your problem is simple enough to be modeled by an equation, the estimation

theory is the answer.

  • Estimating the frequency of a sinusoid is completely solved by classical theory.
  • Estimating the age of the person in picture can not possibly be modeled this simply and

classical theory has no answer.

  • Model based estimation is the best answer when a model exists.
  • Machine learning can be understood as a data driven approach.

37 / 37