Frequentist Statistics DS GA 1002 Probability and Statistics for - - PowerPoint PPT Presentation

frequentist statistics
SMART_READER_LITE
LIVE PREVIEW

Frequentist Statistics DS GA 1002 Probability and Statistics for - - PowerPoint PPT Presentation

Frequentist Statistics DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/DSGA1002_fall17 Carlos Fernandez-Granda Estimation under probabilistic assumptions Assumption: Data are generated by sampling


slide-1
SLIDE 1

Frequentist Statistics

DS GA 1002 Probability and Statistics for Data Science

http://www.cims.nyu.edu/~cfgranda/pages/DSGA1002_fall17 Carlos Fernandez-Granda

slide-2
SLIDE 2

Estimation under probabilistic assumptions

Assumption: Data are generated by sampling from a probabilistic model Aim: Analyze statistical techniques and derive guarantees Frequentist framework: Distribution generating the data is fixed

slide-3
SLIDE 3

Independent identically-distributed sampling Mean-square error Consistency Confidence intervals Nonparametric model estimation Parametric estimation

slide-4
SLIDE 4

Independent identically-distributed sampling

Assumption: Data are iid samples Holds for controlled experiments (randomized trials to test drugs) Often a good approximation (polling)

slide-5
SLIDE 5

Independent identically-distributed sampling

X1 X2 X3 X4 . . . Xn

slide-6
SLIDE 6

Sampling from a population

Population of m individuals We are interested in a feature associated to each person (cholesterol level, salary, who they are voting for. . . ) The feature has k possible values {z1, z2, . . . , zk} mj = number of people for whom feature equals zj

slide-7
SLIDE 7

Sampling from a population

Data: Values of the feature for a subset of individuals X If individuals are chosen uniformly at random with replacement p

X(i) (zj) = P (The feature for the ith chosen person equals zj)

= mj m , 1 ≤ j ≤ k The sequence is iid

slide-8
SLIDE 8

Independent identically-distributed sampling Mean-square error Consistency Confidence intervals Nonparametric model estimation Parametric estimation

slide-9
SLIDE 9

Estimator

Deterministic function of the data x1, x2, . . . , xn y := h (x1, x2, . . . , xn) Aim: Estimating a quantity γ related to the underlying distribution

slide-10
SLIDE 10

Estimator

If data are samples from a probabilistic model, then y is a realization

  • f the random variable
  • Y (n) := h
  • X (1) ,

X (2) , . . . , X (n)

slide-11
SLIDE 11

Mean square error

The mean square error (MSE) of an estimator Y that approximates a quantity γ is MSE (Y ) := E

  • (Y − γ)2
slide-12
SLIDE 12

Bias-variance decomposition

MSE (Y ) = E

  • (Y − γ)2
slide-13
SLIDE 13

Bias-variance decomposition

MSE (Y ) = E

  • (Y − γ)2

= E

  • (Y − E (Y ) + E (Y ) − γ)2
slide-14
SLIDE 14

Bias-variance decomposition

MSE (Y ) = E

  • (Y − γ)2

= E

  • (Y − E (Y ) + E (Y ) − γ)2

= E ((Y − E (Y )))2 + (E (Y ) − γ)2 + 2 (E (Y ) − γ) E (Y − E (Y ))

  • E(Y )−E(Y )
slide-15
SLIDE 15

Bias-variance decomposition

MSE (Y ) = E

  • (Y − γ)2

= E

  • (Y − E (Y ) + E (Y ) − γ)2

= E ((Y − E (Y )))2 + (E (Y ) − γ)2 + 2 (E (Y ) − γ) E (Y − E (Y ))

  • E(Y )−E(Y )

= E

  • (Y − E (Y ))2
  • variance

+ (E (Y ) − γ)2

  • bias
slide-16
SLIDE 16

Unbiased estimator

An estimator Y that approximates γ is unbiased if and only if E (Y ) = γ

slide-17
SLIDE 17

Empirical mean is unbiased

The empirical mean of an iid sequence X with mean µ

  • Y (n) := 1

n

n

  • i=1
  • X (i)

is unbiased E

  • Y (n)
  • = E
  • 1

n

n

  • i=1
  • X (i)
slide-18
SLIDE 18

Empirical mean is unbiased

The empirical mean of an iid sequence X with mean µ

  • Y (n) := 1

n

n

  • i=1
  • X (i)

is unbiased E

  • Y (n)
  • = E
  • 1

n

n

  • i=1
  • X (i)
  • = 1

n

n

  • i=1

E

  • X (i)
slide-19
SLIDE 19

Empirical mean is unbiased

The empirical mean of an iid sequence X with mean µ

  • Y (n) := 1

n

n

  • i=1
  • X (i)

is unbiased E

  • Y (n)
  • = E
  • 1

n

n

  • i=1
  • X (i)
  • = 1

n

n

  • i=1

E

  • X (i)
  • = µ
slide-20
SLIDE 20

Empirical mean is unbiased

The empirical mean of an iid sequence X with mean µ

  • Y (n) := 1

n

n

  • i=1
  • X (i)

is unbiased E

  • Y (n)
  • = E
  • 1

n

n

  • i=1
  • X (i)
  • = 1

n

n

  • i=1

E

  • X (i)
  • = µ

The empirical variance is also unbiased

slide-21
SLIDE 21

Independent identically-distributed sampling Mean-square error Consistency Confidence intervals Nonparametric model estimation Parametric estimation

slide-22
SLIDE 22

Consistency

An estimator Y (n) := h

  • X (1) ,

X (2) , . . . , X (n)

  • that approximates γ

is consistent if it converges to γ as n → ∞ in mean square, with probability one or in probability

slide-23
SLIDE 23

Consistency

The empirical mean of an iid sequence X with mean µ

  • Y (n) := 1

n

n

  • i=1
  • X (i)

is consistent by the law of large numbers if the variance is bounded

slide-24
SLIDE 24

Estimating the average height

Population of 25 000 people Goal: Estimate average height from iid samples X The average of the population is the mean of the iid sequence E

  • X (i)
  • :=

m

  • j=1

P (Person j is chosen) · height of person j = 1 m

m

  • j=1

hj = av (h1, . . . , hm)

slide-25
SLIDE 25

Estimating the average height

60 62 64 66 68 70 72 74 76

Height (inches)

0.05 0.10 0.15 0.20 0.25

slide-26
SLIDE 26

Estimating the average height

100 101 102 103 n 64 65 66 67 68 69 70 71 72 Height (inches) True mean Empirical mean

slide-27
SLIDE 27

Empirical median is consistent

The empirical median of an iid sequence X is consistent even if the mean is not well defined or the variance is unbounded

slide-28
SLIDE 28

Cauchy iid sequence: empirical mean

10 20 30 40 50 i 5 5 10 15 20 25 30

Moving average Median of iid seq.

slide-29
SLIDE 29

Cauchy iid sequence: empirical mean

100 200 300 400 500 i 10 5 5 10

Moving average Median of iid seq.

slide-30
SLIDE 30

Cauchy iid sequence: empirical mean

1000 2000 3000 4000 5000 i 60 50 40 30 20 10 10 20 30

Moving average Median of iid seq.

slide-31
SLIDE 31

Cauchy iid sequence: empirical median

10 20 30 40 50 i 3 2 1 1 2 3

Moving median Median of iid seq.

slide-32
SLIDE 32

Cauchy iid sequence: empirical median

100 200 300 400 500 i 3 2 1 1 2 3

Moving median Median of iid seq.

slide-33
SLIDE 33

Cauchy iid sequence: empirical median

1000 2000 3000 4000 5000 i 3 2 1 1 2 3

Moving median Median of iid seq.

slide-34
SLIDE 34

Consistency

Empirical variance is consistent if fourth moment is bounded Covariance matrix converges under similar conditions

slide-35
SLIDE 35

PCA: n = 5

True covariance Empirical covariance

slide-36
SLIDE 36

PCA: n = 5

True covariance Empirical covariance

slide-37
SLIDE 37

PCA: n = 20

slide-38
SLIDE 38

PCA: n = 100

slide-39
SLIDE 39

Independent identically-distributed sampling Mean-square error Consistency Confidence intervals Nonparametric model estimation Parametric estimation

slide-40
SLIDE 40

Confidence intervals

Aim: quantify accuracy of estimator for a fixed number of data A 1 − α confidence interval I for a quantity γ satisfies P (γ ∈ I) ≥ 1 − α where 0 < α < 1

slide-41
SLIDE 41

Confidence interval for the mean of an iid sequence

Let X be an iid sequence with mean µ and variance σ2 ≤ b2 for b > 0 For any 0 < α < 1 In :=

  • Yn −

b √α n, Yn + b √α n

  • Yn := av
  • X (1) ,

X (2) , . . . , X (n)

  • ,

is a 1 − α confidence interval for µ

slide-42
SLIDE 42

Proof

P

  • µ ∈
  • Yn −

b √α n, Yn + b √α n

  • = 1 − P
  • |Yn − µ| >

b √α n

slide-43
SLIDE 43

Proof

P

  • µ ∈
  • Yn −

b √α n, Yn + b √α n

  • = 1 − P
  • |Yn − µ| >

b √α n

  • ≥ 1 − α nVar (Yn)

b2

slide-44
SLIDE 44

Proof

P

  • µ ∈
  • Yn −

b √α n, Yn + b √α n

  • = 1 − P
  • |Yn − µ| >

b √α n

  • ≥ 1 − α nVar (Yn)

b2 = 1 − α σ2 b2

slide-45
SLIDE 45

Proof

P

  • µ ∈
  • Yn −

b √α n, Yn + b √α n

  • = 1 − P
  • |Yn − µ| >

b √α n

  • ≥ 1 − α nVar (Yn)

b2 = 1 − α σ2 b2 ≥ 1 − α

slide-46
SLIDE 46

Bears in Yosemite

Aim: average weight of bears in Yosemite Scientist captures 300 bears, average weight Y := 200 lbs We need bound on the variance Maximum weight = 880 lbs For a randomly selected bear X σ2 = E

  • X 2

− E2 (X) ≤ E

  • X 2

≤ 8802 because X ≤ 880 := b

slide-47
SLIDE 47

Bears in Yosemite

  • Y −

b √α n, Y + b √α n

  • = [−27.2, 427.2]

is a 95% confidence interval for the average weight of the whole population

slide-48
SLIDE 48

Central limit theorem with empirical standard deviation

Let X be an iid discrete sequence with mean µ such that its variance and fourth moment E

  • X (i)4

are bounded. The sequence √n

  • av
  • X (1) , . . . ,

X (n)

  • − µ
  • std
  • X (1) , . . . ,

X (n)

  • converges in distribution to a standard Gaussian random variable
slide-49
SLIDE 49

Q function

For x > 0 Q (x) := ∞

u=x

1 √ 2π exp

  • −u2

2

  • du

If U is a standard Gaussian random variable and y < 0 P (U < y) = Q (−y)

slide-50
SLIDE 50

Approximate confidence interval for the mean

Let X be an iid discrete sequence with mean µ such that its variance and fourth moment E

  • X (i)4

are bounded. For any 0 < α < 1 In :=

  • Yn − Sn

√nQ−1 α 2

  • , Yn + Sn

√nQ−1 α 2

  • Yn := av
  • X (1) ,

X (2) , . . . , X (n)

  • Sn := std
  • X (1) ,

X (2) , . . . , X (n)

  • is an approximate 1 − α confidence interval for µ, i.e.

P (µ ∈ In) ≈ 1 − α

slide-51
SLIDE 51

Approximate confidence interval for the mean

P (µ ∈ In) = 1 − P

  • Yn > µ + Sn

√nQ−1 α 2

  • − P
  • Yn < µ − Sn

√nQ−1 α 2

slide-52
SLIDE 52

Approximate confidence interval for the mean

P (µ ∈ In) = 1 − P

  • Yn > µ + Sn

√nQ−1 α 2

  • − P
  • Yn < µ − Sn

√nQ−1 α 2

  • = 1 − P

√n (Yn − µ) Sn > Q−1 α 2

  • − P

√n (Yn − µ) Sn < −Q−1 α 2

slide-53
SLIDE 53

Approximate confidence interval for the mean

P (µ ∈ In) = 1 − P

  • Yn > µ + Sn

√nQ−1 α 2

  • − P
  • Yn < µ − Sn

√nQ−1 α 2

  • = 1 − P

√n (Yn − µ) Sn > Q−1 α 2

  • − P

√n (Yn − µ) Sn < −Q−1 α 2

  • ≈ 1 − 2Q
  • Q−1 α

2

slide-54
SLIDE 54

Approximate confidence interval for the mean

P (µ ∈ In) = 1 − P

  • Yn > µ + Sn

√nQ−1 α 2

  • − P
  • Yn < µ − Sn

√nQ−1 α 2

  • = 1 − P

√n (Yn − µ) Sn > Q−1 α 2

  • − P

√n (Yn − µ) Sn < −Q−1 α 2

  • ≈ 1 − 2Q
  • Q−1 α

2

  • = 1 − α
slide-55
SLIDE 55

Bears in Yosemite

Empirical standard deviation is 100 lbs Given that Q (1.95) ≈ 0.025,

  • Y − σ

√nQ−1 α 2

  • , Y + σ

√nQ−1 α 2

  • ≈ [188.8, 211.3]

is an approximate 95% confidence interval

slide-56
SLIDE 56

Interpreting confidence intervals

The average weight is between 188.8 and 211.3 lbs with probability 0.95

slide-57
SLIDE 57

Interpreting confidence intervals

If we repeat the process of sampling the population and computing the confidence interval, then the true value will lie in the interval 95% of the time

slide-58
SLIDE 58

Estimating the average height

We compute 40 confidence intervals of the form In :=

  • Yn − Sn

√nQ−1 α 2

  • , Yn + Sn

√nQ−1 α 2

  • Yn := av
  • X (1) ,

X (2) , . . . , X (n)

  • Sn := std
  • X (1) ,

X (2) , . . . , X (n)

  • for 1 − α = 0.95 and different values of n
slide-59
SLIDE 59

Estimating the average height: n = 50

True mean

slide-60
SLIDE 60

Estimating the average height: n = 200

True mean

slide-61
SLIDE 61

Estimating the average height: n = 1000

True mean

slide-62
SLIDE 62

Independent identically-distributed sampling Mean-square error Consistency Confidence intervals Nonparametric model estimation Parametric estimation

slide-63
SLIDE 63

Nonparametric methods

Aim: Estimate the distribution underlying the data Very challenging: many (infinite!) different distributions could have generated the measurements

slide-64
SLIDE 64

Empirical cdf

The empirical cdf corresponding to data x1, . . . , xn is

  • Fn (x) := 1

n

n

  • i=1

1xi≤x, x ∈ R If data are iid with cdf FX, Fn (x) is an unbiased and consistent estimator

slide-65
SLIDE 65

Empirical cdf is unbiased

E

  • Fn (x)
slide-66
SLIDE 66

Empirical cdf is unbiased

E

  • Fn (x)
  • = E
  • 1

n

n

  • i=1

1

X(i)≤x

slide-67
SLIDE 67

Empirical cdf is unbiased

E

  • Fn (x)
  • = E
  • 1

n

n

  • i=1

1

X(i)≤x

  • = 1

n

n

  • i=1

E

  • 1

X(i)≤x

slide-68
SLIDE 68

Empirical cdf is unbiased

E

  • Fn (x)
  • = E
  • 1

n

n

  • i=1

1

X(i)≤x

  • = 1

n

n

  • i=1

E

  • 1

X(i)≤x

  • = 1

n

n

  • i=1

P

  • X (i) ≤ x
slide-69
SLIDE 69

Empirical cdf is unbiased

E

  • Fn (x)
  • = E
  • 1

n

n

  • i=1

1

X(i)≤x

  • = 1

n

n

  • i=1

E

  • 1

X(i)≤x

  • = 1

n

n

  • i=1

P

  • X (i) ≤ x
  • = FX (x)
slide-70
SLIDE 70

Empirical cdf is consistent

The mean square of the empirical cdf is E

  • F 2

n (x)

slide-71
SLIDE 71

Empirical cdf is consistent

The mean square of the empirical cdf is E

  • F 2

n (x)

  • = E

  1 n2

n

  • i=1

n

  • j=1

1

X(i)≤x1 X(j)≤x

 

slide-72
SLIDE 72

Empirical cdf is consistent

The mean square of the empirical cdf is E

  • F 2

n (x)

  • = E

  1 n2

n

  • i=1

n

  • j=1

1

X(i)≤x1 X(j)≤x

  = 1 n2

n

  • i=1

E

  • 1

X(i)≤x

  • + 1

n2

n

  • i=1

n

  • j=1,i=j

E

  • 1

X(i)≤x1 X(j)≤x

slide-73
SLIDE 73

Empirical cdf is consistent

The mean square of the empirical cdf is E

  • F 2

n (x)

  • = E

  1 n2

n

  • i=1

n

  • j=1

1

X(i)≤x1 X(j)≤x

  = 1 n2

n

  • i=1

E

  • 1

X(i)≤x

  • + 1

n2

n

  • i=1

n

  • j=1,i=j

E

  • 1

X(i)≤x1 X(j)≤x

  • = 1

n2

n

  • i=1

P

  • X (i) ≤ x
  • + 1

n2

n

  • i=1

n

  • j=1,i=j

P

  • X (i) ≤ x
  • P
  • X (j) ≤ x
slide-74
SLIDE 74

Empirical cdf is consistent

The mean square of the empirical cdf is E

  • F 2

n (x)

  • = E

  1 n2

n

  • i=1

n

  • j=1

1

X(i)≤x1 X(j)≤x

  = 1 n2

n

  • i=1

E

  • 1

X(i)≤x

  • + 1

n2

n

  • i=1

n

  • j=1,i=j

E

  • 1

X(i)≤x1 X(j)≤x

  • = 1

n2

n

  • i=1

P

  • X (i) ≤ x
  • + 1

n2

n

  • i=1

n

  • j=1,i=j

P

  • X (i) ≤ x
  • P
  • X (j) ≤ x
  • = FX (x)

n + 1 n2

n

  • i=1,i=j

n

  • j=1

FX (x) FX (x)

slide-75
SLIDE 75

Empirical cdf is consistent

The mean square of the empirical cdf is E

  • F 2

n (x)

  • = E

  1 n2

n

  • i=1

n

  • j=1

1

X(i)≤x1 X(j)≤x

  = 1 n2

n

  • i=1

E

  • 1

X(i)≤x

  • + 1

n2

n

  • i=1

n

  • j=1,i=j

E

  • 1

X(i)≤x1 X(j)≤x

  • = 1

n2

n

  • i=1

P

  • X (i) ≤ x
  • + 1

n2

n

  • i=1

n

  • j=1,i=j

P

  • X (i) ≤ x
  • P
  • X (j) ≤ x
  • = FX (x)

n + 1 n2

n

  • i=1,i=j

n

  • j=1

FX (x) FX (x) = FX (x) n + n − 1 n F 2

X (x) = FX (x) (1 − FX (x))

n + F 2

X (x)

slide-76
SLIDE 76

Empirical cdf is consistent

The variance is consequently equal to Var

  • Fn (x)
slide-77
SLIDE 77

Empirical cdf is consistent

The variance is consequently equal to Var

  • Fn (x)
  • = E
  • Fn (x)2

− E2

  • Fn (x)
slide-78
SLIDE 78

Empirical cdf is consistent

The variance is consequently equal to Var

  • Fn (x)
  • = E
  • Fn (x)2

− E2

  • Fn (x)
  • = FX (x) (1 − FX (x))

n

slide-79
SLIDE 79

Empirical cdf is consistent

lim

n→∞ E

  • FX (x) −

Fn (x) 2 = lim

n→∞ Var

  • Fn (x)
  • = 0
slide-80
SLIDE 80

Example: Heights, n = 10

60 62 64 66 68 70 72 74 76 Height (inches) 0.0 0.2 0.4 0.6 0.8 True cdf Empirical cdf

slide-81
SLIDE 81

Example: Heights, n = 100

60 62 64 66 68 70 72 74 76 Height (inches) 0.0 0.2 0.4 0.6 0.8 True cdf Empirical cdf

slide-82
SLIDE 82

Example: Heights, n = 1000

60 62 64 66 68 70 72 74 76 Height (inches) 0.0 0.2 0.4 0.6 0.8 True cdf Empirical cdf

slide-83
SLIDE 83

Estimating the pdf at x

Idea: Use weighted average of points close to x Problem: How to weight different samples?

slide-84
SLIDE 84

Kernel density estimation

Weight samples using a kernel centered at x Desireable properties:

◮ Maximum at 0 ◮ Decreasing away to zero (closer samples are more informative) ◮ Normalized and nonnegative

k (x) ≥ 0 for all x ∈ R

  • R

k (x) dx = 1

slide-85
SLIDE 85

Kernel density estimation

The kernel density estimator with bandwidth h of the pdf of x1, . . . , xn at x ∈ R is

  • fh,n (x) := 1

n h

n

  • i=1

k x − xi h

slide-86
SLIDE 86

Bandwidth

Governs how samples are weighted Large:

◮ Average is over more distant samples ◮ Robust, but smooths out local details

Small:

◮ Average is only over close samples ◮ Reflects local structure, but potentially unstable

slide-87
SLIDE 87

Gaussian mixture n = 3, h = 0.1

5 5 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

True distribution Data Kernel-density estimate

slide-88
SLIDE 88

Gaussian mixture n = 102, h = 0.1

5 5 0.1 0.0 0.1 0.2 0.3 0.4 0.5

True distribution Data Kernel-density estimate

slide-89
SLIDE 89

Gaussian mixture n = 104, h = 0.1

5 5 0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

True distribution Data Kernel-density estimate

slide-90
SLIDE 90

Gaussian mixture n = 5, h = 0.5

5 5 0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

True distribution Data Kernel-density estimate

slide-91
SLIDE 91

Gaussian mixture n = 102, h = 0.5

5 5 0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

True distribution Data Kernel-density estimate

slide-92
SLIDE 92

Gaussian mixture n = 104, h = 0.5

5 5 0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

True distribution Data Kernel-density estimate

slide-93
SLIDE 93

Example: Abalone weights

1 1 2 3 4

Weight (grams)

0.0 0.2 0.4 0.6 0.8 1.0

KDE bandwidth: 0.05 KDE bandwidth: 0.25 KDE bandwidth: 0.5 True pdf

slide-94
SLIDE 94

Example: Abalone weights

1 1 2 3 4

Weight (grams)

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

slide-95
SLIDE 95

Independent identically-distributed sampling Mean-square error Consistency Confidence intervals Nonparametric model estimation Parametric estimation

slide-96
SLIDE 96

Parametric models

Assumption: Data sampled from known distribution with a small number of unknown parameters Justification: Theoretical (Central Limit Theorem), empirical . . . Frequentist viewpoint: Parameters are deterministic

slide-97
SLIDE 97

Method of moments

Fitting parameters so that they are consistent with empirical moments For an exponential with parameter λ and mean µ µ = 1 λ so by the method of moments the estimate of λ is λMM := 1 av (x1, . . . , xn)

slide-98
SLIDE 98

Fitting an exponential

1 2 3 4 5 6 7 8 9 Interarrival times (s) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Exponential distribution Real data

slide-99
SLIDE 99

Fitting a Gaussian

60 62 64 66 68 70 72 74 76

Height (inches)

0.05 0.10 0.15 0.20 0.25

Gaussian distribution Real data

slide-100
SLIDE 100

Maximum likelihood

Model data x1, . . . , xn as realizations of a set of discrete random variables X1, . . . , Xn The joint pmf depends on a vector of parameters θ p

θ (x1, . . . , xn) := pX1,...,Xn (x1, . . . , xn)

is the probability that X1, . . . , Xn equal the observed data Idea: Choose θ such that the probability is as high as possible

slide-101
SLIDE 101

Likelihood

The likelihood is defined as Lx1,...,xn

  • θ
  • := p

θ (x1, . . . , xn)

if the distribution is discrete and has pmf p

θ and

Lx1,...,xn

  • θ
  • := f

θ (x1, . . . , xn)

if the distribution is continuous and has pdf f

θ

The log-likelihood function is the log. of the likelihood log Lx1,...,xn

  • θ
slide-102
SLIDE 102

Maximum-likelihood estimator

The likelihood quantifies how likely the data are according to the model Maximum-likelihood (ML) estimator :

  • θML (x1, . . . , xn) := arg max
  • θ

Lx1,...,xn

  • θ
  • = arg max
  • θ

log Lx1,...,xn

  • θ
  • Maximizing the log-likelihood is equivalent, and often more convenient
slide-103
SLIDE 103

ML estimator of a Bernoulli distribution

Data x1, . . . , xn are iid samples from a Bernoulli with parameter θ The likelihood function is Lx1,...,xn (θ) = pθ (x1, . . . , xn)

slide-104
SLIDE 104

ML estimator of a Bernoulli distribution

Data x1, . . . , xn are iid samples from a Bernoulli with parameter θ The likelihood function is Lx1,...,xn (θ) = pθ (x1, . . . , xn) =

  • i=1

(1xi=1θ + 1xi=0 (1 − θ))

slide-105
SLIDE 105

ML estimator of a Bernoulli distribution

Data x1, . . . , xn are iid samples from a Bernoulli with parameter θ The likelihood function is Lx1,...,xn (θ) = pθ (x1, . . . , xn) =

  • i=1

(1xi=1θ + 1xi=0 (1 − θ)) = θn1 (1 − θ)n0

slide-106
SLIDE 106

ML estimator of a Bernoulli distribution

Data x1, . . . , xn are iid samples from a Bernoulli with parameter θ The likelihood function is Lx1,...,xn (θ) = pθ (x1, . . . , xn) =

  • i=1

(1xi=1θ + 1xi=0 (1 − θ)) = θn1 (1 − θ)n0 The log-likelihood function is log Lx1,...,xn (p) = n1 log θ + n0 log (1 − θ)

slide-107
SLIDE 107

ML estimator of a Bernoulli distribution

Data x1, . . . , xn are iid samples from a Bernoulli with parameter θ The likelihood function is Lx1,...,xn (θ) = pθ (x1, . . . , xn) =

  • i=1

(1xi=1θ + 1xi=0 (1 − θ)) = θn1 (1 − θ)n0 The log-likelihood function is log Lx1,...,xn (p) = n1 log θ + n0 log (1 − θ) The ML estimator is θML = n1 n0 + n1

slide-108
SLIDE 108

ML estimator of a Gaussian distribution

Data x1, . . . , xn are iid samples from a Gaussian with mean µ and std σ The likelihood function is Lx1,...,xn (µ, σ) = fµ,σ (x1, . . . , xn)

slide-109
SLIDE 109

ML estimator of a Gaussian distribution

Data x1, . . . , xn are iid samples from a Gaussian with mean µ and std σ The likelihood function is Lx1,...,xn (µ, σ) = fµ,σ (x1, . . . , xn) =

n

  • i=1

1 √ 2πσ e− (xi −µ)2

2σ2

slide-110
SLIDE 110

ML estimator of a Gaussian distribution

Data x1, . . . , xn are iid samples from a Gaussian with mean µ and std σ The likelihood function is Lx1,...,xn (µ, σ) = fµ,σ (x1, . . . , xn) =

n

  • i=1

1 √ 2πσ e− (xi −µ)2

2σ2

The log-likelihood function is log Lx1,...,xn (µ, σ) = −n log (2π) 2 − n log σ −

n

  • i=1

(xi − µ)2 2σ2

slide-111
SLIDE 111

ML estimator of a Gaussian distribution

The ML estimator is µML = 1 n

n

  • i=1

xi, σ2

ML = 1

n

n

  • i=1

(xi − µML)2

slide-112
SLIDE 112

ML estimator of a Gaussian distribution

10 5 5 10 15 0.00 0.05 0.10 0.15

Estimated distribution True distribution Data

slide-113
SLIDE 113

ML estimator of a Gaussian distribution

2.0 2.5 3.0 3.5 4.0 4.5 5.0

µ

3.0 3.5 4.0 4.5 5.0 5.5 6.0

σ

Estimated parameters True parameters

123 120 117 114 111 108 105 102 99 96

slide-114
SLIDE 114

ML estimator of a Gaussian distribution

10 5 5 10 15 0.00 0.05 0.10 0.15

Estimated distribution True distribution Data

slide-115
SLIDE 115

ML estimator of a Gaussian distribution

2.0 2.5 3.0 3.5 4.0 4.5 5.0

µ

3.0 3.5 4.0 4.5 5.0 5.5 6.0

σ

Estimated parameters True parameters

120.8 118.4 116.0 113.6 111.2 108.8 106.4 104.0 101.6

slide-116
SLIDE 116

ML estimator of a Gaussian distribution

10 5 5 10 15 0.00 0.05 0.10 0.15

Estimated distribution True distribution Data

slide-117
SLIDE 117

ML estimator of a Gaussian distribution

2.0 2.5 3.0 3.5 4.0 4.5 5.0

µ

3.0 3.5 4.0 4.5 5.0 5.5 6.0

σ

Estimated parameters True parameters

107.4 105.9 104.4 102.9 101.4 99.9 98.4 96.9 95.4 93.9

slide-118
SLIDE 118

Log-likelihood function of a Gaussian mixture

X is a Gaussian mixture X :=

  • G1

with probability 1

5,

G2 with probability 4

5,

G1 is Gaussian random variable with mean −µ and variance σ2 G2 is Gaussian with mean µ and variance σ2 Data x1, . . . , xn are iid samples from X

slide-119
SLIDE 119

Log-likelihood function of a Gaussian mixture

The likelihood function is Lx1,...,xn (µ, σ) = fµ,σ (x1, . . . , xn)

slide-120
SLIDE 120

Log-likelihood function of a Gaussian mixture

The likelihood function is Lx1,...,xn (µ, σ) = fµ,σ (x1, . . . , xn) =

n

  • i=1

1 5 √ 2πσ e− (xi +µ)2

2σ2

+ 4 5 √ 2πσ e− (xi −µ)2

2σ2

slide-121
SLIDE 121

Log-likelihood function of a Gaussian mixture

The likelihood function is Lx1,...,xn (µ, σ) = fµ,σ (x1, . . . , xn) =

n

  • i=1

1 5 √ 2πσ e− (xi +µ)2

2σ2

+ 4 5 √ 2πσ e− (xi −µ)2

2σ2

The log-likelihood function is log Lx1,...,xn (µ, σ) =

n

  • i=1

log

  • 1

√ 10πσ e− (xi +µ)2

2σ2

+ 2 √ 5πσ e− (xi −µ)2

2σ2

slide-122
SLIDE 122

Log-likelihood function of a Gaussian mixture

6 4 2 2 4 6

µ

0.5 1.0 1.5 2.0 2.5 3.0

σ

Global maximum Local maximum True parameters

1200 1075 950 825 700 575 450 325 200 75

slide-123
SLIDE 123

Log-likelihood function of a Gaussian mixture

10 5 5 10 15 20 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

Estimate (maximum) Estimate (local max.) True distribution Data

slide-124
SLIDE 124

Quadratic discriminant analysis

Training data: a1, . . . , an and b1, . . . , bn with d features Aim: Classify new instances

slide-125
SLIDE 125

Quadratic discriminant analysis

  • 1. Fit multidimensional Gaussian distribution to each class

{ µa, Σa} := arg max

  • µ,Σ L

a1,..., an (

µ, Σ) { µb, Σb} := arg max

  • µ,Σ L

b1,..., bn (

µ, Σ)

slide-126
SLIDE 126

Quadratic discriminant analysis

  • 1. Fit multidimensional Gaussian distribution to each class

{ µa, Σa} := arg max

  • µ,Σ L

a1,..., an (

µ, Σ) { µb, Σb} := arg max

  • µ,Σ L

b1,..., bn (

µ, Σ)

  • 2. For each new example if

f

µa,Σa (

x) > f

µb,Σb (

x) then x is assigned to class 1, otherwise to class 2

slide-127
SLIDE 127

Quadratic discriminant analysis

slide-128
SLIDE 128

Quadratic discriminant analysis