Frequentist Statistics DS GA 1002 Probability and Statistics for - - PowerPoint PPT Presentation
Frequentist Statistics DS GA 1002 Probability and Statistics for - - PowerPoint PPT Presentation
Frequentist Statistics DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/DSGA1002_fall17 Carlos Fernandez-Granda Estimation under probabilistic assumptions Assumption: Data are generated by sampling
Estimation under probabilistic assumptions
Assumption: Data are generated by sampling from a probabilistic model Aim: Analyze statistical techniques and derive guarantees Frequentist framework: Distribution generating the data is fixed
Independent identically-distributed sampling Mean-square error Consistency Confidence intervals Nonparametric model estimation Parametric estimation
Independent identically-distributed sampling
Assumption: Data are iid samples Holds for controlled experiments (randomized trials to test drugs) Often a good approximation (polling)
Independent identically-distributed sampling
X1 X2 X3 X4 . . . Xn
Sampling from a population
Population of m individuals We are interested in a feature associated to each person (cholesterol level, salary, who they are voting for. . . ) The feature has k possible values {z1, z2, . . . , zk} mj = number of people for whom feature equals zj
Sampling from a population
Data: Values of the feature for a subset of individuals X If individuals are chosen uniformly at random with replacement p
X(i) (zj) = P (The feature for the ith chosen person equals zj)
= mj m , 1 ≤ j ≤ k The sequence is iid
Independent identically-distributed sampling Mean-square error Consistency Confidence intervals Nonparametric model estimation Parametric estimation
Estimator
Deterministic function of the data x1, x2, . . . , xn y := h (x1, x2, . . . , xn) Aim: Estimating a quantity γ related to the underlying distribution
Estimator
If data are samples from a probabilistic model, then y is a realization
- f the random variable
- Y (n) := h
- X (1) ,
X (2) , . . . , X (n)
Mean square error
The mean square error (MSE) of an estimator Y that approximates a quantity γ is MSE (Y ) := E
- (Y − γ)2
Bias-variance decomposition
MSE (Y ) = E
- (Y − γ)2
Bias-variance decomposition
MSE (Y ) = E
- (Y − γ)2
= E
- (Y − E (Y ) + E (Y ) − γ)2
Bias-variance decomposition
MSE (Y ) = E
- (Y − γ)2
= E
- (Y − E (Y ) + E (Y ) − γ)2
= E ((Y − E (Y )))2 + (E (Y ) − γ)2 + 2 (E (Y ) − γ) E (Y − E (Y ))
- E(Y )−E(Y )
Bias-variance decomposition
MSE (Y ) = E
- (Y − γ)2
= E
- (Y − E (Y ) + E (Y ) − γ)2
= E ((Y − E (Y )))2 + (E (Y ) − γ)2 + 2 (E (Y ) − γ) E (Y − E (Y ))
- E(Y )−E(Y )
= E
- (Y − E (Y ))2
- variance
+ (E (Y ) − γ)2
- bias
Unbiased estimator
An estimator Y that approximates γ is unbiased if and only if E (Y ) = γ
Empirical mean is unbiased
The empirical mean of an iid sequence X with mean µ
- Y (n) := 1
n
n
- i=1
- X (i)
is unbiased E
- Y (n)
- = E
- 1
n
n
- i=1
- X (i)
Empirical mean is unbiased
The empirical mean of an iid sequence X with mean µ
- Y (n) := 1
n
n
- i=1
- X (i)
is unbiased E
- Y (n)
- = E
- 1
n
n
- i=1
- X (i)
- = 1
n
n
- i=1
E
- X (i)
Empirical mean is unbiased
The empirical mean of an iid sequence X with mean µ
- Y (n) := 1
n
n
- i=1
- X (i)
is unbiased E
- Y (n)
- = E
- 1
n
n
- i=1
- X (i)
- = 1
n
n
- i=1
E
- X (i)
- = µ
Empirical mean is unbiased
The empirical mean of an iid sequence X with mean µ
- Y (n) := 1
n
n
- i=1
- X (i)
is unbiased E
- Y (n)
- = E
- 1
n
n
- i=1
- X (i)
- = 1
n
n
- i=1
E
- X (i)
- = µ
The empirical variance is also unbiased
Independent identically-distributed sampling Mean-square error Consistency Confidence intervals Nonparametric model estimation Parametric estimation
Consistency
An estimator Y (n) := h
- X (1) ,
X (2) , . . . , X (n)
- that approximates γ
is consistent if it converges to γ as n → ∞ in mean square, with probability one or in probability
Consistency
The empirical mean of an iid sequence X with mean µ
- Y (n) := 1
n
n
- i=1
- X (i)
is consistent by the law of large numbers if the variance is bounded
Estimating the average height
Population of 25 000 people Goal: Estimate average height from iid samples X The average of the population is the mean of the iid sequence E
- X (i)
- :=
m
- j=1
P (Person j is chosen) · height of person j = 1 m
m
- j=1
hj = av (h1, . . . , hm)
Estimating the average height
60 62 64 66 68 70 72 74 76
Height (inches)
0.05 0.10 0.15 0.20 0.25
Estimating the average height
100 101 102 103 n 64 65 66 67 68 69 70 71 72 Height (inches) True mean Empirical mean
Empirical median is consistent
The empirical median of an iid sequence X is consistent even if the mean is not well defined or the variance is unbounded
Cauchy iid sequence: empirical mean
10 20 30 40 50 i 5 5 10 15 20 25 30
Moving average Median of iid seq.
Cauchy iid sequence: empirical mean
100 200 300 400 500 i 10 5 5 10
Moving average Median of iid seq.
Cauchy iid sequence: empirical mean
1000 2000 3000 4000 5000 i 60 50 40 30 20 10 10 20 30
Moving average Median of iid seq.
Cauchy iid sequence: empirical median
10 20 30 40 50 i 3 2 1 1 2 3
Moving median Median of iid seq.
Cauchy iid sequence: empirical median
100 200 300 400 500 i 3 2 1 1 2 3
Moving median Median of iid seq.
Cauchy iid sequence: empirical median
1000 2000 3000 4000 5000 i 3 2 1 1 2 3
Moving median Median of iid seq.
Consistency
Empirical variance is consistent if fourth moment is bounded Covariance matrix converges under similar conditions
PCA: n = 5
True covariance Empirical covariance
PCA: n = 5
True covariance Empirical covariance
PCA: n = 20
PCA: n = 100
Independent identically-distributed sampling Mean-square error Consistency Confidence intervals Nonparametric model estimation Parametric estimation
Confidence intervals
Aim: quantify accuracy of estimator for a fixed number of data A 1 − α confidence interval I for a quantity γ satisfies P (γ ∈ I) ≥ 1 − α where 0 < α < 1
Confidence interval for the mean of an iid sequence
Let X be an iid sequence with mean µ and variance σ2 ≤ b2 for b > 0 For any 0 < α < 1 In :=
- Yn −
b √α n, Yn + b √α n
- Yn := av
- X (1) ,
X (2) , . . . , X (n)
- ,
is a 1 − α confidence interval for µ
Proof
P
- µ ∈
- Yn −
b √α n, Yn + b √α n
- = 1 − P
- |Yn − µ| >
b √α n
Proof
P
- µ ∈
- Yn −
b √α n, Yn + b √α n
- = 1 − P
- |Yn − µ| >
b √α n
- ≥ 1 − α nVar (Yn)
b2
Proof
P
- µ ∈
- Yn −
b √α n, Yn + b √α n
- = 1 − P
- |Yn − µ| >
b √α n
- ≥ 1 − α nVar (Yn)
b2 = 1 − α σ2 b2
Proof
P
- µ ∈
- Yn −
b √α n, Yn + b √α n
- = 1 − P
- |Yn − µ| >
b √α n
- ≥ 1 − α nVar (Yn)
b2 = 1 − α σ2 b2 ≥ 1 − α
Bears in Yosemite
Aim: average weight of bears in Yosemite Scientist captures 300 bears, average weight Y := 200 lbs We need bound on the variance Maximum weight = 880 lbs For a randomly selected bear X σ2 = E
- X 2
− E2 (X) ≤ E
- X 2
≤ 8802 because X ≤ 880 := b
Bears in Yosemite
- Y −
b √α n, Y + b √α n
- = [−27.2, 427.2]
is a 95% confidence interval for the average weight of the whole population
Central limit theorem with empirical standard deviation
Let X be an iid discrete sequence with mean µ such that its variance and fourth moment E
- X (i)4
are bounded. The sequence √n
- av
- X (1) , . . . ,
X (n)
- − µ
- std
- X (1) , . . . ,
X (n)
- converges in distribution to a standard Gaussian random variable
Q function
For x > 0 Q (x) := ∞
u=x
1 √ 2π exp
- −u2
2
- du
If U is a standard Gaussian random variable and y < 0 P (U < y) = Q (−y)
Approximate confidence interval for the mean
Let X be an iid discrete sequence with mean µ such that its variance and fourth moment E
- X (i)4
are bounded. For any 0 < α < 1 In :=
- Yn − Sn
√nQ−1 α 2
- , Yn + Sn
√nQ−1 α 2
- Yn := av
- X (1) ,
X (2) , . . . , X (n)
- Sn := std
- X (1) ,
X (2) , . . . , X (n)
- is an approximate 1 − α confidence interval for µ, i.e.
P (µ ∈ In) ≈ 1 − α
Approximate confidence interval for the mean
P (µ ∈ In) = 1 − P
- Yn > µ + Sn
√nQ−1 α 2
- − P
- Yn < µ − Sn
√nQ−1 α 2
Approximate confidence interval for the mean
P (µ ∈ In) = 1 − P
- Yn > µ + Sn
√nQ−1 α 2
- − P
- Yn < µ − Sn
√nQ−1 α 2
- = 1 − P
√n (Yn − µ) Sn > Q−1 α 2
- − P
√n (Yn − µ) Sn < −Q−1 α 2
Approximate confidence interval for the mean
P (µ ∈ In) = 1 − P
- Yn > µ + Sn
√nQ−1 α 2
- − P
- Yn < µ − Sn
√nQ−1 α 2
- = 1 − P
√n (Yn − µ) Sn > Q−1 α 2
- − P
√n (Yn − µ) Sn < −Q−1 α 2
- ≈ 1 − 2Q
- Q−1 α
2
Approximate confidence interval for the mean
P (µ ∈ In) = 1 − P
- Yn > µ + Sn
√nQ−1 α 2
- − P
- Yn < µ − Sn
√nQ−1 α 2
- = 1 − P
√n (Yn − µ) Sn > Q−1 α 2
- − P
√n (Yn − µ) Sn < −Q−1 α 2
- ≈ 1 − 2Q
- Q−1 α
2
- = 1 − α
Bears in Yosemite
Empirical standard deviation is 100 lbs Given that Q (1.95) ≈ 0.025,
- Y − σ
√nQ−1 α 2
- , Y + σ
√nQ−1 α 2
- ≈ [188.8, 211.3]
is an approximate 95% confidence interval
Interpreting confidence intervals
The average weight is between 188.8 and 211.3 lbs with probability 0.95
Interpreting confidence intervals
If we repeat the process of sampling the population and computing the confidence interval, then the true value will lie in the interval 95% of the time
Estimating the average height
We compute 40 confidence intervals of the form In :=
- Yn − Sn
√nQ−1 α 2
- , Yn + Sn
√nQ−1 α 2
- Yn := av
- X (1) ,
X (2) , . . . , X (n)
- Sn := std
- X (1) ,
X (2) , . . . , X (n)
- for 1 − α = 0.95 and different values of n
Estimating the average height: n = 50
True mean
Estimating the average height: n = 200
True mean
Estimating the average height: n = 1000
True mean
Independent identically-distributed sampling Mean-square error Consistency Confidence intervals Nonparametric model estimation Parametric estimation
Nonparametric methods
Aim: Estimate the distribution underlying the data Very challenging: many (infinite!) different distributions could have generated the measurements
Empirical cdf
The empirical cdf corresponding to data x1, . . . , xn is
- Fn (x) := 1
n
n
- i=1
1xi≤x, x ∈ R If data are iid with cdf FX, Fn (x) is an unbiased and consistent estimator
Empirical cdf is unbiased
E
- Fn (x)
Empirical cdf is unbiased
E
- Fn (x)
- = E
- 1
n
n
- i=1
1
X(i)≤x
Empirical cdf is unbiased
E
- Fn (x)
- = E
- 1
n
n
- i=1
1
X(i)≤x
- = 1
n
n
- i=1
E
- 1
X(i)≤x
Empirical cdf is unbiased
E
- Fn (x)
- = E
- 1
n
n
- i=1
1
X(i)≤x
- = 1
n
n
- i=1
E
- 1
X(i)≤x
- = 1
n
n
- i=1
P
- X (i) ≤ x
Empirical cdf is unbiased
E
- Fn (x)
- = E
- 1
n
n
- i=1
1
X(i)≤x
- = 1
n
n
- i=1
E
- 1
X(i)≤x
- = 1
n
n
- i=1
P
- X (i) ≤ x
- = FX (x)
Empirical cdf is consistent
The mean square of the empirical cdf is E
- F 2
n (x)
Empirical cdf is consistent
The mean square of the empirical cdf is E
- F 2
n (x)
- = E
1 n2
n
- i=1
n
- j=1
1
X(i)≤x1 X(j)≤x
Empirical cdf is consistent
The mean square of the empirical cdf is E
- F 2
n (x)
- = E
1 n2
n
- i=1
n
- j=1
1
X(i)≤x1 X(j)≤x
= 1 n2
n
- i=1
E
- 1
X(i)≤x
- + 1
n2
n
- i=1
n
- j=1,i=j
E
- 1
X(i)≤x1 X(j)≤x
Empirical cdf is consistent
The mean square of the empirical cdf is E
- F 2
n (x)
- = E
1 n2
n
- i=1
n
- j=1
1
X(i)≤x1 X(j)≤x
= 1 n2
n
- i=1
E
- 1
X(i)≤x
- + 1
n2
n
- i=1
n
- j=1,i=j
E
- 1
X(i)≤x1 X(j)≤x
- = 1
n2
n
- i=1
P
- X (i) ≤ x
- + 1
n2
n
- i=1
n
- j=1,i=j
P
- X (i) ≤ x
- P
- X (j) ≤ x
Empirical cdf is consistent
The mean square of the empirical cdf is E
- F 2
n (x)
- = E
1 n2
n
- i=1
n
- j=1
1
X(i)≤x1 X(j)≤x
= 1 n2
n
- i=1
E
- 1
X(i)≤x
- + 1
n2
n
- i=1
n
- j=1,i=j
E
- 1
X(i)≤x1 X(j)≤x
- = 1
n2
n
- i=1
P
- X (i) ≤ x
- + 1
n2
n
- i=1
n
- j=1,i=j
P
- X (i) ≤ x
- P
- X (j) ≤ x
- = FX (x)
n + 1 n2
n
- i=1,i=j
n
- j=1
FX (x) FX (x)
Empirical cdf is consistent
The mean square of the empirical cdf is E
- F 2
n (x)
- = E
1 n2
n
- i=1
n
- j=1
1
X(i)≤x1 X(j)≤x
= 1 n2
n
- i=1
E
- 1
X(i)≤x
- + 1
n2
n
- i=1
n
- j=1,i=j
E
- 1
X(i)≤x1 X(j)≤x
- = 1
n2
n
- i=1
P
- X (i) ≤ x
- + 1
n2
n
- i=1
n
- j=1,i=j
P
- X (i) ≤ x
- P
- X (j) ≤ x
- = FX (x)
n + 1 n2
n
- i=1,i=j
n
- j=1
FX (x) FX (x) = FX (x) n + n − 1 n F 2
X (x) = FX (x) (1 − FX (x))
n + F 2
X (x)
Empirical cdf is consistent
The variance is consequently equal to Var
- Fn (x)
Empirical cdf is consistent
The variance is consequently equal to Var
- Fn (x)
- = E
- Fn (x)2
− E2
- Fn (x)
Empirical cdf is consistent
The variance is consequently equal to Var
- Fn (x)
- = E
- Fn (x)2
− E2
- Fn (x)
- = FX (x) (1 − FX (x))
n
Empirical cdf is consistent
lim
n→∞ E
- FX (x) −
Fn (x) 2 = lim
n→∞ Var
- Fn (x)
- = 0
Example: Heights, n = 10
60 62 64 66 68 70 72 74 76 Height (inches) 0.0 0.2 0.4 0.6 0.8 True cdf Empirical cdf
Example: Heights, n = 100
60 62 64 66 68 70 72 74 76 Height (inches) 0.0 0.2 0.4 0.6 0.8 True cdf Empirical cdf
Example: Heights, n = 1000
60 62 64 66 68 70 72 74 76 Height (inches) 0.0 0.2 0.4 0.6 0.8 True cdf Empirical cdf
Estimating the pdf at x
Idea: Use weighted average of points close to x Problem: How to weight different samples?
Kernel density estimation
Weight samples using a kernel centered at x Desireable properties:
◮ Maximum at 0 ◮ Decreasing away to zero (closer samples are more informative) ◮ Normalized and nonnegative
k (x) ≥ 0 for all x ∈ R
- R
k (x) dx = 1
Kernel density estimation
The kernel density estimator with bandwidth h of the pdf of x1, . . . , xn at x ∈ R is
- fh,n (x) := 1
n h
n
- i=1
k x − xi h
Bandwidth
Governs how samples are weighted Large:
◮ Average is over more distant samples ◮ Robust, but smooths out local details
Small:
◮ Average is only over close samples ◮ Reflects local structure, but potentially unstable
Gaussian mixture n = 3, h = 0.1
5 5 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
True distribution Data Kernel-density estimate
Gaussian mixture n = 102, h = 0.1
5 5 0.1 0.0 0.1 0.2 0.3 0.4 0.5
True distribution Data Kernel-density estimate
Gaussian mixture n = 104, h = 0.1
5 5 0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
True distribution Data Kernel-density estimate
Gaussian mixture n = 5, h = 0.5
5 5 0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
True distribution Data Kernel-density estimate
Gaussian mixture n = 102, h = 0.5
5 5 0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
True distribution Data Kernel-density estimate
Gaussian mixture n = 104, h = 0.5
5 5 0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
True distribution Data Kernel-density estimate
Example: Abalone weights
1 1 2 3 4
Weight (grams)
0.0 0.2 0.4 0.6 0.8 1.0
KDE bandwidth: 0.05 KDE bandwidth: 0.25 KDE bandwidth: 0.5 True pdf
Example: Abalone weights
1 1 2 3 4
Weight (grams)
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Independent identically-distributed sampling Mean-square error Consistency Confidence intervals Nonparametric model estimation Parametric estimation
Parametric models
Assumption: Data sampled from known distribution with a small number of unknown parameters Justification: Theoretical (Central Limit Theorem), empirical . . . Frequentist viewpoint: Parameters are deterministic
Method of moments
Fitting parameters so that they are consistent with empirical moments For an exponential with parameter λ and mean µ µ = 1 λ so by the method of moments the estimate of λ is λMM := 1 av (x1, . . . , xn)
Fitting an exponential
1 2 3 4 5 6 7 8 9 Interarrival times (s) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Exponential distribution Real data
Fitting a Gaussian
60 62 64 66 68 70 72 74 76
Height (inches)
0.05 0.10 0.15 0.20 0.25
Gaussian distribution Real data
Maximum likelihood
Model data x1, . . . , xn as realizations of a set of discrete random variables X1, . . . , Xn The joint pmf depends on a vector of parameters θ p
θ (x1, . . . , xn) := pX1,...,Xn (x1, . . . , xn)
is the probability that X1, . . . , Xn equal the observed data Idea: Choose θ such that the probability is as high as possible
Likelihood
The likelihood is defined as Lx1,...,xn
- θ
- := p
θ (x1, . . . , xn)
if the distribution is discrete and has pmf p
θ and
Lx1,...,xn
- θ
- := f
θ (x1, . . . , xn)
if the distribution is continuous and has pdf f
θ
The log-likelihood function is the log. of the likelihood log Lx1,...,xn
- θ
Maximum-likelihood estimator
The likelihood quantifies how likely the data are according to the model Maximum-likelihood (ML) estimator :
- θML (x1, . . . , xn) := arg max
- θ
Lx1,...,xn
- θ
- = arg max
- θ
log Lx1,...,xn
- θ
- Maximizing the log-likelihood is equivalent, and often more convenient
ML estimator of a Bernoulli distribution
Data x1, . . . , xn are iid samples from a Bernoulli with parameter θ The likelihood function is Lx1,...,xn (θ) = pθ (x1, . . . , xn)
ML estimator of a Bernoulli distribution
Data x1, . . . , xn are iid samples from a Bernoulli with parameter θ The likelihood function is Lx1,...,xn (θ) = pθ (x1, . . . , xn) =
- i=1
(1xi=1θ + 1xi=0 (1 − θ))
ML estimator of a Bernoulli distribution
Data x1, . . . , xn are iid samples from a Bernoulli with parameter θ The likelihood function is Lx1,...,xn (θ) = pθ (x1, . . . , xn) =
- i=1
(1xi=1θ + 1xi=0 (1 − θ)) = θn1 (1 − θ)n0
ML estimator of a Bernoulli distribution
Data x1, . . . , xn are iid samples from a Bernoulli with parameter θ The likelihood function is Lx1,...,xn (θ) = pθ (x1, . . . , xn) =
- i=1
(1xi=1θ + 1xi=0 (1 − θ)) = θn1 (1 − θ)n0 The log-likelihood function is log Lx1,...,xn (p) = n1 log θ + n0 log (1 − θ)
ML estimator of a Bernoulli distribution
Data x1, . . . , xn are iid samples from a Bernoulli with parameter θ The likelihood function is Lx1,...,xn (θ) = pθ (x1, . . . , xn) =
- i=1
(1xi=1θ + 1xi=0 (1 − θ)) = θn1 (1 − θ)n0 The log-likelihood function is log Lx1,...,xn (p) = n1 log θ + n0 log (1 − θ) The ML estimator is θML = n1 n0 + n1
ML estimator of a Gaussian distribution
Data x1, . . . , xn are iid samples from a Gaussian with mean µ and std σ The likelihood function is Lx1,...,xn (µ, σ) = fµ,σ (x1, . . . , xn)
ML estimator of a Gaussian distribution
Data x1, . . . , xn are iid samples from a Gaussian with mean µ and std σ The likelihood function is Lx1,...,xn (µ, σ) = fµ,σ (x1, . . . , xn) =
n
- i=1
1 √ 2πσ e− (xi −µ)2
2σ2
ML estimator of a Gaussian distribution
Data x1, . . . , xn are iid samples from a Gaussian with mean µ and std σ The likelihood function is Lx1,...,xn (µ, σ) = fµ,σ (x1, . . . , xn) =
n
- i=1
1 √ 2πσ e− (xi −µ)2
2σ2
The log-likelihood function is log Lx1,...,xn (µ, σ) = −n log (2π) 2 − n log σ −
n
- i=1
(xi − µ)2 2σ2
ML estimator of a Gaussian distribution
The ML estimator is µML = 1 n
n
- i=1
xi, σ2
ML = 1
n
n
- i=1
(xi − µML)2
ML estimator of a Gaussian distribution
10 5 5 10 15 0.00 0.05 0.10 0.15
Estimated distribution True distribution Data
ML estimator of a Gaussian distribution
2.0 2.5 3.0 3.5 4.0 4.5 5.0
µ
3.0 3.5 4.0 4.5 5.0 5.5 6.0
σ
Estimated parameters True parameters
123 120 117 114 111 108 105 102 99 96
ML estimator of a Gaussian distribution
10 5 5 10 15 0.00 0.05 0.10 0.15
Estimated distribution True distribution Data
ML estimator of a Gaussian distribution
2.0 2.5 3.0 3.5 4.0 4.5 5.0
µ
3.0 3.5 4.0 4.5 5.0 5.5 6.0
σ
Estimated parameters True parameters
120.8 118.4 116.0 113.6 111.2 108.8 106.4 104.0 101.6
ML estimator of a Gaussian distribution
10 5 5 10 15 0.00 0.05 0.10 0.15
Estimated distribution True distribution Data
ML estimator of a Gaussian distribution
2.0 2.5 3.0 3.5 4.0 4.5 5.0
µ
3.0 3.5 4.0 4.5 5.0 5.5 6.0
σ
Estimated parameters True parameters
107.4 105.9 104.4 102.9 101.4 99.9 98.4 96.9 95.4 93.9
Log-likelihood function of a Gaussian mixture
X is a Gaussian mixture X :=
- G1
with probability 1
5,
G2 with probability 4
5,
G1 is Gaussian random variable with mean −µ and variance σ2 G2 is Gaussian with mean µ and variance σ2 Data x1, . . . , xn are iid samples from X
Log-likelihood function of a Gaussian mixture
The likelihood function is Lx1,...,xn (µ, σ) = fµ,σ (x1, . . . , xn)
Log-likelihood function of a Gaussian mixture
The likelihood function is Lx1,...,xn (µ, σ) = fµ,σ (x1, . . . , xn) =
n
- i=1
1 5 √ 2πσ e− (xi +µ)2
2σ2
+ 4 5 √ 2πσ e− (xi −µ)2
2σ2
Log-likelihood function of a Gaussian mixture
The likelihood function is Lx1,...,xn (µ, σ) = fµ,σ (x1, . . . , xn) =
n
- i=1
1 5 √ 2πσ e− (xi +µ)2
2σ2
+ 4 5 √ 2πσ e− (xi −µ)2
2σ2
The log-likelihood function is log Lx1,...,xn (µ, σ) =
n
- i=1
log
- 1
√ 10πσ e− (xi +µ)2
2σ2
+ 2 √ 5πσ e− (xi −µ)2
2σ2
Log-likelihood function of a Gaussian mixture
6 4 2 2 4 6
µ
0.5 1.0 1.5 2.0 2.5 3.0
σ
Global maximum Local maximum True parameters
1200 1075 950 825 700 575 450 325 200 75
Log-likelihood function of a Gaussian mixture
10 5 5 10 15 20 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
Estimate (maximum) Estimate (local max.) True distribution Data
Quadratic discriminant analysis
Training data: a1, . . . , an and b1, . . . , bn with d features Aim: Classify new instances
Quadratic discriminant analysis
- 1. Fit multidimensional Gaussian distribution to each class
{ µa, Σa} := arg max
- µ,Σ L
a1,..., an (
µ, Σ) { µb, Σb} := arg max
- µ,Σ L
b1,..., bn (
µ, Σ)
Quadratic discriminant analysis
- 1. Fit multidimensional Gaussian distribution to each class
{ µa, Σa} := arg max
- µ,Σ L
a1,..., an (
µ, Σ) { µb, Σb} := arg max
- µ,Σ L
b1,..., bn (
µ, Σ)
- 2. For each new example if