Maximum likelihood and EM algorithm (after the Chapter 8) Pasha - - PowerPoint PPT Presentation

maximum likelihood and em algorithm after the chapter 8
SMART_READER_LITE
LIVE PREVIEW

Maximum likelihood and EM algorithm (after the Chapter 8) Pasha - - PowerPoint PPT Presentation

1 Maximum likelihood and EM algorithm (after the Chapter 8) Pasha Zusmanovich, deCODE Statistics Colloquium March 30, 2007 2 What is likelihood and what it is good for? Likelihood is just a conditional probability. Formal definition Given


slide-1
SLIDE 1

1

Maximum likelihood and EM algorithm (after the Chapter 8)

Pasha Zusmanovich, deCODE Statistics Colloquium March 30, 2007

slide-2
SLIDE 2

2

What is likelihood and what it is good for?

Likelihood is just a conditional probability.

Formal definition

Given random events A and B, the likelihood function of A relative to B is: {set of states of B} → [0, 1] x → Pr(A | B = x). Nothing fancy so far. Consider an ...

slide-3
SLIDE 3

3

What is likelihood and what it is good for?

Example: alleles and genotypes

frequencies of alleles: a: θ A: 1 − θ

slide-4
SLIDE 4

3

What is likelihood and what it is good for?

Example: alleles and genotypes

frequencies of alleles: a: θ A: 1 − θ = ⇒ frequencies of genotypes: aa: θ2 aA: 2θ(1 − θ) AA: (1 − θ)2

slide-5
SLIDE 5

3

What is likelihood and what it is good for?

Example: alleles and genotypes

frequencies of alleles: a: θ A: 1 − θ = ⇒ frequencies of genotypes: aa: θ2 aA: 2θ(1 − θ) AA: (1 − θ)2 numbers: naa naA nAA The probability that numbers of genotypes would be exactly (naa, naA, nAA): f (θ) = (naa + naA + nAA)! naa!naA!nAA! θ2naa(2θ(1 − θ))naA(1 − θ)2nAA f is a likelihood function: { probability of alleles } → { conditional probability of genotypes assuming given probability of alleles }.

slide-6
SLIDE 6

3

What is likelihood and what it is good for?

Example: alleles and genotypes

frequencies of alleles: a: θ A: 1 − θ = ⇒ frequencies of genotypes: aa: θ2 aA: 2θ(1 − θ) AA: (1 − θ)2 numbers: naa naA nAA The probability that numbers of genotypes would be exactly (naa, naA, nAA): f (θ) = (naa + naA + nAA)! naa!naA!nAA! θ2naa(2θ(1 − θ))naA(1 − θ)2nAA f is a likelihood function: { probability of alleles } → { conditional probability of genotypes assuming given probability of alleles }. This is a model with parameter θ. Question: Which parameter makes model the “best”? Answer ...

slide-7
SLIDE 7

4

What is likelihood and what it is good for?

Example: alleles and genotypes (continued)

Question: Which parameter makes model the “best”? Answer: Those which makes the observed data more likely, i.e. which maximizes f (θ) = (naa + naA + nAA)! naa!naA!nAA! θ2naa(2θ(1 − θ))naA(1 − θ)2nAA

  • n [0, 1].

Solution: ˆ θ = 2naa + naA 2(naa + naA + nAA).

slide-8
SLIDE 8

4

What is likelihood and what it is good for?

Example: alleles and genotypes (continued)

Question: Which parameter makes model the “best”? Answer: Those which makes the observed data more likely, i.e. which maximizes f (θ) = (naa + naA + nAA)! naa!naA!nAA! θ2naa(2θ(1 − θ))naA(1 − θ)2nAA

  • n [0, 1].

Solution: ˆ θ = 2naa + naA 2(naa + naA + nAA). But this is exactly the Hardy-Weinberg equilibrium!

slide-9
SLIDE 9

5

What is likelihood and what it is good for?

Another example: linear regression

Fitting a line to the set of points on the plane {(x1, y1), . . . , (xn, yn)}, assuming observations are independent, and errors are normally distributed. The model is: Y = β1X + β0 + ε, ε ∼ N(0, σ2). What is the “probability” to have the observed data under the given model?

slide-10
SLIDE 10

5

What is likelihood and what it is good for?

Another example: linear regression

Fitting a line to the set of points on the plane {(x1, y1), . . . , (xn, yn)}, assuming observations are independent, and errors are normally distributed. The model is: Y = β1X + β0 + ε, ε ∼ N(0, σ2). What is the “probability” to have the observed data under the given model? P(Y lies in δ-neighbourhood of yi|X = xi) ≈ density(Y )|X=xi,Y =yi·2δ, so “probability” is replaced by density. If X is fixed, Y − β1X − β0 ∼ N(0, σ2) ⇒ Y ∼ N(β1X + β0, σ2).

slide-11
SLIDE 11

6

What is likelihood and what it is good for?

Another example: linear regression (continued)

Maximizing density(Y )|X=xi,Y =yi =

n

  • i=1

1 √ 2πσ exp

  • − (β1xi + β0 − yi)2

2σ2

  • =
  • 1

√ 2πσ nexp

1 2σ2

n

  • i=1

(β1xi + β0 − yi)2 is equivalent to minimizing

n

  • i=1

(β1xi + β0 − yi)2.

slide-12
SLIDE 12

6

What is likelihood and what it is good for?

Another example: linear regression (continued)

Maximizing density(Y )|X=xi,Y =yi =

n

  • i=1

1 √ 2πσ exp

  • − (β1xi + β0 − yi)2

2σ2

  • =
  • 1

√ 2πσ nexp

1 2σ2

n

  • i=1

(β1xi + β0 − yi)2 is equivalent to minimizing

n

  • i=1

(β1xi + β0 − yi)2. But this is exactly the least squares!

slide-13
SLIDE 13

7

What is likelihood and what it is good for?

Refined formal definition

Assuming a random variable X has a density function f (x, θ) parametrized by θ, the likelihood function is: θ → f (x, θ).

“Conceptual” definition

Likelihood is the probability of observed data under the given model. Thus, the maximum likelihood correspond to the model (in the given parametrized class of models) which makes the observerd data “most likely”. One usually maximize log f (x, θ) instead of f (x, θ) (log-likelihood function). Ok, since log is monotonic. But ...

slide-14
SLIDE 14

8

Why logarithm?

◮ Turns multiplicative things to additive. ◮ Diminishes the “long tail”.

slide-15
SLIDE 15

8

Why logarithm?

◮ Turns multiplicative things to additive.

In most cases on practice, the likelihood function is the product of several

  • functions. E.g., if X1, . . . , Xn are independent random

variables, then their likelihood function: f (x1, . . . , xn, θ) = f (x1, θ) . . . f (xn, θ), so logarithm turns multiplicative things to additive and easier to deal with. (And logarithm is the only “good” function taking multiplication to addition).

◮ Diminishes the “long tail”.

slide-16
SLIDE 16

8

Why logarithm?

◮ Turns multiplicative things to additive.

In most cases on practice, the likelihood function is the product of several

  • functions. E.g., if X1, . . . , Xn are independent random

variables, then their likelihood function: f (x1, . . . , xn, θ) = f (x1, θ) . . . f (xn, θ), so logarithm turns multiplicative things to additive and easier to deal with. (And logarithm is the only “good” function taking multiplication to addition).

◮ Diminishes the “long tail”.

A random variable with values in R+ (say, results of a measurement) tends to have a skewed distribution to the right because there is lower limit but not upper limit. Passing to log diminishes this skewness.

slide-17
SLIDE 17

9

What is likelihood and what it is good for?

Maximum likelihood behaves nicely asymtotically

Taylor series: ℓ(θ) = ℓ(ˆ θ) + 1 2(θ − ˆ θ)2ℓ′′(ˆ θ) + . . . i(θ) = E(−ℓ′′(θ)) – Fisher information. ˆ θ ∼ N(θ0, i(θ0)−1) as number of samples → ∞. Could be used to assess the precision of ˆ θ.

slide-18
SLIDE 18

10

What is likelihood and what it is good for?

Connection with some fancy areas of Mathematics

Back to alleles and genotypes example: model with inbreeding coefficient λ: frequencies of alleles: a: θ A: 1 − θ frequencies of genotypes: aa: θ2 + θ(1 − θ)λ aA: 2θ(1 − θ)(1 − λ) AA: (1 − θ)2 + θ(1 − θ)λ numbers: 38 95 53

(some real blood groups data from UK, 1947)

Scoring equations are equivalent to:

slide-19
SLIDE 19

10

What is likelihood and what it is good for?

Connection with some fancy areas of Mathematics

Back to alleles and genotypes example: model with inbreeding coefficient λ: frequencies of alleles: a: θ A: 1 − θ frequencies of genotypes: aa: θ2 + θ(1 − θ)λ aA: 2θ(1 − θ)(1 − λ) AA: (1 − θ)2 + θ(1 − θ)λ numbers: 38 95 53

(some real blood groups data from UK, 1947)

Scoring equations are equivalent to: 372θ3λ2−744θ3λ−558θ2λ2+372θ3+1131θ2λ+186θλ2−573θ2 − 668θλ + 201θ + 148λ = 0; 186θ2λ2−372θ2λ−186θλ2+186θ2+387θλ−201θ−148λ+53 = 0.

slide-20
SLIDE 20

10

What is likelihood and what it is good for?

Connection with some fancy areas of Mathematics

Back to alleles and genotypes example: model with inbreeding coefficient λ: frequencies of alleles: a: θ A: 1 − θ frequencies of genotypes: aa: θ2 + θ(1 − θ)λ aA: 2θ(1 − θ)(1 − λ) AA: (1 − θ)2 + θ(1 − θ)λ numbers: 38 95 53

(some real blood groups data from UK, 1947)

Scoring equations are equivalent to: 372θ3λ2−744θ3λ−558θ2λ2+372θ3+1131θ2λ+186θλ2−573θ2 − 668θλ + 201θ + 148λ = 0; 186θ2λ2−372θ2λ−186θλ2+186θ2+387θλ−201θ−148λ+53 = 0. Statistics + Algebraic Geometry = Algebraic Statistics.

slide-21
SLIDE 21

11

What is likelihood and what it is good for?

Advantages (to summarize)

◮ Agrees with intuition. ◮ Confirmed by other methods. ◮ “Nice” asymptotic behavior. ◮ Very good practical results. ◮ Universal. ◮ Connection with other areas of Mathematics.

slide-22
SLIDE 22

11

What is likelihood and what it is good for?

Advantages (to summarize)

◮ Agrees with intuition. ◮ Confirmed by other methods. ◮ “Nice” asymptotic behavior. ◮ Very good practical results. ◮ Universal. ◮ Connection with other areas of Mathematics.

Disadvantages

◮ No “theoretical” justification. ◮ Could be bad for small samples. ◮ No way to compare “disjoint” models. ◮ “Bayesian” issue ...

slide-23
SLIDE 23

12

What is likelihood and what it is good for?

“Bayesian” issue:

Pr(data|model) = Pr(model|data)Pr(data) Pr(model) .

slide-24
SLIDE 24

12

What is likelihood and what it is good for?

“Bayesian” issue:

Pr(data|model) = Pr(model|data)Pr(data) Pr(model) .

Philosophical mumbo-jumbo:

◮ M. Forster and E. Sober, Why likelihood?, The Nature of

Scientific Evidence (ed. M. Taper and S. Lele), Univ. of Chicago Press, 2004, 153–165

http://philosophy.wisc.edu/forster/Likelihood/default.htm

◮ B. Fitelson, Likelihoodism, bayesianism, and relational

confirmation, Synthese, to appear

http://fitelson.org/research.htm

slide-25
SLIDE 25

13

EM algorithm

Finding the maximum of likelihood function could be difficult.

Example: alleles and phenotypes

Assume A is dominant, and we observe only phenotypes: frequencies of alleles: a: θ A: 1 − θ frequencies of geno- types: aa: θ2 aA: 2θ(1 − θ) AA: (1 − θ)2 numbers

  • f

pheno- types: a: 38 A: 148

slide-26
SLIDE 26

13

EM algorithm

Finding the maximum of likelihood function could be difficult.

Example: alleles and phenotypes

Assume A is dominant, and we observe only phenotypes: frequencies of alleles: a: θ A: 1 − θ frequencies of geno- types: aa: θ2 aA: 2θ(1 − θ) AA: (1 − θ)2 numbers

  • f

pheno- types: a: 38 A: 148 Scoring equation amounts to: 38/θ2 − 148/(1 − θ2) = 0, i.e. is

  • biquadratic. Suppose we don’t know how/don’t want to solve it.

What to do?

slide-27
SLIDE 27

13

EM algorithm

Finding the maximum of likelihood function could be difficult.

Example: alleles and phenotypes

Assume A is dominant, and we observe only phenotypes: frequencies of alleles: a: θ A: 1 − θ frequencies of geno- types: aa: θ2 aA: 2θ(1 − θ) AA: (1 − θ)2 numbers

  • f

pheno- types: a: 38 A: 148 Scoring equation amounts to: 38/θ2 − 148/(1 − θ2) = 0, i.e. is

  • biquadratic. Suppose we don’t know how/don’t want to solve it.

What to do? Introduce back missing numbers naA and nAA (hidden parameters) and iterate.

slide-28
SLIDE 28

14

EM algorithm

Example: alleles and phenotypes (continued)

slide-29
SLIDE 29

14

EM algorithm

Example: alleles and phenotypes (continued)

E Step 1: initial genotype numbers: naA = nAA = 148/2 = 74.00

slide-30
SLIDE 30

14

EM algorithm

Example: alleles and phenotypes (continued)

E Step 1: initial genotype numbers: naA = nAA = 148/2 = 74.00 M Step 2: find MLE for those numbers: θ = (2 · 38 + 74.00)/(2 · 186) = 0.40

slide-31
SLIDE 31

14

EM algorithm

Example: alleles and phenotypes (continued)

E Step 1: initial genotype numbers: naA = nAA = 148/2 = 74.00 M Step 2: find MLE for those numbers: θ = (2 · 38 + 74.00)/(2 · 186) = 0.40 E Step 3: for θ = 0.40, find genotype frequencies: for aA: 2·0.40·(1−0.40) = 0.48 and for AA: (1−0.40)2 = 0.36, and for them, genotype numbers: naA = 186 · 0.48 = 89.28, nAA = 148 − 89.28 = 58.72

slide-32
SLIDE 32

14

EM algorithm

Example: alleles and phenotypes (continued)

E Step 1: initial genotype numbers: naA = nAA = 148/2 = 74.00 M Step 2: find MLE for those numbers: θ = (2 · 38 + 74.00)/(2 · 186) = 0.40 E Step 3: for θ = 0.40, find genotype frequencies: for aA: 2·0.40·(1−0.40) = 0.48 and for AA: (1−0.40)2 = 0.36, and for them, genotype numbers: naA = 186 · 0.48 = 89.28, nAA = 148 − 89.28 = 58.72 M Step 4: find MLE for those numbers: θ = (2 · 38 + 89.28)/(2 · 186) = 0.44

slide-33
SLIDE 33

14

EM algorithm

Example: alleles and phenotypes (continued)

E Step 1: initial genotype numbers: naA = nAA = 148/2 = 74.00 M Step 2: find MLE for those numbers: θ = (2 · 38 + 74.00)/(2 · 186) = 0.40 E Step 3: for θ = 0.40, find genotype frequencies: for aA: 2·0.40·(1−0.40) = 0.48 and for AA: (1−0.40)2 = 0.36, and for them, genotype numbers: naA = 186 · 0.48 = 89.28, nAA = 148 − 89.28 = 58.72 M Step 4: find MLE for those numbers: θ = (2 · 38 + 89.28)/(2 · 186) = 0.44 E Step 5: for θ = 0.44, find genotype frequencies: for aA: 2·0.44·(1−0.44) = 0.49 and for AA: (1−0.44)2 = 0.31 and genotype numbers: naA = 186 · 0.49 = 91.14, nAA = 148 − 91.14 = 56.86

slide-34
SLIDE 34

14

EM algorithm

Example: alleles and phenotypes (continued)

E Step 1: initial genotype numbers: naA = nAA = 148/2 = 74.00 M Step 2: find MLE for those numbers: θ = (2 · 38 + 74.00)/(2 · 186) = 0.40 E Step 3: for θ = 0.40, find genotype frequencies: for aA: 2·0.40·(1−0.40) = 0.48 and for AA: (1−0.40)2 = 0.36, and for them, genotype numbers: naA = 186 · 0.48 = 89.28, nAA = 148 − 89.28 = 58.72 M Step 4: find MLE for those numbers: θ = (2 · 38 + 89.28)/(2 · 186) = 0.44 E Step 5: for θ = 0.44, find genotype frequencies: for aA: 2·0.44·(1−0.44) = 0.49 and for AA: (1−0.44)2 = 0.31 and genotype numbers: naA = 186 · 0.49 = 91.14, nAA = 148 − 91.14 = 56.86 M Step 6: find MLE for those numbers: θ = (2 · 38 + 91.14)/(2 · 186) = 0.44

slide-35
SLIDE 35

14

EM algorithm

Example: alleles and phenotypes (continued)

E Step 1: initial genotype numbers: naA = nAA = 148/2 = 74.00 M Step 2: find MLE for those numbers: θ = (2 · 38 + 74.00)/(2 · 186) = 0.40 E Step 3: for θ = 0.40, find genotype frequencies: for aA: 2·0.40·(1−0.40) = 0.48 and for AA: (1−0.40)2 = 0.36, and for them, genotype numbers: naA = 186 · 0.48 = 89.28, nAA = 148 − 89.28 = 58.72 M Step 4: find MLE for those numbers: θ = (2 · 38 + 89.28)/(2 · 186) = 0.44 E Step 5: for θ = 0.44, find genotype frequencies: for aA: 2·0.44·(1−0.44) = 0.49 and for AA: (1−0.44)2 = 0.31 and genotype numbers: naA = 186 · 0.49 = 91.14, nAA = 148 − 91.14 = 56.86 M Step 6: find MLE for those numbers: θ = (2 · 38 + 91.14)/(2 · 186) = 0.44 Stop!

slide-36
SLIDE 36

15

EM algorithm

Advantages

◮ Reduces MLE problem to another more manageable (MLE)

problem.

◮ Agrees with results obtained by other means. ◮ Works on practice.

slide-37
SLIDE 37

15

EM algorithm

Advantages

◮ Reduces MLE problem to another more manageable (MLE)

problem.

◮ Agrees with results obtained by other means. ◮ Works on practice.

Disadvantages

◮ No theoretical justification.

slide-38
SLIDE 38

16

Maximum likelihood and ME algorithm at deCODE

Associations studies

nemo by Dan´ ıel Gudbjartsson. Typical input data: list of affected and unaffected individuals, list

  • f markers (e.g. SNPs), list of genotypes (per marker and per

individual).

slide-39
SLIDE 39

16

Maximum likelihood and ME algorithm at deCODE

Associations studies

nemo by Dan´ ıel Gudbjartsson. Typical input data: list of affected and unaffected individuals, list

  • f markers (e.g. SNPs), list of genotypes (per marker and per

individual).

Haplotypes inference from genotypes

Maximum parsimony vs. maximum likelihood.

slide-40
SLIDE 40

16

Maximum likelihood and ME algorithm at deCODE

Associations studies

nemo by Dan´ ıel Gudbjartsson. Typical input data: list of affected and unaffected individuals, list

  • f markers (e.g. SNPs), list of genotypes (per marker and per

individual).

Haplotypes inference from genotypes

Maximum parsimony vs. maximum likelihood. Example (0,1 – homozygote, 2 – heterozygote): genotypes: 2120 2102 1221

slide-41
SLIDE 41

16

Maximum likelihood and ME algorithm at deCODE

Associations studies

nemo by Dan´ ıel Gudbjartsson. Typical input data: list of affected and unaffected individuals, list

  • f markers (e.g. SNPs), list of genotypes (per marker and per

individual).

Haplotypes inference from genotypes

Maximum parsimony vs. maximum likelihood. Example (0,1 – homozygote, 2 – heterozygote): genotypes: 2120 2102 1221 ⇐ = parsimonial solution: 0100 + 1110 0100 + 1101 1011 + 1101

slide-42
SLIDE 42

17

That’s all.

Slides at http://justpasha.org/tmp/presentation.pdf .