Two Statistical Paradigms Bayesian versus Frequentist Steven Janke - - PowerPoint PPT Presentation

two statistical paradigms
SMART_READER_LITE
LIVE PREVIEW

Two Statistical Paradigms Bayesian versus Frequentist Steven Janke - - PowerPoint PPT Presentation

Two Statistical Paradigms Bayesian versus Frequentist Steven Janke April 2012 (Bayesian Seminar) April 2012 1 / 31 Probability versus Statistics Probability Problem Given some symmetry, what is the probability that an event occurs?


slide-1
SLIDE 1

Two Statistical Paradigms

Bayesian versus Frequentist Steven Janke April 2012

(Bayesian Seminar) April 2012 1 / 31

slide-2
SLIDE 2

Probability versus Statistics

Probability Problem

Given some symmetry, what is the probability that an event occurs? (Example: Probability of 3 heads when a fair coin is flipped 5 times.)

Statistical Problem

Given some data, what is the underlying symmetry? (Example: Four heads resulted from tossing a coin 5 times. Is is a fair coin?)

(Bayesian Seminar) April 2012 2 / 31

slide-3
SLIDE 3

Statistics Problems

All statistical problems begin with a model for the problem. Then there are different flavors for the actual question: Point Estimation (Find the mean of the population.) Interval Estimation (Confidence interval for the mean.) Hypothesis testing (Is the mean greater than 10?) Model fit (Could one random variable be linearly related to another?)

(Bayesian Seminar) April 2012 3 / 31

slide-4
SLIDE 4

Estimators and Loss functions

Definition (Estimator)

An estimator (ˆ θ) is a random variable that is a function of the data and reasonably approximates θ. For example, ˆ θ = ¯ X = 1

n

n

i=1 Xi

Loss Functions

Squared Error: L(θ − ˆ θ) = (θ − ˆ θ)2 Absolute Error: L(θ − ˆ θ) = |θ − ˆ θ| Linex: L(θ − ˆ θ) = exp(c(θ − ˆ θ)) − c(θ − ˆ θ) − 1

(Bayesian Seminar) April 2012 4 / 31

slide-5
SLIDE 5

Comparing Estimators

Definition (Risk)

An estimator’s risk is R(θ, ˆ θ) = E(L(θ, ˆ θ)). For two estimators ˆ θ1 and ˆ θ2, if R(θ, ˆ θ1) < R(θ, ˆ θ2) for all θ ∈ Θ, then ˆ θ2 is called inadmissible. How do you find ”good” estimators? What does ”good” mean? How do you definitively rank estimators?

(Bayesian Seminar) April 2012 5 / 31

slide-6
SLIDE 6

Frequentist Approach

Philosophical Principles: Probability as relative frequency (repeatable trials). (Law of large numbers, Central Limit Theorem) θ is a fixed quantity, but the data can vary.

Definition (Likelihood Function)

Suppose the population density is fθ and the data (x1, x2, · · · , xn) are sampled independently, then the Likelihood Function is L(θ) = Πn

i=1fθ(xi).

The value of θ (as a function of the data) that gives the maximum likelihood is a good candidate for an estimator ˆ θ.

(Bayesian Seminar) April 2012 6 / 31

slide-7
SLIDE 7

Maximum Likelihood for Normal and Binomial

Population is Normal(µ, σ2): L(µ) = Πn

i=1 1 σ √ 2πe− (xi −µ)2

2σ2

Maximum Likelihood gives ˆ µ = ¯ X = 1

n

n

i=1 xi.

Population is Binomial(n,p): Flip a coin n times with P[head]=p. L(p) = pk(1 − p)n−k, where k is the number of heads. Maximum Likelihood gives ˆ p = k

n.

Both estimators are unbiased: Eˆ p = np

n = p and E ˆ

µ =

P µ n

= µ. Among all unbiased estimators, these estimators minimize the squared error loss.

(Bayesian Seminar) April 2012 7 / 31

slide-8
SLIDE 8

Frequentist Conclusions

Definition (Confidence Interval)

¯ X is a random variable with mean µ and distribution N(µ, σ2

n )

The random interval (¯ X − 1.96 σ

√n, ¯

X + 1.96 σ

√n) is a confidence

interval. The probability that it covers µ is 0.95. >>> The probability that µ is in the interval is not 0.95!

Definition (Hypothesis Testing)

Null Hypothesis: µ = 10 Calculate z∗ = ¯

X−10

σ n

p-value is P[z ≤ z∗] + P[z ≥ z∗] >>> p-value is not the probability that the null hypothesis is true.

(Bayesian Seminar) April 2012 8 / 31

slide-9
SLIDE 9

Space of Estimators

Methods: Moments, Linear, Invariant Characteristics: Unbiased, Sufficient, Consistent, Minimum Variance, Admissible

(Bayesian Seminar) April 2012 9 / 31

slide-10
SLIDE 10

Frequentist Justification: Three Theorems

Theorem (Rao-Blackwell)

A statistic T such that the conditional distribution of the data given T does not depend on θ (the parameter) is called sufficient. Unbiased estimators that are functions of a sufficient statistic give the least variance.

Theorem (Lehmann-Scheffe)

If T is a complete sufficient statistic for θ and ˆ θ = h(T) is an unbiased estimator of θ, then ˆ θ has the smallest possible variance among unbiased estimators of θ. (UMVUE of θ.)

Theorem (Cramer-Rao)

Let a population have a ”regular” distribution with density fθ(x) and let ˆ θ be an unbiased estimator of the θ. Then Var(ˆ θ) ≥ (nE[( ∂

∂θ ln fθ(X))2])−1

(Bayesian Seminar) April 2012 10 / 31

slide-11
SLIDE 11

Problems Choosing Estimators

How does the Frequentist choose an estimator?

Example (Variance)

Maximum likelihood estimator:

Pn

i=1(xi−¯

x)2 n

Unbiased estimator:

Pn

i=1(xi−¯

x)2 n−1

Minimum variance:

Pn

i=1(xi−¯

x)2 n+1

Example (Mean)

Estimate IQ for an individual. Say score on a test was 130. Suppose scoring machine reported all scores less than 100 as 100. Now the Frequentist must change estimate of IQ.

(Bayesian Seminar) April 2012 11 / 31

slide-12
SLIDE 12

What is the p-value?

The p-value depends on events that didn’t happen.

Example (Flipping Coins)

Flipping coin: K = 13 implies p-value is P[K ≤ 4 or K ≥ 13] = 0.049. If experimenter stopped when at least 4 heads and 4 tails, p − value = 0.021. If experimenter stops at n = 17 and continues if data inconclusive to n = 44, then p − value > 0.049. p-value already suspect since it is NOT the probability H0 is true.

(Bayesian Seminar) April 2012 12 / 31

slide-13
SLIDE 13

Several Normal means at once

Example (Estimating Several Normal Means)

Suppose estimating k normal means µ1, µ2, · · · , µk from populations with common variance. Then Frequentist estimate is ¯ x = (¯ x1, ¯ x2 · · · ¯ xk) under Euclidean distance as the loss function. Stein (1956) showed that ¯ x is not even admissible! A better estimator is ˆ µ(¯ x) = (1 − k−2

||x||2 ) · ¯

x. The better estimator can be thought of as a Bayesian estimator with the correct prior.

(Bayesian Seminar) April 2012 13 / 31

slide-14
SLIDE 14

Bayes Theorem

Richard Price communicates Rev. Thomas Bayes paper to Royal Society (1763) Given data, Bayes found probability p for a binomial is in a given interval. One of the many lemmas is now known as“Bayes Theorem” P[A|B] = P[AB] P[A] = P[B|A]P[A] P[B] = P[B|A]P[A] P[B|A]P[A] + P[B|Ac]P[Ac] Conditional probability defined after defining conditional expectation as an integral over a sub-sigma field of sets. Conditional distribution is then f (x|y) = f (x, y) fY (y) = f (y|x)fX(x) fY (y)

(Bayesian Seminar) April 2012 14 / 31

slide-15
SLIDE 15

Bayesian Approach

Assume that the parameter under investigation is a random variable and that the data are fixed. Decide on a prior distribution for the parameter. (Bayes considered p to be uniformly distributed from 0 to 1.) Calculate the distribution of the parameter given the data. (Posterior distribution) To minimize squared error loss, take the expected value of the posterior distribution as the estimator.

(Bayesian Seminar) April 2012 15 / 31

slide-16
SLIDE 16

Bayesian Approach for the Binomial

Assume a binomial model with parameters n, p (flipping a coin.) Let K be the number of heads (the data). For a prior distribution on p take the Beta distribution. fp(p) =

Γ(α+β) Γ(α)Γ(β)pα−1(1 − p)β−1 for 0 < p < 1.

f (p|k) = f (k|p)fp(p)

fk(k)

= C · pα+k−1(1 − p)β+n−k−1 Hence, E[p|k] =

α+k α+β+n

With the uniform prior (α = 1, β = 1), the estimator is

k+1 n+2 = ( n n+2)k n + ( 2 n+2)1 2.

(Bayesian Seminar) April 2012 16 / 31

slide-17
SLIDE 17

Bayesian Approach for the Normal

If µ is the mean of a normal population, a reasonable prior distribution for µ could be another normal distribution, N(µo, σ2

  • ).

f (µ|¯ x) =

√n 2πσσog(¯ x)e − n(¯

x−µ)2 2σ2

− (µ−µo)2

2σ2

  • =

√ne−c 2πσσo e − (µ−µ1)2

2σ2 1

Hence, E[µ|¯ x] = ¯ x(

nσ2

  • nσ2
  • +σ2 ) + µo(

σ2 nσ2

  • +σ2 )

(Bayesian Seminar) April 2012 17 / 31

slide-18
SLIDE 18

Bayesian Approach under other Loss functions

Note: If we use absolute error loss, then the Bayes estimator is the median

  • f the posterior distribution.

If we use Linex loss, then the Bayes estimator is − 1

c ln{Ee−cθ}.

(Bayesian Seminar) April 2012 18 / 31

slide-19
SLIDE 19

Calculating Mean Squared Error

If the loss function is squared error, then the mean squared error (MSE) measures expected loss. MSE = E(ˆ θ − θ)2 = (E(ˆ θ) − θ)2 + E(ˆ θ − E(ˆ θ))2 (Squared Bias + Variance) For ˆ θ = k

n: MSE = 02 + p(1−p) n

For ˆ θ = k+1

n+2: MSE = (1−2p n+2 )2 + ( 1 n+2)2 · p(1 − p)

(Bayesian Seminar) April 2012 19 / 31

slide-20
SLIDE 20

Comparing the MSE

Figure: Comparing MSE for Freq and Bayes estimators

(Bayesian Seminar) April 2012 20 / 31

slide-21
SLIDE 21

Bayesian Confidence (Credible) Interval

Θ and X are the random variables representing the parameter of interest and the data. The functions u(x) and v(x) are arbitrary and f is the posterior distribution, Then, P[u(x) < Θ < v(x)|X = x] = v(x)

u(x) f (θ|x)dθ

Definition (Credible Interval)

If u(x) and v(x) are picked so the probability is 0.95, the interval [u(x), v(x)] is a 95% credible interval for Θ. For the normal distribution, ¯

xnσ2

  • +µoσ2

nσ2

  • +σ2

± 1.96

  • σ2σ2
  • nσ2
  • +σ2 is a 95% credible

interval for µ.

(Bayesian Seminar) April 2012 21 / 31

slide-22
SLIDE 22

Comparing Estimates

Frequentist and Bayesian estimates for p in the Binomial: (k, n) Estimate 95% Conf.

  • Int. Width

Frequentist (2,3) 0.667 (0.125, 0.982) 0.857 (30,45) 0.667 (0.509, 0.796) 0.287 (60,90) 0.667 (0.559, 0.760) 0.201 Bayes (Beta(1,1) (2,3) 0.600 (0.235, 0.964) 0.729 (30,45) 0.660 (0.527, 0.793) 0.266 (60,90) 0.663 (0.567, 0.759) 0.191 Bayes (Beta(4,4) (2,3) 0.545 (0.270, 0.820) 0.550 (30,45) 0.642 (0.514, 0.729) 0.254 (60,90) 0.653 (0.560, 0.747) 0.187

(Bayesian Seminar) April 2012 22 / 31

slide-23
SLIDE 23

Bayesian Hypothesis testing

For the coin example, Ho : p = 1

2 and H1 : p = 1 2

Suppose K=14, n=17. Objective stance: P[Ho is true ] = 1

2.

Subjective input: Largest value of P[Ho is true ], say po. P[Ho|K = k] = P[K = k|Ho]P[Ho] P[K = k|Ho]P[Ho] + P[K = k|H1]P[h − 1] (1) = 217 · 0.5 217 · 0.5 +

0.5 2po−1

po

1−po p13(1 − p)4dp

(2) Probability depends on po, but range is from 0.21 to 0.5 giving only mild evidence against Ho. Selecting a uniform distribution for p between 1 − po and po is not

  • restricting. Symmetric distributions give same range.

More generally, Bayesians calculate the odds ratio between the hypotheses.

(Bayesian Seminar) April 2012 23 / 31

slide-24
SLIDE 24

Noninformative Priors

If objectivity is an issue, pick priors which affect the posterior distribution the least. Improper priors: Select a uniform distribution on the real line. Formally, this gives a proper posterior for the normal and identifies ¯ X as the estimator. Reference priors (maximize “distance” from posterior), Jeffrey priors (scale invariant), minimum entropy priors. Conjugate priors: convenient, easily interpreted.

(Bayesian Seminar) April 2012 24 / 31

slide-25
SLIDE 25

Subjective Probability Axioms

Definition (Weak Order)

If A and B are events, then A B means B is at least as likely as A. (Informally, we are willing to place a bet that the odds of B are at least as large as the odds for A.) The relation is referred to as weak order. Axioms: For any pair of events A and B, either A B or A B. For disjoint sets, A1 B1 and A2 B2 implies A1 ∪ A2 B1 ∪ B2. For any event A, ∅ A. {Ai} decreasing sequence of events and Ai B, then ∞

i=1 Ai B.

For uniform distribution on [0, 1], any event is comparable to the subinterval I. (A I or I A) Conclusion: The relation is in agreement with a probability measure.

(Bayesian Seminar) April 2012 25 / 31

slide-26
SLIDE 26

Theoretical Bayesian Results

Theorem (Bayes Estimators)

Under squared error loss, if ˆ θ is an unbiased estimator of θ, then it is not a Bayes estimator under any proper prior.

Theorem (Complete Class Theorem)

Under some weak assumptions, every “good” estimator is a “Bayes” estimator with respect to some prior distribution.

(Bayesian Seminar) April 2012 26 / 31

slide-27
SLIDE 27

When is the Bayesian estimator better?

Definition (Bayes Risk)

Two sources of randomness: estimator and parameter. The expected loss is EFθ[L(ˆ θ, θ)]. Let Go be the true distribution of θ. Then the Bayes risk is: r(Go, ˆ θ) = EGoEFθ[L(ˆ θ, θ)] To compare a Bayesian estimator with prior G (call it ˆ θG) to a Frequentist estimator ˆ θ, compare the Bayes risk. When is r(Go, ˆ θG) ≤ r(Go, ˆ θ)?

(Bayesian Seminar) April 2012 27 / 31

slide-28
SLIDE 28

Threshold Theorem

Theorem (Samaniego, 2010)

If the Bayes estimator ˆ θG under square error loss has the form ˆ θG = (1 − η)EGθ + ηˆ θ where ˆ θ is a sufficient, unbiased estimator of θ and η ∈ [0, 1), then r(Go, ˆ θG) ≤ r(Go, ˆ θ) if and only if VarGo(θ) + (EGθ − EGoθ)2 ≤ 1 + η 1 − ηr(Go, ˆ θ)

(Bayesian Seminar) April 2012 28 / 31

slide-29
SLIDE 29

Proof of Threshold Theorem

Proof: EFθ(ˆ θG − θ)2 = EFθ(η(ˆ θ − θ) + (1 − η)(EGθ − θ)2 (3) = η2EFθ(ˆ θ − θ)2 + (1 − η)2(EGθ − θ)2 (4) Expectation of both sides with respect to Go gives: r(Go, ˆ θG) = η2r(Go, ˆ θ) + (1 − η)2EGo(θ − EGθ)2 (5) = η2r(Go, ˆ θ) + (1 − η)2(VarGo(θ) + (EGθ − EGoθ)2) (6) So if r(Go, ˆ θG) ≤ r(Go, ˆ θ) then VarGo(θ) + (EGθ − EGoθ)2 ≤ 1 + η 1 − ηr(Go, ˆ θ) Q.E.D.

(Bayesian Seminar) April 2012 29 / 31

slide-30
SLIDE 30

Example of Threshold Theorem

Example (Binomial Model)

Problem: Estimate the proportion of ”long” words starting the pages

  • f Of Human Bondage.

Bayes approach is to use (

k+α n+α+β)

Could just specify the prior mean (

α α+β) and η ∈ [0, 1) which

expresses confidence in k

n.

True prior was degenerate distribution with mean 0.3008. r(Go, ˆ p) = 0.02103 Bayes superior if (p∗ − 0.3008)2 < 1+η

1−η(0.02103)

Superior pairs (p∗, η) account for 55% of the square’s area. Empirical result: 88 out of 99 Bayesian estimators were superior.

(Bayesian Seminar) April 2012 30 / 31

slide-31
SLIDE 31

Lingering Questions

What does objective mean? Should subjective probability add to the inference process? What is a good estimator? A frequentist uses careful arguments and events that did not happen to answer the wrong question, while a Bayesian answers the right question by making assumptions that nobody can fully embrace.

(Bayesian Seminar) April 2012 31 / 31