Two Statistical Paradigms Bayesian versus Frequentist Steven Janke - PowerPoint PPT Presentation

Two Statistical Paradigms Bayesian versus Frequentist Steven Janke April 2012 (Bayesian Seminar) April 2012 1 / 31

Probability versus Statistics Probability Problem Given some symmetry, what is the probability that an event occurs? (Example: Probability of 3 heads when a fair coin is flipped 5 times.) Statistical Problem Given some data, what is the underlying symmetry? (Example: Four heads resulted from tossing a coin 5 times. Is is a fair coin?) (Bayesian Seminar) April 2012 2 / 31

Statistics Problems All statistical problems begin with a model for the problem. Then there are different flavors for the actual question: Point Estimation (Find the mean of the population.) Interval Estimation (Confidence interval for the mean.) Hypothesis testing (Is the mean greater than 10?) Model fit (Could one random variable be linearly related to another?) (Bayesian Seminar) April 2012 3 / 31

Estimators and Loss functions Definition (Estimator) An estimator (ˆ θ ) is a random variable that is a function of the data and reasonably approximates θ . For example, ˆ θ = ¯ X = 1 � n i =1 X i n Loss Functions Squared Error: L ( θ − ˆ θ ) = ( θ − ˆ θ ) 2 Absolute Error: L ( θ − ˆ θ ) = | θ − ˆ θ | Linex: L ( θ − ˆ θ ) = exp ( c ( θ − ˆ θ )) − c ( θ − ˆ θ ) − 1 (Bayesian Seminar) April 2012 4 / 31

Comparing Estimators Definition (Risk) An estimator’s risk is R ( θ, ˆ θ ) = E ( L ( θ, ˆ θ )). For two estimators ˆ θ 1 and ˆ θ 2 , if R ( θ, ˆ θ 1 ) < R ( θ, ˆ θ 2 ) for all θ ∈ Θ, then ˆ θ 2 is called inadmissible. How do you find ”good” estimators? What does ”good” mean? How do you definitively rank estimators? (Bayesian Seminar) April 2012 5 / 31

Frequentist Approach Philosophical Principles: Probability as relative frequency (repeatable trials). (Law of large numbers, Central Limit Theorem) θ is a fixed quantity, but the data can vary. Definition (Likelihood Function) Suppose the population density is f θ and the data ( x 1 , x 2 , · · · , x n ) are sampled independently, then the Likelihood Function is L ( θ ) = Π n i =1 f θ ( x i ). The value of θ (as a function of the data) that gives the maximum likelihood is a good candidate for an estimator ˆ θ . (Bayesian Seminar) April 2012 6 / 31

Maximum Likelihood for Normal and Binomial Population is Normal( µ, σ 2 ): 2 π e − ( xi − µ )2 L ( µ ) = Π n 1 √ 2 σ 2 i =1 σ µ = ¯ X = 1 � n Maximum Likelihood gives ˆ i =1 x i . n Population is Binomial(n,p): Flip a coin n times with P[head]=p. L ( p ) = p k (1 − p ) n − k , where k is the number of heads. p = k Maximum Likelihood gives ˆ n . P µ p = np Both estimators are unbiased: E ˆ n = p and E ˆ µ = = µ . n Among all unbiased estimators, these estimators minimize the squared error loss. (Bayesian Seminar) April 2012 7 / 31

Frequentist Conclusions Definition (Confidence Interval) X is a random variable with mean µ and distribution N ( µ, σ 2 ¯ n ) The random interval (¯ √ n , ¯ X − 1 . 96 σ X + 1 . 96 σ √ n ) is a confidence interval. The probability that it covers µ is 0.95. >>> The probability that µ is in the interval is not 0.95! Definition (Hypothesis Testing) Null Hypothesis: µ = 10 Calculate z ∗ = ¯ X − 10 σ n p-value is P [ z ≤ z ∗ ] + P [ z ≥ z ∗ ] >>> p-value is not the probability that the null hypothesis is true. (Bayesian Seminar) April 2012 8 / 31

Space of Estimators Methods: Moments, Linear, Invariant Characteristics: Unbiased, Sufficient, Consistent, Minimum Variance, Admissible (Bayesian Seminar) April 2012 9 / 31

Frequentist Justification: Three Theorems Theorem (Rao-Blackwell) A statistic T such that the conditional distribution of the data given T does not depend on θ (the parameter) is called sufficient. Unbiased estimators that are functions of a sufficient statistic give the least variance. Theorem (Lehmann-Scheffe) If T is a complete sufficient statistic for θ and ˆ θ = h ( T ) is an unbiased estimator of θ , then ˆ θ has the smallest possible variance among unbiased estimators of θ . (UMVUE of θ .) Theorem (Cramer-Rao) Let a population have a ”regular” distribution with density f θ ( x ) and let ˆ θ be an unbiased estimator of the θ . Then Var (ˆ θ ) ≥ ( nE [( ∂ ∂θ ln f θ ( X )) 2 ]) − 1 (Bayesian Seminar) April 2012 10 / 31

Problems Choosing Estimators How does the Frequentist choose an estimator? Example (Variance) P n x ) 2 i =1 ( x i − ¯ Maximum likelihood estimator: n P n x ) 2 i =1 ( x i − ¯ Unbiased estimator: n − 1 P n x ) 2 i =1 ( x i − ¯ Minimum variance: n +1 Example (Mean) Estimate IQ for an individual. Say score on a test was 130. Suppose scoring machine reported all scores less than 100 as 100. Now the Frequentist must change estimate of IQ. (Bayesian Seminar) April 2012 11 / 31

What is the p-value? The p-value depends on events that didn’t happen. Example (Flipping Coins) Flipping coin: K = 13 implies p-value is P [ K ≤ 4 or K ≥ 13] = 0 . 049. If experimenter stopped when at least 4 heads and 4 tails, p − value = 0 . 021. If experimenter stops at n = 17 and continues if data inconclusive to n = 44, then p − value > 0 . 049. p-value already suspect since it is NOT the probability H 0 is true. (Bayesian Seminar) April 2012 12 / 31

Several Normal means at once Example (Estimating Several Normal Means) Suppose estimating k normal means µ 1 , µ 2 , · · · , µ k from populations with common variance. Then Frequentist estimate is � x = (¯ ¯ x 1 , ¯ x 2 · · · ¯ x k ) under Euclidean distance as the loss function. Stein (1956) showed that � ¯ x is not even admissible! x ) = (1 − k − 2 || x || 2 ) · ¯ A better estimator is ˆ µ (¯ x . The better estimator can be thought of as a Bayesian estimator with the correct prior. (Bayesian Seminar) April 2012 13 / 31

Bayes Theorem Richard Price communicates Rev. Thomas Bayes paper to Royal Society (1763) Given data, Bayes found probability p for a binomial is in a given interval. One of the many lemmas is now known as“Bayes Theorem” P [ A | B ] = P [ AB ] P [ A ] = P [ B | A ] P [ A ] P [ B | A ] P [ A ] = P [ B | A ] P [ A ] + P [ B | A c ] P [ A c ] P [ B ] Conditional probability defined after defining conditional expectation as an integral over a sub-sigma field of sets. Conditional distribution is then f ( x | y ) = f ( x , y ) f Y ( y ) = f ( y | x ) f X ( x ) f Y ( y ) (Bayesian Seminar) April 2012 14 / 31

Bayesian Approach Assume that the parameter under investigation is a random variable and that the data are fixed. Decide on a prior distribution for the parameter. (Bayes considered p to be uniformly distributed from 0 to 1.) Calculate the distribution of the parameter given the data. (Posterior distribution) To minimize squared error loss, take the expected value of the posterior distribution as the estimator. (Bayesian Seminar) April 2012 15 / 31

Bayesian Approach for the Binomial Assume a binomial model with parameters n , p (flipping a coin.) Let K be the number of heads (the data). For a prior distribution on p take the Beta distribution. Γ( α )Γ( β ) p α − 1 (1 − p ) β − 1 for 0 < p < 1. Γ( α + β ) f p ( p ) = f ( p | k ) = f ( k | p ) f p ( p ) = C · p α + k − 1 (1 − p ) β + n − k − 1 f k ( k ) α + k Hence, E [ p | k ] = α + β + n With the uniform prior ( α = 1 , β = 1), the estimator is k +1 n +2 ) k n n +2 ) 1 2 n +2 = ( n + ( 2 . (Bayesian Seminar) April 2012 16 / 31

Bayesian Approach for the Normal If µ is the mean of a normal population, a reasonable prior distribution for µ could be another normal distribution, N ( µ o , σ 2 o ) . − ( µ − µ 1)2 x − µ )2 − ( µ − µ o )2 √ n √ ne − c − n (¯ 2 σ 2 2 σ 2 2 σ 2 f ( µ | ¯ x ) = x ) e = 2 πσσ o e o 1 2 πσσ o g (¯ n σ 2 σ 2 Hence, E [ µ | ¯ x ] = ¯ x ( o + σ 2 ) + µ o ( o + σ 2 ) o n σ 2 n σ 2 (Bayesian Seminar) April 2012 17 / 31

Bayesian Approach under other Loss functions Note: If we use absolute error loss, then the Bayes estimator is the median of the posterior distribution. If we use Linex loss, then the Bayes estimator is − 1 c ln { Ee − c θ } . (Bayesian Seminar) April 2012 18 / 31

Calculating Mean Squared Error If the loss function is squared error, then the mean squared error ( MSE ) measures expected loss. θ − θ ) 2 = ( E (ˆ θ ) − θ ) 2 + E (ˆ MSE = E (ˆ θ − E (ˆ θ )) 2 (Squared Bias + Variance) n : MSE = 0 2 + p (1 − p ) For ˆ θ = k n n +2 ) 2 + ( n +2 ) 2 · p (1 − p ) For ˆ n +2 : MSE = ( 1 − 2 p θ = k +1 1 (Bayesian Seminar) April 2012 19 / 31

Comparing the MSE Figure: Comparing MSE for Freq and Bayes estimators (Bayesian Seminar) April 2012 20 / 31

Bayesian Confidence (Credible) Interval Θ and X are the random variables representing the parameter of interest and the data. The functions u ( x ) and v ( x ) are arbitrary and f is the posterior distribution, � v ( x ) Then, P [ u ( x ) < Θ < v ( x ) | X = x ] = u ( x ) f ( θ | x ) d θ Definition (Credible Interval) If u ( x ) and v ( x ) are picked so the probability is 0.95, the interval [ u ( x ) , v ( x )] is a 95% credible interval for Θ. � xn σ 2 o + µ o σ 2 σ 2 σ 2 For the normal distribution, ¯ ± 1 . 96 o + σ 2 is a 95% credible o n σ 2 o + σ 2 n σ 2 interval for µ . (Bayesian Seminar) April 2012 21 / 31

Two Statistical Paradigms Bayesian versus Frequentist Steven Janke - PowerPoint PPT Presentation

Two Statistical Paradigms Bayesian versus Frequentist Steven Janke April 2012 (Bayesian Seminar) April 2012 1 / 31 Probability versus Statistics Probability Problem Given some symmetry, what is the probability that an event occurs?

Breaking Paradigms in Control Building Design By Robert Frye Tennessee Valley Authority April 6,

Lists CoSc 450: Programming Paradigms 07 The definition of a list CoSc 450: Programming

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

New Paradigms in Serous Ovarian New Paradigms in Serous Ovarian Cancer Dr. Patricia Shaw Dept of

between Cyber-Physical Systems (CPS) and Smart Systems and Smart System Integration paradigms,

Trees CoSc 450: Programming Paradigms 08 The definition of a tree CoSc 450: Programming

Language Paradigms Introduction to SML Amtoft from Hatcliff from Leavens Paradigms Motivation

Distributed Computing Paradigms Distributed Application Paradigms level of abstraction high

One of three programming paradigms We can identify three paradigms: functional programming,

paradigms why study paradigms Concerns how can an interactive system be developed to ensure

Orders of Growth and Tree Recursion CoSc 450: Programming Paradigms 04 Graphics primitive

Higher-Order Procedures CoSc 450: Programming Paradigms 05 In the functional paradigm,

PARADIGMS PARADIGMS & & PRINCIPLES PRINCIPLES Presented By: Parakram (CSE) Ved

Statistical graphics with Statistical graphics with ggplot2 ggplot2 Programming for Statistical

Richard Fiene, Ph.D. Sonya Stevens, Ed.D. Regulatory Compliance Monitoring Paradigms

Regulatory Compliance Monitoring Paradigms February 2017 Richard Fiene, Ph.D. 1 Introduction

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

Precision nuclear physics Observable calculations are becoming increasingly precise Hamiltonian

Objective Bayesian Statistics Jos M. Bernardo Universitat de Valncia, Spain

Political Science 209 - Fall 2018 Probability II Florian Hollenbach 8th November 2018

Introduction to Machine Learning CMU-10701 2. Basic Statistics Barnabs Pczos & Alex

Learning: A Bayesian solution Dmitry P. Vetrov Research professor at HSE, Head of Bayesian

DATA MINING: NAVE BAYES 1 Nave Bayes Classifier Thomas Bayes 1702 - 1761 We will start off

Classification Algorithms UCSB 293S, 2017. T. Yang Some of slides based on R. Mooney (UT Austin)

Two Statistical Paradigms Bayesian versus Frequentist Steven Janke - PowerPoint PPT Presentation

Two Statistical Paradigms Bayesian versus Frequentist Steven Janke April 2012 (Bayesian Seminar) April 2012 1 / 31 Probability versus Statistics Probability Problem Given some symmetry, what is the probability that an event occurs?

Breaking Paradigms in Control Building Design By Robert Frye Tennessee Valley Authority April 6,

Lists CoSc 450: Programming Paradigms 07 The definition of a list CoSc 450: Programming

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

New Paradigms in Serous Ovarian New Paradigms in Serous Ovarian Cancer Dr. Patricia Shaw Dept of

between Cyber-Physical Systems (CPS) and Smart Systems and Smart System Integration paradigms,

Trees CoSc 450: Programming Paradigms 08 The definition of a tree CoSc 450: Programming

Language Paradigms Introduction to SML Amtoft from Hatcliff from Leavens Paradigms Motivation

Distributed Computing Paradigms Distributed Application Paradigms level of abstraction high

One of three programming paradigms We can identify three paradigms: functional programming,

paradigms why study paradigms Concerns how can an interactive system be developed to ensure

Orders of Growth and Tree Recursion CoSc 450: Programming Paradigms 04 Graphics primitive

Higher-Order Procedures CoSc 450: Programming Paradigms 05 In the functional paradigm,

PARADIGMS PARADIGMS &amp; &amp; PRINCIPLES PRINCIPLES Presented By: Parakram (CSE) Ved

Statistical graphics with Statistical graphics with ggplot2 ggplot2 Programming for Statistical

Richard Fiene, Ph.D. Sonya Stevens, Ed.D. Regulatory Compliance Monitoring Paradigms

Regulatory Compliance Monitoring Paradigms February 2017 Richard Fiene, Ph.D. 1 Introduction

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

Precision nuclear physics Observable calculations are becoming increasingly precise Hamiltonian

Objective Bayesian Statistics Jos M. Bernardo Universitat de Valncia, Spain

Political Science 209 - Fall 2018 Probability II Florian Hollenbach 8th November 2018

Introduction to Machine Learning CMU-10701 2. Basic Statistics Barnabs Pczos &amp; Alex

Learning: A Bayesian solution Dmitry P. Vetrov Research professor at HSE, Head of Bayesian

DATA MINING: NAVE BAYES 1 Nave Bayes Classifier Thomas Bayes 1702 - 1761 We will start off

Classification Algorithms UCSB 293S, 2017. T. Yang Some of slides based on R. Mooney (UT Austin)

PARADIGMS PARADIGMS & & PRINCIPLES PRINCIPLES Presented By: Parakram (CSE) Ved

Introduction to Machine Learning CMU-10701 2. Basic Statistics Barnabs Pczos & Alex