The Gaussian parameterized by mean and SD (position / width) - PDF document

1 Mathematical Tools for Neural and Cognitive Science Fall semester, 2018 Probability & Statistics: Estimation, inference, model-fitting 2 Estimation of model parameters (outline) • How do I compute an estimate?   (mathematics vs. numerical optimization) • How “good” are my estimates?   (classical stats vs. simulation vs. resampling) • How well does my model explain the data?   Future data (prediction/generalization)?   (classical stats vs. resampling) • How do I compare two (or more) models?   (classical stats vs. resampling) 3 The sample average Mea N x ) = 1 X a ( ~ x n N n =1 Inf • Most common common form of estimator • Value of a converges to true mean E(x), for all reasonable distributions • Variance of a converges to zero, as • Distribution p(a) converges to a Gaussian   (the “Central Limit Theorem”)

4 The Gaussian • parameterized by mean and SD (position / width) • product of two Gaussians is Gaussian! [easy] • sum of Gaussian RVs is Gaussian! [moderate] • central limit theorem: sum of many RVs is Gaussian! [hard] 5 Central limit for a uniform distribution... 10k samples, uniform density (sigma=1) 10 4 samples of uniform dist (u+u)/sqrt(2) 450 250 400 200 350 300 150 250 200 100 150 100 50 50 0 0 − 4 − 3 − 2 − 1 0 1 2 3 4 − 4 − 3 − 2 − 1 0 1 2 3 4 (u+u+u+u)/sqrt(4) 10 u’s divided by sqrt(10) 500 600 450 500 400 350 400 300 250 300 200 200 150 100 100 50 0 0 − 4 − 3 − 2 − 1 0 1 2 3 4 − 4 − 3 − 2 − 1 0 1 2 3 4 6 Central limit for a binary distribution... one coin avg of 16 coins 6000 2000 5000 1500 4000 3000 1000 2000 500 1000 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 avg of 4 coins avg of 256 coins avg of 64 coins 4000 2500 2000 2000 3000 1500 1500 2000 1000 1000 1000 500 500 0 0 0 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 1 1

7 true density 700 samples Measurement (sampling) Inference true mean: [0 0.8] sample mean: [-0.05 0.83] true cov: [1.0 -0.25 sample cov: [0.95 -0.23 -0.25 0.3] -0.23 0.29] 8 Point Estimates • Estimator: Any function of the data, intended to provide an estimate of the true value of a parameter • Statistically-motivated estimators: - Maximum likelihood (ML): - Max a posteriori (MAP): - Bayes estimator: ⇣ ⌘ x ( ~ x ) | ~ ˆ d ) = arg min L ( x − ˆ x E d ˆ - Bayes least squares:   (special case) 9 Estimator quality: Bias & Variance • Mean squared error = bias^2 + variance • Bias is difficult to assess (requires knowing the “true” value). Variance is easier. • Classical statistics generally aims for an unbiased estimator, with minimal variance (“MVUE”). • The MLE is asymptotically unbiased (under fairly general conditions), but this is only useful if - the likelihood model is correct - the optimum can be computed - you have lots of data • More general view: estimation is about trading off bias and variance, through model selection, “regularization”, or Bayesian priors…

  10 ML Estimates - discrete ⎛ ⎞ ( ) = ( ) m • Binomial:   m − n 1 − p head p n head | m , p head ⎟ p head n ⎜ ⎝ n ⎠ p head = n ˆ m ) = λ k e − k ( • Poisson: p k | λ k ! ˆ λ = k 11 ML Estimates - continuous x 1 , x 2 , ! x N The N independent samples are N ∑ ML estimates are x i µ = ˆ i = 1 N N ( ) ∑ 2 x i − x σ 2 = biased! ˆ i = 1 N 12 Example: Estimate the bias of a coin

13 14 Bayes’ Rule and Estimation Posterior Likelihood Prior p (parameter value |data) = p (data | parameter value) p (parameter value) p (data) Nuisance normalizing term 15 Likelihood: 1 head Likelihood: 1 tail

16 Posteriors, p(H,T|x), assuming prior p(x)=1 More tails T=0 1 2 3 More heads H=0 1 2 3 17 example infer whether a coin is fair by flipping it repeatedly here, x is the probability of heads (50% is fair) y 1. ..n are the outcomes of flips Consider three different priors: suspect fair suspect biased no idea prior fair prior biased prior uncertain 18 X likelihood (heads) = posterior

previous posteriors 19 X likelihood (heads) = new posterior previous posteriors 20 X likelihood (tails) = new posterior 21 Posteriors after observing 75 heads, 25 tails à prior differences are ultimately overwhelmed by data

22 Confidence intervals PDFs 2H / 1T 10H / 5T 20H / 10T CDFs, and 95% confidence intervals .975 .025 .19 .93 .49 .80 Classical “frequentist” statistical tests 23 Statistical Rethinking, Richard McElreath 24 Classical/frequentist approach - z • H 1 : NZT improves IQ • Null: H 0 : it does nothing • In the general population, IQ is known to be distributed normally with • µ = 100 • σ = 15 • We give the drug to 30 people and test their IQ.

25 The z -test • µ = 100 (Population mean) • σ = 15 (Population standard deviation) • N = 30 (Sample contains scores from 30 participants) • x = 108.3 (Sample mean) • z = ( x – µ )/SE = (108.3-100)/SE (Standardized score) • SE = σ / √ N = 15/ √ 30 = 2.74 • Error bar/CI: ±2 SE • z = 8.3/2.74 = 3.03 • p = 0.0012 • Significant? • One- vs. two-tailed test 26 What if the measured effect of NZT had been half that? • µ = 100 (Population mean) • σ = 15 (Population standard deviation) • N = 30 (Sample contains scores from 30 participants) • x = 104.2 (Sample mean) • z = ( x – µ )/SE = (104.2-100)/SE • SE = σ / √ N = 15/ √ 30 = 2.74 • z = 4.2/2.74 = 1.53 • p = 0.061 • Significant? 27 Significance levels • Are denoted by the Greek letter α . • In principle, we can pick anything that we consider unlikely. • In practice, the consensus is that a level of 0.05 or 1 in 20 is considered as unlikely enough to reject H 0 and accept the alternative. • A level of 0.01 or 1 in 100 is considered “highly significant” or really unlikely.

28 Does NZT improve IQ scores or not? Reality Yes No Type I error Correct α -error Yes Significant? False alarm Type II error No β -error Correct Miss 29 Test statistic • We calculate how far the observed value of the sample average is away from its expected value. • In units of standard error. • In this case, the test statistic is z = x − µ = x − µ SE σ / N • Compare to a distribution, in this case z or N (0,1) 30 Common misconceptions Is “Statistically significant” a synonym for: • Substantial • Important • Big • Real Does statistical significance gives the • probability that the null hypothesis is true • probability that the null hypothesis is false • probability that the alternative hypothesis is true • probability that the alternative hypothesis is false Meaning of p -value. Meaning of CI.

        31 Student’s t -test • σ not assumed known • Use   N ( ) ∑ 2 x i − x s 2 = i = 1 N − 1 E ( s 2 ) = σ 2 • Why N -1? s is unbiased (unlike ML version), i.e.,   t = x − µ 0 • Test statistic is   s / N • Compare to t distribution for CIs and NHST • “Degrees of freedom” reduced by 1 to N -1 32 The t distribution approaches the normal distribution for large N Probability x (z or t) 33 The z -test for binomial data • Is the coin fair? • Lean on central limit theorem • Sample is n heads out of m tosses p = n / m ˆ • Sample mean: • H 0 : p = 0.5 • Binomial variability (one toss): σ = pq , where q = 1 − p p − p 0 ˆ • Test statistic:   z = p 0 q 0 / m • Compare to z (standard normal) • For CI, use ± z α /2 p ˆ ˆ q / m

34 Many varieties of frequentist univariate tests • goodness of fit χ 2 • test of independence χ 2 • test a variance using χ 2 • F to compare variances (as a ratio) • Nonparametric tests (e.g., sign, rank-order, etc.) 35 Bootstrapping • “The Baron had fallen to the bottom of a deep lake. Just when it looked like all was lost, he thought to pick himself up by his own bootstraps”   [Adventures of Baron von Munchausen, by Rudolph Erich Raspe] • A ( re)sampling method for computing estimator distribution (incl. stdev error bars or confidence intervals) • Idea: instead of running experiment multiple times, resample (with replacement) from the existing data. Compute an estimate from each of these “bootstrapped” data sets. 36 [New York Times, 27 Jan 1987] Histogram of bootstrap estimates: 1400 Boostrapped Original 1200 95% conf 1000 800 600 400 200 0 0.2 0.4 0.6 0.8 1 => with 95% confidence, [Efron & Tibshirani ’98]

⃗ ⃗ 37 [Efron & Tibshirani ’98] 38 probabilistic data model Measurement p θ ( x ) { x n } Inference 39 Point Estimates • Estimator: Any function of the data, intended to provide an estimate of the true value of a parameter • The most common estimator is the sample average, used to estimate the true mean of a distribution. • Statistically-motivated estimators: - Maximum likelihood (ML): - Max a posteriori (MAP): - Bayes estimator: ⇣ ⌘ x ( ~ x ) | ~ ˆ d ) = arg min x E L ( x − ˆ d ˆ

40 41 Signal Detection Theory P(x|N) P(x|S) x “S” “N” For equal, unimodal, symmetric distributions, ML decision rule is a threshold function. 42 Signal Detection Theory: Potential outcomes P(x|N) P(x|S) Doctor responds Doctor responds “no” “yes” x Tumor miss hit present P(x|N) P(x|S) Tumor correct false absent reject alarm x threshold

The Gaussian parameterized by mean and SD (position / width) - PDF document

1 Mathematical Tools for Neural and Cognitive Science Fall semester, 2018 Probability & Statistics: Estimation, inference, model-fitting 2 Estimation of model parameters (outline) How do I compute an estimate? (mathematics vs.

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

CS70: Jean Walrand: Lecture 36. Gaussian and CLT CS70: Jean Walrand: Lecture 36. Gaussian and

Gaussian Process Lei Tang Arizona State University Jul. 31th, 2007 Lei Tang (ASU) Gaussian

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee

MAP for Gaussian mean and variance Conjugate priors Mean: Gaussian prior Variance:

Gaussian Processes Dan Cervone NYU CDS November 10, 2015 Dan Cervone (NYU CDS) Gaussian

CMPUT 466 Introduction to Gaussian Processes Dan Lizotte The Plan Introduction to Gaussian

State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2

Gaussian Processes Seung-Hoon Na Chonbuk National University Gaussian Process Regression

The Gaussian Distribution Chris Williams School of Informatics, University of Edinburgh October

Chapter 9 Gaussian Channel Peng-Hua Wang Graduate Inst. of Comm. Engineering National Taipei

Gaussian tutorial -Infrared spectra calculation In this tutorial Gaussian 03 program was used

Determining the PSF over the Full FoV of LSST using Anisotropic Gaussian Processes

Review of basic frequentist concepts Shravan Vasishth March 10, 2020 1 Foundations 1.1 Random

Testing Specification testing Michel Bierlaire Introduction to choice models Differences from

Null Hypothesis Significance Testing Signifcance Level, Power, t -Tests 18.05 Spring 2014 Jeremy

Gov 2000: 6. Hypothesis Testing Matthew Blackwell October 11, 2016 1 / 55 1. Hypothesis

The Power and Limits of Statistics DPRRGSP 2018-11-29 @ReinhardFurrer Applied Statistics

Significance Testing Evaluation, session 6 CS6200: Information Retrieval Statistical

Error Exponents for Composite Hypothesis Testing of Markov Forest Distributions Vincent Tan,

Lecture 8: Information Theory and Statistics I-Hsiang Wang Department of Electrical Engineering