Section 4: Statistics and Inference Probability : an abstract - PDF document

Mathematical Tools for Neural and Cognitive Science Fall semester, 2016 Section 4: Statistics and Inference Probability : an abstract mathematical framework for describing random quantities (e.g. measurements) Statistics : use of probability to summarize, analyze, interpret data. Fundamental to all experimental science.

Statistics as a form of summary 0, 1, 0, 0, 0, 1, 0, 1, ... The purpose of statistics is to replace a quantity of data by relatively few quantities which shall ... contain as much as possible, ideally the whole, of the relevant information P(x) contained in the original data. - R.A. Fisher, 1934 Statistics for Data Summary... • Sample average (minimizes mean squared error) • Sample median (minimizes mean absolute deviation) • Least-squares regression - summarizes relationships between controlled and measured quantities • TLS regression - summarizes relationships between measured quantities

- Efron & Tibshirani, Introduction to the Bootstrap Scientific process Observe / Measure Generate predictions, Summarize, and Design experiment compare with expectations Create/modify Hypothesis/model

Probability basics • discrete probability distributions • continuous probability densities • cumulative distributions • translation and scaling of distributions (adding or multiplying by a constant) • monotonic nonlinear transformations • drawing samples from a distribution via inverse cumulative mapping • example densities/distributions [on board] Example distributions a not-quite-fair coin roll of a fair die sum of two rolled fair dice 0.7 0.2 0.2 0.6 0.15 0.15 0.5 0.4 0.1 0.1 0.3 0.2 0.05 0.05 0.1 0 0 0 1 2 3 4 5 6 2 3 4 5 6 7 8 9 10 11 12 0 1 clicks of a Geiger counter, horizontal velocity of gas ... and, time between clicks in a fixed time interval molecules exiting a fan 0.25 0.1 0.2 0.08 0.15 0.06 0.1 0.04 0.05 0.02 0 0 1 2 3 4 5 6 7 8 9 10 11 0 200 400 600 800 1000 0 1 2 3 4 5 6 7 8 9 10

Multi-dimensional random variables • Joint distributions • Marginals (integrating) • Conditionals (slicing) • Bayes’ Rule (inverting) • Statistical independence Joint distribution p ( x, y )

Marginal distribution p ( x, y ) Z p ( x ) = p ( x, y ) dy Generalized marginal distribution ˆ u Using vector notation: Z p ( z ) = p ( ~ x ) d ~ x ~ x · ˆ u = z z

Conditional distribution p ( x, y ) p ( x | y = 68) Conditional distribution P(x|Y=68) �Z p ( x | y = 68) = p ( x, y = 68) p ( x, y = 68) dx . = p ( x, y = 68) p ( y = 68) More generally: p ( x | y ) = p ( x, y ) /p ( y ) slice joint distribution normalize (by marginal)

Bayes’ Rule p ( x | y ) = p ( y | x ) p ( x ) /p ( y ) (a direct consequence of the definition of conditional probability) Conditional vs. marginal P ( x | Y =120) P ( x ) In general, these differ. When are they they same? In particular, when are all conditionals equal to the marginal?

Statistical independence Variables x and y are statistically independent if (and only if): p ( x, y ) = p ( x ) p ( y ) Independence implies that all condionals are equal to the corresponding marginal: p ( y | x ) = p ( y, x ) /p ( x ) = p ( y ) , ∀ x Uncorrelated doesn’t mean independent... Statistical independence a stronger assumption uncorrelatedness ⇒ All independent variables are uncorrelated ⇒ Not all uncorrelated variables are independent: r =

Expected value Z [the mean, ] E ( x ) = x p ( x ) dx µ Z x 2 p ( x ) dx E ( x 2 ) = [the “second moment”] Z [the variance, ] ( x − µ ) 2 p ( x ) dx σ 2 ( x − µ ) 2 � � E = Z x 2 p ( x ) dx − µ 2 = Z [note: an inner product, E ( f ( x )) = f ( x ) p ( x ) dx and thus linear !] Mean and (co)variance • One-D: mean and covariance summarize centroid/width • translation and rescaling of random variables • nonlinear transformations - “warping” • Multi-D: vector mean and covariance matrix, elliptical geometry • Mean/variance of weighted sum of random variables • The sample average • ... converges to true mean (except for bizarre distributions) • ... with variance • ... most common common choice for an estimate ... • Correlation

Distribution of a sum of independent R.V.’s - the return of convolution The Central Limit Theorem [on board] Central limit for a uniform distribution... 10k samples, uniform density (sigma=1) 10 4 samples of uniform dist (u+u)/sqrt(2) 250 450 400 200 350 300 150 250 200 100 150 100 50 50 0 0 − 4 − 3 − 2 − 1 0 1 2 3 4 − 4 − 3 − 2 − 1 0 1 2 3 4 (u+u+u+u)/sqrt(4) 10 u’s divided by sqrt(10) 500 600 450 500 400 350 400 300 250 300 200 200 150 100 100 50 0 0 − 4 − 3 − 2 − 1 0 1 2 3 4 − 4 − 3 − 2 − 1 0 1 2 3 4

Central limit for a binary distribution... one coin avg of 16 coins 6000 2000 5000 1500 4000 3000 1000 2000 500 1000 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 avg of 4 coins avg of 256 coins avg of 64 coins 4000 2500 2000 2000 3000 1500 1500 2000 1000 1000 1000 500 500 0 0 0 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 1 1 The Gaussian • parameterized by mean and stdev (position / width) • joint density of two indep Gaussian RVs is circular! [easy] • product of two Gaussians is Gaussian! [easy] • conditionals of a Gaussian are Gaussian! [easy] • sum of Gaussian RVs is Gaussian! [moderate] • marginals of a Gaussian are Gaussian! [moderate] • central limit theorem: sum of many RVs is Gaussian! [hard] • most random (max entropy) density with this variance! [moderate]

Gaussian densities mean: [0.2, 0.8] cov: [1.0 -0.3; -0.3 0.4] Product of Gaussians is Gaussian Completing the square shows that this posterior is also Gaussian, with (average, weighted by inverse variances!)

Product of Gaussians is Gaussian p ( x | y ) p ( y | x ) p ( x ) ∝  �  � − 1 1 n ( x − y ) 2 − 1 1 x ( x − µ x ) 2 2 σ 2 2 σ 2 ∝ e e ✓ ◆ ✓ ◆ � − 1 1 n + 1 x 2 − 2 y n + µx x + ... 2 σ 2 σ 2 σ 2 σ 2 = e x x Completing the square shows that this posterior is also Gaussian, with (average, weighted by inverse variances!) (known as the “precision” matrix) let P = C − 1 ~ x ∼ N ( ~ µ, C ) , Gaussian, with: Conditional: Marginal: Z p ( x 1 ) = p ( ~ xdx 2 Gaussian, with:

Generalized marginals of a Gaussian x2 u T ~ z = ˆ x w x1 p ( z ) is Gaussian, with: u T ~ µ z µ x = ˆ u T C x ˆ � 2 u = ˆ z z ˆ u true density 700 samples Measurement (sampling) Inference true mean: [0 0.8] sample mean: [-0.05 0.83] true cov: [1.0 -0.25 sample cov: [0.95 -0.23 -0.25 0.3] -0.23 0.29]

Point Estimates • Estimator: Any function of the data, intended to represent the best approximation of the true value of a parameter • Most common estimator is the sample average • Statistically-motivated examples: - Maximum likelihood (ML): - Max a posteriori (MAP): - Min Mean Squared Error (MMSE): p(x|y) proportional to p(x) * p(y|x) • why must both prior and likelihood be taken into account? • why doesn’t data dominate? • when would it? when would prior dominate? • what if prior and likelihood are incompatible?

Likelihood: 1 head Likelihood: 1 tail Posteriors, p(H,T|x), assuming prior p(x)=1 More tails T=0 1 2 3 More heads H=0 1 2 3

example infer whether a coin is fair by flipping it repeatedly here, x is the probability of heads (50% is fair) y 1. ..n are the outcomes of flips Consider three different priors: suspect fair suspect biased no idea prior fair prior biased prior uncertain X likelihood (heads) = posterior

previous posteriors X likelihood (heads) = new posterior previous posteriors X likelihood (tails) = new posterior

Posteriors after observing 75 heads, 25 tails à prior differences are ultimately overwhelmed by data Confidence PDFs 2H / 1T 10H / 5T 20H / 10T CDFs .975 .025 .19 .93 .49 .80

Bias & Variance • MSE = bias^2 + variance • Bias is difficult to assess (since requires knowing the “true” value). But variance is easier. • Classical statistics generally aims for an unbiased estimator, with minimal variance • The MLE is asymptotically unbiased (under fairly general conditions), but this is only useful if - the likelihood model is correct - the optimum can be computed - you have enough data • More general/modern view: estimation is about trading off bias and variance, through model selection, “regularization”, or Bayesian priors. Optimization... Heuristics, exhaustive search, (pain & suffering) Smooth (C 2 ) Iterative descent, Convex (possibly) nonunique Quadratic Iterative descent, unique Closed-form, and unique statAnMod - 9/12/07 - E.P. Simoncelli

Section 4: Statistics and Inference Probability : an abstract - PDF document

Mathematical Tools for Neural and Cognitive Science Fall semester, 2016 Section 4: Statistics and Inference Probability : an abstract mathematical framework for describing random quantities (e.g. measurements) Statistics : use of probability

Module V: Vector Spaces Module V Math 237 Module V Section V.0 Section V.1 Section V.2

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

The Foundations: Logic and Proofs Chapter 1, Part III: Proofs Rules of Inference Section 1.6

Probability Chapter 4 Section 2: Fundamentals Section 3: Addition Rule Section 4:

Probability Chapter 4 Section 2: Fundamentals Section 3: Addition Rule Section 4:

Half Year Results Presentation 2019 6 months ended 30 June 2019 Section 1 Section 2 Section 3

2018 Full year results presentation 12 months ended 31 December 2018 1 Section 1 Section 2

ACMS 20340 Statistics for Life Sciences Chapter 15: Inference in Practice Inference in Practice

Soft Inference and Posterior Marginals September 19, 2013 Soft vs. Hard Inference Hard

Post-Selection Inference Todd Kuffner Washington University in St. Louis PhyStat 2016

Type Inference 75 Definition Type Inference Type inference = Java compiler's ability

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Exact Inference Inference Basic task for inference: Compute

May 2013 Agenda Section 1 Jaypee Group Overview Section 2 Company Overview Section 3 Yamuna

Fermilab NORTH 0 20 20 40 1"=20'-0" 2/8/2019 6:57:50 PM 4850 LEVEL SCALE SC LE

ITG for Joint Phrasal Translation Modeling Colin Cherry Dekang Lin University of Alberta

A common weakness in RSA signatures: extracting public keys from communications and embedded

Paper Summaries Any takers? Material Properties Assignments Projects Proposals

Using Cache Algorithms to Choose Shortcut Links Justin Brickell Inderjit S. Dhillon Dharmendra

Jerry Gilfoyle Shroud 1 / 11 How Old is the Shroud of Turin? What is the Clock? Jerry Gilfoyle

Lecture at the J. Stefan Institute Ljubljana within the course: 'Advanced particle detectors and

Collapse of the Wave Function David Chalmers and Kelvin McQueen Two Questions What is the

Bioinformatics Methods for Biomedical Complex System Applications May-June, 2008 Luciano Milanesi

Section 4: Statistics and Inference Probability : an abstract - PDF document

Mathematical Tools for Neural and Cognitive Science Fall semester, 2016 Section 4: Statistics and Inference Probability : an abstract mathematical framework for describing random quantities (e.g. measurements) Statistics : use of probability

Module V: Vector Spaces Module V Math 237 Module V Section V.0 Section V.1 Section V.2

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

The Foundations: Logic and Proofs Chapter 1, Part III: Proofs Rules of Inference Section 1.6

Probability Chapter 4 Section 2: Fundamentals Section 3: Addition Rule Section 4:

Probability Chapter 4 Section 2: Fundamentals Section 3: Addition Rule Section 4:

Half Year Results Presentation 2019 6 months ended 30 June 2019 Section 1 Section 2 Section 3

2018 Full year results presentation 12 months ended 31 December 2018 1 Section 1 Section 2

ACMS 20340 Statistics for Life Sciences Chapter 15: Inference in Practice Inference in Practice

Soft Inference and Posterior Marginals September 19, 2013 Soft vs. Hard Inference Hard

Post-Selection Inference Todd Kuffner Washington University in St. Louis PhyStat 2016

Type Inference 75 Definition Type Inference Type inference = Java compiler's ability

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Exact Inference Inference Basic task for inference: Compute

May 2013 Agenda Section 1 Jaypee Group Overview Section 2 Company Overview Section 3 Yamuna

Fermilab NORTH 0 20 20 40 1&quot;=20'-0&quot; 2/8/2019 6:57:50 PM 4850 LEVEL SCALE SC LE

ITG for Joint Phrasal Translation Modeling Colin Cherry Dekang Lin University of Alberta

A common weakness in RSA signatures: extracting public keys from communications and embedded

Paper Summaries Any takers? Material Properties Assignments Projects Proposals

Using Cache Algorithms to Choose Shortcut Links Justin Brickell Inderjit S. Dhillon Dharmendra

Jerry Gilfoyle Shroud 1 / 11 How Old is the Shroud of Turin? What is the Clock? Jerry Gilfoyle

Lecture at the J. Stefan Institute Ljubljana within the course: 'Advanced particle detectors and

Collapse of the Wave Function David Chalmers and Kelvin McQueen Two Questions What is the

Bioinformatics Methods for Biomedical Complex System Applications May-June, 2008 Luciano Milanesi

Fermilab NORTH 0 20 20 40 1"=20'-0" 2/8/2019 6:57:50 PM 4850 LEVEL SCALE SC LE