section 4 statistics and inference
play

Section 4: Statistics and Inference Probability : an abstract - PDF document

Mathematical Tools for Neural and Cognitive Science Fall semester, 2016 Section 4: Statistics and Inference Probability : an abstract mathematical framework for describing random quantities (e.g. measurements) Statistics : use of probability


  1. Mathematical Tools for Neural and Cognitive Science Fall semester, 2016 Section 4: Statistics and Inference Probability : an abstract mathematical framework for describing random quantities (e.g. measurements) Statistics : use of probability to summarize, analyze, interpret data. Fundamental to all experimental science.

  2. Statistics as a form of summary 0, 1, 0, 0, 0, 1, 0, 1, ... The purpose of statistics is to replace a quantity of data by relatively few quantities which shall ... contain as much as possible, ideally the whole, of the relevant information P(x) contained in the original data. - R.A. Fisher, 1934 Statistics for Data Summary... • Sample average (minimizes mean squared error) • Sample median (minimizes mean absolute deviation) • Least-squares regression - summarizes relationships between controlled and measured quantities • TLS regression - summarizes relationships between measured quantities

  3. - Efron & Tibshirani, Introduction to the Bootstrap Scientific process Observe / Measure Generate predictions, Summarize, and Design experiment compare with expectations Create/modify Hypothesis/model

  4. Probability basics • discrete probability distributions • continuous probability densities • cumulative distributions • translation and scaling of distributions (adding or multiplying by a constant) • monotonic nonlinear transformations • drawing samples from a distribution via inverse cumulative mapping • example densities/distributions [on board] Example distributions a not-quite-fair coin roll of a fair die sum of two rolled fair dice 0.7 0.2 0.2 0.6 0.15 0.15 0.5 0.4 0.1 0.1 0.3 0.2 0.05 0.05 0.1 0 0 0 1 2 3 4 5 6 2 3 4 5 6 7 8 9 10 11 12 0 1 clicks of a Geiger counter, horizontal velocity of gas ... and, time between clicks in a fixed time interval molecules exiting a fan 0.25 0.1 0.2 0.08 0.15 0.06 0.1 0.04 0.05 0.02 0 0 1 2 3 4 5 6 7 8 9 10 11 0 200 400 600 800 1000 0 1 2 3 4 5 6 7 8 9 10

  5. Multi-dimensional random variables • Joint distributions • Marginals (integrating) • Conditionals (slicing) • Bayes’ Rule (inverting) • Statistical independence Joint distribution p ( x, y )

  6. Marginal distribution p ( x, y ) Z p ( x ) = p ( x, y ) dy Generalized marginal distribution ˆ u Using vector notation: Z p ( z ) = p ( ~ x ) d ~ x ~ x · ˆ u = z z

  7. Conditional distribution p ( x, y ) p ( x | y = 68) Conditional distribution P(x|Y=68) �Z p ( x | y = 68) = p ( x, y = 68) p ( x, y = 68) dx . = p ( x, y = 68) p ( y = 68) More generally: p ( x | y ) = p ( x, y ) /p ( y ) slice joint distribution normalize (by marginal)

  8. Bayes’ Rule p ( x | y ) = p ( y | x ) p ( x ) /p ( y ) (a direct consequence of the definition of conditional probability) Conditional vs. marginal P ( x | Y =120) P ( x ) In general, these differ. When are they they same? In particular, when are all conditionals equal to the marginal?

  9. Statistical independence Variables x and y are statistically independent if (and only if): p ( x, y ) = p ( x ) p ( y ) Independence implies that all condionals are equal to the corresponding marginal: p ( y | x ) = p ( y, x ) /p ( x ) = p ( y ) , ∀ x Uncorrelated doesn’t mean independent... Statistical independence a stronger assumption uncorrelatedness ⇒ All independent variables are uncorrelated ⇒ Not all uncorrelated variables are independent: r =

  10. Expected value Z [the mean, ] E ( x ) = x p ( x ) dx µ Z x 2 p ( x ) dx E ( x 2 ) = [the “second moment”] Z [the variance, ] ( x − µ ) 2 p ( x ) dx σ 2 ( x − µ ) 2 � � E = Z x 2 p ( x ) dx − µ 2 = Z [note: an inner product, E ( f ( x )) = f ( x ) p ( x ) dx and thus linear !] Mean and (co)variance • One-D: mean and covariance summarize centroid/width • translation and rescaling of random variables • nonlinear transformations - “warping” • Multi-D: vector mean and covariance matrix, elliptical geometry • Mean/variance of weighted sum of random variables • The sample average • ... converges to true mean (except for bizarre distributions) • ... with variance • ... most common common choice for an estimate ... • Correlation

  11. Distribution of a sum of independent R.V.’s - the return of convolution The Central Limit Theorem [on board] Central limit for a uniform distribution... 10k samples, uniform density (sigma=1) 10 4 samples of uniform dist (u+u)/sqrt(2) 250 450 400 200 350 300 150 250 200 100 150 100 50 50 0 0 − 4 − 3 − 2 − 1 0 1 2 3 4 − 4 − 3 − 2 − 1 0 1 2 3 4 (u+u+u+u)/sqrt(4) 10 u’s divided by sqrt(10) 500 600 450 500 400 350 400 300 250 300 200 200 150 100 100 50 0 0 − 4 − 3 − 2 − 1 0 1 2 3 4 − 4 − 3 − 2 − 1 0 1 2 3 4

  12. Central limit for a binary distribution... one coin avg of 16 coins 6000 2000 5000 1500 4000 3000 1000 2000 500 1000 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 avg of 4 coins avg of 256 coins avg of 64 coins 4000 2500 2000 2000 3000 1500 1500 2000 1000 1000 1000 500 500 0 0 0 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 1 1 The Gaussian • parameterized by mean and stdev (position / width) • joint density of two indep Gaussian RVs is circular! [easy] • product of two Gaussians is Gaussian! [easy] • conditionals of a Gaussian are Gaussian! [easy] • sum of Gaussian RVs is Gaussian! [moderate] • marginals of a Gaussian are Gaussian! [moderate] • central limit theorem: sum of many RVs is Gaussian! [hard] • most random (max entropy) density with this variance! [moderate]

  13. Gaussian densities mean: [0.2, 0.8] cov: [1.0 -0.3; -0.3 0.4] Product of Gaussians is Gaussian Completing the square shows that this posterior is also Gaussian, with (average, weighted by inverse variances!)

  14. Product of Gaussians is Gaussian p ( x | y ) p ( y | x ) p ( x ) ∝  �  � − 1 1 n ( x − y ) 2 − 1 1 x ( x − µ x ) 2 2 σ 2 2 σ 2 ∝ e e ✓ ◆ ✓ ◆ � − 1 1 n + 1 x 2 − 2 y n + µx x + ... 2 σ 2 σ 2 σ 2 σ 2 = e x x Completing the square shows that this posterior is also Gaussian, with (average, weighted by inverse variances!) (known as the “precision” matrix) let P = C − 1 ~ x ∼ N ( ~ µ, C ) , Gaussian, with: Conditional: Marginal: Z p ( x 1 ) = p ( ~ xdx 2 Gaussian, with:

  15. Generalized marginals of a Gaussian x2 u T ~ z = ˆ x w x1 p ( z ) is Gaussian, with: u T ~ µ z µ x = ˆ u T C x ˆ � 2 u = ˆ z z ˆ u true density 700 samples Measurement (sampling) Inference true mean: [0 0.8] sample mean: [-0.05 0.83] true cov: [1.0 -0.25 sample cov: [0.95 -0.23 -0.25 0.3] -0.23 0.29]

  16. Point Estimates • Estimator: Any function of the data, intended to represent the best approximation of the true value of a parameter • Most common estimator is the sample average • Statistically-motivated examples: - Maximum likelihood (ML): - Max a posteriori (MAP): - Min Mean Squared Error (MMSE): p(x|y) proportional to p(x) * p(y|x) • why must both prior and likelihood be taken into account? • why doesn’t data dominate? • when would it? when would prior dominate? • what if prior and likelihood are incompatible?

  17. Likelihood: 1 head Likelihood: 1 tail Posteriors, p(H,T|x), assuming prior p(x)=1 More tails T=0 1 2 3 More heads H=0 1 2 3

  18. example infer whether a coin is fair by flipping it repeatedly here, x is the probability of heads (50% is fair) y 1. ..n are the outcomes of flips Consider three different priors: suspect fair suspect biased no idea prior fair prior biased prior uncertain X likelihood (heads) = posterior

  19. previous posteriors X likelihood (heads) = new posterior previous posteriors X likelihood (tails) = new posterior

  20. Posteriors after observing 75 heads, 25 tails à prior differences are ultimately overwhelmed by data Confidence PDFs 2H / 1T 10H / 5T 20H / 10T CDFs .975 .025 .19 .93 .49 .80

  21. Bias & Variance • MSE = bias^2 + variance • Bias is difficult to assess (since requires knowing the “true” value). But variance is easier. • Classical statistics generally aims for an unbiased estimator, with minimal variance • The MLE is asymptotically unbiased (under fairly general conditions), but this is only useful if - the likelihood model is correct - the optimum can be computed - you have enough data • More general/modern view: estimation is about trading off bias and variance, through model selection, “regularization”, or Bayesian priors. Optimization... Heuristics, exhaustive search, (pain & suffering) Smooth (C 2 ) Iterative descent, Convex (possibly) nonunique Quadratic Iterative descent, unique Closed-form, and unique statAnMod - 9/12/07 - E.P. Simoncelli

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend