ii 2 statistical inference sampling and estimation
play

II.2 Statistical Inference: Sampling and Estimation A statistical - PowerPoint PPT Presentation

II.2 Statistical Inference: Sampling and Estimation A statistical model is a set of distributions (or regression functions), e.g., all uni-modal, smooth distributions. is called a parametric model if it can be completely described by a


  1. II.2 Statistical Inference: Sampling and Estimation A statistical model Μ is a set of distributions (or regression functions), e.g., all uni-modal, smooth distributions. Μ is called a parametric model if it can be completely described by a finite number of parameters, e.g., the family of Normal distributions for a finite number of parameters μ , σ : 2 ( x ) 2 1 2 f ( x ; , ) e for R , 0 X 2 2 IR&DM, WS'11/12 October 25, 2011 II.1

  2. Statistical Inference Given a parametric model M and a sample X 1 ,...,X n , how do we infer (learn) the parameters of M? For multivariate models with observed variable X and „outcome (response)“ variable Y, this is called prediction or regression , for a discrete outcome variable this is also called classification . r(x) = E[Y | X=x] is called the regression function . IR&DM, WS'11/12 October 25, 2011 II.2

  3. Idea of Sampling Distribution X Statistical Inference (e.g., a population, What can we say about X based on objects of interest) X 1 ,…, X n ? Samples X 1 ,…, X n drawn from X Distrib. Param. Sample Param. (e.g., people, objects) μ X mean X 2 variance 2 S X X N size n Example: Suppose we want to estimate the average salary of employees in German companies.  Sample 1: Suppose we look at n=200 top-paid CEOs of major banks.  Sample 2: Suppose we look at n=100 employees across all kinds of companies. IR&DM, WS'11/12 October 25, 2011 II.3

  4. Basic Types of Statistical Inference Given a set of iid. samples X 1 ,...,X n ~ X of an unknown distribution X. e.g.: n single-coin-toss experiments X 1 ,...,X n ~ X: Bernoulli(p) • Parameter Estimation e.g.: - what is the parameter p of X: Bernoulli(p) ? - what is E[X], the cdf F X of X, the pdf f X of X, etc.? • Confidence Intervals e.g.: give me all values C=(a,b) such that P(p C ) ≥ 0.95 where a and b are derived from samples X 1 ,...,X n • Hypothesis Testing e.g.: H 0 : p = 1/2 vs. H 1 : p ≠ 1/2 IR&DM, WS'11/12 October 25, 2011 II.4

  5. Statistical Estimators A point estimator for a parameter of a prob. distribution X is a ˆ random variable derived from an iid. sample X 1 ,...,X n . n n 1 Examples: X : X Sample mean: i n i 1 n 1 2 2 S : ( X X ) Sample variance: X i n 1 i 1 ˆ An estimator for parameter is unbiased n ˆ if E[ ] = ; n ˆ otherwise the estimator has bias E[ ] – . n An estimator on a sample of size n is consistent ˆ lim P [| | ] 1 for any 0 if n n Sample mean and sample variance are unbiased and consistent estimators of μ X and . 2 X IR&DM, WS'11/12 October 25, 2011 II.5

  6. Estimator Error ˆ Let be an estimator for parameter over iid. samples X 1 , ...,X n . n ˆ The distribution of is called the sampling distribution . n ˆ ˆ ˆ The standard error for is: se ( ) Var [ ] n n n ˆ The mean squared error (MSE) for is: n ˆ ˆ 2 MSE ( ) E [( ) ] n n ˆ ˆ 2 bias ( ) Var[ ] n n Theorem: If bias 0 and se 0 then the estimator is consistent. ˆ The estimator is asymptotically Normal if n ˆ converges in distribution to standard Normal N(0,1). ( ) / se n IR&DM, WS'11/12 October 25, 2011 II.6

  7. Types of Estimation • Nonparametric Estimation No assumptions about model M nor the parameters θ of the underlying distribution X  “ Plug- in estimators” (e.g. histograms) to approximate X • Parametric Estimation (Inference) Requires assumptions about model M and the parameters θ of the underlying distribution X Analytical or numerical methods for estimating θ  Method-of-Moments estimator  Maximum Likelihood estimator and Expectation Maximization (EM) IR&DM, WS'11/12 October 25, 2011 II.7

  8. Nonparametric Estimation ˆ The empirical distribution function is the cdf that F n puts probability mass 1/n at each data point X i : 1 ˆ n F ( x ) I( X x ) n i i 1 n 1 if X x i with I ( X x ) i 0 if X x i A statistical functional (“statistics”) T(F) is any function over F, e.g., mean, variance, skewness, median, quantiles, correlation. ˆ ˆ The plug-in estimator of = T(F) is: n T( F ) n ˆ  Simply use instead of F to calculate the statistics T of interest. F n IR&DM, WS'11/12 October 25, 2011 II.8

  9. Histograms as Density Estimators Instead of the full empirical distribution, often compact data synopses may be used, such as histograms where X 1 , ...,X n are grouped into m cells (buckets) c 1 , ..., c m with bucket boundaries lb(c i ) and ub(c i ) s.t. lb(c 1 ) = , ub(c m ) = , ub(c i ) = lb(c i+1 ) for 1 i<m , and 1 ˆ n f ( x ) I ( lb ( c ) X ub ( c )) freq f (c i ) = n i v i v 1 n 1 ˆ n freq F (c i ) = F ( x ) I ( X ub ( c )) n v i v 1 n Example: 2 3 5 ˆ n f X (x) 1 2 3 X 1 = 1 20 20 20 X 2 = 1 4 3 2 1 X 3 = 2 4 5 6 7 5/20 20 20 20 20 X 4 = 2 4/20 X 5 = 2 3 . 65 3/20 3/20 X 6 = 3 2/20 2/20 … 1/20 x 1 2 3 4 5 6 7 X 20 =7 Histograms provide a (discontinuous) density estimator . IR&DM, WS'11/12 October 25, 2011 II.9

  10. Parametric Inference (1): Method of Moments Suppose parameter θ = ( θ 1 ,…,θ k ) has k components. j j ( ) E [ X ] x f ( x ) dx Compute j-th moment: j j X 1 n ˆ j-th sample moment: for 1 ≤ j ≤ k j X j i i 1 n ˆ Estimate parameter by method-of-moments estimator s.t. n ˆ ˆ ( ) 1 n 1 ˆ ˆ and ( ) 2 n 2 … … ˆ ˆ and (for the first k moments) ( ) k n k  Solve equation system with k equations and k unknowns. Method-of-moments estimators are usually consistent and asymptotically Normal , but may be biased . IR&DM, WS'11/12 October 25, 2011 II.10

  11. Parametric Inference (2): Maximum Likelihood Estimators (MLE) Let X 1 ,...,X n be iid. with pdf f(x; θ ). Estimate parameter of a postulated distribution f(x; ) such that the likelihood that the sample values x 1 ,...,x n are generated by this distribution is maximized. Maximum likelihood estimation: Maximize L(x 1 ,...,x n ; ) ≈ P[x 1 , ...,x n originate from f(x; )] Usually formulated as L n ( ) = ∏ i f(X i ; ) Or (alternatively) Maximize l n ( ) = log L n ( ) ˆ The value that maximizes L n ( ) is the MLE of . n If analytically untractable use numerical iteration methods IR&DM, WS'11/12 October 25, 2011 II.11

  12. Simple Example for Maximum Likelihood Estimator Given: • Coin toss experiment (Bernoulli distribution) with unknown parameter p for seeing heads, 1-p for tails • Sample (data): h times head with n coin tosses Want: Maximum likelihood estimation of p n n X 1 X h n h Let L(h, n, p) f ( X ; p ) p ( 1 p ) p ( 1 p ) i i i i 1 i 1 with h = ∑ i X i Maximize log-likelihood function: log L (h, n, p) h log( p ) ( n h ) log( 1 p ) h ln L h n h p 0 n p p 1 p IR&DM, WS'11/12 October 25, 2011 II.12

  13. MLE for Parameters of Normal Distributions 2 n ( x ) n 1 i 2 2 L ( x ,..., x , , ) e 2 1 n 2 i 1 n ln( L ) 1 2( x ) 0 i 2 i 1 2 n ln( L ) n 1 2 ( x ) 0 i 2 2 4 2 2 i 1 n n 1 1 ˆ ˆ ˆ 2 2 x ( x ) i i n n i 1 i 1 IR&DM, WS'11/12 October 25, 2011 II.13

  14. MLE Properties Maximum Likelihood estimators are consistent , asymptotically Normal , and asymptotically optimal (i.e., efficient ) in the following sense: Consider two estimators U and T which are asymptotically Normal. Let u 2 and t 2 denote the variances of the two Normal distributions to which U and T converge in probability. The asymptotic relative efficiency of U to T is ARE(U,T) := t 2 /u 2 . ˆ Theorem: For an MLE and any other estimator n n the following inequality holds: ˆ ARE( , ) 1 n n That is, among all estimators MLE has the smallest variance. IR&DM, WS'11/12 October 25, 2011 II.14

  15. Bayesian Viewpoint of Parameter Estimation • Assume prior distribution g( ) of parameter • Choose statistical model ( generative model ) f (x | ) that reflects our beliefs about RV X • Given RVs X 1 ,...,X n for the observed data, the posterior distribution is h ( | x 1 ,...,x n ) For X 1 = x 1 , ... ,X n = x n the likelihood is h ( | x ) f ( x | ' ) g ( ' ) n n i i ' L ( x ... x , ) f ( x | ) 1 n i i 1 i 1 g ( ) which implies (posterior is proportional to h ( | x ... x ) ~ L ( x ... x , ) g ( ) likelihood times prior) 1 n 1 n MAP estimator (maximum a posteriori): Compute that maximizes h( | x 1 , …, x n ) given a prior for . IR&DM, WS'11/12 October 25, 2011 II.15

  16. Analytically Non-tractable MLE for parameters of Multivariate Normal Mixture Consider samples from a k -mixture of m -dimensional Normal distributions with the density (e.g. height and weight of males and females):    f ( x , ,..., , ,..., , ,..., ) 1 k 1 k 1 k     1 k   k 1 T 1 ( x ) ( x ) j j j n ( x , , ) e 2 j j j j m ( 2 ) j 1 j 1 j  with expectation values j and invertible, positive definite, symmetric m m covariance matrices j Maximize log-likelihood function: n    n k   log L ( x ,..., x , ) : log P [ x | ] log n ( x , , ) 1 n i j i j j i 1 i 1 j 1 IR&DM, WS'11/12 October 25, 2011 II.16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend