 
              Chapter II: Basics from Linear Algebra, Probability Theory, and Statistics Information Retrieval & Data Mining Universität des Saarlandes, Saarbrücken Wintersemester 2013/14
Chapter II II.1 Linear Algebra Vectors, Matrices, Eigenvalues, Eigenvectors, Singular Value Decomposition II.2 Probability Theory Events, Probabilities, Random Variables, Distributions, Bounds, Limit Theorems II.3 Statistical Inference Parameter Estimation, Confidence Intervals, Hypothesis Testing � IR&DM ’13/’14 � 2
II.3 Statistical Inference 1. Parameter Estimation 2. Confidence Intervals 3. Hypothesis Testing Based on LW Chapters 6, 7, 9, 10 IR&DM ’13/’14 � 3
Statistical Model • A statistical model M is a set of distributions (or regression functions), e.g., all unimodal smooth distributions • M is called a parametric model if it can be completely described by a finite number of parameters, e.g., the family of Normal distributions for a finite number of parameters µ and σ ⇢ � 1 2 π σ e − ( x − µ )2 M = f X ( x ; µ, σ ) = | µ ∈ R , σ > 0 2 σ 2 √ IR&DM ’13/’14 � 4
Statistical Inference • Given a parametric model M and a sample X 1 ,…, X m , how do we infer (learn) the parameters of M ? • For multivariate models with observed variable X and response variable Y , this is called prediction or regression , for a discrete outcome variable this is also called classification IR&DM ’13/’14 � 5
Idea of Sampling Samples X 1 , …, X m Distribution X (e.g., people) (population of interest) Statistical Inference What can we say about X based on X 1 , …, X m ? • Example: Suppose we want to estimate the average salary of employees in German companies • Sample 1: Suppose we look at n = 200 top-paid CEOs of major banks • Sample 2: Suppose we look at n = 1,000 employees across all sectors IR&DM ’13/’14 � 6
Basic Types of Statistical Inference • Given independent and identically distributed (iid.) samples X 1 , …, X n ~ X of an unknown distribution X • e.g.: n single-coin-toss experiments X 1 , …, X n ~ Bernoulli( p ) • Parameter estimation • e.g.: what is the parameter p of Bernoulli( p )? what is E [ X ], the cdf F X of X , the pdf f X of X , etc.? • Confidence intervals • e.g.: give me all values C = [ a , b ] such that P [ p ∈ C ] ≥ 0.95 with interval boundaries a and b derived from samples X 1 , …, X n • Hypothesis testing • e.g.: H 0 : p = 1/2 (i.e., coin is fair) vs. H 1 : p ⧧ 1/2 IR&DM ’13/’14 � 7
1. Parameter Estimation • A point estimator for a parameter θ of a probability distribution X is a random variable θ derived from an iid. sample X 1 , …, X n ˆ θ n • Examples: n X := 1 X • Sample mean ¯ X i n i =1 n 1 ( X i − ¯ X S 2 X ) 2 • Sample variance X := n − 1 i =1 � E [ˆ ˆ • An estimator for parameter θ is unbiased if θ n θ n ] = θ E [ˆ otherwise the estimator has bias θ n ] − θ • An estimator on sample size n is consistent if n →∞ P [ | ˆ lim ✓ n − ✓ | < ✏ ] = 1 for any ✏ > 0 IR&DM ’13/’14 � 8
Estimation Error • Let be an estimator for parameter θ over iid. samples X 1 , …, X n ˆ θ n ˆ • The distribution of is called sampling distribution θ n q se (ˆ V ar (ˆ ˆ • The standard error for is: θ ) = θ n ) θ n ˆ • The mean squared error (MSE) for is: θ n MSE (ˆ θ n ) = E [(ˆ θ n − θ ) 2 ] = bias 2 (ˆ θ n ) + V ar (ˆ θ n ) � ˆ • The estimator is asymptotically Normal if θ n (ˆ converges in distribution to N(0,1) θ n − θ ) /se IR&DM ’13/’14 � 9
Types of Estimation • Non-Parametric Estimation • no assumptions about the model M nor the parameters θ of the underlying distribution X • e.g.: “plug-in estimators” (e.g., histograms) to approximate X • Parametric Estimation • requires assumptions about the model M and the parameters θ of the underlying distribution X • analytical or numerical methods for estimating θ • Method of Moments • Maximum Likelihood • Expectation Maximization (EM) IR&DM ’13/’14 � 10
Empirical Distribution Function ˆ • The empirical distribution function is the cdf that puts F n probability mass 1/ n at each data point X i n F n ( x ) = 1 ˆ X I ( X i ≤ x ) n i =1 with indicator function ⇢ 1 : X i ≤ x � I ( X i ≤ x ) = 0 : X i > x � • A statistical function (“statistics”) T ( F ) is any function over F , e.g., mean, variance, skewness, median, quantiles, correlation ˆ θ n = T ( ˆ • The plug-in estimator of θ = T ( F ) is F n ) IR&DM ’13/’14 � 11
Histograms as Density Estimators • Instead of the full empirical distribution, often compact synopses can be used, such as histograms where X 1 , …, X n are grouped into m cells (buckets) c 1 , …, c m with bucket boundaries lb ( c i ) and ub ( c i ) lb ( c 1 ) = −∞ , ub ( c m ) = ∞ , ub ( c i − 1 ) = lb ( c i ) for (1 ≤ i ≤ m ) , and freq f ( c i ) = ˆ P n f n ( x ) = 1 j =1 I ( lb ( c i ) < X j ≤ ub ( c i )) n freq F ( c i ) = ˆ P n F n ( x ) = 1 j =1 I ( X j ≤ ub ( c i )) n • Example: X 1 = X 2 = 1 1 × 2 20 + 2 × 3 20 + . . . + 7 × 1 ˆ = f X (x) µ n 20 X 3 = X 4 = X 5 = 2 = 3 . 65 X 6 = … X 10 = 3 5/20 X 11 = … X 14 = 4 4/20 X 15 = … X 17 = 5 3/20 3/20 X 18 = X 19 = 6 2/20 2/20 1/20 X 20 = 7 x 1 2 3 4 5 6 7 IR&DM ’13/’14 � 12
Method of Moments • Suppose parameter θ = ( θ 1 , …, θ k ) has k components • Compute j -th moment for 1 ≤ j ≤ k : Z + ∞ x j f X ( x ) dx α j = α j ( θ ) = E θ [ X j ] = −∞ • Compute j -th sample moment for 1 ≤ j ≤ k : n α j = 1 X X j ˆ i n i =1 • Method-of-moments estimate of θ is obtained by solving a system of k equations in k unknowns α 1 (ˆ θ n ) = ˆ α 1 . . . α k (ˆ θ n ) = ˆ α k IR&DM ’13/’14 � 13
Method of Moments (Example) • Let X 1 , …, X n ~ Normal( µ , σ 2 ). α 1 = E θ [ X ] = µ � α 2 = E θ [ X 2 ] = V ar ( X ) + ( E [ X ]) 2 = σ 2 + µ 2 � • By solving the system of 2 equations in 2 unknowns n µ = 1 X X i ˆ n i =1 n µ 2 = 1 σ 2 + ˆ X X 2 ˆ i n i =1 we obtain as solutions n σ 2 = 1 ( X i − ¯ X µ = ¯ X n ) 2 ˆ X n ˆ n i =1 IR&DM ’13/’14 � 14
Maximum Likelihood • Let X 1 , …, X n be iid. with pdf f ( x ; θ ) • Estimate parameter θ of a postulated distribution f ( x ; θ ) such that the likelihood that the sample values x 1 , …, x n are generated by the distribution are maximized • Maximize L ( x 1 , …, x n , θ ) ≈ P[ x 1 , …, x n originate from f ( x ; θ )] • Usually formulated as: n � Y L n [ θ ] = f ( X i , θ ) arg max θ i =1 � ˆ • The value that maximizes L n [ θ ] is called the θ maximum-likelihood estimate (MLE) of θ • If analytically intractable, MLE can be determined using numerical iteration methods IR&DM ’13/’14 � 15
Maximum Likelihood (Example) • Let X 1 , …, X n ~ Bernoulli( p ) (corresponding to n coin tosses) • Assume that we observed h times head and ( n - h ) times tail • Maximum-likelihood estimation of parameter p n n p X i (1 − p ) 1 − X i = p h (1 − p ) ( n − h ) � Y Y L [ h, n, p ] = f ( X i ; p ) = i =1 i =1 � • Maximize log-likelihood function log L [ h, n, p ] = h × log ( p ) + ( n − h ) × log (1 − p ) ∂ L ∂ p = h p − n − h p = h 1 − p = 0 ⇒ n IR&DM ’13/’14 � 16
Maximum Likelihood for Normal Distributions n ◆ n ✓ 1 e − ( xi − µ )2 Y L ( x 1 , . . . , x n , µ, σ 2 ) = 2 σ 2 √ 2 πσ i =1 n ∂ µ = − 1 ∂ L X 2 ( x i − σ ) = 0 2 σ 2 i =1 n 1 ∂ L ∂σ 2 = − n ( x i − µ ) 2 = 0 X 2 σ 2 + 2 σ 4 i =1 n n µ = 1 σ 2 = 1 X X µ ) 2 ⇒ ˆ ( x i − ˆ x i n n i =1 i =1 IR&DM ’13/’14 � 17
2. Confidence Intervals • Determine interval estimator T for parameter θ such that P [ T − a ≤ θ ≤ T + a ] = 1 − α T ± a is the confidence interval and 1- α the confidence level • For the distribution of a random variable X , a value x γ (0 < γ < 1) is with P [ X ≤ x γ ] ≥ γ and P [ X ≥ x γ ] ≥ 1- γ is called γ -quantile • the 0.5-quantile is known as median • for the standard Normal distribution N(0,1) the γ -quantile is denoted Φ γ • For a given a or α , find a value z of N(0,1) that denotes the [T-a,T+a] confidence interval or a corresponding γ -quantile for 1- α IR&DM ’13/’14 � 18
Confidence Intervals for Expectations (I) • Let X 1 , …, X n be a sample from a distribution X with unknown expectation µ and known variance σ 2 • For sufficiently large n, the sample mean is N( µ, σ 2 /n ) ¯ X distributed and P [ − z ≤ ( ¯ X − µ ) √ n ≤ z ] = Φ ( z ) − Φ ( − z ) σ = Φ ( z ) − (1 − Φ ( z )) = 2 Φ ( z ) − 1 P [ ¯ √ n ≤ µ ≤ ¯ X − z σ X + z σ = √ n ] Φ 1 − α / 2 σ Φ 1 − α / 2 σ P [ ¯ ≤ µ ≤ ¯ X + ] = 1 − α X − ⇒ √ n √ n IR&DM ’13/’14 � 19
Recommend
More recommend