Chapter II: Basics from Linear Algebra, Probability Theory, and - - PowerPoint PPT Presentation

chapter ii basics from linear algebra probability theory
SMART_READER_LITE
LIVE PREVIEW

Chapter II: Basics from Linear Algebra, Probability Theory, and - - PowerPoint PPT Presentation

Chapter II: Basics from Linear Algebra, Probability Theory, and Statistics Information Retrieval & Data Mining Universitt des Saarlandes, Saarbrcken Wintersemester 2013/14 Chapter II II.1 Linear Algebra Vectors, Matrices,


slide-1
SLIDE 1

Chapter II: Basics from Linear Algebra,
 Probability Theory, and Statistics

Information Retrieval & Data Mining Universität des Saarlandes, Saarbrücken Wintersemester 2013/14

slide-2
SLIDE 2

IR&DM ’13/’14

Chapter II

II.1 Linear Algebra


Vectors, Matrices, Eigenvalues, Eigenvectors,
 Singular Value Decomposition

II.2 Probability Theory


Events, Probabilities, Random Variables, Distributions,
 Bounds, Limit Theorems

II.3 Statistical Inference


Parameter Estimation, Confidence Intervals, Hypothesis Testing

  • 2
slide-3
SLIDE 3

IR&DM ’13/’14

II.3 Statistical Inference

1. Parameter Estimation 2. Confidence Intervals 3. Hypothesis Testing
 
 
 
 
 
 
 
 
 
 
 
 Based on LW Chapters 6, 7, 9, 10

3

slide-4
SLIDE 4

IR&DM ’13/’14

Statistical Model

  • A statistical model M is a set of distributions (or regression

functions), e.g., all unimodal smooth distributions

  • M is called a parametric model if it can be completely described

by a finite number of parameters, e.g., the family of Normal distributions for a finite number of parameters µ and σ

4

M = ⇢ fX(x; µ, σ) = 1 √ 2 π σ e− (x−µ)2

2 σ2

| µ ∈ R, σ > 0

slide-5
SLIDE 5

IR&DM ’13/’14

Statistical Inference

  • Given a parametric model M and a sample X1,…, Xm,


how do we infer (learn) the parameters of M?

  • For multivariate models with observed variable X and 


response variable Y, this is called prediction or regression,
 for a discrete outcome variable this is also called classification

5

slide-6
SLIDE 6

IR&DM ’13/’14

Idea of Sampling

  • Example: Suppose we want to estimate the average salary of

employees in German companies

  • Sample 1: Suppose we look at n = 200 top-paid CEOs of major banks
  • Sample 2: Suppose we look at n = 1,000 employees across all sectors

6

Distribution X
 (population of interest) Samples
 X1, …, Xm
 (e.g., people) Statistical Inference What can we say about X based on X1, …, Xm?

slide-7
SLIDE 7

IR&DM ’13/’14

Basic Types of Statistical Inference

  • Given independent and identically distributed (iid.) 


samples X1, …, Xn ~ X of an unknown distribution X

  • e.g.:

n single-coin-toss experiments X1, …, Xn ~ Bernoulli(p)

  • Parameter estimation
  • e.g.:

what is the parameter p of Bernoulli(p)?
 what is E[X], the cdf FX of X, the pdf fX of X, etc.?

  • Confidence intervals
  • e.g.:

give me all values C = [a, b] such that P[p ∈ C] ≥ 0.95
 with interval boundaries a and b derived from samples X1, …, Xn

  • Hypothesis testing
  • e.g.:

H0 : p = 1/2 (i.e., coin is fair) vs. H1 : p ⧧ 1/2

7

slide-8
SLIDE 8

IR&DM ’13/’14

  • 1. Parameter Estimation
  • A point estimator for a parameter θ of a probability distribution

X is a random variable θ derived from an iid. sample X1, …, Xn

  • Examples:
  • Sample mean

  • Sample variance
  • An estimator for parameter θ is unbiased if 


  • therwise the estimator has bias
  • An estimator on sample size n is consistent if

8

¯ X := 1 n

n

X

i=1

Xi

ˆ θn ˆ θn E[ˆ θn] = θ E[ˆ θn] − θ lim

n→∞ P[|ˆ

✓n − ✓| < ✏] = 1 for any ✏ > 0 S2

X :=

1 n − 1

n

X

i=1

(Xi − ¯ X)2

slide-9
SLIDE 9

IR&DM ’13/’14

Estimation Error

  • Let be an estimator for parameter θ over iid. samples X1, …, Xn
  • The distribution of is called sampling distribution
  • The standard error for is:
  • The mean squared error (MSE) for is:
  • The estimator is asymptotically Normal if



 converges in distribution to N(0,1)


9

se(ˆ θ) = q V ar(ˆ θn) ˆ θn ˆ θn ˆ θn ˆ θn MSE(ˆ θn) = E[(ˆ θn − θ)2] = bias2(ˆ θn) + V ar(ˆ θn) (ˆ θn − θ)/se ˆ θn

slide-10
SLIDE 10

IR&DM ’13/’14

Types of Estimation

  • Non-Parametric Estimation
  • no assumptions about the model M nor the parameters θ 

  • f the underlying distribution X
  • e.g.:

“plug-in estimators” (e.g., histograms) to approximate X

  • Parametric Estimation
  • requires assumptions about the model M and the parameters θ 

  • f the underlying distribution X
  • analytical or numerical methods for estimating θ
  • Method of Moments
  • Maximum Likelihood
  • Expectation Maximization (EM)

10

slide-11
SLIDE 11

IR&DM ’13/’14

Empirical Distribution Function

  • The empirical distribution function is the cdf that puts


probability mass 1/n at each data point Xi 
 
 
 
 with indicator function

  • A statistical function (“statistics”) T(F) is any function over F,

e.g., mean, variance, skewness, median, quantiles, correlation

  • The plug-in estimator of θ = T(F) is

11

ˆ Fn ˆ Fn(x) = 1 n

n

X

i=1

I(Xi ≤ x) I(Xi ≤ x) = ⇢ 1 : Xi ≤ x : Xi > x ˆ θn = T( ˆ Fn)

slide-12
SLIDE 12

IR&DM ’13/’14

Histograms as Density Estimators

  • Instead of the full empirical distribution, often compact synopses


can be used, such as histograms where X1, …, Xn are grouped
 into m cells (buckets) c1, …, cm with
 bucket boundaries lb(ci) and ub(ci)
 
 
 
 


  • Example:


X1 = X2 = 1
 X3 = X4 = X5 = 2
 X6 = … X10 = 3
 X11 = … X14 = 4
 X15 = … X17 = 5
 X18 = X19 = 6
 X20 = 7

12

lb(c1) = −∞, ub(cm) = ∞, ub(ci−1) = lb(ci) for (1 ≤ i ≤ m), and freqf(ci) = ˆ fn(x) = 1

n

Pn

j=1 I(lb(ci) < Xj ≤ ub(ci))

freqF (ci) = ˆ Fn(x) = 1

n

Pn

j=1 I(Xj ≤ ub(ci))

x fX(x)

1 2 3 4 5 6 7

2/20 3/20 5/20 4/20 3/20 2/20 1/20

ˆ µn = 1 × 2

20 + 2 × 3 20 + . . . + 7 × 1 20

= 3.65

slide-13
SLIDE 13

IR&DM ’13/’14

Method of Moments

  • Suppose parameter θ = (θ1, …, θk) has k components
  • Compute j-th moment for 1 ≤ j ≤ k:


  • Compute j-th sample moment for 1 ≤ j ≤ k:



 


  • Method-of-moments estimate of θ is obtained by solving a

system of k equations in k unknowns

13

ˆ αj = 1 n

n

X

i=1

Xj

i

αj = αj(θ) = Eθ[Xj] = Z +∞

−∞

xj fX(x) dx α1(ˆ θn) = ˆ α1 . . . αk(ˆ θn) = ˆ αk

slide-14
SLIDE 14

IR&DM ’13/’14

Method of Moments (Example)

  • Let X1, …, Xn ~ Normal(µ, σ2).
  • By solving the system of 2 equations in 2 unknowns



 
 
 
 
 
 we obtain as solutions

14

α1 = Eθ[X] = µ ˆ µ = ¯ Xn ˆ σ2 = 1 n

n

X

i=1

(Xi − ¯ Xn)2 ˆ µ = 1 n

n

X

i=1

Xi ˆ σ2 + ˆ µ2 = 1 n

n

X

i=1

X2

i

α2 = Eθ[X2] = V ar(X) + (E[X])2 = σ2 + µ2

slide-15
SLIDE 15

IR&DM ’13/’14

Maximum Likelihood

  • Let X1, …, Xn be iid. with pdf f(x;θ)
  • Estimate parameter θ of a postulated distribution f(x;θ) such that


the likelihood that the sample values x1, …, xn are generated by
 the distribution are maximized

  • Maximize L(x1, …, xn, θ) ≈ P[x1, …, xn originate from f(x;θ)]
  • Usually formulated as:
  • The value that maximizes Ln[θ] is called the


maximum-likelihood estimate (MLE) of θ

  • If analytically intractable, MLE can be determined using

numerical iteration methods

15

arg max

θ

Ln[θ] =

n

Y

i=1

f(Xi, θ) ˆ θ

slide-16
SLIDE 16

IR&DM ’13/’14

Maximum Likelihood (Example)

  • Let X1, …, Xn ~ Bernoulli(p) (corresponding to n coin tosses)
  • Assume that we observed h times head and (n-h) times tail
  • Maximum-likelihood estimation of parameter p
  • Maximize log-likelihood function

16

log L[h, n, p] = h × log(p) + (n − h) × log(1 − p) ∂L ∂p = h p − n − h 1 − p = 0 ⇒ p = h n L[h, n, p] =

n

Y

i=1

f(Xi; p) =

n

Y

i=1

pXi(1 − p)1−Xi = ph (1 − p)(n−h)

slide-17
SLIDE 17

IR&DM ’13/’14

Maximum Likelihood for Normal Distributions

17

L(x1, . . . , xn, µ, σ2) = ✓ 1 √ 2 πσ ◆n

n

Y

i=1

e− (xi−µ)2

2 σ2

∂L ∂µ = −1 2 σ2

n

X

i=1

2 (xi − σ) = 0 ∂L ∂σ2 = − n 2 σ2 + 1 2 σ4

n

X

i=1

(xi − µ)2 = 0 ⇒ ˆ µ = 1 n

n

X

i=1

xi σ2 = 1 n

n

X

i=1

(xi − ˆ µ)2

slide-18
SLIDE 18

IR&DM ’13/’14

  • 2. Confidence Intervals
  • Determine interval estimator T for parameter θ such that



 
 T±a is the confidence interval and 1-α the confidence level

  • For the distribution of a random variable X, a value xγ (0 < γ < 1)


is with P[X ≤ xγ] ≥ γ and P[X ≥ xγ] ≥ 1-γ is called γ-quantile

  • the 0.5-quantile is known as median
  • for the standard Normal distribution N(0,1) the γ-quantile is denoted Φγ
  • For a given a or α, find a value z of N(0,1)


that denotes the [T-a,T+a] confidence
 interval or a corresponding γ-quantile
 for 1-α

18

P[T − a ≤ θ ≤ T + a] = 1 − α

slide-19
SLIDE 19

IR&DM ’13/’14

Confidence Intervals for Expectations (I)

  • Let X1, …, Xn be a sample from a distribution X with unknown


expectation µ and known variance σ2

  • For sufficiently large n, the sample mean is N(µ, σ2/n)

distributed and

19

¯ X P[−z ≤ ( ¯

X−µ)√n σ

≤ z] = Φ(z) − Φ(−z) = Φ(z) − (1 − Φ(z)) = 2 Φ(z) − 1 = P[ ¯ X − z σ

√n ≤ µ ≤ ¯

X + z σ

√n]

⇒ P[ ¯ X −

Φ1−α/2 σ √n

≤ µ ≤ ¯ X +

Φ1−α/2 σ √n

] = 1 − α

slide-20
SLIDE 20

IR&DM ’13/’14

Confidence Intervals for Expectations (I) (cont’d)

  • For confidence interval compute



 and lookup Φ(z) to determine 1-α

  • For confidence level 1-α set



 (i.e., as (1-α/2)-quantile of N(0,1))
 
 then to determine
 
 confidence interval


20

[ ¯ X − a, ¯ X + a] z = a √n σ z = Φ1− α

2

a = z σ √n P[ ¯ X − Φ1−α/2 σ √n ≤ µ ≤ ¯ X + Φ1−α/2 σ √n ] = 1 − α

slide-21
SLIDE 21

IR&DM ’13/’14

Confidence Intervals for Expectations (I) (Example)

  • Based on a random sample of n = 100 queries, we observe an

average response time of . We further know that the standard deviation is

  • Q: What is the confidence of the interval 64±0.5?



 
 
 
 A: 78.87%

  • Q: What’s the 99% confidence interval?



 
 
 A: 64±1.032

21

σ = 4 ¯ X = 64

a = 0.5 z =

0.5 √ 100 4

= 1.25 Φ(1.25) = 0.89435 1 − α

2

= 0.89435 1 − α = 0.7887 1 − α = 0.99 α = 0.01 a =

Φ0.005×4 √ 100

= 1.032

slide-22
SLIDE 22

IR&DM ’13/’14

Confidence Intervals for Expectations (II)

  • Let X1, …, Xn be an iid. sample from a distribution X with

unknown expectation µ, unknown variance σ2, but 
 known sample variance S2

  • For sufficiently large n, the random variable



 
 
 has a Student’s t distribution with (n-1) degrees of freedom
 
 
 
 with the Gamma function

22

T = ( ¯ X − µ)√n S fT,n(t) = Γ( n+1

2 )

Γ( n

2 ) √n π (1 + t2 n )

n+1 2

Γ(x) = Z ∞ e−t tx−1 dt for x > 0

slide-23
SLIDE 23

IR&DM ’13/’14

Confidence Intervals for Expectations (II) (cont’d)

  • For confidence interval compute



 and lookup fT(n-1)(t) to determine 1-α

  • For confidence level 1-α set



 (i.e., as (1-α/2)-quantile of fT(n-1)) 
 
 then to determine
 
 confidence interval

23

P[ ¯ X − tn−1,1−α/2 S √n ≤ µ ≤ ¯ X + tn−1,1−α/2 S √n ] = 1 − α [ ¯ X − a, ¯ X + a] t = a √n S t = tn−1,1−α/2 a = t S √n

slide-24
SLIDE 24

IR&DM ’13/’14

  • 3. Hypothesis Testing
  • Suppose we throw a coin n times and want to know whether 


the coin is fair, i.e., P(H) = P(T)

  • Let X1, …, Xn ~ Bernoulli(p) be the iid. coin flips, so that the


coin is fair if p = 0.5

  • Let the null hypothesis H0 be “the coin is fair”
  • The alternative hypothesis H1 is then “the coin is not fair”
  • Intuitively, if is large, we should reject H0

24

| ¯ X − 0.5|

slide-25
SLIDE 25

IR&DM ’13/’14

Hypothesis Testing Terminology

  • θ = θ0 is called a simple hypothesis
  • θ > θ0 or θ < θ0 is called a compound hypothesis
  • H0 : θ = θ0 vs. H1 : θ ⧧ θ0 is called a two-sided test
  • H0 : θ ≤ θ0 vs. H1 : θ > θ0 and H0 : θ ≥ θ0 vs. H1 : θ < θ0 


are called a one-sided test

  • Rejection region R : if X ∈ R, reject H0 otherwise retain H0
  • The rejection region is typically defined using a test statistic T


and a critical value c

25

R = { X : T(X) > c }

slide-26
SLIDE 26

IR&DM ’13/’14

p-Values

  • The p-value is the probability that if H0 holds, we observe values


at least as extreme of the test statistic

  • It is not the probability that H0 holds
  • The smaller the p-value, the stronger is the evidence against H0, i.e.,


if we observe a small enough p-value, we can reject H0

  • How small the p-value needs to be depends on the application
  • Typical p-value scale:
  • < 0.01

very strong evidence against H0

  • 0.01 – 0.05

strong evidence against H0

  • 0.05 – 0.10

weak evidence against H0

  • > 0.1

little or no evidence against H0

26

slide-27
SLIDE 27

IR&DM ’13/’14

Types of Errors & Statistical Significance

  • Hypothesis tests often performed at a level of significance α
  • means that H0 is rejected if the p-value is less than α
  • reported as “results is statistically significant at the α level”
  • specifying p-values is more informative
  • Don’t confuse statistical significance with practical significance
  • e.g.:

“blue hyperlinks increase click rate by 0.0001% over black ones” 
 “fuel consumption is reduced by 0.0001 l/km by new part”
 …

27

Retain H0 Reject H0 H0 true OK Type I Error H1 true Type II Error OK

slide-28
SLIDE 28

IR&DM ’13/’14

The Wald Test

  • Two-sided test for H0 : θ = θ0 vs. H1 : θ ⧧ θ0
  • Test statistic with sample estimate 



 and

  • W converges in probability to N(0, 1)
  • If w is the observed value of the Wald statistic,


the p-value is 2Φ(-|w|)

28

ˆ θ ˆ se = se(ˆ θ) = q V ar(ˆ θ) W = |ˆ θ − θ0| ˆ se

slide-29
SLIDE 29

IR&DM ’13/’14

The Wald Test (Example)

  • We can use the Wald test to test if our coin is fair
  • Suppose the observed sample mean is 0.6 and 


the observed standard error is 0.049

  • We obtain as a test statistic value w = (0.6 - 0.5) / 0.049 ≈ 2.04
  • The p-value is therefore 2Φ(-|2.04|) ≈ 0.042 (i.e., a fair coin would lead to

such an extreme value w only with probability 0.042), which gives us 
 strong evidence to reject the null hypothesis H0

29

2 * (1 - 0.97882) ≈ 0.042

slide-30
SLIDE 30

IR&DM ’13/’14

Pearson’s 𝜓2 Test for Multinomial Data

  • Let X1, …, Xm ~ Multinomial(n, p), 


the MLE of p is (X1/n, X2/n, …, Xn/n)

  • Let p0 = (p01, p02, …, p0n) and we want to test


H0 : p = p0 vs. H1 : p ⧧ p0

  • Pearson’s 𝜓2 statistic is



 
 
 
 with expected value Ej = E[Xj] = n p0j of Xj under H0

  • The p-value is where t is the observed value

  • f the test statistic and there are (k-1) degrees of freedom

30

T =

k

X

j=1

(Xj − n p0j)2 n p0j =

k

X

j=1

(Xj − Ej)2 Ej P(χ2

k−1 > t)

slide-31
SLIDE 31

IR&DM ’13/’14

Pearson’s 𝜓2 Test for Multinomial Data (Example)

  • We can use Pearson’s 𝜓2 test to test whether a dice is fair
  • Suppose after 1,000 throws of the dice, we observed


① x 173, ② x 167, ③ x 167, ④ x 176, ⑤ x 167, ⑥ x 150
 => p = (0.173, 0.167, 0.167, 0.176, 0.167, 0.150) (based on MLE)

  • p0 = (0.167, 0.167, 0.167, 0.167, 0.167, 0.167)
  • T = 2.43 => p-value is 0.80 giving us no evidence to reject H0

31

IRDM WS 2007 2-63
slide-32
SLIDE 32

IR&DM ’13/’14

Pearson’s 𝜓2 Test of Independence

  • Pearson’s 𝜓2 test can also be used to test if two random variables


X and Y are independent

  • Let X1, …, Xn and Y1, …, Yn be the two samples
  • Divide outcomes into r (for X) and c (for Y) disjoint intervals
  • Populate r-by-c table O with frequencies, so that Olk tells how


many (Xi, Yi) pairs have values l-th and k-interval respectively

  • Assuming independence (H0) the expected value of Olk is

32

Elk = Pc

i=1 Oli

Pr

j=1 Ojk

Pr

j=1

Pc

i=1 Oij

slide-33
SLIDE 33

IR&DM ’13/’14

Pearson’s 𝜓2 Test of Independence (cont’d)

  • The value of the test statistic is
  • There are (r-1)(c-1) degrees of freedom

33

χ2 =

c

X

i=1 r

X

j=1

(Oij − Eij)2 Eij

slide-34
SLIDE 34

IR&DM ’13/’14

Summary of II.3

  • Statistical inference based on a sample from a population
  • Empirical distribution function and histograms as


non-parametric estimation methods

  • Method of moments and maximum likelihood as


parametric estimation methods

  • Confidence intervals
  • Wald test and Pearson’s 𝜓2 test for hypothesis testing

34

slide-35
SLIDE 35

IR&DM ’13/’14

Normal Distribution Table

35

slide-36
SLIDE 36

IR&DM ’13/’14

𝜓2 Distribution Table

36

IRDM WS 2007 2-63

slide-37
SLIDE 37

IR&DM ’13/’14

Student’s t Distribution Table

37