Chapter 3: Basics from Probability Theory and Statistics 3.1 - - PowerPoint PPT Presentation

chapter 3 basics from probability theory
SMART_READER_LITE
LIVE PREVIEW

Chapter 3: Basics from Probability Theory and Statistics 3.1 - - PowerPoint PPT Presentation

Chapter 3: Basics from Probability Theory and Statistics 3.1 Probability Theory Events, Probabilities, Bayes Theorem, Random Variables, Distributions, Moments, Tail Bounds, Central Limit Theorem, Entropy Measures 3.2 Statistical Inference


slide-1
SLIDE 1

IRDM WS 2015

Chapter 3: Basics from Probability Theory and Statistics

3-39

3.1 Probability Theory

Events, Probabilities, Bayes‘ Theorem, Random Variables, Distributions, Moments, Tail Bounds, Central Limit Theorem, Entropy Measures

3.2 Statistical Inference

Sampling, Parameter Estimation, Maximum Likelihood, Confidence Intervals, Hypothesis Testing, p-Values, Chi-Square Test, Linear and Logistic Regression mostly following L. Wasserman Chapters 6, 9, 10, 13

slide-2
SLIDE 2

IRDM WS 2015

3.2 Statistical Inference

A statistical model is a set of distributions (or regression functions), e.g., all unimodal, smooth distributions. A parametric model is a set that is completely described by a finite number of parameters, (e.g., the family of Normal distributions). Statistical inference: given a sample X1, ..., Xn how do we infer the distribution or its parameters within a given model. For multivariate models with one specific „outcome (response)“ variable Y, this is called prediction or regression, for discrete outcome variable also classification. r(x) = E[Y | X=x] is called the regression function.

3-40

Example for classification: biomedical markers  cancer or not Example for regression: business indicators  stock price

slide-3
SLIDE 3

IRDM WS 2015

Sampling Illustrated

3-41

Distribution X (population of interest) Samples X1, X2, …, Xn Statistical Inference: What can we say about X based on X1, X2, …, Xn? Example: estimate the average salary in Germany? Approach 1: ask your 10 neighbors Approach 2: ask 100 random people you spot on the Internet Approach 2: ask all 1000 living Germans in Wikipedia Approach 4: ask 1000 random people from all age groups, jobs, …

slide-4
SLIDE 4

IRDM WS 2015

Basic Types of Statistical Inference

3-42

Given: independent and identically distributed (iid) samples X1, X2, …, Xn from (unknown) distribution X

  • Parameter estimation:
  • Confidence intervals:
  • Hypothesis testing:
  • Regression (for parameter fitting)

What is the parameter p of a Bernoulli coin? What are the values of  and  of a Normal distribution? What are 1, 2, 1, 2 of a Poisson mixture? What is the interval [mean  tolerance] s.t. the expectation

  • f my observations or measurements falls into the interval

with high confidence? H0: p=1/2 (fair coin) vs. H1: p 1/2 H0: p1 = p2 (methods have same precision) vs. H1: p1  p2

slide-5
SLIDE 5

IRDM WS 2015

3.2.1 Statistical Parameter Estimation

A point estimator for a parameter  of a prob. distribution is a random variable X derived from a random sample X1, ..., Xn. Examples: Sample mean: Sample variance:

 

 n i i

X n : X

1

1

2 1 2

1 1 ) X X ( n : S

n i i 

  

An estimator T for parameter  is unbiased if ;

  • therwise the estimator has bias

. An estimator on a sample of size n is consistent if   ] T [ E 1    

 

   each for ] T [ P limn   ] T [ E Sample mean and sample variance are unbiased, consistent estimators with minimal variance.

3-43

slide-6
SLIDE 6

IRDM WS 2015

Estimation Error

Let = T() be an estimator for parameter  over sample X1, ..., Xn. The distribution of is called the sampling distribution. The standard error for is:

n

ˆ 

n

ˆ 

n

ˆ 

The mean squared error (MSE) for is:

n

ˆ 

2 n

ˆ ˆ MSE( ) E[( ) ]     

2 n n

ˆ ˆ bias ( ) Var[ ]     If bias  0 and se  0 then the estimator is consistent. The estimator is asymptotically Normal if converges in distribution to standard Normal N(0,1)

n

ˆ 

n

ˆ ( )/ se   

3-44

𝑡𝑓 𝜄 = 𝑊𝑏𝑠( 𝜄𝑜)

slide-7
SLIDE 7

IRDM WS 2015

Nonparametric Estimation

The empirical distribution function is the cdf that puts prob. mass 1/n at each data point Xi: where indicator function I(𝑌𝑗 ≤ 𝑦) is 1 if 𝑌𝑗 ≤ 𝑦 and 0 otherwise

n

ˆ F

n n i i 1

1 ˆ F ( x ) I( X x ) n

 

A statistical functional T(F) is any function of F, e.g., mean, variance, skewness, median, quantiles, correlation The plug-in estimator of  = T(F) is: n

n

ˆ ˆ T( F )  

3-45

slide-8
SLIDE 8

IRDM WS 2015

Nonparametric Estimation: Histograms

Instead of the full empirical distribution, often compact data synopses may be used, such as histograms where X1, ..., Xn are grouped into m cells (buckets or bins) c1, ..., cm with bucket boundaries lb(ci) and ub(ci) s.t. lb(c1) = , ub(cm) = , ub(ci) = lb(ci+1) for 1i<m, and freq(ci) =

n n i i 1

1 ˆ F ( x ) I(lb( c ) X ub( c )) n

  

  

Histograms provide a (discontinuous) density estimator.

3-46

Example:

X1 = X2 = 1 X3 = X4 = X5 = 2 X6 = … X10 = 3 X11 = … X14 = 4 X15 = … X17 = 5 X18 = X19 = 6 X20 = 7

slide-9
SLIDE 9

IRDM WS 2015 3-47

Sources: en.wikipedia.org de.wikipedia.org

Different Kinds of Histograms

equidistant buckets non-equidistant buckets

slide-10
SLIDE 10

IRDM WS 2015

Method of Moments

Method-of-moments estimators are usually consistent and asympotically Normal, but may be biased

3-48

  • Suppose parameter θ = (θ1, …, θk) has k components
  • Compute j-th moment for 1 ≤ j ≤ k:
  • Compute j-th sample moment for 1 ≤ j ≤ k:
  • Method-of-moments estimate of θ is obtained by solving

a system of k equations in k unknowns:

slide-11
SLIDE 11

IRDM WS 2015

Example: Method of Moments

Let X1, …, Xn ~ Normal(,2)

3-49

𝛽1 = 𝐹𝜄 𝑌 = 𝜈 𝛽2 = 𝐹𝜄 𝑌2 = 𝑊𝑏𝑠 𝑌 + 𝐹 𝑌 2 = 𝜏2 + 𝜈2 Solve the equation system: 𝜈 = 𝛽1 = 𝛽1 = 1 𝑜

𝑗=1 𝑜

𝑌𝑗 𝜏2 + 𝜈2 = 𝛽2 = 𝛽2 = 1 𝑜

𝑗=1 𝑜

𝑌𝑗

2

Solution:

𝜈 = 1 𝑜

𝑗=1 𝑜

𝑌𝑗 =

𝑌 𝜏2 = 1 𝑜

𝑗=1 𝑜

𝑌𝑗 − 𝑌 2

slide-12
SLIDE 12

IRDM WS 2015

Parametric Inference: Maximum Likelihood Estimators (MLE)

Estimate parameter  of a postulated distribution f(,x) such that the probability that the data of the sample are generated by this distribution is maximized.  Maximum likelihood estimation: Maximize L(x1,...,xn, ) = P[x1, ..., xn originate from f(,x)]

  • ften written as

𝜾𝑵𝑴𝑭 = 𝒃𝒔𝒉𝒏𝒃𝒚𝜾 L( , x1,...,xn) = 𝒃𝒔𝒉𝒏𝒃𝒚𝜾 𝒋=𝟐

𝒐

𝒈(𝒚𝒋, , 𝜾)

  • r maximize log L

if analytically untractable  use numerical iteration methods

3-50

slide-13
SLIDE 13

IRDM WS 2015

MLE Properties

Maximum Likelihood Estimators are consistent, asymptotically Normal, and asymptotically optimal in the following sense: Consider two estimators U and T which are asymptotically Normal. Let u2 and t2 denote the variances of the two Normal distributions to which U and T converge in probability. The asymptotic relative efficiency of U to T is ARE(U,T) = t2/u2 . Theorem: For an MLE and any other estimator the following inequality holds:

n

ˆ 

n

n n

ˆ ARE( , ) 1   

3-51

slide-14
SLIDE 14

IRDM WS 2015

Simple Example for Maximum Likelihood Estimator

given:

  • coin with Bernoulli distribution with

unknown parameter p für head, 1-p for tail

  • sample (data): k times head with n coin tosses

needed: maximum likelihood estimation of p Let L(k, n, p) = P[sample is generated from distr. with param. p]

k n k

p p k n

        ) 1 ( Maximize log-likelihood function log L (k, n, p):

n log L log k logp (n k) log (1 p) k           

n k p  

1 log        p k n p k p L

3-52

slide-15
SLIDE 15

IRDM WS 2015

Advanced Example for Maximum Likelihood Estimator

given:

  • Poisson distribution with parameter  (expectation)
  • sample (data): numbers x1, ..., xn N0

needed: maximum likelihood estimation of  1 ln            

 r i i

i f L   x x n f f i

n i i r i i r i i

   

  

   1

1 ˆ 

         

  r i i f i n

! i e ) , x ,..., x ( L

1

 

 Let r be the largest among these numbers, and let f0, ..., fr be the absolute frequencies of numbers 0, ..., r.

3-53

slide-16
SLIDE 16

IRDM WS 2015

Sophisticated Example for Maximum Likelihood Estimator

given:

  • discrete uniform distribution over [1,]  N0 and density f(x) = 1/ 
  • sample (data): numbers x1, ..., xn N0

MLE for  is max{x1, ..., xn } (see Wasserman p. 124)

3-54

slide-17
SLIDE 17

IRDM WS 2015

MLE for Parameters

  • f Normal Distributions

         

   n i ) i x ( n n

e ) , , x ,..., x ( L

1 2 2 2 2 1

2 1

 

   

n i 2 i 1

ln( L) 1 2( x ) 2   

     

2 1 2

1 2 4 2 2

       

 n i i

) x ( n ) L ln(    

 

 n i i

x n ˆ

1

1 

2 1 2

1 ) ˆ x ( n ˆ

n i i

  

 

3-55

slide-18
SLIDE 18

IRDM WS 2015

Analytically Non-tractable MLE for parameters

  • f Multivariate Normal Mixture

) ,..., , ,..., , ,..., , (

1 1 1 k k k

x f         

     

 

k j j x j T j x j m j

e

1 ) ( 1 ) ( 2 1

) 2 ( 1

 

 

   

with expectation values and invertible, positive definite, symmetric mm covariance matrices

j

 

j

 

k j j j j

x n

1

) , , (      maximize log-likelihood function:

  

  

          

n i k j j j i j n i i n

x n x P x x L

1 1 1 1

) , , ( log ] | [ log : ) , ,..., ( log          consider samples from a mixture of multivariate Normal distributions with the density (e.g. height and weight of males and females):

3-56

slide-19
SLIDE 19

IRDM WS 2015

Expectation-Maximization Method (EM)

When L(, X1, ..., Xn) is analytically intractable then

  • introduce latent (non-observable) random variable(s) Z such that:

joint distribution J(X1, ..., Xn, Z, ) of „complete“ data is tractable

  • iteratively compute:
  • Expectation (E Step):

compute expected complete data likelihood EZ [log J(X1, …, Xn, Z | (t))] given a previous estimate of 

  • Maximization (M Step):

estimate (t+1) that maximizes EZ [log J(X1, …, Xn, Z | (t))]

2-57

details depend on distribution at hand (often mixture models) convergence guaranteed, but problem is non-convex  numerical methods

slide-20
SLIDE 20

IRDM WS 2015

Bayesian Viewpoint of Parameter Estimation

  • assume prior distribution g() of parameter 
  • choose statistical model (generative model) f (x | )

that reflects our beliefs about RV X

  • given RVs X1, ..., Xn for observed data,

the posterior distribution is h ( | x1, ..., xn) for X1=x1, ..., Xn=xn the likelihood is which implies (posterior is proportional to likelihood times prior) MAP estimator (maximum a posteriori): compute  that maximizes h ( | x1, …, xn) given a prior for  ) ( g ) , x ... x ( L ~ ) x ... x | ( h

n 1 n 1

   

  

 

  

n 1 i ' i i n 1 i i n 1

) ( g ) ' ( g ) ' | x ( f ) x | ( h ) | x ( f ) , x ... x ( L      

2-58

slide-21
SLIDE 21

IRDM WS 2015

3.2.2 Confidence Intervals

Estimator T for an interval for parameter  such that For the distribution of random variable X a value x (0<  <1) with is called a  quantile; the 0.5 quantile is called the median. For the normal distribution N(0,1) the  quantile is denoted  .

 

 

      1 ] x X [ P ] x X [ P         1 ] a T a T [ P

[T-a, T+a] is the confidence interval and 1- is the confidence level.

3-59

area: (a)= a=

slide-22
SLIDE 22

IRDM WS 2015

Confidence Intervals for Expectations (1)

Let x1, ..., xn be a sample from a distribution with unknown expectation  and known variance 2. For sufficiently large n the sample mean is N(,2/n) distributed and is N(0,1) distributed:

X

  n ) X ( 

1 ) z ( 2 )) z ( 1 ( ) z ( ) z ( ) z ( ] z n ) X ( z [ P                    ] n z X n z X [ P        

   

 

        

 

1

2 1 2 1

] n X n X [ P

/ /

) , ( N

  • f

quantile ) ( : z 1 2 1   

then set to determine interval For required confidence interval

  • r confidence level 1- set

] a X , a X [    n a : z 

  • r

then look up (z) to determine 1/2

z a : n  

3-60

slide-23
SLIDE 23

IRDM WS 2015

Normal Distribution Table

3-61

slide-24
SLIDE 24

IRDM WS 2015

Confidence Intervals for Expectations (2)

Let x1, ..., xn be a sample from a distribution with unknown expectation  and unknown variance 2 and sample variance S2 . For sufficiently large n the random variable

S n ) X ( : T   

has a t distribution (Student distribution) with n-1 degrees of freedom:

2 1 2

1 1 2 2 1

                        

n n , T

n t n n n ) t ( f 

with the Gamma function:

   

   1

x für dt t e ) x (

x t

) ) x ( x ) x ( and ) ( properties the with (       1 1 1

 

 

      

   

1

2 1 1 2 1 1

] n S t X n S t X [ P

/ , n / , n

3-62

slide-25
SLIDE 25

IRDM WS 2015

Student‘s t Distribution Table

3-63

William Gosset (1876-1937)

  • A. Student:

The Probable Error of a Mean, Biometrika 6(1), 1908

slide-26
SLIDE 26

for interval [ 𝑌 − 𝑏, 𝑌 + 𝑏]: then look up (z) to determine 1/2

IRDM WS 2015

Example: Confidence Interval for Expectation

3-64

X: time for student to solve exercise n=16 samples, 𝑌 = 2.5, 𝑇2 = 0.25 A) Assume 𝜏2 is known: 𝜏2=0.25 A1) Estimate 0.2 A2) Estimate  with 1=0.9 confidence B) Assume 𝜏2 is unknown B1) Estimate 0.2 B2) Estimate  with 1=0.9 confidence

1 ) z ( 2 )) z ( 1 ( ) z ( ) z ( ) z ( ] z n ) X ( z [ P                    ] n z X n z X [ P        

   

 

        

 

1

2 1 2 1

] n X n X [ P

/ /

 n a : z 

) , ( N

  • f

quantile ) ( : z 1 2 1   

for confidence 1: then set to determine interval

z a : n  

slide-27
SLIDE 27

IRDM WS 2015

3.2.3 Hypothesis Testing

Hypothesis testing:

  • aims to falsify some hypothesis by lack of statistical evidence
  • design of test RV (test statistic) and its (approx. / limit) distribution

3-65

  • Toss a coin n times and judge if the coin is fair

X1, …, Xn ~ Bernoulli(p), coin is fair if p = 0.5

  • Let the null hypothesis H0 be “the coin is fair”
  • The alternative hypothesis H1 is then “the coin is not fair”
  • Intuitively, if is large, we should reject H0

Example: H0 is default, interest is in H1: aim to reject H0 (e.g. suspecting that the coin is unfair)

slide-28
SLIDE 28

IRDM WS 2015

Hypothesis Testing Terminology (1)

A hypothesis test determines a probability 1- (test level , significance level) that a sample X1, ..., Xn from some unknown probability distribution has a certain property. Examples:

1) The sample originates from a normal distribution. 2) Under the assumption of a normal distribution the sample originates from a N(, 2) distribution. 3) Two random variables are independent. 4) Two random variables are identically distributed. 5) Parameter  of a Poisson distribution from which the sample stems has value 5.

General form: null hypothesis H0 vs. alternative hypothesis H1 needs test variable (test statistic) X (derived from X1, ..., Xn, H0, H1) and test region R with XR for rejecting H0 and XR for retaining H0

Retain H0 Reject H0 H0 true  type I error H1 true type II error 

3-66

H0 is default, interest is in H1

slide-29
SLIDE 29

IRDM WS 2015

Hypothesis Testing Terminology (2)

3-67

  • θ = θ0 is called a simple hypothesis
  • θ > θ0 or θ < θ0 is called a composite hypothesis
  • H0 : θ = θ0 vs. H1 : θ ⧧ θ0 is called a two-sided test
  • H0 : θ ≤ θ0 vs. H1 : θ > θ0 and H0 : θ ≥ θ0 vs. H1 : θ < θ0

are called a one-sided test

  • Rejection region R : if X ∈ R, reject H0 otherwise retain H0
  • The rejection region is typically defined using a test statistic T

and a critical value c:

slide-30
SLIDE 30

IRDM WS 2015

p-Value

Suppose that for every level   (0,1) there is a test with rejection region R. Then the p-value is the smallest level at which we can reject H0: }

  • 1

n

p value inf{ |T( X ,...,X ) R    small p-value means strong evidence against H0

3-68

p-value: prob. of test statistic (sample) as extreme as the observed data under H0 Caution: p-value  P[H0|data] typical interpretation of p-values:

  • < 0.01

very strong evidence against H0

  • 0.01 – 0.05:

strong evidence against H0

  • 0.05 – 0.10:

weak evidence against H0

  • > 0.1:

little or no evidence against H0

slide-31
SLIDE 31

IRDM WS 2015

Hypothesis Testing Example

Null hypothesis for n coin tosses: coin is fair or has head probability p = p0; alternative hypothesis: p  p0 Test variable: X, the #heads, is N(pn, p(1-p)n) distributed (by the Central Limit Theorem), thus is N(0, 1) distributed ) p 1 ( p n ) p n / X ( : Z    Rejection of null hypothesis at test level  (e.g. 0.05) if

2 2 1 / /

Z Z

 

    

3-69

/2 1/2

rejection region

slide-32
SLIDE 32

IRDM WS 2015

Wald Test

for testing H0:  = 0 vs. H1:   0 use the test variable

ˆ W ˆ se( )     

with sample estimate and standard error

ˆ 

W converges in probability to N(0,1)  reject H0 at level  when W > 1/2 or W < /2 ˆ ˆ se( ) Var[ ]    generalization (for unknown variance): t-test (based on Student‘s t distribution)

3-70

the p-value for the Wald test is 2( |w|) where w is the value of the test variable W

slide-33
SLIDE 33

IRDM WS 2015

Example: Wald Test

3-71

n=20 coin tosses X1, …, Xn with 15 times heads H0: p=0.5 (coin is fair) vs. H1: p0.5 sample mean: 𝑞 = 0.75, variance Var[ 𝑞] = 𝑜 𝑞 1 − 𝑞 / 𝑜2 =

3 320

Test statistic W =

𝑞−𝑞 𝑡𝑓 𝑞  0.25 1/100  2.5

Test level =0.1: W > 1/2= 0.95 or W < /2= 0.05 Test: 2.5 > 1.65  reject H0 Test level =0.01: W > 1/2= 0.995 or W < /2= 0.005 Test: 2.5 < 2.58  retain H0 p-value in between

not variance, but sample variance

slide-34
SLIDE 34

IRDM WS 2015

t-Test

for testing H0:  = 0 vs. H1:   0 use the test variable 𝑈 =

𝜄−𝜄0 𝑡𝑓( 𝜄)

with sample estimate and standard error

ˆ 

T converges in probability to a t-distribution with n-1 degrees  reject H0 at level  when T > 𝑢𝑜−1,1−𝛽/2 or T < 𝑢𝑜−1,𝛽/2 Extensions for

  • two-sample tests – comparing two independent samples
  • paired two-sample tests – for testing differences (ordering) of RVs

3-72

𝑡𝑓 𝜄 = 𝑇2( 𝜄)

Given: n samples for  with sample mean 𝜄 and 𝒕𝒃𝒏𝒒𝒎𝒇 𝒕𝒖𝒃𝒐𝒆𝒃𝒔𝒆 𝒆𝒇𝒘𝒋𝒃𝒖𝒋𝒑𝒐 𝑻( 𝜾)

t-test is most widely used test for statistical significance of experimental data

slide-35
SLIDE 35

IRDM WS 2015

Paired t-Test Tools

3-73

https://www.usablestats.com/calcs/2samplet use software like Matlab, R, etc.

slide-36
SLIDE 36

IRDM WS 2015

Chi-Square Distribution

Let X1, ..., Xn be independent, N(0,1) distributed random variables. Then the random variable is chi-square distributed with n degrees of freedom:

2 2 1 2 n n

X ... X :    

  • therwise

, x for n e x ) x ( f

n x n n

2 2 2

2 2 2 2

        

  

Let n be a natural number, let X be N(0,1) distributed and Y 2 distributed with n degrees of freedom. Then the random variable is t distributed with n degrees of freedom.

Y X n : Tn 

3-74

slide-37
SLIDE 37

IRDM WS 2015

Chi-Square Goodness-of-Fit-Test

Given: n sample values X1, ..., Xn of random variable X with absolute frequencies H1, ..., Hk for k value classes vi (e.g. value intervals) of random variable X Null hypothesis: the values Xi are f distributed (e.g. uniformly distributed), where f has expectation  and variance 2 Approach: and

 

k i i i k

n v E H Y

1

/ )) ( ( : 

Rejection of null hypothesis at test level  (e.g. 0.05) if

2 1 1 

 

, k k

Z

 

k i i i i k

v E v E H Z

1 2

) ( ) ( ( :

are both approximately 2 distributed with k-1 degrees of freedom with E(vi) := n P[X is in class vi according to f ]

3-75

slide-38
SLIDE 38

IRDM WS 2015

Chi-Square Independence Test

Given: n samples of two random variables X, Y or, equivalently, a twodimensional random variable with absolute frequencies H11, ..., Hrc for 𝑠 × 𝑑 value classes, where X has r and Y has c distinct classes. (This is called a contingency table.) Null hypothesis: X und Y are independent; then the expectations for the absolute frequencies of the value classes would be

n C R E

j i ij 

with

c j ij i

H R

1

:

and

 

 r i ij j

H : C

1 2 r c ij ij ij i 1 j 1

( H E ) Z : E

 

   

Approach: is approximately 2 distributed with (r-1)(c-1) degrees of freedom Rejection of null hypothesis at test level  (e.g. 0.05) if

2 1 1 1 

  

), c )( r (

Z

3-76

slide-39
SLIDE 39

IRDM WS 2015

Example: Chi-Square Independence Test

3-77

women and men seem to prefer different study subjects  we compiled enrollment data in a contingency table Hypothesis H0: Gender and Subject are independent Gender Male Female Total Subject CS 80 20 100 Math 40 20 60 Bioinf 20 20 40 Total 140 60 200 Test statistic 𝑎 = 𝑗=1

𝑠

𝑘=1

𝑑 𝐼𝑗𝑘−𝐹𝑗𝑘

2

𝐹𝑗𝑘

~ 2((r−1)(c−1)) ~ 2 (2) 𝑎 =

102 70 + −10 2 30

+

(−2)2 42 + 22 18 + (−8)2 28 + 82 12  12.6

Test level 1=0.95  2

2,0.95  5.99  reject H0

slide-40
SLIDE 40

IRDM WS 2015

Chi-Square Distribution Table

3-78

slide-41
SLIDE 41

IRDM WS 2015

Chi-Square Distribution Table

3-79

slide-42
SLIDE 42

IRDM WS 2015

3.2.4 Regression for Parameter Fitting

Estimate r(x) = E[Y | X1=x1  ... Xm=xm] using a linear model

m i i i 1

Y r( x ) x    

    

with error  with E[]=0 given n sample points (x1

(i) , ..., xm (i), y(i)), i=1..n, the

least-squares estimator (LSE) minimizes the quadratic error:

2 ( i ) ( i ) k m k i 1..n k 0..m

x y : E( ,..., )   

 

                 

 

(with xo

(i)=1)

Solve linear equation system:

k

E    

for k=0, ..., m equivalent to MLE

T 1 T

( X X ) X Y 

with Y = (y(1) ... y(n))T and

(1) (1) (1) m 1 2 ( 2 ) ( 2 ) ( 2 ) m 1 2 ( n ) ( n ) ( n ) m 1 2

1 x x ... x 1 x x ... x X ... 1 x x ... x                 

2-80

Linear Regression

slide-43
SLIDE 43

IRDM WS 2015

Logistic Regression

Estimate r(x) = E[Y | X=x] for Bernoulli Y using a logistic model

m i i i 1 m i i i 1

x x

e Y r( x ) 1 e

   

 

   

 

    

with error  with E[]=0 solution for MLE for i values based on numerical gradient-descent methods loglinear

2-81

slide-44
SLIDE 44

IRDM WS 2015

Summary of Section 3.2

  • Samples and Estimators are RVs
  • Estimators should be unbiased
  • MLE is canonical estimator for parameters
  • Confidence intervals based on Normal and t distributions
  • Hypothesis testing: reject or retain H0 at level 
  • p-value: smallest level  for rejecting H0
  • Wald test and t-test for (in)equality of parameters
  • Chi-Square test for independence or goodness-of-fit
  • Linear regression for predicting continuous variables

2-82

slide-45
SLIDE 45

IRDM WS 2015

Additional Literature for Section 3.2

  • A. Allen: Probability, Statistics, and Queueing Theory

With Computer Science Applications, Wiley 1978

  • G. Casella, R. Berger: Statistical Inference, Duxbury 2002
  • M. Greiner, G. Tinhofer: Stochastik für Studienanfänger

der Informatik, Carl Hanser Verlag, 1996

  • G. Hübner: Stochastik: Eine Anwendungsorientierte Einführung für

Informatiker, Ingenieure und Mathematiker, Vieweg & Teubner 2009

3-83