Statistics and learning Regression Emmanuel Rachelson and Matthieu - - PowerPoint PPT Presentation

statistics and learning
SMART_READER_LITE
LIVE PREVIEW

Statistics and learning Regression Emmanuel Rachelson and Matthieu - - PowerPoint PPT Presentation

Statistics and learning Regression Emmanuel Rachelson and Matthieu Vignes ISAE SupAero Wednesday 6 th November 2013 E. Rachelson & M. Vignes (ISAE) SAD 2013 1 / 15 The regression model expresses a random variable Y as a function of


slide-1
SLIDE 1

Statistics and learning

Regression Emmanuel Rachelson and Matthieu Vignes

ISAE SupAero

Wednesday 6th November 2013

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 1 / 15

slide-2
SLIDE 2

The regression model

◮ expresses a random variable Y as a function of random variables

X ∈ Rp according to: Y = f(X; β) + ǫ, where functional f depends on unknown parameters β1, . . . , βk and the residual (or error) ǫ is an unobservable rv which accounts for random fluctuations between the model and Y .

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 2 / 15

slide-3
SLIDE 3

The regression model

◮ expresses a random variable Y as a function of random variables

X ∈ Rp according to: Y = f(X; β) + ǫ, where functional f depends on unknown parameters β1, . . . , βk and the residual (or error) ǫ is an unobservable rv which accounts for random fluctuations between the model and Y .

◮ Goal: from n experimental observations (xi, yi), we aim at

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 2 / 15

slide-4
SLIDE 4

The regression model

◮ expresses a random variable Y as a function of random variables

X ∈ Rp according to: Y = f(X; β) + ǫ, where functional f depends on unknown parameters β1, . . . , βk and the residual (or error) ǫ is an unobservable rv which accounts for random fluctuations between the model and Y .

◮ Goal: from n experimental observations (xi, yi), we aim at

◮ estimating unknown (βl)l=1...k,

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 2 / 15

slide-5
SLIDE 5

The regression model

◮ expresses a random variable Y as a function of random variables

X ∈ Rp according to: Y = f(X; β) + ǫ, where functional f depends on unknown parameters β1, . . . , βk and the residual (or error) ǫ is an unobservable rv which accounts for random fluctuations between the model and Y .

◮ Goal: from n experimental observations (xi, yi), we aim at

◮ estimating unknown (βl)l=1...k, ◮ evaluating the fitness of the model

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 2 / 15

slide-6
SLIDE 6

The regression model

◮ expresses a random variable Y as a function of random variables

X ∈ Rp according to: Y = f(X; β) + ǫ, where functional f depends on unknown parameters β1, . . . , βk and the residual (or error) ǫ is an unobservable rv which accounts for random fluctuations between the model and Y .

◮ Goal: from n experimental observations (xi, yi), we aim at

◮ estimating unknown (βl)l=1...k, ◮ evaluating the fitness of the model ◮ if the fit is acceptable, tests on parameters can be performed and the

model can be used for predictions

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 2 / 15

slide-7
SLIDE 7

Simple linear regression

◮ A single explanatory variable X and an affine relationship to the

dependant variable Y : E[Y | X = x] = β0 + β1x or Yi = β0 + β1Xi + ǫi, where β1 is the slope of the adjusted regression line and β0 is the intercept.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 3 / 15

slide-8
SLIDE 8

Simple linear regression

◮ A single explanatory variable X and an affine relationship to the

dependant variable Y : E[Y | X = x] = β0 + β1x or Yi = β0 + β1Xi + ǫi, where β1 is the slope of the adjusted regression line and β0 is the intercept.

◮ Residuals ǫi are assumed to be centred (R1), have equal variances

(= σ2, R2) and be uncorrelated: Cov(ǫi, ǫj) = 0, ∀i = j (R3).

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 3 / 15

slide-9
SLIDE 9

Simple linear regression

◮ A single explanatory variable X and an affine relationship to the

dependant variable Y : E[Y | X = x] = β0 + β1x or Yi = β0 + β1Xi + ǫi, where β1 is the slope of the adjusted regression line and β0 is the intercept.

◮ Residuals ǫi are assumed to be centred (R1), have equal variances

(= σ2, R2) and be uncorrelated: Cov(ǫi, ǫj) = 0, ∀i = j (R3).

◮ Hence: E[Yi] = β0 + β1xi, Var(Yi) = σ2 and

Cov(Yi, Yj) = 0, ∀i = j.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 3 / 15

slide-10
SLIDE 10

Simple linear regression

◮ A single explanatory variable X and an affine relationship to the

dependant variable Y : E[Y | X = x] = β0 + β1x or Yi = β0 + β1Xi + ǫi, where β1 is the slope of the adjusted regression line and β0 is the intercept.

◮ Residuals ǫi are assumed to be centred (R1), have equal variances

(= σ2, R2) and be uncorrelated: Cov(ǫi, ǫj) = 0, ∀i = j (R3).

◮ Hence: E[Yi] = β0 + β1xi, Var(Yi) = σ2 and

Cov(Yi, Yj) = 0, ∀i = j.

◮ Fitting (or adjusting) the model = estimate β0, β1 and σ from the

n-sample (xi, yi).

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 3 / 15

slide-11
SLIDE 11

Least square estimate

◮ Seeking values for β0 and β1 minimising the sum of quadratic errors:

( ˆ β0, ˆ β1) = argmin(β0,β1)∈R2

  • [yi − (β0 + β1xi)]2
  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 4 / 15

slide-12
SLIDE 12

Least square estimate

◮ Seeking values for β0 and β1 minimising the sum of quadratic errors:

( ˆ β0, ˆ β1) = argmin(β0,β1)∈R2

  • [yi − (β0 + β1xi)]2

Note that Y and X do not play a symetric role !

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 4 / 15

slide-13
SLIDE 13

Least square estimate

◮ Seeking values for β0 and β1 minimising the sum of quadratic errors:

( ˆ β0, ˆ β1) = argmin(β0,β1)∈R2

  • [yi − (β0 + β1xi)]2

Note that Y and X do not play a symetric role !

◮ ◮ In matrix notation (useful later): Y = X.B + ǫ, with

Y = ⊤(Y1 . . . Yn), B = ⊤(β0, β1), ǫ = ⊤(ǫ1 . . . ǫn) and X = ⊤

  • 1

· · · 1 X1 · · · Xn

  • .
  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 4 / 15

slide-14
SLIDE 14

Estimator properties

◮ useful notations: ¯

x = 1/n

i xi, ¯

y, s2

x, s2 y and

sxy = 1/(n − 1)

i(xi − ¯

x)(yi − ¯ y).

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 5 / 15

slide-15
SLIDE 15

Estimator properties

◮ useful notations: ¯

x = 1/n

i xi, ¯

y, s2

x, s2 y and

sxy = 1/(n − 1)

i(xi − ¯

x)(yi − ¯ y).

◮ Linear correlation coefficient: rxy = sxy sxsy .

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 5 / 15

slide-16
SLIDE 16

Estimator properties

◮ useful notations: ¯

x = 1/n

i xi, ¯

y, s2

x, s2 y and

sxy = 1/(n − 1)

i(xi − ¯

x)(yi − ¯ y).

◮ Linear correlation coefficient: rxy = sxy sxsy .

Theorem

  • 1. Least Square estimators are ˆ

β1 = sxy/s2

x and ˆ

β0 = ¯ y − ˆ β1¯ x.

  • 2. These estimators are unbiased and efficient.
  • 3. s2 =

1 n−2

  • i
  • yi − ( ˆ

β0 + ˆ β1xi) 2 is an unbiased estimator of σ2. It is however not efficient.

  • 4. Var( ˆ

β1) =

σ2 (n−1)s2

x and Var( ˆ

β0) = ¯ x2Var( ˆ β1) + σ2/n

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 5 / 15

slide-17
SLIDE 17

Simple Gaussian linear model

◮ In addition to R1 (centred noise), R2 (equal variance noise) and R3

(uncorrelated noise), we assume (R3’) ∀i = j, ǫi and ǫj independent and (R4) ∀i, ǫi ∼ N(0, σ2) or equivalently yi ∼ N(β0 + β1xi, σ2).

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 6 / 15

slide-18
SLIDE 18

Simple Gaussian linear model

◮ In addition to R1 (centred noise), R2 (equal variance noise) and R3

(uncorrelated noise), we assume (R3’) ∀i = j, ǫi and ǫj independent and (R4) ∀i, ǫi ∼ N(0, σ2) or equivalently yi ∼ N(β0 + β1xi, σ2).

◮ Theorem: under (R1, R2, R3’ and R4), Least Square estimators =

MLE.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 6 / 15

slide-19
SLIDE 19

Simple Gaussian linear model

◮ In addition to R1 (centred noise), R2 (equal variance noise) and R3

(uncorrelated noise), we assume (R3’) ∀i = j, ǫi and ǫj independent and (R4) ∀i, ǫi ∼ N(0, σ2) or equivalently yi ∼ N(β0 + β1xi, σ2).

◮ Theorem: under (R1, R2, R3’ and R4), Least Square estimators =

MLE.

Theorem (Distribution of estimators)

  • 1. ˆ

β0 ∼ N(β0, σ2

ˆ β0) and ˆ

β1 ∼ N(β0, σ2

ˆ β1), with

σ2

ˆ β0 = σ2

¯ x2/

i(xi − ¯

x)2 + 1/n

  • and σ2

ˆ β1 = σ2/ i(xi − ¯

x)2

  • 2. (n − 2)s2/σ2 ∼ χ2

n−2

  • 3. ˆ

β0 and ˆ β1 are independent of ˆ ǫi.

  • 4. Estimators of σ2

ˆ β0 and σ2 ˆ β1 are given in 1. by replacing σ2 by s2.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 6 / 15

slide-20
SLIDE 20

Tests, ANOVA and determination coefficient

◮ Previous theorem allows us to build CI for β0 and β1.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 7 / 15

slide-21
SLIDE 21

Tests, ANOVA and determination coefficient

◮ Previous theorem allows us to build CI for β0 and β1. ◮ SST/n = SSR/n + SSE/n, with SST = i(yi − ¯

y)2 (total sum of squares), SSR =

i( ˆ

yi − ¯ y)2 (regression sum of squares) and SSE =

i(yi − ¯

yi)2 (sum of squared errors).

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 7 / 15

slide-22
SLIDE 22

Tests, ANOVA and determination coefficient

◮ Previous theorem allows us to build CI for β0 and β1. ◮ SST/n = SSR/n + SSE/n, with SST = i(yi − ¯

y)2 (total sum of squares), SSR =

i( ˆ

yi − ¯ y)2 (regression sum of squares) and SSE =

i(yi − ¯

yi)2 (sum of squared errors).

◮ Definition: Determination coefficient

R2 =

  • i( ˆ

yi−¯ y)2

  • i(yi−¯

y)2 = SSR SST = 1 − SSE SST = 1 − Residual Variance Total variance .

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 7 / 15

slide-23
SLIDE 23

Tests, ANOVA and determination coefficient

◮ Previous theorem allows us to build CI for β0 and β1. ◮ SST/n = SSR/n + SSE/n, with SST = i(yi − ¯

y)2 (total sum of squares), SSR =

i( ˆ

yi − ¯ y)2 (regression sum of squares) and SSE =

i(yi − ¯

yi)2 (sum of squared errors).

◮ Definition: Determination coefficient

R2 =

  • i( ˆ

yi−¯ y)2

  • i(yi−¯

y)2 = SSR SST = 1 − SSE SST = 1 − Residual Variance Total variance .

→ Always use scatterplots to interpret linear model adequacy same R2 = 0.667

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 7 / 15

slide-24
SLIDE 24

Prediction

◮ Given a new x∗, what is the prediction ˜

y ?

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 8 / 15

slide-25
SLIDE 25

Prediction

◮ Given a new x∗, what is the prediction ˜

y ?

◮ It’s simply

y(x∗) = ˆ β0 + ˆ β1x∗. But what is its precision ?

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 8 / 15

slide-26
SLIDE 26

Prediction

◮ Given a new x∗, what is the prediction ˜

y ?

◮ It’s simply

y(x∗) = ˆ β0 + ˆ β1x∗. But what is its precision ?

◮ Its CI is

  • ˆ

β0 + ˆ β1x∗ + / − tn−2;1−α/2s∗ , where s∗ = s

  • 1 + 1

n + (x∗−¯ x)2

  • i(xi−¯

x)2 .

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 8 / 15

slide-27
SLIDE 27

Prediction

◮ Given a new x∗, what is the prediction ˜

y ?

◮ It’s simply

y(x∗) = ˆ β0 + ˆ β1x∗. But what is its precision ?

◮ Its CI is

  • ˆ

β0 + ˆ β1x∗ + / − tn−2;1−α/2s∗ , where s∗ = s

  • 1 + 1

n + (x∗−¯ x)2

  • i(xi−¯

x)2 . ◮ Predictions are valid in the range of (xi)’s.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 8 / 15

slide-28
SLIDE 28

Prediction

◮ Given a new x∗, what is the prediction ˜

y ?

◮ It’s simply

y(x∗) = ˆ β0 + ˆ β1x∗. But what is its precision ?

◮ Its CI is

  • ˆ

β0 + ˆ β1x∗ + / − tn−2;1−α/2s∗ , where s∗ = s

  • 1 + 1

n + (x∗−¯ x)2

  • i(xi−¯

x)2 . ◮ Predictions are valid in the range of (xi)’s. ◮ The precision varies according to the x∗ value you want to predict:

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 8 / 15

slide-29
SLIDE 29

Multiple linear regression

◮ Natural extension when several (Xj)j=1...p are used to explain Y .

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 9 / 15

slide-30
SLIDE 30

Multiple linear regression

◮ Natural extension when several (Xj)j=1...p are used to explain Y . ◮ Model simply writes: Y = β0 + p j=1 βjXj + ǫ. In matrix notations

with obvious generalisation: Y = Xβ + ǫ.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 9 / 15

slide-31
SLIDE 31

Multiple linear regression

◮ Natural extension when several (Xj)j=1...p are used to explain Y . ◮ Model simply writes: Y = β0 + p j=1 βjXj + ǫ. In matrix notations

with obvious generalisation: Y = Xβ + ǫ.

◮ x = (xj i)i,j is the observed design matrix.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 9 / 15

slide-32
SLIDE 32

Multiple linear regression

◮ Natural extension when several (Xj)j=1...p are used to explain Y . ◮ Model simply writes: Y = β0 + p j=1 βjXj + ǫ. In matrix notations

with obvious generalisation: Y = Xβ + ǫ.

◮ x = (xj i)i,j is the observed design matrix. ◮ Identifiability of β is equivalent to the linear independence of the

columns of x i.e. Rank(X) = p + 1. This is equivalent to ⊤XX being invertible.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 9 / 15

slide-33
SLIDE 33

Multiple linear regression

◮ Natural extension when several (Xj)j=1...p are used to explain Y . ◮ Model simply writes: Y = β0 + p j=1 βjXj + ǫ. In matrix notations

with obvious generalisation: Y = Xβ + ǫ.

◮ x = (xj i)i,j is the observed design matrix. ◮ Identifiability of β is equivalent to the linear independence of the

columns of x i.e. Rank(X) = p + 1. This is equivalent to ⊤XX being invertible.

◮ Parameter estimation: argminβ

n

i=1

  • yi − p

j=1 βjxj i − β0

2 ⇔ argminβ

  • i ˆ

ǫi2 ⇔ argminβY − Xβ2

2.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 9 / 15

slide-34
SLIDE 34

Multiple linear regression

◮ Natural extension when several (Xj)j=1...p are used to explain Y . ◮ Model simply writes: Y = β0 + p j=1 βjXj + ǫ. In matrix notations

with obvious generalisation: Y = Xβ + ǫ.

◮ x = (xj i)i,j is the observed design matrix. ◮ Identifiability of β is equivalent to the linear independence of the

columns of x i.e. Rank(X) = p + 1. This is equivalent to ⊤XX being invertible.

◮ Parameter estimation: argminβ

n

i=1

  • yi − p

j=1 βjxj i − β0

2 ⇔ argminβ

  • i ˆ

ǫi2 ⇔ argminβY − Xβ2

2. ◮ Theorem The Least Square Estimator of β is ˆ

β = (⊤XX)−1 ⊤X Y .

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 9 / 15

slide-35
SLIDE 35

Properties of the least square estimate

Theorem

The estimator ˆ β previously defined is s.t.

  • 1. ˆ

β ∼ N(β, σ2(⊤XX)−1) and

  • 2. ˆ

β efficient: among all unbiased estimator, it has the smallest variance.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 10 / 15

slide-36
SLIDE 36

Properties of the least square estimate

Theorem

The estimator ˆ β previously defined is s.t.

  • 1. ˆ

β ∼ N(β, σ2(⊤XX)−1) and

  • 2. ˆ

β efficient: among all unbiased estimator, it has the smallest variance.

◮ few control on σ2. So the structure of ⊤X X dictates the quality of

estimator ˆ β: optimal experimental design subject.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 10 / 15

slide-37
SLIDE 37

Properties of the least square estimate

Theorem

The estimator ˆ β previously defined is s.t.

  • 1. ˆ

β ∼ N(β, σ2(⊤XX)−1) and

  • 2. ˆ

β efficient: among all unbiased estimator, it has the smallest variance.

◮ few control on σ2. So the structure of ⊤X X dictates the quality of

estimator ˆ β: optimal experimental design subject.

Theorem

ˆ Y = X ˆ β: predicted values. Then ˆ Y = H Y , with H = X (⊤X X)−1 ⊤X; ǫ = Y − ˆ Y = (Id − H) Y . Note that H is the orthogonal projection on Vect(X) ⊂ Rn. We have:

  • 1. Cov( ˆ

Y ) = σ2H,

  • 2. Cov(ǫ) = σ2(Id − H) and
  • 3. ˆ

σ2 =

ǫ2 n−p−1.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 10 / 15

slide-38
SLIDE 38

Practical uses

◮ CI for βj: [ ˆ

βj + / − tn−p−1;1−α/2σ ˆ

βj], with tn−p−1;1−α/2 a

Student-quantile and σ ˆ

βj the squareroot of the jth element of

Cov(ˆ β).

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 11 / 15

slide-39
SLIDE 39

Practical uses

◮ CI for βj: [ ˆ

βj + / − tn−p−1;1−α/2σ ˆ

βj], with tn−p−1;1−α/2 a

Student-quantile and σ ˆ

βj the squareroot of the jth element of

Cov(ˆ β).

◮ Tests on βj: the rv ˆ βj−βj σ ˆ

βj

has a Student distribution.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 11 / 15

slide-40
SLIDE 40

Practical uses

◮ CI for βj: [ ˆ

βj + / − tn−p−1;1−α/2σ ˆ

βj], with tn−p−1;1−α/2 a

Student-quantile and σ ˆ

βj the squareroot of the jth element of

Cov(ˆ β).

◮ Tests on βj: the rv ˆ βj−βj σ ˆ

βj

has a Student distribution.

◮ Confidence region for β = (β0 . . . βp):

R1−α(β) =

  • z ∈ Rp+1| ⊤(z − ˆ

β) ⊤X X (z − ˆ β) ≤ (p + 1)s2fk;n−p−1;1−α

  • .

It is an ellipsoid centred on ˆ β with volume, shape and orientation depending upon ⊤X X.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 11 / 15

slide-41
SLIDE 41

Practical uses

◮ CI for βj: [ ˆ

βj + / − tn−p−1;1−α/2σ ˆ

βj], with tn−p−1;1−α/2 a

Student-quantile and σ ˆ

βj the squareroot of the jth element of

Cov(ˆ β).

◮ Tests on βj: the rv ˆ βj−βj σ ˆ

βj

has a Student distribution.

◮ Confidence region for β = (β0 . . . βp):

R1−α(β) =

  • z ∈ Rp+1| ⊤(z − ˆ

β) ⊤X X (z − ˆ β) ≤ (p + 1)s2fk;n−p−1;1−α

  • .

It is an ellipsoid centred on ˆ β with volume, shape and orientation depending upon ⊤X X.

◮ CI for previsions on y∗:

[y∗ + / − tn−p−1;1−α/2s

  • 1 +⊤ x∗(⊤X X)−11/2

].

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 11 / 15

slide-42
SLIDE 42

Usual diagnosis

◮ residual plot: variance homogeneity (weights can be used if not),

model validation. . .

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 12 / 15

slide-43
SLIDE 43

Usual diagnosis

◮ residual plot: variance homogeneity (weights can be used if not),

model validation. . .

◮ QQ-plots: to detect outliers . . .

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 12 / 15

slide-44
SLIDE 44

Usual diagnosis

◮ residual plot: variance homogeneity (weights can be used if not),

model validation. . .

◮ QQ-plots: to detect outliers . . . ◮ model selection. R2 for model with same number of regressors.

R2

adj = (n−1)R2−(p−1) n−p

. Maximising R2

adj is equivalent to maximising

the mean quadratic error.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 12 / 15

slide-45
SLIDE 45

Usual diagnosis

◮ residual plot: variance homogeneity (weights can be used if not),

model validation. . .

◮ QQ-plots: to detect outliers . . . ◮ model selection. R2 for model with same number of regressors.

R2

adj = (n−1)R2−(p−1) n−p

. Maximising R2

adj is equivalent to maximising

the mean quadratic error.

◮ test by ANOVA: F = SSR/p SSE/(n−p−1) has a Fisher distribution with

p, (n − p − 1) df. Since testing (H0) β1 = . . . = βp = 0 has little interest (rejected asa one of the variable is linked to Y ), one can test (H0’) βi1 = . . . = βiq = 0, with q < p and (SSR−SSRq)/q

SSE/(n−p−1) has a Fisher

distribution with q, (n − p − 1) df.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 12 / 15

slide-46
SLIDE 46

Usual diagnosis

◮ residual plot: variance homogeneity (weights can be used if not),

model validation. . .

◮ QQ-plots: to detect outliers . . . ◮ model selection. R2 for model with same number of regressors.

R2

adj = (n−1)R2−(p−1) n−p

. Maximising R2

adj is equivalent to maximising

the mean quadratic error.

◮ test by ANOVA: F = SSR/p SSE/(n−p−1) has a Fisher distribution with

p, (n − p − 1) df. Since testing (H0) β1 = . . . = βp = 0 has little interest (rejected asa one of the variable is linked to Y ), one can test (H0’) βi1 = . . . = βiq = 0, with q < p and (SSR−SSRq)/q

SSE/(n−p−1) has a Fisher

distribution with q, (n − p − 1) df.

◮ Application: variable selection for model interpretation: backward

(remove 1 by 1 least significative with t-test), forward (include 1 by 1 most significative with F-test), stepwise (variant of forward).

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 12 / 15

slide-47
SLIDE 47

Collinearity and model selection

◮ detecting colinearity between the xi’s. Inverting ⊤X X if

det(⊤X X) ≈ 0 is difficult. Moreover, the inverse will have a huge variance !

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 13 / 15

slide-48
SLIDE 48

Collinearity and model selection

◮ detecting colinearity between the xi’s. Inverting ⊤X X if

det(⊤X X) ≈ 0 is difficult. Moreover, the inverse will have a huge variance !

◮ to detect collinearity, compute V IF(xj) = 1 1−R2

j , with R2

j the

determination coefficient of xj regressed againt x \ {xj}. Perfect

  • rthogonality is V IF(xj) = 1 and the stronger the collinearity, the

larger the value for V IF(xj).

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 13 / 15

slide-49
SLIDE 49

Collinearity and model selection

◮ detecting colinearity between the xi’s. Inverting ⊤X X if

det(⊤X X) ≈ 0 is difficult. Moreover, the inverse will have a huge variance !

◮ to detect collinearity, compute V IF(xj) = 1 1−R2

j , with R2

j the

determination coefficient of xj regressed againt x \ {xj}. Perfect

  • rthogonality is V IF(xj) = 1 and the stronger the collinearity, the

larger the value for V IF(xj).

◮ Ridge regression introduces a bias but reduces the variance (keeps all

variables). Lasso regression does the same but also does a selection

  • n variables. Issue here: penalty term to tune...
  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 13 / 15

slide-50
SLIDE 50

Last generalisations

Multiple outputs, curvilinear and non-linear regressions

◮ Multiple output regression Y = X B + E, Y inM(n, K) and

X ∈ M(n, p) so RSS(B) = Tr ⊤(Y − XB)(Y − XB)

  • (column-wise) or

i ⊤(yi − xi,.B)ǫ−1(yi − xi,.B), with ǫ = Cov(ǫ)

(correlated errors).

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 14 / 15

slide-51
SLIDE 51

Last generalisations

Multiple outputs, curvilinear and non-linear regressions

◮ Multiple output regression Y = X B + E, Y inM(n, K) and

X ∈ M(n, p) so RSS(B) = Tr ⊤(Y − XB)(Y − XB)

  • (column-wise) or

i ⊤(yi − xi,.B)ǫ−1(yi − xi,.B), with ǫ = Cov(ǫ)

(correlated errors).

◮ Curvilinear models are of the form

Y = β0 +

  • j

βjxj +

  • k,l

βk,lxkxl + ǫ.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 14 / 15

slide-52
SLIDE 52

Last generalisations

Multiple outputs, curvilinear and non-linear regressions

◮ Multiple output regression Y = X B + E, Y inM(n, K) and

X ∈ M(n, p) so RSS(B) = Tr ⊤(Y − XB)(Y − XB)

  • (column-wise) or

i ⊤(yi − xi,.B)ǫ−1(yi − xi,.B), with ǫ = Cov(ǫ)

(correlated errors).

◮ Curvilinear models are of the form

Y = β0 +

  • j

βjxj +

  • k,l

βk,lxkxl + ǫ.

◮ Non-linear (parametric) regression has the form Y = f(x; θ) + ǫ.

Examples include exponential or logistic models.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 14 / 15

slide-53
SLIDE 53

Today’s session is over

Next time: A practical R session to be studied by you !

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 15 / 15