Statistics and learning
Regression Emmanuel Rachelson and Matthieu Vignes
ISAE SupAero
Wednesday 6th November 2013
- E. Rachelson & M. Vignes (ISAE)
SAD 2013 1 / 15
Statistics and learning Regression Emmanuel Rachelson and Matthieu - - PowerPoint PPT Presentation
Statistics and learning Regression Emmanuel Rachelson and Matthieu Vignes ISAE SupAero Wednesday 6 th November 2013 E. Rachelson & M. Vignes (ISAE) SAD 2013 1 / 15 The regression model expresses a random variable Y as a function of
ISAE SupAero
SAD 2013 1 / 15
◮ expresses a random variable Y as a function of random variables
SAD 2013 2 / 15
◮ expresses a random variable Y as a function of random variables
◮ Goal: from n experimental observations (xi, yi), we aim at
SAD 2013 2 / 15
◮ expresses a random variable Y as a function of random variables
◮ Goal: from n experimental observations (xi, yi), we aim at
◮ estimating unknown (βl)l=1...k,
SAD 2013 2 / 15
◮ expresses a random variable Y as a function of random variables
◮ Goal: from n experimental observations (xi, yi), we aim at
◮ estimating unknown (βl)l=1...k, ◮ evaluating the fitness of the model
SAD 2013 2 / 15
◮ expresses a random variable Y as a function of random variables
◮ Goal: from n experimental observations (xi, yi), we aim at
◮ estimating unknown (βl)l=1...k, ◮ evaluating the fitness of the model ◮ if the fit is acceptable, tests on parameters can be performed and the
SAD 2013 2 / 15
◮ A single explanatory variable X and an affine relationship to the
SAD 2013 3 / 15
◮ A single explanatory variable X and an affine relationship to the
◮ Residuals ǫi are assumed to be centred (R1), have equal variances
SAD 2013 3 / 15
◮ A single explanatory variable X and an affine relationship to the
◮ Residuals ǫi are assumed to be centred (R1), have equal variances
◮ Hence: E[Yi] = β0 + β1xi, Var(Yi) = σ2 and
SAD 2013 3 / 15
◮ A single explanatory variable X and an affine relationship to the
◮ Residuals ǫi are assumed to be centred (R1), have equal variances
◮ Hence: E[Yi] = β0 + β1xi, Var(Yi) = σ2 and
◮ Fitting (or adjusting) the model = estimate β0, β1 and σ from the
SAD 2013 3 / 15
◮ Seeking values for β0 and β1 minimising the sum of quadratic errors:
SAD 2013 4 / 15
◮ Seeking values for β0 and β1 minimising the sum of quadratic errors:
SAD 2013 4 / 15
◮ Seeking values for β0 and β1 minimising the sum of quadratic errors:
◮ ◮ In matrix notation (useful later): Y = X.B + ǫ, with
SAD 2013 4 / 15
◮ useful notations: ¯
i xi, ¯
x, s2 y and
i(xi − ¯
SAD 2013 5 / 15
◮ useful notations: ¯
i xi, ¯
x, s2 y and
i(xi − ¯
◮ Linear correlation coefficient: rxy = sxy sxsy .
SAD 2013 5 / 15
◮ useful notations: ¯
i xi, ¯
x, s2 y and
i(xi − ¯
◮ Linear correlation coefficient: rxy = sxy sxsy .
x and ˆ
1 n−2
σ2 (n−1)s2
x and Var( ˆ
SAD 2013 5 / 15
◮ In addition to R1 (centred noise), R2 (equal variance noise) and R3
SAD 2013 6 / 15
◮ In addition to R1 (centred noise), R2 (equal variance noise) and R3
◮ Theorem: under (R1, R2, R3’ and R4), Least Square estimators =
SAD 2013 6 / 15
◮ In addition to R1 (centred noise), R2 (equal variance noise) and R3
◮ Theorem: under (R1, R2, R3’ and R4), Least Square estimators =
ˆ β0) and ˆ
ˆ β1), with
ˆ β0 = σ2
i(xi − ¯
ˆ β1 = σ2/ i(xi − ¯
n−2
ˆ β0 and σ2 ˆ β1 are given in 1. by replacing σ2 by s2.
SAD 2013 6 / 15
◮ Previous theorem allows us to build CI for β0 and β1.
SAD 2013 7 / 15
◮ Previous theorem allows us to build CI for β0 and β1. ◮ SST/n = SSR/n + SSE/n, with SST = i(yi − ¯
i( ˆ
i(yi − ¯
SAD 2013 7 / 15
◮ Previous theorem allows us to build CI for β0 and β1. ◮ SST/n = SSR/n + SSE/n, with SST = i(yi − ¯
i( ˆ
i(yi − ¯
◮ Definition: Determination coefficient
yi−¯ y)2
y)2 = SSR SST = 1 − SSE SST = 1 − Residual Variance Total variance .
SAD 2013 7 / 15
◮ Previous theorem allows us to build CI for β0 and β1. ◮ SST/n = SSR/n + SSE/n, with SST = i(yi − ¯
i( ˆ
i(yi − ¯
◮ Definition: Determination coefficient
yi−¯ y)2
y)2 = SSR SST = 1 − SSE SST = 1 − Residual Variance Total variance .
SAD 2013 7 / 15
◮ Given a new x∗, what is the prediction ˜
SAD 2013 8 / 15
◮ Given a new x∗, what is the prediction ˜
◮ It’s simply
SAD 2013 8 / 15
◮ Given a new x∗, what is the prediction ˜
◮ It’s simply
◮ Its CI is
n + (x∗−¯ x)2
x)2 .
SAD 2013 8 / 15
◮ Given a new x∗, what is the prediction ˜
◮ It’s simply
◮ Its CI is
n + (x∗−¯ x)2
x)2 . ◮ Predictions are valid in the range of (xi)’s.
SAD 2013 8 / 15
◮ Given a new x∗, what is the prediction ˜
◮ It’s simply
◮ Its CI is
n + (x∗−¯ x)2
x)2 . ◮ Predictions are valid in the range of (xi)’s. ◮ The precision varies according to the x∗ value you want to predict:
SAD 2013 8 / 15
◮ Natural extension when several (Xj)j=1...p are used to explain Y .
SAD 2013 9 / 15
◮ Natural extension when several (Xj)j=1...p are used to explain Y . ◮ Model simply writes: Y = β0 + p j=1 βjXj + ǫ. In matrix notations
SAD 2013 9 / 15
◮ Natural extension when several (Xj)j=1...p are used to explain Y . ◮ Model simply writes: Y = β0 + p j=1 βjXj + ǫ. In matrix notations
◮ x = (xj i)i,j is the observed design matrix.
SAD 2013 9 / 15
◮ Natural extension when several (Xj)j=1...p are used to explain Y . ◮ Model simply writes: Y = β0 + p j=1 βjXj + ǫ. In matrix notations
◮ x = (xj i)i,j is the observed design matrix. ◮ Identifiability of β is equivalent to the linear independence of the
SAD 2013 9 / 15
◮ Natural extension when several (Xj)j=1...p are used to explain Y . ◮ Model simply writes: Y = β0 + p j=1 βjXj + ǫ. In matrix notations
◮ x = (xj i)i,j is the observed design matrix. ◮ Identifiability of β is equivalent to the linear independence of the
◮ Parameter estimation: argminβ
i=1
j=1 βjxj i − β0
2.
SAD 2013 9 / 15
◮ Natural extension when several (Xj)j=1...p are used to explain Y . ◮ Model simply writes: Y = β0 + p j=1 βjXj + ǫ. In matrix notations
◮ x = (xj i)i,j is the observed design matrix. ◮ Identifiability of β is equivalent to the linear independence of the
◮ Parameter estimation: argminβ
i=1
j=1 βjxj i − β0
2. ◮ Theorem The Least Square Estimator of β is ˆ
SAD 2013 9 / 15
SAD 2013 10 / 15
◮ few control on σ2. So the structure of ⊤X X dictates the quality of
SAD 2013 10 / 15
◮ few control on σ2. So the structure of ⊤X X dictates the quality of
ǫ2 n−p−1.
SAD 2013 10 / 15
◮ CI for βj: [ ˆ
βj], with tn−p−1;1−α/2 a
βj the squareroot of the jth element of
SAD 2013 11 / 15
◮ CI for βj: [ ˆ
βj], with tn−p−1;1−α/2 a
βj the squareroot of the jth element of
◮ Tests on βj: the rv ˆ βj−βj σ ˆ
βj
SAD 2013 11 / 15
◮ CI for βj: [ ˆ
βj], with tn−p−1;1−α/2 a
βj the squareroot of the jth element of
◮ Tests on βj: the rv ˆ βj−βj σ ˆ
βj
◮ Confidence region for β = (β0 . . . βp):
SAD 2013 11 / 15
◮ CI for βj: [ ˆ
βj], with tn−p−1;1−α/2 a
βj the squareroot of the jth element of
◮ Tests on βj: the rv ˆ βj−βj σ ˆ
βj
◮ Confidence region for β = (β0 . . . βp):
◮ CI for previsions on y∗:
SAD 2013 11 / 15
◮ residual plot: variance homogeneity (weights can be used if not),
SAD 2013 12 / 15
◮ residual plot: variance homogeneity (weights can be used if not),
◮ QQ-plots: to detect outliers . . .
SAD 2013 12 / 15
◮ residual plot: variance homogeneity (weights can be used if not),
◮ QQ-plots: to detect outliers . . . ◮ model selection. R2 for model with same number of regressors.
adj = (n−1)R2−(p−1) n−p
adj is equivalent to maximising
SAD 2013 12 / 15
◮ residual plot: variance homogeneity (weights can be used if not),
◮ QQ-plots: to detect outliers . . . ◮ model selection. R2 for model with same number of regressors.
adj = (n−1)R2−(p−1) n−p
adj is equivalent to maximising
◮ test by ANOVA: F = SSR/p SSE/(n−p−1) has a Fisher distribution with
SSE/(n−p−1) has a Fisher
SAD 2013 12 / 15
◮ residual plot: variance homogeneity (weights can be used if not),
◮ QQ-plots: to detect outliers . . . ◮ model selection. R2 for model with same number of regressors.
adj = (n−1)R2−(p−1) n−p
adj is equivalent to maximising
◮ test by ANOVA: F = SSR/p SSE/(n−p−1) has a Fisher distribution with
SSE/(n−p−1) has a Fisher
◮ Application: variable selection for model interpretation: backward
SAD 2013 12 / 15
◮ detecting colinearity between the xi’s. Inverting ⊤X X if
SAD 2013 13 / 15
◮ detecting colinearity between the xi’s. Inverting ⊤X X if
◮ to detect collinearity, compute V IF(xj) = 1 1−R2
j , with R2
j the
SAD 2013 13 / 15
◮ detecting colinearity between the xi’s. Inverting ⊤X X if
◮ to detect collinearity, compute V IF(xj) = 1 1−R2
j , with R2
j the
◮ Ridge regression introduces a bias but reduces the variance (keeps all
SAD 2013 13 / 15
◮ Multiple output regression Y = X B + E, Y inM(n, K) and
i ⊤(yi − xi,.B)ǫ−1(yi − xi,.B), with ǫ = Cov(ǫ)
SAD 2013 14 / 15
◮ Multiple output regression Y = X B + E, Y inM(n, K) and
i ⊤(yi − xi,.B)ǫ−1(yi − xi,.B), with ǫ = Cov(ǫ)
◮ Curvilinear models are of the form
SAD 2013 14 / 15
◮ Multiple output regression Y = X B + E, Y inM(n, K) and
i ⊤(yi − xi,.B)ǫ−1(yi − xi,.B), with ǫ = Cov(ǫ)
◮ Curvilinear models are of the form
◮ Non-linear (parametric) regression has the form Y = f(x; θ) + ǫ.
SAD 2013 14 / 15
SAD 2013 15 / 15