Econometrics and Regression ? Galton (1870, Heriditary Genius , 1886, - - PowerPoint PPT Presentation

econometrics and regression galton 1870 heriditary genius
SMART_READER_LITE
LIVE PREVIEW

Econometrics and Regression ? Galton (1870, Heriditary Genius , 1886, - - PowerPoint PPT Presentation

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Universit de Rennes 1 Advanced Econometrics #1 : Nonlinear Transformations * A. Charpentier (Universit de Rennes 1) Universit de Rennes 1 Graduate Course, 2018. 1


slide-1
SLIDE 1

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Advanced Econometrics #1 : Nonlinear Transformations*

  • A. Charpentier (Université de Rennes 1)

Université de Rennes 1 Graduate Course, 2018.

@freakonometrics

1

slide-2
SLIDE 2

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Econometrics and ‘Regression’ ? Galton (1870, Heriditary Genius, 1886, Regression to- wards mediocrity in hereditary stature) and Pearson & Lee (1896, On Telegony in Man, 1903 On the Laws of Inheritance in Man) studied genetic transmission of characterisitcs, e.g. the heigth. On average the child of tall parents is taller than

  • ther children, but less than his parents.

“I have called this peculiarity by the name of regres- sion”, Francis Galton, 1886.

@freakonometrics

2

slide-3
SLIDE 3

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Econometrics and ‘Regression’ ?

1 > library(HistData) 2 > attach(Galton) 3 > Galton$count

<- 1

4 > df <- aggregate (Galton , by=list(parent ,

child), FUN=sum)[,c(1,2,5)]

5 > plot(df[,1:2], cex=sqrt(df[,3]/3)) 6 > abline(a=0,b=1,lty =2) 7 > abline(lm(child~parent ,data=Galton)) 8 >

coefficients (lm(child~parent ,data=Galton) )[2]

9

parent

10 0.6462906

  • ● ●
  • ● ● ●
  • 64

66 68 70 72 62 64 66 68 70 72 74 height of the mid−parent height of the child

  • ● ●
  • ● ● ●
  • It is more an autoregression issue here :

if Yt = φYt−1 + εt, then cor[Yt, Yt+h] = φh → 0 as h → ∞.

@freakonometrics

3

slide-4
SLIDE 4

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Econometrics and ‘Regression’ ? Regression is a correlation problem. Overall, children are not smaller than parents

  • 60

65 70 75 60 65 70 75

  • 60

65 70 75 60 65 70 75

@freakonometrics

4

slide-5
SLIDE 5

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Overview

  • Linear Regression Model: yi = β0 + xT

i β + εi = β0 + β1x1,i + β2x2,i + εi

  • Nonlinear Transformations : smoothing techniques

h(yi) = β0 + β1x1,i + β2x2,i + εi yi = β0 + β1x1,i + h(x2,i) + εi

  • Asymptotics vs. Finite Distance : boostrap techniques
  • Penalization : Parcimony, Complexity and Overfit
  • From least squares to other regressions : quantiles, expectiles, etc.

@freakonometrics

5

slide-6
SLIDE 6

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

References Motivation Kopczuk, W. Tax bases, tax rates and the elasticity of reported income. JPE. References Eubank, R.L. (1999) Nonparametric Regression and Spline Smoothing, CRC Press. Fan, J. & Gijbels, I. (1996) Local Polynomial Modelling and Its Applications CRC Press. Hastie, T.J. & Tibshirani, R.J. (1990) Generalized Additive Models. CRC Press Wand, M.P & Jones, M.C. (1994) Kernel Smoothing. CRC Press

@freakonometrics

6

slide-7
SLIDE 7

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Deterministic or Parametric Transformations Consider child mortality rate (y) as a function of GDP per capita (x).

  • 0e+00

2e+04 4e+04 6e+04 8e+04 1e+05 50 100 150 PIB par tête Taux de mortalité infantile

Afghanistan Albania Algeria American.Samoa Angola Argentina Armenia Austria Azerbaijan Bahamas Bangladesh Belarus Belgium Belize Benin Bhutan Bolivia Bosnia.and.Herzegovina Brunei.Darussalam Bulgaria Burkina.Faso Cambodia Canada Cape.Verde Central.African.Republic Chad Channel.Islands Chile China Comoros Congo Cook.Islands Côte.dIvoire Cuba Cyprus Czech.Republic Korea Democratic.Republic.of.the.Congo Denmark Djibouti Egypt El.Salvador Equatorial.Guinea Estonia Fiji Finland France French.Guiana French.Polynesia Gabon Gambia Ghana Gibraltar Greece Grenada Guam Guatemala Guinea Guinea−Bissau Guyana Haiti Honduras India Indonesia Iran Ireland Israel Italy Jamaica Japan Jordan Kazakhstan Kenya Kyrgyzstan Laos Latvia Lesotho Libyan.Arab.Jamahiriya Liechtenstein Lithuania Luxembourg Madagascar Malawi Malaysia Malta Marshall.Islands Martinique Mauritius Micronesia.(Federated.States.of) Mongolia Montenegro Morocco Mozambique Myanmar Namibia Netherlands Netherlands.Antilles New.Caledonia Nicaragua Nigeria Niue Norway Occupied.Palestinian.Territory Oman Pakistan Papua.New.Guinea Paraguay Peru Poland Puerto.Rico Qatar Republic.of.Korea Republic.of.Moldova Réunion Romania Russian.Federation Saint.Vincent.and.the.Grenadines Samoa San.Marino Saudi.Arabia Serbia Sierra.Leone Singapore Slovakia Slovenia Solomon.Islands Somalia Sri.Lanka Sudan Suriname Sweden Syrian.Arab.Republic Tajikistan Thailand Macedonia Timor−Leste Togo Tunisia Turkey Turkmenistan Tuvalu Ukraine United.Arab.Emirates United.Kingdom United.Republic.of.Tanzania United.States.of.America United.States.Virgin.Islands Uruguay Venezuela Viet.Nam Yemen Zimbabwe

@freakonometrics

7

slide-8
SLIDE 8

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Deterministic or Parametric Transformations Logartihmic transformation, log(y) as a function of log(x)

  • 1e+02

5e+02 1e+03 5e+03 1e+04 5e+04 1e+05 2 5 10 20 50 100 PIB par tête (log) Taux de mortalité (log)

Afghanistan Albania Algeria American.Samoa Angola Argentina Armenia Aruba Australia Austria Azerbaijan Bahamas Bahrain Bangladesh Barbados Belarus Belgium Belize Benin Bhutan Bolivia Bosnia.and.Herzegovina Botswana Brazil Brunei.Darussalam Bulgaria Burkina.Faso Burundi Cambodia Cameroon Canada Cape.Verde Central.African.Republic Chad Channel.Islands Chile China Hong.Kong Colombia Comoros Congo Cook.Islands Costa.Rica Côte.dIvoire Croatia Cuba Cyprus Czech.Republic Korea Democratic.Republic.of.the.Congo Denmark Djibouti Dominican.Republic Ecuador Egypt El.Salvador Equatorial.Guinea Eritrea Estonia Ethiopia Fiji Finland France French.Guiana French.Polynesia Gabon Gambia Georgia Germany Ghana Gibraltar Greece Greenland Grenada Guadeloupe Guam Guatemala Guinea Guinea−Bissau Guyana Haiti Honduras Hungary Iceland India Indonesia Iran Iraq Ireland Isle.of.Man Israel Italy Jamaica Japan Jordan Kazakhstan Kenya Kiribati Kuwait Kyrgyzstan Laos Latvia Lebanon Lesotho Liberia Libyan.Arab.Jamahiriya Liechtenstein Lithuania Luxembourg Madagascar Malawi Malaysia Maldives Mali Malta Marshall.Islands Martinique Mauritania Mauritius Mexico Micronesia.(Federated.States.of) Mongolia Montenegro Morocco Mozambique Myanmar Namibia Nepal Netherlands Netherlands.Antilles New.Caledonia New.Zealand Nicaragua Niger Nigeria Niue Norway Occupied.Palestinian.Territory Oman Pakistan Palau Panama Papua.New.Guinea Paraguay Peru Philippines Poland Portugal Puerto.Rico Qatar Republic.of.Korea Republic.of.Moldova Réunion Romania Russian.Federation Rwanda Saint.Lucia Saint.Vincent.and.the.Grenadines Samoa San.Marino Sao.Tome.and.Principe Saudi.Arabia Senegal Serbia Sierra.Leone Singapore Slovakia Slovenia Solomon.Islands Somalia South.Africa Spain Sri.Lanka Sudan Suriname Swaziland Sweden Switzerland Syrian.Arab.Republic Tajikistan Thailand Macedonia Timor−Leste Togo Tonga Trinidad.and.Tobago Tunisia Turkey Turkmenistan Turks.and.Caicos.Islands Tuvalu Uganda Ukraine United.Arab.Emirates United.Kingdom United.Republic.of.Tanzania United.States.of.America United.States.Virgin.Islands Uruguay Uzbekistan Vanuatu Venezuela Viet.Nam Western.Sahara Yemen Zambia Zimbabwe

@freakonometrics

8

slide-9
SLIDE 9

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Deterministic or Parametric Transformations Reverse transformation

  • 0e+00

2e+04 4e+04 6e+04 8e+04 1e+05 50 100 150 PIB par tête Taux de mortalité

Afghanistan Albania Algeria American.Samoa Angola Argentina Armenia Aruba Australia Austria Azerbaijan Bahamas Bahrain Bangladesh Barbados Belarus Belgium Belize Benin Bhutan Bolivia Bosnia.and.Herzegovina Botswana Brazil Brunei.Darussalam Bulgaria Burkina.Faso Burundi Cambodia Cameroon Canada Cape.Verde Central.African.Republic Chad Channel.Islands Chile China Hong.Kong Colombia Comoros Congo Cook.Islands Costa.Rica Côte.dIvoire Croatia Cuba Cyprus Czech.Republic Korea Democratic.Republic.of.the.Congo Denmark Djibouti Dominican.Republic Ecuador Egypt El.Salvador Equatorial.Guinea Eritrea Estonia Ethiopia Fiji Finland France French.Guiana French.Polynesia Gabon Gambia Georgia Germany Ghana Gibraltar Greece Greenland Grenada Guadeloupe Guam Guatemala Guinea Guinea−Bissau Guyana Haiti Honduras Hungary Iceland India Indonesia Iran Iraq Ireland Isle.of.Man Israel Italy Jamaica Japan Jordan Kazakhstan Kenya Kiribati Kuwait Kyrgyzstan Laos Latvia Lebanon Lesotho Liberia Libyan.Arab.Jamahiriya Liechtenstein Lithuania Luxembourg Madagascar Malawi Malaysia Maldives Mali Malta Marshall.Islands Martinique Mauritania Mauritius Mexico Micronesia.(Federated.States.of) Mongolia Montenegro Morocco Mozambique Myanmar Namibia Nepal Netherlands Netherlands.Antilles New.Caledonia New.Zealand Nicaragua Niger Nigeria Niue Norway Occupied.Palestinian.Territory Oman Pakistan Palau Panama Papua.New.Guinea Paraguay Peru Philippines Poland Portugal Puerto.Rico Qatar Republic.of.Korea Republic.of.Moldova Réunion Romania Russian.Federation Rwanda Saint.Lucia Saint.Vincent.and.the.Grenadines Samoa San.Marino Sao.Tome.and.Principe Saudi.Arabia Senegal Serbia Sierra.Leone Singapore Slovakia Slovenia Solomon.Islands Somalia South.Africa Spain Sri.Lanka Sudan Suriname Swaziland Sweden Switzerland Syrian.Arab.Republic Tajikistan Thailand Macedonia Timor−Leste Togo Tonga Trinidad.and.Tobago Tunisia Turkey Turkmenistan Turks.and.Caicos.Islands Tuvalu Uganda Ukraine United.Arab.Emirates United.Kingdom United.Republic.of.Tanzania United.States.of.America United.States.Virgin.Islands Uruguay Uzbekistan Vanuatu Venezuela Viet.Nam Western.Sahara Yemen Zambia Zimbabwe

@freakonometrics

9

slide-10
SLIDE 10

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Box-Cox transformation See Box & Cox (1964) An Analysis of Transformations , h(y, λ) =    yλ − 1 λ if λ = 0 log(y) if λ = 0

  • r

h(y, λ, µ) =    [y + µ]λ − 1 λ if λ = 0 log([y + µ]) if λ = 0

@freakonometrics

10

1 2 3 4 −4 −3 −2 −1 1 2 −1 −0.5 0.5 1 1.5 2

slide-11
SLIDE 11

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Profile Likelihood In a statistical context, suppose that unknown parameter can be partitioned θ = (λ, β) where λ is the parameter of interest, and β is a nuisance parameter. Consider {y1, · · · , yn}, a sample from distribution Fθ, so that the log-likelihood is log L(θ) =

n

  • i=1

log fθ(yi)

  • θ

MLE is defined as

θ

MLE = argmax {log L(θ)}

Rewrite the log-likelihood as log L(θ) = log Lλ(β). Define

  • β

pMLE λ

= argmax

β

{log Lλ(β)} and then λpMLE = argmax

λ

  • log Lλ(

β

pMLE λ

)

  • . Observe that

√n( λpMLE − λ)

L

− → N(0, [Iλ,λ − Iλ,βI−1

β,βIβ,λ]−1)

@freakonometrics

11

slide-12
SLIDE 12

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Profile Likelihood and Likelihood Ratio Test The (profile) likelihood ratio test is based on 2

  • max
  • L(λ, β)
  • − max
  • L(λ0, β)
  • If (λ0, β0) are the true value, this difference can be written

2

  • max
  • L(λ, β)
  • − max
  • L(λ0, β0)
  • − 2
  • max
  • L(λ0, β)
  • − max
  • L(λ0, β0)
  • Using Taylor’s expension

∂L(λ, β) ∂λ

  • (λ0,

βλ0)

∼ ∂L(λ, β) ∂λ

  • (λ0,β0)

− Iβ0λ0I−1

β0β0

∂L(λ0, β) ∂β

  • (λ0,β0)

Thus, 1 √n ∂L(λ, β) ∂λ

  • (λ0,

βλ0) L

→ N(0, Iλ0λ0) − Iλ0β0I−1

β0β0Iβ0λ0

and 2

  • L(

λ, β) − L(λ0, βλ0) L → χ2(dim(λ)).

@freakonometrics

12

slide-13
SLIDE 13

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Profile Likelihood and Likelihood Ratio Test Consider some lognormal sample, and fit a Gamma distribution, f(x; α, β) = xα−1βα e−β x Γ(α) with x > 0 and θ = (α, β).

1 > x=exp(rnorm (100))

Maximum-likelihood, θ = argmax{log L(θ)}.

1 > library(MASS) 2 > (F=fitdistr(x,"gamma")) 3

shape rate

4

1.4214497 0.8619969

5

(0.1822570) (0.1320717)

6 > F$estimate [1]+c(-1,1)*1.96*F$sd [1] 7 [1]

1.064226 1.778673

@freakonometrics

13

slide-14
SLIDE 14

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Profile Likelihood and Likelihood Ratio Test See also

1 > log_lik=function(theta){ 2 +

a=theta [1]

3 +

b=theta [2]

4 +

logL=sum(log(dgamma(x,a,b)))

5 +

return(-logL)

6 + } 7 > optim(c(1 ,1),log_lik) 8 $par 9 [1]

1.4214116 0.8620311

We can also use profile likelihood,

  • α = argmax
  • max

β

  • log L(α, β)
  • = argmax
  • log L(α,

βα)

  • @freakonometrics

14

slide-15
SLIDE 15

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Profile Likelihood and Likelihood Ratio Test

1 > prof_log_lik=function(a){ 2 +

b=( optim (1, function (z) -sum(log(dgamma(x,a,z)))))$par

3 +

return(-sum(log(dgamma(x,a,b))))

4 + } 5 6 > vx=seq (.5,3, length =101) 7 > vl=-Vectorize (prof_log_lik)(vx) 8 > plot(vx ,vl ,type="l") 9 > optim (1,prof_log_lik) 10 $par 11 [1]

1.421094

We can use the likelihood ratio test 2

  • log Lp(

α) − log Lp(α)

  • ∼ χ2(1)

@freakonometrics

15

slide-16
SLIDE 16

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Profile Likelihood and Likelihood Ratio Test The implied 95% confidence interval is

1 > (b1=uniroot(function(z) Vectorize (prof_log_lik)(z)+borne ,c(.5 ,1.5))

$root)

2 [1]

1.095726

3 > (b2=uniroot(function(z) Vectorize (prof_log_lik)(z)+borne ,c

(1.25 ,2.5))$root)

4 [1]

1.811809

@freakonometrics

16

slide-17
SLIDE 17

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Box-Cox

1 > boxcox(lm(dist~speed ,data=cars))

Here h∗ ∼ 0.5 Heuristally, y1/2

i

∼ β0 + β1xi + εi why not consider a quadratic regression...?

−0.5 0.0 0.5 1.0 1.5 2.0 −90 −80 −70 −60 −50 λ log−Likelihood 95%

@freakonometrics

17

  • 5

10 15 20 25 20 40 60 80 100 120 speed dist

slide-18
SLIDE 18

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Uncertainty: Parameters vs. Prediction Uncertainty on regression parameters (β0, β1) From the output of the regression we can derive confidence intervals for β0 and β1, usually βk ∈

  • βk ± u1−α/2

se[ βk]

  • 5

10 15 20 25 20 40 60 80 100 120 Vitesse du véhicule Distance de freinage

  • 5

10 15 20 25 20 40 60 80 100 120 Vitesse du véhicule Distance de freinage

@freakonometrics

18

slide-19
SLIDE 19

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Uncertainty: Parameters vs. Prediction Uncertainty on a prediction, y = m(x). Usually m(x) ∈

  • m(x) ± u1−α/2

se[m(x)]

  • hence, for a linear model
  • xT

β ± u1−α/2 σ

  • xT[XTX]−1x
  • i.e. (with one covariate)

se2[m(x)]2 = Var[ β0 + β1x] se2[ β0] + cov[ β0, β1]x + se2[ β1]x2

  • 5

10 15 20 25 20 40 60 80 100 120 Vitesse du véhicule Distance de freinage

1 > predict(lm(dist~speed ,data=cars),newdata=data.frame(speed=x),

interval="confidence ")

@freakonometrics

19

slide-20
SLIDE 20

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Least Squares and Expected Value (Orthogonal Projection Theorem) Let y ∈ Rd, y = argmin

m∈R

  

n

  • i=1

1 n

  • yi − m

εi

2    . It is the empirical version of E[Y ] = argmin

m∈R

   y − m

ε

2dF(y)    = argmin

m∈R

  E

  • (Y − m

ε

)2    where Y is a ℓ1 random variable. Thus, argmin

m(·):Rk→R

    

n

  • i=1

1 n

  • yi − m(xi)
  • εi

2      is the empirical version of E[Y |X = x].

@freakonometrics

20

slide-21
SLIDE 21

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

The Histogram and the Regressogram Connections between the estimation of f(y) and E[Y |X = x]. Assume that yi ∈ [a1, ak+1), divided in k classes [aj, aj+1). The histogram is ˆ fa(y) =

k

  • j=1

1(t ∈ [aj, aj+1)) aj+1 − aj 1 n

n

  • i=1

1(yi ∈ [aj, aj+1)) Assume that aj+1 −aj = hn and hn → 0 as n → ∞ with nhn → ∞ then E

  • ( ˆ

fa(y) − f(y))2 ∼ O(n−2/3) (for an optimal choice of hn).

1 > hist(height) @freakonometrics

21

150 160 170 180 190 0.00 0.01 0.02 0.03 0.04 0.05 0.06

slide-22
SLIDE 22

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

The Histogram and the Regressogram Then a moving histogram was considered, ˆ f(y) = 1 2nhn

n

  • i=1

1(yi ∈ [y ± hn)) = 1 nhn

n

  • i=1

k yi − y hn

  • with k(x) = 1

21(x ∈ [−1, 1)), which a (flat) kernel estimator.

1 > density(height , kernel = " rectangular ")

150 160 170 180 190 200 0.00 0.01 0.02 0.03 0.04 150 160 170 180 190 200 0.00 0.01 0.02 0.03 0.04

@freakonometrics

22

slide-23
SLIDE 23

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

The Histogram and the Regressogram From Tukey (1961) Curves as parameters, and touch estimation, the regressogram is defined as ˆ ma(x) = n

i=1 1(xi ∈ [aj, aj+1))yi

n

i=1 1(xi ∈ [aj, aj+1))

and the moving regressogram is ˆ m(x) = n

i=1 1(xi ∈ [x ± hn])yi

n

i=1 1(xi ∈ [x ± hn])

@freakonometrics

23

  • 5

10 15 20 25 20 40 60 80 100 120 speed dist

  • 5

10 15 20 25 20 40 60 80 100 120 speed dist

slide-24
SLIDE 24

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Nadaraya-Watson and Kernels Background: Kernel Density Estimator Consider sample {y1, · · · , yn}, Fn empirical cumulative distribution function

  • Fn(y) = 1

n

n

  • i=1

1(yi ≤ y) The empirical measure Pn consists in weights 1/n on each observation. Idea: add (little) continuous noise to smooth Fn. Let Yn denote a random variable with distribution Fn and define ˜ Y = Yn + hU where U ⊥ ⊥ Yn, with cdf K The cumulative distribution function of ˜ Y is ˜ F ˜ F(y) = P[ ˜ Y ≤ y] = E

  • 1( ˜

Y ≤ y)

  • = E
  • E
  • 1( ˜

Y ≤ y)

  • Yn
  • ˜

F(y) = E

  • 1
  • U ≤ y − Yn

h

  • Yn
  • =

n

  • i=1

1 nK y − yi h

  • @freakonometrics

24

slide-25
SLIDE 25

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Nadaraya-Watson and Kernels If we differentiate ˜ f(y)= 1 nh

n

  • i=1

k y − yi h

  • = 1

n

n

  • i=1

kh (y − yi) with kh(u) = 1 hk u h

  • ˜

f is the kernel density estimator of f, with kernel k and bandwidth h. Rectangular, k(u) = 1 21(|u| ≤ 1) Epanechnikov, k(u) = 3 41(|u| ≤ 1)(1 − u2) Gaussian, k(u) = 1 √ 2π e− u2

2 1 > density(height , kernel = " epanechnikov ")

−2 −1 1 2

@freakonometrics

25

150 160 170 180 190 200 0.00 0.01 0.02 0.03 0.04

slide-26
SLIDE 26

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Kernels and Statistical Properties Consider here an i.id. sample {Y1, · · · , Yn} with density f Given y, observe that E[ ˜ f(y)] = 1 hk y − t h

  • f(t)dt =
  • k(u)f(y − hu)du. Use

Taylor expansion around h = 0,f(y − hu) ∼ f(y) − f ′(y)hu + 1 2f ′′(y)h2u2 E[ ˜ f(y)] =

  • f(y)k(u)du −
  • f ′(y)huk(u)du +

1 2f ′′(y + hu)h2u2k(u)du = f(y) + 0 + h2 f ′′(y) 2

  • k(u)u2du + o(h2)

Thus, if f is twice continuously differentiable with bounded second derivative,

  • k(u)du = 1,
  • uk(u)du = 0 and
  • u2k(u)du < ∞,

then E[ ˜ f(y)] = f(y) + h2 f ′′(y) 2

  • k(u)u2du + o(h2)

@freakonometrics

26

slide-27
SLIDE 27

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Kernels and Statistical Properties For the heuristics on that bias, consider a flat kernel, and set fh(y) = F(y + h) − F(y − h) 2h then the natural estimate is

  • fh(y) =
  • F(y + h) −

F(y − h) 2h = 1 2nh

n

  • i=1

1(yi ∈ [y ± h])

  • Zi

where Zi’s are Bernoulli B(px) i.id. variables with px = P[Yi ∈ [x ± h]] = 2h · fh(x). Thus, E( fh(y)) = fh(y), while fh(y) ∼ f(y) + h2 6 f ′′(y) as h ∼ 0.

@freakonometrics

27

slide-28
SLIDE 28

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Kernels and Statistical Properties Similarly, as h → 0 and nh → ∞ Var[ ˜ f(y)] = 1 n

  • E[kh(z − Z)2] − (E[kh(z − Z)])2

Var[ ˜ f(y)] = f(y) nh

  • k(u)2du + o

1 nh

  • Hence
  • if h → 0 the bias goes to 0
  • if nh → ∞ the variance goes to 0

@freakonometrics

28

slide-29
SLIDE 29

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Kernels and Statistical Properties Extension in Higher Dimension: ˜ f(y) = 1 n|H|1/2

n

  • i=1

k

  • H−1/2(y − yi)
  • ˜

f(y) = 1 nhd|Σ|1/2

n

  • i=1

k

  • Σ−1/2 (y − yi)

h

  • @freakonometrics

29

150 160 170 180 190 200 40 60 80 100 120 height weight

2e−04 4 e − 4 6 e − 4 8e−04 0.001 . 1 2 . 1 4 0.0016 0.0018

slide-30
SLIDE 30

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Kernels and Convolution Given f and g, set (f ⋆ g)(x) =

  • R

f(x − y)g(y)dy Then ˜ fh = ( ˆ f ⋆ kh), where ˆ f(y) = ˆ F(y) dy =

n

  • i=1

δyi(y) Hence, ˜ f is the distribution of Y + ε where

  • Y is uniform over {y1, · · · , yn} and ε ∼ kh are independent

@freakonometrics

30

  • −0.2

0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0

slide-31
SLIDE 31

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Nadaraya-Watson and Kernels Here E[Y |X = x] = m(x). Write m as a function of densities m(x) =

  • yf(y|x)dy =
  • yf(y, x)dy
  • f(y, x)dy

Consider some bivariate kernel k, such that

  • tk(t, u)dt = 0 and κ(u) =
  • k(t, u)dt

For the numerator, it can be estimated using

  • y ˜

f(y, x)dy = 1 nh2

n

  • i=1
  • yk

y − yi h , x − xi h

  • =

1 nh

n

  • i=1
  • yik
  • t, x − xi

h

  • dt = 1

nh

n

  • i=1

yiκ x − xi h

  • @freakonometrics

31

slide-32
SLIDE 32

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Nadaraya-Watson and Kernels and for the denominator

  • f(y, x)dy =

1 nh2

n

  • i=1
  • k

y − yi h , x − xi h

  • = 1

nh

n

  • i=1

κ x − xi h

  • Therefore, plugging in the expression for g(x) yields

˜ m(x) = n

i=1 yiκh (x − xi)

n

i=1 κh (x − xi)

Observe that this regression estimator is a weighted average (see linear predictor section) ˜ m(x) =

n

  • i=1

ωi(x)yi with ωi(x) = κh (x − xi) n

i=1 κh (x − xi)

@freakonometrics

32

  • 5

10 15 20 25 20 40 60 80 100 120 speed dist

slide-33
SLIDE 33

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Nadaraya-Watson and Kernels One can prove that kernel regression bias is given by E[ ˜ m(x)] ∼ m(x) + h2 C1 2 m′′(x) + C2m′(x)f ′(x) f(x)

  • while Var[ ˜

m(x)] ∼ C3 nh σ(x) f(x). In this univariate case, one can easily get the kernel estimator of derivatives. Actually, ˜ m is a function of bandwidth h. Note: this can be extended to multivariate x.

@freakonometrics

33

  • 5

10 15 20 25 20 40 60 80 100 120 speed dist

slide-34
SLIDE 34

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Nadaraya-Watson and Kernels in Higher Dimension Here mH(x) = n

i=1 yikH(xi − x)

n

i=1 kH(xi − x)

for some symmetric positive definite bandwidth matrix H, and kH(x) = det[H]−1k(H−1x). Then E[ mH(x)] ∼ m(x) + C1 2 trace

  • HTm′′(x)H
  • + C2

m′(x)THHT∇f(x) f(x) while Var[ mH(x)] ∼ C3 ndet(H) σ(x) f(x) Hence, if H = hI, h⋆ ∼ Cn−

1 4+dim(x) . @freakonometrics

34

slide-35
SLIDE 35

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

From kernels to k-nearest neighbours An alternative is to consider ˜ mk(x) = 1 n

n

  • i=1

ωi,k(x)yi where ωi,k(x) = n k if i ∈ Ik

x with

Ik

x = {i : xi one of the k nearest observations to x}

Lai (1977) Large sample properties of K-nearest neighbor procedures if k → ∞ and k/n → 0 as n → ∞, then E[ ˜ mk(x)] ∼ m(x) + 1 24f(x)3

  • (m′′f + 2m′f ′)(x)

k n 2 while Var[ ˜ mk(x)] ∼ σ2(x) k

@freakonometrics

35

  • 5

10 15 20 25 20 40 60 80 100 120 speed dist

slide-36
SLIDE 36

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

From kernels to k-nearest neighbours Remark: Brent & John (1985) Finding the median requires 2n comparisons considered some median smoothing algorithm, where we consider the median

  • ver the k nearest neighbours (see section #4).

@freakonometrics

36

slide-37
SLIDE 37

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

k-Nearest Neighbors and Curse of Dimensionality The higher the dimension, the larger the distance to the closest neigbbor min

i∈{1,··· ,n}{d(a, xi)}, xi ∈ Rd.

  • dim1

dim2 dim3 dim4 dim5 0.0 0.2 0.4 0.6 0.8 1.0 dim1 dim2 dim3 dim4 dim5 0.0 0.2 0.4 0.6 0.8 1.0

n = 10 n = 100

@freakonometrics

37

slide-38
SLIDE 38

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Bandwidth selection : MISE for Density MSE[ ˜ f(y)] = bias[ ˜ f(y)]2 + Var[ ˜ f(y)] MSE[ ˜ f(y)] = f(y) 1 nh

  • k(u)2du + h4

f ′′(y) 2

  • k(u)u2du

2 + o

  • h4 + 1

nh

  • Bandwidth choice is based on minimization of the asymptotic integrated MSE

(over y) MISE( ˜ f) =

  • MSE[ ˜

f(y)]dy ∼ 1 nh

  • k(u)2du + h4

f ′′(y) 2

  • k(u)u2du

2

@freakonometrics

38

slide-39
SLIDE 39

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Bandwidth selection : MISE for Density Thus, the first-order condition yields − C1 nh2 + h3

  • f ′′(y)2dyC2 = 0

with C1 =

  • k2(u)du and C2 =
  • k(u)u2du

2 , and h⋆ = n− 1

5

  • C1

C2

  • f ′′(y)dy

1

5

h⋆ = 1.06n− 1

5

Var[Y ] from Silverman (1986) Density Estimation

1 > bw.nrd0(cars$speed) 2 [1]

2.150016

3 > bw.nrd(cars$speed) 4 [1]

2.532241

with Scott correction, see Scott (1992) Multivariate Density Estimation

@freakonometrics

39

slide-40
SLIDE 40

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Bandwidth selection : MISE for Regression Model One can prove that MISE[ mh] ∼

bias2

  • h4

4

  • x2k(x)dx

2 m′′(x) + 2m′(x)f ′(x) f(x) 2dx + σ2 nh

  • k2(x)dx ·
  • dx

f(x)

  • variance

as n → 0 and nh → ∞. The bias is sensitive to the position of the xi’s. h⋆ = n− 1

5

  C1

  • dx

f(x)

C2 m′′(x) + 2m′(x) f ′(x)

f(x)

  • dx

 

1 5

Problem: depends on unknown f(x) and m(x).

@freakonometrics

40

slide-41
SLIDE 41

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Bandwidth Selection : Cross Validation Let R(h) = E

  • (Y −

mh(X))2 . Natural idea R(h) = 1 n

n

  • i=1

(yi − mh(xi))2 Instead use leave-one-out cross validation,

  • R(h) = 1

n

n

  • i=1
  • yi −

m(i)

h (xi)

2 where m(i)

h

is the estimator obtained by omitting the ith pair (yi, xi) or k-fold cross validation,

  • R(h) = 1

n

k

  • j=1
  • i∈Ij
  • yi −

m(j)

h (xi)

2 where m(j)

h

is the estimator obtained by omitting pairs (yi, xi) with i ∈ Ij.

@freakonometrics

41

  • 5

10 15 20 25 20 40 60 80 100 120 speed dist

  • 5

10 15 20 25 20 40 60 80 100 120 speed dist

slide-42
SLIDE 42

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Bandwidth Selection : Cross Validation Then find (numerically) h⋆ = argmin R(h)

  • In the context of density estimation, see Chiu

(1991) Bandwidth Selection for Kernel Density Es- timation

2 4 6 8 10 14 16 18 20 22 bandwidth

Usual bias-variance tradeoff, or Goldilock principle: h should be neither too small, nor too large

  • undersmoothed: bias too large, variance too small
  • oversmoothed: variance too large, bias too small

@freakonometrics

42

slide-43
SLIDE 43

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Local Linear Regression Consider ˆ m(x) defined as ˆ m(x) = β0 where ( β0, β) is the solution of min

(β0,β)

n

  • i=1

ω(x)

i

  • yi − [β0 + (x − xi)Tβ]

2

  • where ω(x)

i

= kh(x − xi), e.g. i.e. we seek the constant term in a weighted least squares regression of yi’s on x − xi’s. If Xx is the matrix [1 (x − X)T], and if W x is a matrix diag[kh(x − x1), · · · , kh(x − xn)] then ˆ m(x) = 1T(XT

xW xXx)−1XT xW xy

This estimator is also a linear predictor : ˆ m(x) =

n

  • i=1

ai(x) ai(x)yi

@freakonometrics

43

slide-44
SLIDE 44

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

where ai(x) = 1 nkh(x − xi)

  • 1 − s1(x)Ts2(x)−1 x − xi

h

  • with

s1(x) = 1 n

n

  • i=1

kh(x−xi)x − xi h and s2(x) = 1 n

n

  • i=1

kh(x−xi) x − xi h x − xi h

  • Note that Nadaraya-Watson estimator was simply the solution of

min

β0

n

  • i=1

ω(x)

i

(yi − β0)2

  • where ω(x)

i

= kh(x − xi) E[ ˆ m(x)] ∼ m(x) + h2 2 m′′(x)µ2 where µ2 =

  • k(u)u2du.

Var[ ˆ m(x)] ∼ 1 nh νσ2

x

f(x)

@freakonometrics

44

slide-45
SLIDE 45

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

where ν =

  • k(u)2du

Thus, kernel regression MSE is h2 4

  • g′′(x) + 2g′(x)f ′(x)

f(x) 2 µ2

2 + 1

nh νσ2

x

f(x)

  • 5

10 15 20 25 20 40 60 80 100 120 Vitesse du véhciule Distance de freinage

  • 5

10 15 20 25 20 40 60 80 100 120 Vitesse du véhciule Distance de freinage

@freakonometrics

45

slide-46
SLIDE 46

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1 1 > loess(dist ~ speed , cars ,span =0.75 , degree =1) 2 > predict(REG , data.frame(speed = seq(5, 25, 0.25)), se = TRUE)

  • 5

10 15 20 25 20 40 60 80 100 120 Vitesse du véhciule Distance de freinage

  • 5

10 15 20 25 20 40 60 80 100 120 Vitesse du véhciule Distance de freinage

@freakonometrics

46

slide-47
SLIDE 47

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Local polynomials One might assume that, locally, m(x) ∼ µx(u) as u ∼ 0, with µx(u) = β(x) + β(x)

1

+ [u − x] + β(x)

2

+ [u − x]2 2 + β(x)

3

+ [u − x]3 2 + · · · and we estimate β(x) by minimizing

n

  • i=1

ω(x)

i

  • yi − µx(xi)

2. If Xx is the design matrix

  • 1 xi − x [xi − x]2

2 [xi − x]3 3 · · ·

  • , then
  • β

(x) =

  • XT

xW xXx

−1 XT

xW xy

(weighted least squares estimators).

1 > library(locfit) 2 > locfit(dist~speed ,data=cars) @freakonometrics

47

slide-48
SLIDE 48

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Series Regression Recall that E[Y |X = x] = m(x). Why not approximate m by a linear combination of approx- imating functions h1(x), · · · , hk(x). Set h(x) = (h1(x), · · · , hk(x)), and consider the regression

  • f yi’s on h(xi)’s,

yi = h(xi)Tβ + εi Then β = (HTH)−1HTy

  • 5

10 15 20 25 20 40 60 80 100 120 Vitesse du véhciule Distance de freinage

  • 5

10 15 20 25 20 40 60 80 100 120 Vitesse du véhciule Distance de freinage

@freakonometrics

48

slide-49
SLIDE 49

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Series Regression : polynomials Even if m(x) = E(Y |X = x) is not a polynomial function, a polynomial can still be a good approximation. From Stone-Weierstrass theorem, if m(·) is continuous on some interval, then there is a uniform approximation of m(·) by polynomial functions.

1 > reg

<- lm(y~x,data=db)

  • 2

4 6 8 10 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

  • 2

4 6 8 10 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

@freakonometrics

49

slide-50
SLIDE 50

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Series Regression : polynomials Assume that m(x) = E(Y |X = x) =

k

  • i=0

αixi, where pa- rameters α0, · · · , αk will be estimated (but not k).

1 > reg

<- lm(y~poly(x ,5) ,data=db)

2 > reg

<- lm(y~poly(x ,25) ,data=db)

@freakonometrics

50

  • 2

4 6 8 10 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

  • 2

4 6 8 10 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

slide-51
SLIDE 51

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Series Regression : (Linear) Splines Consider m + 1 knots on X, min{xi} ≤ t0 ≤ t1 ≤ · · · ≤ tm ≤ max{xn}, then define linear (degree = 1) splines positive function, bj,1(x) = (x − tj)+ =    x − tj if x > tj 0 otherwise for linear splines, consider Yi = β0 + β1Xi + β2(Xi − s)+ + εi

1 > positive _part

<- function (x) ifelse(x>0,x ,0)

2 > reg

<- lm(Y~X+positive_part(X-s), data=db)

@freakonometrics

51

  • 2

4 6 8 10 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

slide-52
SLIDE 52

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Series Regression : (Linear) Splines for linear splines, consider Yi = β0 + β1Xi + β2(Xi − s1)+ + β3(Xi − s2)+ + εi

1 > reg

<- lm(Y~X+positive_part(X-s1)+

2

positive_part(X-s2), data=db)

3 > library(bsplines)

A spline is a function defined by piecewise polynomials. b-splines are defined recursively

@freakonometrics

52

  • 2

4 6 8 10 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

  • 5

10 15 20 25 20 40 60 80 100 120 speed dist

slide-53
SLIDE 53

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

b-Splines (in Practice)

1 > reg1

<- lm(dist~speed+positive _part(speed -15) , data=cars)

2 > reg2

<- lm(dist~bs(speed ,df=2, degree =1) , data= cars)

Consider m+1 knots on [0, 1], 0 ≤ t0 ≤ t1 ≤ · · · ≤ tm ≤ 1, then define recursively b-splines as bj,0(t) =    1 if tj ≤ t < tj+1 0 otherwise, and bj,n(t) = t − tj tj+n − tj bj,n−1(t) + tj+n+1 − t tj+n+1 − tj+1 bj+1,n−1(t)

  • 5

10 15 20 25 20 40 60 80 100 120 speed dist

  • 5

10 15 20 25 20 40 60 80 100 120 speed dist

@freakonometrics

53

slide-54
SLIDE 54

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

b-Splines (in Practice)

1 > summary(reg1) 2 3 Coefficients : 4

Estimate Std Error t value Pr(>|t|)

5 (Intercept) -7.6519

10.6254

  • 0.720

0.475

6 speed

3.0186 0.8627 3.499 0.001 **

7 (speed -15)

1.7562 1.4551 1.207 0.233

8 9 > summary(reg2) 10 11 Coefficients : 12

Estimate Std Error t value Pr(>|t|)

13 (Intercept) 4.423

7.343 0.602 0.5493

14 bs(speed)1 33.205

9.489 3.499 0.0012 **

15 bs(speed)2 80.954

8.788 9.211 4.2e -12 ***

@freakonometrics

54

  • 5

10 15 20 25 20 40 60 80 100 120 speed dist 0.25 0.5 0.75 1

  • 5

10 15 20 25 20 40 60 80 100 120 speed dist 0.25 0.5 0.75 1

slide-55
SLIDE 55

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

b and p-Splines Note that those spline function define an orthonormal ba- sis. O’Sullivan (1986) A statistical perspective on ill-posed in- verse problems suggested a penalty on the second deriva- tive of the fitted curve (see #3). m(x) = argmin

  • n
  • i=1
  • yi − b(xi)Tβ

2 + λ

  • R

b′′(xi)Tβ

  • @freakonometrics

55

  • 5

10 15 20 25 20 40 60 80 100 120 speed dist 0.25 0.5 0.75 1

  • 5

10 15 20 25 20 40 60 80 100 120 speed dist 0.25 0.5 0.75 1

slide-56
SLIDE 56

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Adding Constraints: Convex Regression Assume that yi = m(xi) + εi where m : Rd → ∞R is some convex function. m is convex if and only if ∀x1, x2 ∈ Rd, ∀t ∈ [0, 1], m(tx1 + [1 − t]x2) ≤ tm(x1) + [1 − t]m(x2) Proposition (Hidreth (1954) Point Estimates of Ordinates of Concave Functions) m⋆ = argmin

m convex

n

  • i=1
  • yi − m(xi)

2

  • Then θ⋆ = (m⋆(x1), · · · , m⋆(xn)) is unique.

Let y = θ + ε, then θ⋆ = argmin

θ∈K

n

  • i=1
  • yi − θi)

2

  • where K = {θ ∈ Rn : ∃m convex , m(xi) = θi}. I.e. θ⋆ is the projection of y onto

the (closed) convex cone K. The projection theorem gives existence and unicity.

@freakonometrics

56

slide-57
SLIDE 57

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Adding Constraints: Convex Regression In dimension 1: yi = m(xi) + εi. Assume that observations are ordered x1 < x2 < · · · < xn. Here K =

  • θ ∈ Rn : θ2 − θ1

x2 − x1 ≤ θ3 − θ2 x3 − x2 ≤ · · · ≤ θn − θn−1 xn − xn−1

  • Hence, quadratic program with n − 2 linear con-

straints. m⋆ is a piecewise linear function (interpolation of consecutive pairs (xi, θ⋆

i )).

If m is differentiable, m is convex if m(x) + ∇m(x) · [y − x] ≤ m(y)

  • 5

10 15 20 25 20 40 60 80 100 120 speed dist

@freakonometrics

57

slide-58
SLIDE 58

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Adding Constraints: Convex Regression More generally: if m is convex, then there exists ξx ∈ Rn such that m(x) + ξx · [y − x] ≤ m(y) ξx is a subgradient of m at x. And then ∂m(x) =

  • m(x) + ξ · [y − x] ≤ m(y), ∀y ∈ Rn

Hence, θ⋆ is solution of argmin

  • y − θ2

subject to θi + ξi[xj − xi] ≤ θj, ∀i, j and ξ1, · · · , ξn ∈ Rn.

@freakonometrics

58

  • 5

10 15 20 25 20 40 60 80 100 120 speed dist

slide-59
SLIDE 59

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Spatial Smoothing One can also consider some spatial smoothing, if we want to predict E[Y |X = x] for some coordinate x.

1 > library(rgeos) 2 > library(rgdal) 3 > library(maptools) 4 > library( cartography ) 5 > download .file("http://bit.ly/2G3KIUG","zonier.RData") 6 > load("zonier.RData") 7 > cols=rev(carto.pal(pal1="red.pal",n1=10, pal2="green.pal",n2 =10)) 8 > download .file("http://bit.ly/2GSvzGW","FRA_adm0.rds") 9 > download .file("http://bit.ly/2FUZ0Lz","FRA_adm2.rds") 10 > FR=readRDS("FRA_adm2.rds") 11 > donnees_carte=data.frame(FRdata) @freakonometrics

59

slide-60
SLIDE 60

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Spatial Smoothing

1 > FR0=readRDS("FRA_adm0.rds") 2 > plot(FR0) 3 > bk = seq (-5,4.5, length =21) 4 > cuty = cut(simbase$Y,breaks=bk ,labels

=1:20)

5 > points(simbase$long ,simbase$lat ,col=cols

[cuty],pch =19, cex =.5)

One can consider a choropleth map (spatial version of the histogram).

@freakonometrics

60

slide-61
SLIDE 61

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1 1 > A=aggregate (x = simbase$Y,by=list(

simbase$dpt),mean)

2 > names(A)=c("dpt","y") 3 > d=donnees_carte$CCA_2 4 > d[d=="2A"]="201" 5 > d[d=="2B"]="202" 6 > donnees_carte$dpt=as.numeric(as.

character (d))

7 > donnees_carte=merge(donnees_carte ,A,all.

x=TRUE)

8 > donnees_carte=donnees_carte[order(

donnees_carte$OBJECTID) ,]

9 > bk=seq ( -2.75 ,2.75 , length =21) 10 > donnees_carte$cuty=cut(donnees_carte$y,

breaks=bk ,labels =1:20)

11 > plot(FR , col=cols[donnees_carte$cuty],

xlim=c( -5.2 ,12))

Spatial Smoothing

@freakonometrics

61

slide-62
SLIDE 62

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Spatial Smoothing Instead of a "continuous" gradient of colors, one can consider only 4 colors (4 levels) for the prediction.

1 > bk=seq ( -2.75 ,2.75 , length =5) 2 > donnees_carte$cuty=cut(donnees_carte$y,

breaks=bk ,labels =1:4)

3 > plot(FR , col=cols[c(3 ,8 ,12 ,17)][ donnees _

carte$cuty],xlim=c( -5.2 ,12))

@freakonometrics

62

slide-63
SLIDE 63

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Spatial Smoothing

1 > P1 = FR0@polygons [[1]] @Polygons [[355]]

@coords

2 > P2 = FR0@polygons [[1]] @Polygons [[27]]

@coords

3 > plot(FR0 ,border=NA) 4 > polygon(P1) 5 > polygon(P2) 6 > grille <-expand.grid(seq(min(simbase$long

),max(simbase$long),length =101) ,seq(min (simbase$lat),max(simbase$lat),length =101))

7 > paslong =( max(simbase$long)-min(simbase$

long))/100

8 > paslat =( max(simbase$lat)-min(simbase$lat

))/100

@freakonometrics

63

slide-64
SLIDE 64

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Spatial Smoothing We need to create a grid (i.e. X) on which we approximate E[Y |X = x]

1 > f=function(i){ (point.in.polygon (grille

[i, 1]+ paslong/2 , grille[i, 2]+ paslat/ 2 , P1[,1],P1[ ,2]) >0)+( point.in.polygon (grille[i, 1]+ paslong /2 , grille[i, 2]+ paslat/2 , P2[,1],P2[ ,2]) >0) }

2 > indic=unlist(lapply (1: nrow(grille),f)) 3 > grille=grille[which(indic ==1) ,] 4 > points(grille [ ,1]+ paslong /2,grille [ ,2]+

paslat/2)

@freakonometrics

64

slide-65
SLIDE 65

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Spatial Smoothing Consider here some k-NN, with k = 20

1 > library(geosphere) 2 > knn=function(i,k=20){ 3 + d= distHaversine (grille[i ,1:2] , simbase[,c

("long","lat")], r =6378.137)

4 +

r=rank(d)

5 +

ind=which(r<=k)

6 +

mean(simbase[ind ,"Y"])

7 + } 8 > grille$y=Vectorize (knn)(1: nrow(grille)) 9 > bk=seq ( -2.75 ,2.75 , length =21) 10 > grille$cuty=cut(grille$y,breaks=bk ,

labels =1:20)

11 > points(grille [ ,1]+ paslong /2,grille [ ,2]+

paslat/2,col=cols[grille$cuty],pch =19)

@freakonometrics

65

slide-66
SLIDE 66

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Spatial Smoothing Again, instead of a "continuous" gradient, we can use 4 levels,

1 > bk=seq ( -2.75 ,2.75 , length =5) 2 > grille$cuty=cut(grille$y,breaks=bk ,

labels =1:4)

3 > plot(FR0 ,border=NA) 4 > polygon(P1) 5 > polygon(P2) 6 > points(grille [ ,1]+ paslong /2,grille [ ,2]+

paslat/2,col=cols[c(3 ,8 ,12 ,17)][ grille$ cuty],pch =19)

@freakonometrics

66

slide-67
SLIDE 67

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Testing (Non-)Linearities In the linear model,

  • y = X

β = X[XTX]−1XT

  • H

y Hi,i is the leverage of the ith element of this hat matrix. Write

  • yi =

n

  • j=1

[XT

i [XTX]−1XT]jyj = n

  • j=1

[H(Xi)]jyj where H(x) = xT[XTX]−1XT The prediction is m(x) = E(Y |X = x) =

n

  • j=1

[H(x)]jyj

@freakonometrics

67

slide-68
SLIDE 68

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Testing (Non-)Linearities More generally, a predictor m is said to be linear if for all x if there is S(·) : Rn → Rn such that m(x) =

n

  • j=1

S(x)jyj Conversely, given y1, · · · , yn, there is a matrix S n × n such that

  • y = Sy

For the linear model, S = H. trace(H) = dim(β): degrees of freedom Hi,i 1 − Hi,i is related to Cook’s distance, from Cook (1977), Detection of Influential Observations in Linear Regression.

@freakonometrics

68

slide-69
SLIDE 69

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Testing (Non-)Linearities For a kernel regression model, with kernel k and bandwidth h S(k,h)

i,j

= kh(xi − xj)

n

  • k=1

kh(xk − xj) where kh(·) = k(·/h), while S(k,h)(x)j = Kh(x − xj)

n

  • k=1

kh(x − xk) For a k-nearest neighbor, S(k)

i,j = 1

k 1(j ∈ Ixi) where Ixi are the k nearest

  • bservations to xi, while S(k)(x)j = 1

k 1(j ∈ Ix).

@freakonometrics

69

slide-70
SLIDE 70

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Testing (Non-)Linearities Observe that trace(S) is usually seen as a degree of smoothness. Do we have to smooth? Isn’t linear model sufficent? Define T = Sy − Hy trace([S − H]T[S − H]) If the model is linear, then T has a Fisher distribution. Remark: In the case of a linear predictor, with smoothing matrix Sh

  • R(h) = 1

n

n

  • i=1

(yi − m(−i)

h

(xi))2 = 1 n

n

  • i=1

Yi − mh(xi) 1 − [Sh]i,i 2 We do not need to estimate n models. One can also minimize GCV (h) = n2 n2 − trace(S)2 · 1 n

n

  • i=1

(Yi − mh(xi))2 ∼ Mallow’s Cp

@freakonometrics

70

slide-71
SLIDE 71

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Confidence Intervals If y = mh(x) = Sh(x)y, let σ2 = 1 n

n

  • i=1

(yi − mh(xi))2 and a confidence interval is, at x

  • mh(y) ± t1−α/2

σ

  • Sh(x)Sh(x)T
  • .
  • 5

10 15 20 25 20 40 60 80 100 120 vitesse du véhicule distance de freinage

  • @freakonometrics

71

slide-72
SLIDE 72

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Confidence Bands

50 100 150 5 10 15 20 25 dist s p e e d 50 100 150 5 10 15 20 25 dist s p e e d

@freakonometrics

72

slide-73
SLIDE 73

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Confidence Bands Also called variability bands for functions in Härdle (1990) Applied Nonparametric Regresion. From Collomb (1979) Condition nécessaires et suffisantes de convergence uniforme d’un estimateur de la régression, with Kernel regression (Nadarayah-Watson) sup

  • |m(x) −

mh(x)|

  • ∼ C1h2 + C2
  • log n

nh sup

  • |m(x) −

mh(x)|

  • ∼ C1h2 + C2
  • log n

nhdim(x)

@freakonometrics

73

slide-74
SLIDE 74

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Confidence Bands So far, we have mainly discussed pointwise convergence with √ nh ( mh(x) − m(x))

L

→ N(µx, σ2

x).

This asymptotic normality can be used to derive (pointwise) confidence intervals P(IC−(x) ≤ m(x) ≤ IC+(x)) = 1 − α ∀x ∈ X. But we can also seek uniform convergence properties. We want to derive functions IC± such that P(IC−(x) ≤ m(x) ≤ IC+(x) ∀x ∈ X) = 1 − α.

@freakonometrics

74

slide-75
SLIDE 75

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Confidence Bands

  • Bonferroni’s correction

Use a standard Gaussian (pointwise) confidence interval IC±

⋆ (x) =

m(x) ± √ nh σt1−α/2. and take also into accound the regularity of m. Set V (η) = 1 2 2η + 1 n + 1 n

  • m′∞,x, for some 0 < η < 1

where ϕ′∞,x is on a neighborhood of x. Then consider IC±(x) = IC±

⋆ (x) ± V (η).

@freakonometrics

75

slide-76
SLIDE 76

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Confidence Bands

  • Use of Gaussian processes

Observe that √ nh ( mh(x) − m(x))

D

→ Gx for some Gaussian process (Gx). Confidence bands are derived from quantiles of sup{Gx, x ∈ X}. If we use kernel k for smoothing, Johnston (1982) Probabilities of Maximal Deviations for Nonparametric Regression Function Estimates proved that Gx =

  • k(x − t)dWt, for some standard (Wt) Wiener process

is then a Gaussian process with variance

  • k(x)k(t − x)dt. And

IC±(x) = ϕ(x) ±

  • 2 log(1/h)

+ dn

  • 5

7

  • σ2

√ nh with dn =

  • 2 log h−1 +

1

  • 2 log h−1 log
  • 3

4π2 , where exp(−2 exp(−qα)) = 1 − α.

@freakonometrics

76

slide-77
SLIDE 77

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Confidence Bands

  • Bootstrap (see #2)

Finally, McDonald (1986) Smoothing with Split Linear Fits suggested a bootstrap algorithm to approximate the distribution of Zn = sup{| ϕ(x) − ϕ(x)|, x ∈ X}.

@freakonometrics

77

slide-78
SLIDE 78

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Confidence Bands Depending on the smoothing parameter h, we get different corrections

@freakonometrics

78

slide-79
SLIDE 79

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Confidence Bands Depending on the smoothing parameter h, we get different corrections

@freakonometrics

79

slide-80
SLIDE 80

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Boosting to Capture NonLinear Effects We want to solve m⋆ = argmin

  • E
  • (Y − m(X))2

The heuristics is simple: we consider an iterative process where we keep modeling the errors. Fit model for y, h1(·) from y and X, and compute the error, ε1 = y − h1(X). Fit model for ε1, h2(·) from ε1 and X, and compute the error, ε2 = ε1 − h2(X),

  • etc. Then set

mk(·) = h1(·)

  • ∼y

+ h2(·)

  • ∼ε1

+ h3(·)

  • ∼ε2

+ · · · + hk(·)

∼εk−1

Hence, we consider an iterative procedure, mk(·) = mk−1(·) + hk(·).

@freakonometrics

80

slide-81
SLIDE 81

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Boosting h(x) = y − mk(x), which can be interpreted as a residual. Note that this residual is the gradient of 1 2[y − mk(x)]2 A gradient descent is based on Taylor expansion f(xk)

f,xk

∼ f(xk−1)

  • f,xk−1

+ (xk − xk−1)

  • α

∇f(xk−1)

  • ∇f,xk−1

But here, it is different. We claim we can write fk(x)

fk,x

∼ fk−1(x)

  • fk−1,x

+ (fk − fk−1)

  • β

?

  • fk−1,∇x

where ? is interpreted as a ‘gradient’.

@freakonometrics

81

slide-82
SLIDE 82

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Boosting Construct iteratively mk(·) = mk−1(·) + argmin

h∈H

n

  • i=1

(yi − [mk−1(xi) + h(xi)])2

  • mk(·) = mk−1(·) + argmin

h∈H

n

  • i=1

([yi − mk−1(xi)] − h(xi)])2

  • where h ∈ H means that we seek in a class of weak learner functions.

If learner are two strong, the first loop leads to some fixed point, and there is no learning procedure, see linear regression y = xTβ + ε. Since ε ⊥ x we cannot learn from the residuals. In order to make sure that we learn weakly, we can use some shrinkage parameter ν (or collection of parameters νj).

@freakonometrics

82

slide-83
SLIDE 83

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Boosting with Piecewise Linear Spline & Stump Functions Instead of εk = εk−1 − hk(x), set εk = εk−1 − ν·hk(x) Remark : bumps are related to regression trees (see 2015 course).

@freakonometrics

83

  • 1

2 3 4 5 6 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

  • 1

2 3 4 5 6 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

slide-84
SLIDE 84

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Ruptures One can use Chow test to test for a rupture. Note that it is simply Fisher test, with two parts, β =    β1 for i = 1, · · · , i0 β2 for i = i0 + 1, · · · , n and test    H0 : β1 = β2 H1 : β1 = β2 i0 is a point between k and n − k (we need enough observations). Chow (1960) Tests of Equality Between Sets of Coefficients in Two Linear Regressions suggested Fi0 =

  • ηT

η − εT ε

  • εT

ε/(n − 2k) where εi = yi − xT

i

β, and ηi =    Yi − xT

i

β1 for i = k, · · · , i0 Yi − xT

i

β2 for i = i0 + 1, · · · , n − k

@freakonometrics

84

slide-85
SLIDE 85

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Ruptures

1 > Fstats(dist ~ speed ,data=cars ,from =7/50)

  • 5

10 15 20 25 20 40 60 80 100 120 Vitesse du véhicule Distance de feinage

  • Indice

F statistics 10 20 30 40 50 2 4 6 8 10 12

  • @freakonometrics

85

slide-86
SLIDE 86

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Ruptures

1 > Fstats(dist ~ speed ,data=cars ,from =2/50)

  • 5

10 15 20 25 20 40 60 80 100 120 Vitesse du véhicule Distance de feinage

  • Indice

F statistics 10 20 30 40 50 2 4 6 8 10 12

  • @freakonometrics

86

slide-87
SLIDE 87

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Ruptures If i0 is unknown, use CUSUM types of tests, see Ploberger & Krämer (1992) The Cusum Test with OLS Residuals. For all t ∈ [0, 1], set Wt = 1

  • σ√n

⌊nt⌋

  • i=1
  • εi.

If α is the confidence level, bounds are generally ±α, even if theoretical bounds should be ±α

  • t(1 − t).

1 > cusum

<- efp(dist ~ speed , type = "OLS -CUSUM",data=cars)

2 > plot(cusum ,ylim=c(-2,2)) 3 > plot(cusum , alpha = 0.05 , alt.boundary = TRUE ,ylim=c(-2,2)) @freakonometrics

87

slide-88
SLIDE 88

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Ruptures

OLS−based CUSUM test

Time Empirical fluctuation process 0.0 0.2 0.4 0.6 0.8 1.0 −2 −1 1 2

OLS−based CUSUM test with alternative boundaries

Time Empirical fluctuation process 0.0 0.2 0.4 0.6 0.8 1.0 −2 −1 1 2

@freakonometrics

88

slide-89
SLIDE 89

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

From a Rupture to a Discontinuity See Imbens & Lemieux (2008) Regression Discontinuity Designs.

@freakonometrics

89

slide-90
SLIDE 90

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

From a Rupture to a Discontinuity Consider the dataset from Lee (2008) Randomized experiments from non-random selection in U.S. House elections.

1 > library(RDDtools) 2 > data(Lee2008)

We want to test if there is a discontinuity in 0.

  • with parametric tools
  • with nonparametric tools

@freakonometrics

90

  • ● ●
  • ●●
  • ● ●
  • −1.0

−0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x y

slide-91
SLIDE 91

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Testing for a rupture Use some 4th order polynomial, on each part

1 > idx1 = (Lee2008$x >0) 2 > reg1 = lm(y~poly(x ,4) ,data=Lee2008[

idx1 ,])

3 > idx2 = (Lee2008$x <0) 4 > reg2 = lm(y~poly(x ,4) ,data=Lee2008[

idx2 ,])

5 > s1=predict(reg1 ,newdata=data.frame(x

=0))

6 > s2=predict(reg2 ,newdata=data.frame(x

=0))

7 > abs(s1 -s2) 8

1

9 0.07659014 @freakonometrics

91

  • ● ●
  • ●●
  • ● ●
  • −1.0

−0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x y

slide-92
SLIDE 92

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Testing for a rupture

1 > reg_para

<- RDDreg_lm(RDDdata(y = Lee2008$y, x = Lee2008$x, cutpoint = 0), order = 4)

2 > reg_para 3 ### RDD

regression: parametric ###

4

Polynomial

  • rder:

4

5

Slopes: separate

6

Number

  • f obs: 6558 (left: 2740 ,

right: 3818)

7 8

Coefficient :

9

Estimate

  • Std. Error t value

Pr(>|t|)

10 D 0.076590

0.013239 5.7851 7.582e -09 ***

  • ● ●
  • ●●
  • ● ●
  • −1.0

−0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x y

@freakonometrics

92

slide-93
SLIDE 93

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Testing for a rupture

  • r use a simple local regression, see Imbens &

Kalyanaraman (2012).

1 > reg1 = ksmooth(Lee2008 $x[idx1],

Lee2008$y[idx1], kernel = "normal", bandwidth = 0.1)

2 > reg2 = ksmooth(Lee2008 $x[idx2],

Lee2008$y[idx2], kernel = "normal", bandwidth = 0.1)

3 > s1 = reg1$y[1] 4 > s2 = reg2$y[length(reg2$y)] 5 > abs(s1 -s2) 6 [1]

0.09883813

@freakonometrics

93

  • ● ●
  • ●●
  • ● ●
  • −1.0

−0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x y

slide-94
SLIDE 94

Arthur CHARPENTIER, Advanced Econometrics Graduate Course, Winter 2018, Université de Rennes 1

Testing for a rupture

1 > reg_nonpara

<- RDDreg_np(RDDobject = Lee2008 _ rdd , bw = .1)

2 > print(reg_nonpara) 3 ### RDD

regression: nonparametric local linear

4

Bandwidth: 0.1

5

Number

  • f obs: 1209 (left: 577, right: 632)

6 7

Coefficient :

8

Estimate

  • Std. Error z value

Pr(>|z|)

9 D 0.059397

0.014119 4.207 2.588e -05 ***

  • ● ●
  • ●●
  • ● ●
  • −1.0

−0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x y

  • 0.0

0.2 0.4 0.6 0.8 1.0 0.00 0.02 0.04 0.06 0.08 0.10 bandwidth

@freakonometrics

94