great plot now need to find the theory that explains it
play

Great plot. Now need to find the theory that explains it Deville - PowerPoint PPT Presentation

Arthur CHARPENTIER, Advanced Econometrics Graduate Course Advanced Econometrics #3: Model & Variable Selection * A. Charpentier (Universit de Rennes 1) Universit de Rennes 1, Graduate Course, 2018. 1 @freakonometrics Arthur CHARPENTIER,


  1. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Advanced Econometrics #3: Model & Variable Selection * A. Charpentier (Université de Rennes 1) Université de Rennes 1, Graduate Course, 2018. 1 @freakonometrics

  2. Arthur CHARPENTIER, Advanced Econometrics Graduate Course “ Great plot. Now need to find the theory that explains it ” Deville (2017) http://twitter.com 2 @freakonometrics

  3. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Preliminary Results: Numerical Optimization Problem : x ⋆ ∈ argmin { f ( x ); x ∈ R d } Gradient descent : x k +1 = x k − η ∇ f ( x k ) starting from some x 0 Problem : x ⋆ ∈ argmin { f ( x ); x ∈ X ⊂ R d } � � x k − η ∇ f ( x k ) Projected descent : x k +1 = Π X starting from some x 0 A constrained problem is said to be convex if   min { f ( x ) } with f convex   s.t. g i ( x ) = 0 , ∀ i = 1 , · · · , n with g i linear    h i ( x ) ≤ 0 , ∀ i = 1 , · · · , m with h i convex n m � � Lagrangian : L ( x , λ , µ ) = f ( x ) + λ i g i ( x ) + µ i h i ( x ) where x are primal i =1 i =1 variables and ( λ , µ ) are dual variables. Remark L is an affine function in ( λ , µ ) 3 @freakonometrics

  4. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Preliminary Results: Numerical Optimization Karush–Kuhn–Tucker conditions : a convex problem has a solution x ⋆ if and only if there are ( λ ⋆ , µ ⋆ ) such that the following condition hold • stationarity : ∇ x L ( x , λ , µ ) = 0 at ( x ⋆ , λ ⋆ , µ ⋆ ) • primal admissibility : g i ( x ⋆ ) = 0 and h i ( x ⋆ ) ≤ 0, ∀ i • dual admissibility : µ ⋆ ≥ 0 x {L ( x , λ , µ ) } Let L denote the associated dual function L ( λ , µ ) = min L is a convex function in ( λ , µ ) and the dual problem is max λ , µ { L ( λ , µ ) } . 4 @freakonometrics

  5. Arthur CHARPENTIER, Advanced Econometrics Graduate Course References Motivation Banerjee, A., Chandrasekhar, A.G., Duflo, E. & Jackson, M.O. (2016). Gossip: Identifying Central Individuals in a Social Networks . References Belloni, A. & Chernozhukov, V. 2009. Least squares after model selection in high-dimensional sparse models . Hastie, T., Tibshirani, R. & Wainwright, M. 2015 Statistical Learning with Sparsity: The Lasso and Generalizations . CRC Press. 5 @freakonometrics

  6. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Preambule Assume that y = m ( x ) + ε , where ε is some idosyncatic impredictible noise. The error E [( y − m ( x )) 2 ] is the sum of three terms m ( x )) 2 ] • variance of the estimator : E [( y − � • bias 2 of the estimator : [ m ( x ) − � m ( x )] 2 • variance of the noise : E [( y − m ( x )) 2 ] (the latter exists, even with a ‘perfect’ model). 6 @freakonometrics

  7. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Preambule Consider a parametric model, with true (unkown) parameter θ , then � θ − θ ) 2 � � ) 2 � � − θ ) 2 � � ˆ � � ˆ � mse(ˆ (ˆ (ˆ θ ) = E = E θ − E θ + E ( E θ � �� � � �� � variance bias 2 0.8 Let � θ denote an unbiased estimator of θ . Then 0.6 mse( � θ 2 θ ) ˆ · � θ = � · � θ = θ − θ θ 2 + mse( � θ 2 + mse( � 0.4 θ ) θ ) � �� � penalty 0.2 variance satisfies mse(ˆ θ ) ≤ mse( � θ ). 0.0 −2 −1 0 1 2 3 4 7 @freakonometrics

  8. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Bayes vs. Frequentist, inference on heads/tails Consider some Bernoulli sample x = { x 1 , x 2 , · · · , x n } , where x i ∈ { 0 , 1 } . X i ’s are i.i.d. B ( p ) variables, f X ( x ) = p x [1 − p ] 1 − x , x ∈ { 0 , 1 } . Standard frequentist approach � � n n � � p = 1 � x i = argman f X ( x i ) n i =1 i =1 � �� � L ( p ; x ) From the central limit theorem √ n p − p � L � → N (0 , 1) as n → ∞ p (1 − p ) we can derive an approximated 95% confidence interval � � � p ± 1 . 96 √ n � p (1 − � � p ) 8 @freakonometrics

  9. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Bayes vs. Frequentist, inference on heads/tails Example out of 1,047 contracts, 159 claimed a loss 0.035 (True) Binomial Distribution Poisson Approximation 0.030 Gaussian Approximation 0.025 0.020 Probability 0.015 0.010 0.005 0.000 100 120 140 160 180 200 220 Number of Insured Claiming a Loss 9 @freakonometrics

  10. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Bayes’s theorem Consider some hypothesis H and some evidence E , then P E ( H ) = P ( H | E ) = P ( H ∩ E ) = P ( H ) · P ( E | H ) P ( E ) P ( E ) Bayes rule,   prior probability P ( H )  versus posterior probability after receiving evidence E, P E ( H ) = P ( H | E ) . In Bayesian (parametric) statistics, H = { θ ∈ Θ } and E = { X = x } . Bayes’ Theorem, π ( θ | x ) = π ( θ ) · f ( x | θ ) π ( θ ) · f ( x | θ ) � = f ( x | θ ) π ( θ ) dθ ∝ π ( θ ) · f ( x | θ ) f ( x ) 10 @freakonometrics

  11. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Small Data and Black Swans Consider sample x = { 0 , 0 , 0 , 0 , 0 } . Here the likelihood is   f ( x i | θ ) = θ x i [1 − θ ] 1 − x i  f ( x | θ ) = θ x T 1 [1 − θ ] n − x T 1 and we need a priori distribution π ( · ) e.g. a beta distribution π ( θ ) = θ α [1 − θ ] β 1 0.8 0.6 B ( α, β ) 0.4 0.2 0 0 θ α + x T 1 [1 − θ ] β + n − x T 1 π ( θ | x ) = B ( α + x T 1 , β + n − x T 1 ) 11 @freakonometrics

  12. Arthur CHARPENTIER, Advanced Econometrics Graduate Course On Bayesian Philosophy, Confidence vs. Credibility for frequentists, a probability is a measure of the the frequency of repeated events → parameters are fixed (but unknown), and data are random for Bayesians, a probability is a measure of the degree of certainty about values → parameters are random and data are fixed “ Bayesians : Given our observed data, there is a 95% probability that the true value of θ falls within the credible region vs. Frequentists : There is a 95% probability that when I compute a confidence interval from data of this sort, the true value of θ will fall within it. ” in Vanderplas (2014) Example see Jaynes (1976) , e.g. the truncated exponential 12 @freakonometrics

  13. Arthur CHARPENTIER, Advanced Econometrics Graduate Course On Bayesian Philosophy, Confidence vs. Credibility Example What is a 95% confidence interval ● ● ● ● ● ● ● ● ● ● ● ● ● ● of a proportion ? Here x = 159 and n = 1047. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1. draw sets (˜ x 1 , · · · , ˜ x n ) k with X i ∼ B ( x/n ) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2. compute for each set of values confidence ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● intervals ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 3. determine the fraction of these confidence ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● interval that contain x ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● → the parameter is fixed, and we guarantee ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● that 95% of the confidence intervals will con- ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● tain it. ● 140 160 180 200 13 @freakonometrics

  14. Arthur CHARPENTIER, Advanced Econometrics Graduate Course On Bayesian Philosophy, Confidence vs. Credibility Example What is 95% credible region of a pro- portion ? Here x = 159 and n = 1047. 1. draw random parameters p k with from the posterior distribution, π ( ·| x ) x 1 , · · · , ˜ x n ) k with X i,k ∼ B ( p k ) 2. sample sets (˜ 3. compute for each set of values means x k 4. look at the proportion of those x k that are within this credible region [Π − 1 ( . 025 | x ); Π − 1 ( . 975 | x )] 140 160 180 200 → the credible region is fixed, and we guarantee that 95% of possible values of x will fall within it it. 14 @freakonometrics

  15. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Occam’s Razor The “law of parsimony”, “ pluralitas non est ponenda sine necessitate ” Penalize too complex models 15 @freakonometrics

  16. Arthur CHARPENTIER, Advanced Econometrics Graduate Course James & Stein Estimator Let X ∼ N ( µ , σ 2 I ). We want to estimate µ . � � µ , σ 2 µ mle = X n ∼ N � . n I From James & Stein (1961) Estimation with quadratic loss � � 1 − ( d − 2) σ 2 µ JS = � y n � y � 2 where � · � is the Euclidean norm. One can prove that if d ≥ 3, ��� � 2 � ��� � 2 � µ JS − � µ mle − � µ < E µ E Samworth (2015) Stein’s paradox , “ one should use the price of tea in China to obtain a better estimate of the chance of rain in Melbourne ”. 16 @freakonometrics

  17. Arthur CHARPENTIER, Advanced Econometrics Graduate Course James & Stein Estimator Heuristics : consider a biased estimator, to decrease the variance. See Efron (2010) Large-Scale Inference 17 @freakonometrics

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend