Habilitationsvortrag: Machine learning, shrinkage estimation, and - PowerPoint PPT Presentation

Habilitationsvortrag: Machine learning, shrinkage estimation, and economic theory Maximilian Kasy May 25, 2018 1 / 27

Introduction Recent years saw a boom of “machine learning” methods. Impressive advances in domains such as Image recognition, speech recognition, playing chess, playing Go, self-driving cars ... Questions: Why and how do these methods work? Which machine learning methods are useful for what kind of empirical research in economics? Can we combine these methods with insights from economic theory? This talk is based on Abadie and Kasy (2018), and Fessler and Kasy (2018). 2 / 27

Machine learning successes 3 / 27

Outline 1 Brief summaries The risk of machine learning 1 (Abadie and Kasy 2018) How to use economic theory to improve estimators 2 (Fessler and Kasy 2018) 2 For both papers: Some math, 1 empirirical applications. 2 3 Conclusion 4 / 27

The risk of machine learning (Abadie and Kasy 2018) Many applied settings: Estimation of a large number of parameters . Teacher effects, worker and firm effects, judge effects ... Estimation of treatment effects for many subgroups Prediction with many covariates Two key ingredients to avoid over-fitting, used in all of machine learning: Regularized estimation ( shrinkage ) Data-driven choices of regularization parameters ( tuning ) Questions in practice: What kind of regularization should we choose? 1 What features of the data generating process matter for this choice? When do cross-validation or SURE work for tuning? 2 We compare risk functions to answer these questions. (Not average (Bayes) risk or worst case risk!) 5 / 27

Recommendations for empirical researchers 1 Use regularization / shrinkage when you have many parameters of interest, and high variance (overfitting) is a concern. 2 Pick a regularization method appropriate for your application: Ridge: Smoothly distributed true effects, no special role of zero 1 Pre-testing: Many zeros, non-zeros well separated 2 Lasso: Robust choice, especially for series regression / 3 prediction 3 Use CV or SURE in high dimensional settings, when number of observations ≫ number of parameters. 6 / 27

How to use economic theory to improve estimators (Fessler and Kasy 2018) Most regularization methods shrink toward 0, or some other arbitrary point. What if we instead shrink toward parameter values consistent with the predictions of economic theory? Most economic theories are only approximately correct. Therefore: Testing them always rejects for large samples. Imposing them leads to inconsistent estimators. But shrinking toward them leads to uniformly better estimates. Shrinking to theory is an alternative to the standard paradigm of testing theories, and maintaining them while they are not rejected. 7 / 27

General construction of estimators shrinking to theory: Parametric empirical Bayes approach. Assume true parameters are theory-consistent parameters plus some random effects. Variance of random effects can be estimated, and determines the degree of shrinkage toward theory. We apply this to: Consumer demand 1 shrunk toward negative semi-definite compensated demand elasticities. Effect of labor supply on wage inequality 2 shrunk toward CES production function model. Decision probabilities 3 shrunk toward Stochastic Axiom of Revealed Preference. Expected asset returns 4 shrunk toward Capital Asset Pricing Model. 8 / 27

The risk of machine learning (Abadie and Kasy 2018) Roadmap: 1 Stylized setting: Estimation of many means 2 A useful family of examples: Spike and normal DGP Comparing mean squared error as a function of parameters 3 Empirical applications Neighborhood effects (Chetty and Hendren, 2015) Arms trading event study (DellaVigna and La Ferrara, 2010) Nonparametric Mincer equation (Belloni and Chernozhukov, 2011) 4 Uniform loss consistency of tuning methods 9 / 27

Stylized setting: Estimation of many means Observe n random variables X 1 ,..., X n with means µ 1 ,..., µ n . Many applications: X i equal to OLS estimated coefficients. Componentwise estimators : � µ i = m ( X i , λ ) , where m : R × [0 , ∞ ] �→ R and λ may depend on ( X 1 ,..., X n ). Examples: Ridge, Lasso, Pretest. 8 6 4 2 0 m -2 -4 Ridge Pretest -6 Lasso -8 -8 -6 -4 -2 0 2 4 6 8 X 10 / 27

Loss and risk µ , µ ) = 1 µ i − µ i ) 2 Compound squared error loss : L ( � n ∑ i ( � Empirical Bayes risk : µ 1 ,..., µ n as random effects , ( X i , µ i ) ∼ π , R ( m ( · , λ ) , π ) = E π [( m ( X i , λ ) − µ i ) 2 ] . ¯ Conditional expectation: m ∗ ¯ π ( x ) = E π [ µ | X = x ] Theorem : The empirical Bayes risk of m ( · , λ ) can be written as � π ( X )) 2 � ¯ m ∗ R = const . + E π ( m ( X , λ ) − ¯ . ⇒ Performance of estimator m ( · , λ ) depends on how closely it m ∗ approximates ¯ π ( · ). 11 / 27

A useful family of examples: Spike and normal DGP Assume X i ∼ N ( µ i , 1). Distribution of µ i across i : Fraction p µ i = 0 µ i ∼ N ( µ 0 , σ 2 Fraction 1 − p 0 ) Covers many interesting settings: p = 0: smooth distribution of true parameters p ≫ 0, µ 0 or σ 2 0 large: sparsity, non-zeros well separated Consider ridge, lasso, pre-test, optimal shrinkage function. Assume λ is chosen optimally (will return to that). 12 / 27

Best estimator p = 0 . 00 p = 0 . 25 5 5 4 4 3 3 σ 0 σ 0 2 2 1 1 0 0 0 1 2 3 4 5 0 1 2 3 4 5 µ 0 µ 0 p = 0 . 50 p = 0 . 75 5 5 4 4 3 3 σ 0 σ 0 2 2 1 1 0 0 0 1 2 3 4 5 0 1 2 3 4 5 µ 0 µ 0 ◦ is ridge, x is lasso, · is pretest 13 / 27

Applications Neighborhood effects: The effect of location during childhood on adult income (Chetty and Hendren, 2015) Arms trading event study: Changes in the stock prices of arms manufacturers following changes in the intensity of conflicts in countries under arms trade embargoes (DellaVigna and La Ferrara, 2010) Nonparametric Mincer equation: A nonparametric regression equation of log wages on education and potential experience (Belloni and Chernozhukov, 2011) 14 / 27

Estimated Risk Stein’s unbiased risk estimate � R at the optimized tuning parameter � λ ∗ for each application and estimator considered. n Ridge Lasso Pre-test � location effects 595 0.29 0.32 0.41 R � λ ∗ 2.44 1.34 5.00 � arms trade 214 R 0.50 0.06 -0.02 � λ ∗ 0.98 1.50 2.38 � returns to education 65 1.00 0.93 R 0.84 � λ ∗ 0.01 0.59 1.14 15 / 27

Some theory: Estimating λ Can we consistently estimate the optimal λ ∗ , and do almost as well as if we knew it? Answer: Yes, for large n , suitably bounded moments. We show this for two methods: Stein’s Unbiased Risk Estimate (SURE) 1 (requires normality) Cross-validation (CV) 2 (requires panel data) 16 / 27

Uniform loss consistency Shorthand notation for loss: ( m ( X i , λ ) − µ i ) 2 L n ( λ ) = 1 n ∑ i Definition: Uniform loss consistency of m ( ., � λ ) for m ( ., ¯ λ ∗ ): �� L n ( � λ ) − L n (¯ λ ∗ ) sup � > ε → 0 π P π as n → ∞ for all ε > 0, where P i ∼ iid π . 17 / 27

Minimizing estimated risk Estimate λ ∗ by minimizing estimated risk: λ ∗ = argmin � � R ( λ ) λ Different estimators � R ( λ ) of risk: CV, SURE Theorem : Regularization using SURE or CV is uniformly loss consistent as n → ∞ in the random effects setting under some regularity conditions. Contrast with Leeb and P¨ otscher (2006)! (fixed dimension of parameter vector) Key ingredient: uniform laws of larger numbers to get convergence of L n ( λ ), � R ( λ ). 18 / 27

How to use economic theory to improve estimators (Fessler and Kasy 2018) Goal: constructing estimators shrinking to theory. Preliminary unrestricted estimator: � β | β ∼ N ( β , V ) Restrictions implied by theoretical model: β 0 ∈ B 0 = { b : R 1 · b = 0 , R 2 · b ≤ 0 } . Empirical Bayes (random coefficient) construction: β = β 0 + ζ , ζ ∼ N (0 , τ 2 · I ) , β 0 ∈ B 0 . 19 / 27

Solving for the empirical Bayes estimator Marginal distribution of � β given β 0 , τ 2 : β | β 0 , τ 2 ∼ N ( β 0 , τ 2 · I + V ) � Maximum likelihood estimation of β 0 , τ 2 (tuning): � � �� τ 2 · I + � ( � β 0 , � τ 2 ) = argmin log det V b 0 ∈ B 0 , t 2 ≥ 0 � � − 1 β − b 0 ) ′ · τ 2 · I + � +( � · ( � β − b 0 ) . V “Bayes” estimation of β (shrinkage): � � − 1 I + 1 β EB = � β 0 + � · ( � β − � τ 2 � β 0 ) . V � 20 / 27

Application 1: Consumer demand Consumer choice and the restrictions on compensated demand implied by utility maximization. High dimensional parameters if we want to estimate demand elasticities at many different price and income levels. Theory we are shrinking to: Negative semi-definiteness of compensated quantile demand elasticities, which holds under arbitrary preference heterogeneity by Dette et al. (2016). Application as in Blundell et al. (2017): Price and income elasticity of gasoline demand, 2001 National Household Travel Survey (NHTS). 21 / 27

Unrestricted demand estimation log demand income elasticity of demand 7.4 0.8 7.3 0.6 7.2 0.4 7.1 0.2 7 6.9 0 0.2 0.25 0.3 0.35 0.2 0.25 0.3 0.35 log price log price price elasticity of demand compensated price elasticity of demand 2 2 0 0 -2 -2 0.2 0.25 0.3 0.35 0.2 0.25 0.3 0.35 log price log price 22 / 27

Habilitationsvortrag: Machine learning, shrinkage estimation, and - PowerPoint PPT Presentation

Habilitationsvortrag: Machine learning, shrinkage estimation, and economic theory Maximilian Kasy May 25, 2018 1 / 27 Introduction Recent years saw a boom of machine learning methods. Impressive advances in domains such as Image

Econ 2148, fall 2017 Shrinkage in the Normal means model Maximilian Kasy Department of

Advanced Econometrics 2, Hilary term 2020 Shrinkage in the Normal means model Maximilian Kasy

Econ 2148, fall 2019 Shrinkage in the Normal means model Maximilian Kasy Department of

RECSM Summer School: Machine Learning for Social Sciences Session 1.4: Ridge Regression Reto

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Machine learning, shrinkage estimation, and economic theory Maximilian Kasy December 14, 2018 1

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Shrinkage Overview Joint DN Presentation 25 th October 2016 Matt Marshall (National Grid) John

Environmental Outputs Output Primary Incentive mech. Category measure Business Shrinkage

Shrinkage estimation of the three-parameter logistic model Michela Battauz (joint with Ruggero

Shrinkage priors Dr. Jarad Niemi Iowa State University August 24, 2017 Jarad Niemi (Iowa State)

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

The Importance of Firms in Labor Market Outcomes David Card UC Berkeley Most labor economics

Learning Dynamics in Tax Bunching at the Kink: Evidence from Ecuador Albrecht Bohne Jan

L ECTURE 8 Monetary Policy at the Zero Lower Bound: Quantitative Easing October 10, 2018

Financial Sector Reform after the Crisis: Has Anything Happened? Preliminary Work & 2

Experimental Aspects of soft QCD N. van Remortel Universiteit Antwerpen, Belgium Jet workshop

A Preliminary Study of Quantified, Typed Events Robert Dyer 1 , Mehdi Bagherzadeh 1 , Hridesh

Research Workshop Series Session 1: Data and Evidence Jill Walston 9 / 28 / 2017 Agenda 1.

The value of executive director heterogeneity in banking: Evidence from Appointment

Habilitationsvortrag: Machine learning, shrinkage estimation, and - PowerPoint PPT Presentation

Habilitationsvortrag: Machine learning, shrinkage estimation, and economic theory Maximilian Kasy May 25, 2018 1 / 27 Introduction Recent years saw a boom of machine learning methods. Impressive advances in domains such as Image

Econ 2148, fall 2017 Shrinkage in the Normal means model Maximilian Kasy Department of

Advanced Econometrics 2, Hilary term 2020 Shrinkage in the Normal means model Maximilian Kasy

Econ 2148, fall 2019 Shrinkage in the Normal means model Maximilian Kasy Department of

RECSM Summer School: Machine Learning for Social Sciences Session 1.4: Ridge Regression Reto

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Machine learning, shrinkage estimation, and economic theory Maximilian Kasy December 14, 2018 1

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Shrinkage Overview Joint DN Presentation 25 th October 2016 Matt Marshall (National Grid) John

Environmental Outputs Output Primary Incentive mech. Category measure Business Shrinkage

Shrinkage estimation of the three-parameter logistic model Michela Battauz (joint with Ruggero

Shrinkage priors Dr. Jarad Niemi Iowa State University August 24, 2017 Jarad Niemi (Iowa State)

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

The Importance of Firms in Labor Market Outcomes David Card UC Berkeley Most labor economics

Learning Dynamics in Tax Bunching at the Kink: Evidence from Ecuador Albrecht Bohne Jan

L ECTURE 8 Monetary Policy at the Zero Lower Bound: Quantitative Easing October 10, 2018

Financial Sector Reform after the Crisis: Has Anything Happened? Preliminary Work &amp; 2

Experimental Aspects of soft QCD N. van Remortel Universiteit Antwerpen, Belgium Jet workshop

A Preliminary Study of Quantified, Typed Events Robert Dyer 1 , Mehdi Bagherzadeh 1 , Hridesh

Research Workshop Series Session 1: Data and Evidence Jill Walston 9 / 28 / 2017 Agenda 1.

The value of executive director heterogeneity in banking: Evidence from Appointment

Financial Sector Reform after the Crisis: Has Anything Happened? Preliminary Work & 2