orthogonal machine learning power and limitations
play

Orthogonal Machine Learning: Power and Limitations Lester Mackey - PowerPoint PPT Presentation

Orthogonal Machine Learning: Power and Limitations Lester Mackey Joint work with Vasilis Syrgkanis and Ilias Zadik Microsoft Research New England , Massachusetts Institute of Technology October 30, 2018 Mackey (MSR)


  1. Orthogonal Machine Learning: Power and Limitations Lester Mackey ∗ Joint work with Vasilis Syrgkanis ∗ and Ilias Zadik † Microsoft Research New England ∗ , Massachusetts Institute of Technology † October 30, 2018 Mackey (MSR) Orthogonal Machine Learning October 30, 2018 1 / 28

  2. A Conversation with Vasilis Vasilis: Lester, I love Double Machine Learning! Me: What? Vasilis: It’s a tool for accurately estimating treatment effects in the presence of many potential confounders. Me: I have no idea what you’re talking about. Vasilis: Let me give you an example... Mackey (MSR) Orthogonal Machine Learning October 30, 2018 2 / 28

  3. Example: Estimating Price Elasticity of Demand Goal: Estimate elasticity , the effect a change in price has on demand Set prices of goods and services [Chernozhukov, Goldman, Semenova, and Taddy, 2017b] Predict impact of tobacco tax on smoking [Wilkins, Yurekli, and Hu, 2004] Y = θ 0 T + ǫ ���� ���� ���� ���� log demand log price noise elasticity Mackey (MSR) Orthogonal Machine Learning October 30, 2018 3 / 28

  4. Example: Estimating Price Elasticity of Demand Goal: Estimate elasticity , the effect a change in price has on demand Set prices of goods and services [Chernozhukov, Goldman, Semenova, and Taddy, 2017b] Predict impact of tobacco tax on smoking [Wilkins, Yurekli, and Hu, 2004] Y = θ 0 T + ǫ ���� ���� ���� ���� log demand log price noise elasticity Conclusion: Increasing price increases demand! Problem: Demand increases in winter & price anticipates demand Mackey (MSR) Orthogonal Machine Learning October 30, 2018 4 / 28

  5. Example: Estimating Price Elasticity of Demand Goal: Estimate elasticity , the effect a change in price has on demand Set prices of goods and services [Chernozhukov, Goldman, Semenova, and Taddy, 2017b] Predict impact of tobacco tax on smoking [Wilkins, Yurekli, and Hu, 2004] Y = θ 0 T ���� ���� ���� log demand log price elasticity + β 0 X + ǫ ���� ���� season indicator noise Problem: What if there are 100s or 1000s of potential confounders? Mackey (MSR) Orthogonal Machine Learning October 30, 2018 5 / 28

  6. Example: Estimating Price Elasticity of Demand Goal: Estimate elasticity , the effect a change in price has on demand Problem: What if there are 100s or 1000s of potential confounders? Time of day, day of week, month, purchase and browsing history, other product prices, demographics, the weather, ... One option: Estimate effect of all potential confounders really well Y = θ 0 T + f 0 ( X ) + ǫ ���� ���� ���� ���� � �� � log demand log price noise elasticity effect of potential confounders If nuisance function f 0 estimable at O ( n − 1 / 2 ) rate then so is θ 0 Problem: Accurate nuisance estimates often unachievable when f 0 nonparametric or linear and high-dimensional Mackey (MSR) Orthogonal Machine Learning October 30, 2018 6 / 28

  7. Example: Estimating Price Elasticity of Demand Problem: What if there are 100s or 1000s of potential confounders? Double Machine Learning [Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, and Newey, 2017a] Y = θ 0 T + f 0 ( X ) + ǫ ���� ���� ���� ���� � �� � noise log demand log price elasticity effect of potential confounders Estimate nuisance f 0 somewhat poorly: o ( n − 1 / 4 ) suffices Employ Neyman orthogonal estimator of θ 0 robust to first-order errors in nuisance estimates; yields √ n -consistent estimate of θ 0 Questions: Why o ( n − 1 / 4 ) ? Can we relax this? When? How? This talk: Framework for k -th order orthogonal estimation with o ( n − 1 / (2 k +2) ) nuisance consistency ⇒ √ n -consistency for θ 0 Existence characterization and explicit construction of 2nd-order orthogonality in a popular causal inference model Mackey (MSR) Orthogonal Machine Learning October 30, 2018 7 / 28

  8. Estimation with Nuisance Goal: Estimate target parameters θ 0 ∈ Θ ⊆ R d (e.g., elasticities) in the presence of unknown nuisance functions h 0 ∈ H Given Independent replicates ( Z t ) 2 n t =1 of a data vector Z = ( T, Y, X ) Example (Partially Linear Regression (PLR)) T ∈ R represents a treatment or policy applied (e.g., log price) Y ∈ R represents an outcome of interest (e.g., log demand) X ∈ R p is a vector of associated covariates (e.g., seasonality) These observations satisfy Y = θ 0 T + f 0 ( X ) + ǫ, E [ ǫ | X, T ] = 0 a.s. T = g 0 ( X ) + η, E [ η | X ] = 0 a.s., Var ( η ) > 0 for noise η and ǫ , target parameter θ 0 , and nuisance h 0 = ( f 0 , g 0 ) . Mackey (MSR) Orthogonal Machine Learning October 30, 2018 8 / 28

  9. Two-stage Z -estimation with Sample Splitting Goal: Estimate target parameters θ 0 ∈ Θ ⊆ R d (e.g., elasticities) in the presence of unknown nuisance functions h 0 ∈ H Given Independent replicates ( Z t ) 2 n t =1 of a data vector Z = ( T, Y, X ) Moment functions m that identify the target parameters θ 0 : E [ m ( Z, θ 0 , h 0 ( X )) | X ] = 0 a.s. and E [ m ( Z, θ, h 0 ( X ))] � = 0 if θ � = θ 0 PLR model example: m ( Z, θ, h 0 ( X )) = ( Y − θT − f 0 ( X )) T Two-stage Z -estimation with sample splitting Fit estimate ˆ h ∈ H of h 0 using ( Z t ) 2 n t = n +1 (e.g., via 1 nonparametric or high-dimensional regression) � n ˆ t =1 m ( Z t , θ, ˆ θ SS 1 solves h ( X t )) = 0 2 n Con: Splitting statistically inefficient, possible detriment in first stage Mackey (MSR) Orthogonal Machine Learning October 30, 2018 9 / 28

  10. Two-stage Z -estimation with Cross Fitting Goal: Estimate target parameters θ 0 ∈ Θ ⊆ R d (e.g., elasticities) in the presence of unknown nuisance functions h 0 ∈ H Given Independent replicates ( Z t ) 2 n t =1 of a data vector Z = ( T, Y, X ) Moment functions m that identify the target parameters θ 0 : E [ m ( Z, θ 0 , h 0 ( X )) | X ] = 0 a.s. and E [ m ( Z, θ, h 0 ( X ))] � = 0 if θ � = θ 0 PLR model example: m ( Z, θ, h 0 ( X )) = ( Y − θT − f 0 ( X )) T Two-stage Z -estimation with cross fitting [Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, and Newey, 2017a] Split data indices into K batches I 1 , . . . , I K 0 For k ∈ { 1 , . . . , K } , fit estimate ˆ h k ∈ H of h 0 excluding I k 1 � K � ˆ t ∈ I k m ( Z t , θ, ˆ θ CF 1 solves h k ( X t )) = 0 2 n k =1 Pro: Repairs sample splitting deficiencies Mackey (MSR) Orthogonal Machine Learning October 30, 2018 10 / 28

  11. Goal: √ n -Asymptotic Normality Two-stage Z -estimators � n ˆ t =1 m ( Z t , θ, ˆ θ SS 1 solves h ( X t )) = 0 n � K � ˆ t ∈ I k m ( Z t , θ, ˆ θ CF 1 solves h k ( X t )) = 0 n k =1 θ SS and ˆ θ CF enjoy Goal: Establish conditions under which ˆ √ n -asymptotic normality ( √ n -a.n.), that is √ √ n (ˆ θ SS − θ 0 ) θ CF − θ 0 ) d d 2 n (ˆ → N (0 , Σ) and → N (0 , Σ) Asymptotically valid confidence intervals for θ 0 based on Gaussian or Student’s t quantiles Asymptotically valid association tests, like the Wald test Mackey (MSR) Orthogonal Machine Learning October 30, 2018 11 / 28

  12. First-order Orthogonality Definition (First-order Orthogonal Moments [Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, and Newey, 2017a] ) Moments m are first-order orthogonal w.r.t. the nuisance h 0 ( X ) if � � ∇ γ m ( Z, θ 0 , γ ) | γ = h 0 ( X ) | X = 0 . E Principle dates back to early work of [Neyman, 1979] Grants first-order insensitivity to errors in nuisance estimates Annihilates first-order term in Taylor expansion around nuisance Recall: m is 0 -th order orthogonal, E [ m ( Z, θ 0 , h 0 ( X )) | X ] = 0 Not satisfied by m ( Z, θ, h ( X )) = ( Y − θT − f ( X )) T Satisfied by m ( Z, θ, h ( X )) = ( Y − θT − f ( X ))( T − g ( X )) Main result of Chernozhukov et al. [2017a]: under 1st-order θ CF √ n -a.n. when � ˆ orthogonality, ˆ θ SS , ˆ h i − h 0 ,i � = o p ( n − 1 / 4 ) , ∀ i Mackey (MSR) Orthogonal Machine Learning October 30, 2018 12 / 28

  13. Higher-order Orthogonality Definition ( k -Orthogonal Moments) Moments m are k -orthogonal , if for all α ∈ N ℓ with � α � 1 ≤ k : � � D α m ( Z, θ 0 , γ ) | γ = h 0 ( X ) � X ] = 0 . E where D α m ( Z, θ, γ ) = ∇ α 1 γ 1 ∇ α 2 γ 2 . . . ∇ α ℓ γ ℓ m ( Z, θ, γ ) and the γ i ’s are the coordinates of the ℓ nuisance functions Grants k -th-order insensitivity to errors in nuisance estimates Annihilates terms with order ≤ k in Taylor expansion around nuisance Mackey (MSR) Orthogonal Machine Learning October 30, 2018 13 / 28

  14. Asymptotic Normality from k -Orthogonality Theorem ([Mackey, Syrgkanis, and Zadik, 2018]) Under k -orthogonality and standard identifiability and regularity assumptions, � ˆ h i − h 0 ,i � = o p ( n − 1 / (2 k +2) ) for all i suffices for √ n -a.n. of ˆ θ SS and ˆ θ CF with Σ = J − 1 V J − 1 for J = E [ ∇ θ m ( Z, θ 0 , h 0 ( X ))] and V = Cov ( m ( Z, θ 0 , h 0 ( X ))) . Actually suffices to have product of nuisance function errors � decay ( n 1 / 2 · E [ � ℓ p i =1 | ˆ h i ( X ) − h 0 ,i ( X ) | 2 α i | ˆ h ] → 0 for � α � 1 = k + 1 ): if one is more accurately estimated, another can be estimated more crudely We prove similar results for non-uniform orthogonality o p ( n − 1 / (2 k +2) ) rate holds the promise of coping with more complex or higher-dimensional nuisance functions Question: How do we construct k -orthogonal moments in practice? Mackey (MSR) Orthogonal Machine Learning October 30, 2018 14 / 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend