high dimensional m estimation inference from
play

High Dimensional M -Estimation & Inference from Observational - PowerPoint PPT Presentation

High Dimensional M -Estimation & Inference from Observational Data with Incomplete Responses A Semi-Parametric Doubly Robust Framework Abhishek Chakrabortty 1 Department of Statistics University of Pennsylvania Group Meeting April 24, 2019


  1. The Two Standard (Fundamental) Assumptions Ignorability assumption: T ⊥ ⊥ Y | X . 1 A.k.a. ‘missing at random’ (MAR) in the missing data literature. A.k.a. ‘no unmeasured confounding’ (NUC) in causal inference. Special case: T ⊥ ⊥ ( Y , X ). A.k.a. missing completely at random (MCAR) in missing data literature, and complete randomization (e.g. randomized trials) in causal inference (CI) literature. Positivity assumption (a.k.a. ‘sufficient overlap’ in CI literature): 2 Let π ( X ) := P ( T = 1 | X ) be the propensity score (PS), and let π 0 := P ( T = 1). Then, π ( · ) is uniformly bounded away from 0: 1 ≥ π ( x ) ≥ δ π > 0 ∀ x ∈ X , for some constant δ π > 0 . Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 6/50

  2. Relevance in Biomedical Studies: EHR Data Rich resources of data for discovery research; fast growing literature. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 7/50

  3. Relevance in Biomedical Studies: EHR Data Rich resources of data for discovery research; fast growing literature. Detailed clinical and phenotypic data collected electronically for large patient cohorts, as part of routine health care delivery. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 7/50

  4. Relevance in Biomedical Studies: EHR Data Rich resources of data for discovery research; fast growing literature. Detailed clinical and phenotypic data collected electronically for large patient cohorts, as part of routine health care delivery. Structured data: ICD codes, medications, lab tests, demographics etc. Unstructured text data (extracted from clinician notes via NLP): signs and symptoms, family history, social history, radiology reports etc. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 7/50

  5. EHR Data: The Promises and the Challenges Information on a variety of phenotypes (unlike usual cohort studies). Opens up unique opportunities for novel integrative analyses . Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 8/50

  6. EHR Data: The Promises and the Challenges Information on a variety of phenotypes (unlike usual cohort studies). Opens up unique opportunities for novel integrative analyses . EHR + Bio-repositories � genome-phenome association networks, PheWAS studies and genomic risk prediction of diseases. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 8/50

  7. EHR Data: The Promises and the Challenges Information on a variety of phenotypes (unlike usual cohort studies). Opens up unique opportunities for novel integrative analyses . EHR + Bio-repositories � genome-phenome association networks, PheWAS studies and genomic risk prediction of diseases. The key challenges and bottlenecks for EHR driven research: Logistic difficulty in obtaining validated phenotype (Y) information. Often time/labor/cost intensive (and the ICD codes are imprecise). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 8/50

  8. EHR Data and Incompleteness: Various Examples Some examples of missing Y in EHRs and the reason for missingness: Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 9/50

  9. EHR Data and Incompleteness: Various Examples Some examples of missing Y in EHRs and the reason for missingness: Y � some (binary) disease phenotype (e.g. Rheumatoid Arthritis). 1 Requires manual chart review by physicians (logistic constraints). Y � some biomarker (e.g. anti-CCP, an important RA biomarker). 2 Requires lab tests (cost constraints). Similarly, any Y requiring genomic measurements may also have cost/logistics constraints. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 9/50

  10. EHR Data and Incompleteness: Various Examples Some examples of missing Y in EHRs and the reason for missingness: Y � some (binary) disease phenotype (e.g. Rheumatoid Arthritis). 1 Requires manual chart review by physicians (logistic constraints). Y � some biomarker (e.g. anti-CCP, an important RA biomarker). 2 Requires lab tests (cost constraints). Similarly, any Y requiring genomic measurements may also have cost/logistics constraints. Verified phenotypes/treatment response/biomarkers/genomic vars (Y) available only for a subset. Clinical features ( X ) available for all . Further issues: selection bias/treatment by indication/preferential labeling (e.g. sicker patients get labeled/treated/tested more often). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 9/50

  11. EHR Data and Incompleteness: Various Examples Some examples of missing Y in EHRs and the reason for missingness: Y � some (binary) disease phenotype (e.g. Rheumatoid Arthritis). 1 Requires manual chart review by physicians (logistic constraints). Y � some biomarker (e.g. anti-CCP, an important RA biomarker). 2 Requires lab tests (cost constraints). Similarly, any Y requiring genomic measurements may also have cost/logistics constraints. Verified phenotypes/treatment response/biomarkers/genomic vars (Y) available only for a subset. Clinical features ( X ) available for all . Further issues: selection bias/treatment by indication/preferential labeling (e.g. sicker patients get labeled/treated/tested more often). Causal inference problems (treatment effects estimation): EHRs also facilitate comparative effectiveness research on a large scale. Many treatments/medications (and responses) being observed. All other clinical features ( X ) serve as potential confounders. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 9/50

  12. Another Example: eQTL Studies (Integrative Genomics) Association studies for gene expression ( Y ) vs. genetic variants ( X ). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 10/50

  13. Another Example: eQTL Studies (Integrative Genomics) Association studies for gene expression ( Y ) vs. genetic variants ( X ). Popular tools in integrative genomics (genetic association studies + gene expression profiling) for understanding gene regulatory networks. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 10/50

  14. Another Example: eQTL Studies (Integrative Genomics) Association studies for gene expression ( Y ) vs. genetic variants ( X ). Popular tools in integrative genomics (genetic association studies + gene expression profiling) for understanding gene regulatory networks. Missing data issue: gene expression data often missing (loss of power), while genetic variants data often available for a much larger group. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 10/50

  15. Another Example: eQTL Studies (Integrative Genomics) Association studies for gene expression ( Y ) vs. genetic variants ( X ). Popular tools in integrative genomics (genetic association studies + gene expression profiling) for understanding gene regulatory networks. Missing data issue: gene expression data often missing (loss of power), while genetic variants data often available for a much larger group. Causal inference: estimate the causal effect of any one variant (the ‘treatment’) on Y while all other variants are potential confounders. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 10/50

  16. High Dimensional M -Estimation: The Parameter(s) of Interest Goal for M -estimation: estimation and inference, based on D n , of θ 0 ∈ R d (possibly high dimensional), defined as the risk minimizer: θ 0 ≡ θ 0 ( P ) := arg min R ( θ ) , where R ( θ ) := E { L ( Y , X , θ ) } and θ ∈ R d L ( · ) ∈ R + is any ‘loss’ function that is convex and differentiable in θ . Existence of θ 0 implicitly assumed (guaranteed for most usual probs). d can diverge with n (including d ≫ n ). Also, θ 0 ( P ) is ‘model free’ (no restrictions on P ). In particular, no model assumptions on Y | X . Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 11/50

  17. High Dimensional M -Estimation: The Parameter(s) of Interest Goal for M -estimation: estimation and inference, based on D n , of θ 0 ∈ R d (possibly high dimensional), defined as the risk minimizer: θ 0 ≡ θ 0 ( P ) := arg min R ( θ ) , where R ( θ ) := E { L ( Y , X , θ ) } and θ ∈ R d L ( · ) ∈ R + is any ‘loss’ function that is convex and differentiable in θ . Existence of θ 0 implicitly assumed (guaranteed for most usual probs). d can diverge with n (including d ≫ n ). Also, θ 0 ( P ) is ‘model free’ (no restrictions on P ). In particular, no model assumptions on Y | X . The key challenges: the missingness via T (if not accounted for, the estimator will be inconsistent!) and the high dimensional setting. Need suitable methods - involves estimation of nuisance functions and careful analyses (due to error terms with complex dependencies). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 11/50

  18. High Dimensional M -Estimation: The Parameter(s) of Interest Goal for M -estimation: estimation and inference, based on D n , of θ 0 ∈ R d (possibly high dimensional), defined as the risk minimizer: θ 0 ≡ θ 0 ( P ) := arg min R ( θ ) , where R ( θ ) := E { L ( Y , X , θ ) } and θ ∈ R d L ( · ) ∈ R + is any ‘loss’ function that is convex and differentiable in θ . Existence of θ 0 implicitly assumed (guaranteed for most usual probs). d can diverge with n (including d ≫ n ). Also, θ 0 ( P ) is ‘model free’ (no restrictions on P ). In particular, no model assumptions on Y | X . The key challenges: the missingness via T (if not accounted for, the estimator will be inconsistent!) and the high dimensional setting. Need suitable methods - involves estimation of nuisance functions and careful analyses (due to error terms with complex dependencies). Special (but low- d ) case: θ 0 = E ( Y ) and L ( Y , X , θ ) = ( Y − θ ) 2 . Leads to the average treatment effect (ATE) estimation prob in CI. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 11/50

  19. M -Estimation and Missing Data/Causal Inference Problems: A Review The framework includes a broad class of M / Z -estimation problems. M -estimation for fully observed data: well studied with rich literature. Classical settings: Van der Vaart (2000); High dimensional settings: Negahban et al. (2012), Loh and Wainwright (2012, 2015) etc. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 12/50

  20. M -Estimation and Missing Data/Causal Inference Problems: A Review The framework includes a broad class of M / Z -estimation problems. M -estimation for fully observed data: well studied with rich literature. Classical settings: Van der Vaart (2000); High dimensional settings: Negahban et al. (2012), Loh and Wainwright (2012, 2015) etc. Missing data/causal inference problems: semi-parametric inference. Classical settings: vast literature (typically for mean estimation). Tsiatis (2007); Bang and Robins (2005); Robins et al. (1994) etc. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 12/50

  21. M -Estimation and Missing Data/Causal Inference Problems: A Review The framework includes a broad class of M / Z -estimation problems. M -estimation for fully observed data: well studied with rich literature. Classical settings: Van der Vaart (2000); High dimensional settings: Negahban et al. (2012), Loh and Wainwright (2012, 2015) etc. Missing data/causal inference problems: semi-parametric inference. Classical settings: vast literature (typically for mean estimation). Tsiatis (2007); Bang and Robins (2005); Robins et al. (1994) etc. High dimensional settings (but low dimensional parameters): lot of attention in recent times on mean (or ATE) estimation. Belloni et al. (2014, 2017); Farrell (2015); Chernozhukov et al. (2018). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 12/50

  22. M -Estimation and Missing Data/Causal Inference Problems: A Review The framework includes a broad class of M / Z -estimation problems. M -estimation for fully observed data: well studied with rich literature. Classical settings: Van der Vaart (2000); High dimensional settings: Negahban et al. (2012), Loh and Wainwright (2012, 2015) etc. Missing data/causal inference problems: semi-parametric inference. Classical settings: vast literature (typically for mean estimation). Tsiatis (2007); Bang and Robins (2005); Robins et al. (1994) etc. High dimensional settings (but low dimensional parameters): lot of attention in recent times on mean (or ATE) estimation. Belloni et al. (2014, 2017); Farrell (2015); Chernozhukov et al. (2018). Much less attention when the parameter itself is high dimensional. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 12/50

  23. M -Estimation and Missing Data/Causal Inference Problems: A Review The framework includes a broad class of M / Z -estimation problems. M -estimation for fully observed data: well studied with rich literature. Classical settings: Van der Vaart (2000); High dimensional settings: Negahban et al. (2012), Loh and Wainwright (2012, 2015) etc. Missing data/causal inference problems: semi-parametric inference. Classical settings: vast literature (typically for mean estimation). Tsiatis (2007); Bang and Robins (2005); Robins et al. (1994) etc. High dimensional settings (but low dimensional parameters): lot of attention in recent times on mean (or ATE) estimation. Belloni et al. (2014, 2017); Farrell (2015); Chernozhukov et al. (2018). Much less attention when the parameter itself is high dimensional. This work contributes to both literature above: M -estimation + missing data + high dimensional setting and parameter. (Also has applications in heterogeneous treatment effects estimation in CI). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 12/50

  24. HD M -Estimation: A Few (Class of) Applications All standard high dimensional (HD) regression problems with: (a) 1 missing outcomes and (b) potentially misspecified (working) models. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 13/50

  25. HD M -Estimation: A Few (Class of) Applications All standard high dimensional (HD) regression problems with: (a) 1 missing outcomes and (b) potentially misspecified (working) models. E.g. squared loss: L ( Y , X , θ ) := ( Y − X ′ θ ) 2 � linear regression; logistic loss: L ( Y , X , θ ) := log { 1 + exp( X ′ θ ) } − Y ( X ′ θ ) � logistic regression (for binary Y ), exponential loss (Poisson reg.) so on . . . . Note: throughout, regardless of any motivating ‘working model’ being true or not, the definition of θ 0 is completely ‘model free’. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 13/50

  26. HD M -Estimation: A Few (Class of) Applications All standard high dimensional (HD) regression problems with: (a) 1 missing outcomes and (b) potentially misspecified (working) models. E.g. squared loss: L ( Y , X , θ ) := ( Y − X ′ θ ) 2 � linear regression; logistic loss: L ( Y , X , θ ) := log { 1 + exp( X ′ θ ) } − Y ( X ′ θ ) � logistic regression (for binary Y ), exponential loss (Poisson reg.) so on . . . . Note: throughout, regardless of any motivating ‘working model’ being true or not, the definition of θ 0 is completely ‘model free’. Series estimation problems (model free) with missing Y and HD basis 2 functions (instead of X in Example 1 above). E.g. spline bases. Use the same choices of L ( · ) as in Example 1 above with X replaced by any set of d (possibly HD) basis functions Ψ ( X ) := { ψ j ( X ) } d j =1 . E.g. polynomial bases: Ψ ( X ) := { 1 , x k j : 1 ≤ j ≤ p , 1 ≤ k ≤ d 0 } . ( d 0 = 1 � linear bases as in Example 1; d 0 = 3 � cubic splines). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 13/50

  27. Another Application: HD Single Index Models (SIMs) Signal recovery in high dimensional single index models (SIMs) with elliptically symmetric design distribution (e.g. X is Gaussian). 0 X , ǫ ) with f : R 2 → Y unknown (i.e. β 0 identifiable Let Y = f ( β ′ ⊥ X | β ′ only upto scalar multiples) and ǫ ⊥ ⊥ X (i.e., Y ⊥ 0 X ). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 14/50

  28. Another Application: HD Single Index Models (SIMs) Signal recovery in high dimensional single index models (SIMs) with elliptically symmetric design distribution (e.g. X is Gaussian). 0 X , ǫ ) with f : R 2 → Y unknown (i.e. β 0 identifiable Let Y = f ( β ′ ⊥ X | β ′ only upto scalar multiples) and ǫ ⊥ ⊥ X (i.e., Y ⊥ 0 X ). Consider any of the regression problems introduced in Example 1. Let θ 0 := arg min θ ∈ R p E { L ( Y , X ′ θ ) } for any convex loss function L ( · ) : R 2 → R (convex in the second argument). Then, θ 0 ∝ β 0 ! A remarkable result due to Li and Duan (1989). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 14/50

  29. Another Application: HD Single Index Models (SIMs) Signal recovery in high dimensional single index models (SIMs) with elliptically symmetric design distribution (e.g. X is Gaussian). 0 X , ǫ ) with f : R 2 → Y unknown (i.e. β 0 identifiable Let Y = f ( β ′ ⊥ X | β ′ only upto scalar multiples) and ǫ ⊥ ⊥ X (i.e., Y ⊥ 0 X ). Consider any of the regression problems introduced in Example 1. Let θ 0 := arg min θ ∈ R p E { L ( Y , X ′ θ ) } for any convex loss function L ( · ) : R 2 → R (convex in the second argument). Then, θ 0 ∝ β 0 ! A remarkable result due to Li and Duan (1989). Classic example of a misspecified parametric model defining θ 0 , yet θ 0 directly relates to an actual (interpretable) semi-parametric model! The proportionality result also preserves any sparsity assumptions. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 14/50

  30. Applications in Causal Inference (Treatment Effects Estimation) Applications of all these problems in causal inference (estimation of treatment effects with useful applications in precision medicine): Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 15/50

  31. Applications in Causal Inference (Treatment Effects Estimation) Applications of all these problems in causal inference (estimation of treatment effects with useful applications in precision medicine): Linear heterogeneous treatment effects estimation: application of 1 the linear regression example (twice). Write { Y (0) , Y (1) } linearly as: Y ( j ) = X ′ β ( j ) + ǫ ( j ) , E ( ǫ ( j ) X ) = 0 ∀ j = 0 , 1 , so that Y (1) − Y (0) = X ′ β ∗ + ǫ ∗ , β ∗ := β (1) − β (0) , ǫ ∗ := ǫ (1) − ǫ (0) . Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 15/50

  32. Applications in Causal Inference (Treatment Effects Estimation) Applications of all these problems in causal inference (estimation of treatment effects with useful applications in precision medicine): Linear heterogeneous treatment effects estimation: application of 1 the linear regression example (twice). Write { Y (0) , Y (1) } linearly as: Y ( j ) = X ′ β ( j ) + ǫ ( j ) , E ( ǫ ( j ) X ) = 0 ∀ j = 0 , 1 , so that Y (1) − Y (0) = X ′ β ∗ + ǫ ∗ , β ∗ := β (1) − β (0) , ǫ ∗ := ǫ (1) − ǫ (0) . β ∗ denotes the (model free) linear projection of Y (1) − Y (0) | X . Of interest in HD settings when E { Y (1) − Y (0) | X } is difficult to model (Chernozhukov et al., 2017; Chernozhukov and Semenova, 2017). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 15/50

  33. Applications in Causal Inference (Treatment Effects Estimation) Applications of all these problems in causal inference (estimation of treatment effects with useful applications in precision medicine): Linear heterogeneous treatment effects estimation: application of 1 the linear regression example (twice). Write { Y (0) , Y (1) } linearly as: Y ( j ) = X ′ β ( j ) + ǫ ( j ) , E ( ǫ ( j ) X ) = 0 ∀ j = 0 , 1 , so that Y (1) − Y (0) = X ′ β ∗ + ǫ ∗ , β ∗ := β (1) − β (0) , ǫ ∗ := ǫ (1) − ǫ (0) . β ∗ denotes the (model free) linear projection of Y (1) − Y (0) | X . Of interest in HD settings when E { Y (1) − Y (0) | X } is difficult to model (Chernozhukov et al., 2017; Chernozhukov and Semenova, 2017). Average conditional treatment effects (ACTE) estimation via series 2 estimators: application of the series estimation example (twice). Causal inference via SIMs (signal recovery, ACTE estimation and 3 ATE estimation): application of the SIM example (twice). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 15/50

  34. Before Getting Started: A Few Facts and Considerations Some notations: m ( X ) := E ( Y | X ) and φ ( X , θ ) := E { L ( Y , X , θ ) | X } . Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 16/50

  35. Before Getting Started: A Few Facts and Considerations Some notations: m ( X ) := E ( Y | X ) and φ ( X , θ ) := E { L ( Y , X , θ ) | X } . It is generally necessary to ‘account’ for the missingness in Y . The ‘complete case’ estimator of θ 0 in general will be inconsistent ! Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 16/50

  36. Before Getting Started: A Few Facts and Considerations Some notations: m ( X ) := E ( Y | X ) and φ ( X , θ ) := E { L ( Y , X , θ ) | X } . It is generally necessary to ‘account’ for the missingness in Y . The ‘complete case’ estimator of θ 0 in general will be inconsistent ! That estimator may be consistent only if: (1) ∇ φ ( X , θ 0 ) = 0 a.s. for every X (for regression problems, this indicates the ‘correct model’ case), and/or (2) T ⊥ ⊥ ( Y , X ) (i.e. the MCAR case). Illustration of (1) for sq. loss: ∇ φ ( X , θ 0 ) = E { X ( Y − X ′ θ 0 ) | X } = 0 . Hence, E ( Y | X ) = X ′ θ 0 (i.e. a ‘linear model’ holds for Y | X ). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 16/50

  37. Before Getting Started: A Few Facts and Considerations Some notations: m ( X ) := E ( Y | X ) and φ ( X , θ ) := E { L ( Y , X , θ ) | X } . It is generally necessary to ‘account’ for the missingness in Y . The ‘complete case’ estimator of θ 0 in general will be inconsistent ! That estimator may be consistent only if: (1) ∇ φ ( X , θ 0 ) = 0 a.s. for every X (for regression problems, this indicates the ‘correct model’ case), and/or (2) T ⊥ ⊥ ( Y , X ) (i.e. the MCAR case). Illustration of (1) for sq. loss: ∇ φ ( X , θ 0 ) = E { X ( Y − X ′ θ 0 ) | X } = 0 . Hence, E ( Y | X ) = X ′ θ 0 (i.e. a ‘linear model’ holds for Y | X ). With θ 0 (and X ) being high dimensional (compared to n ), we need some further structural constraints on θ 0 to estimate it using D n . We assume that θ 0 is s -sparse: � θ 0 � 0 := s and s ≤ min( n , d ). Note: the sparsity requirement has attractive (and fairly intuitive) geometric justification for all the examples we have given here. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 16/50

  38. Estimation of θ 0 : Getting Identifiable Representation(s) of R ( θ ) Under MAR assmpn., R ( θ ) := E { L ( Y , X , θ ) } ≡ E X { φ ( X , θ ) } admits the following debiased and doubly robust (DDR) representation: Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 17/50

  39. Estimation of θ 0 : Getting Identifiable Representation(s) of R ( θ ) Under MAR assmpn., R ( θ ) := E { L ( Y , X , θ ) } ≡ E X { φ ( X , θ ) } admits the following debiased and doubly robust (DDR) representation: � T � R ( θ ) = E X { φ ( X , θ ) } + E π ( X ) { L ( Y , X , θ ) − φ ( X , θ ) } . (1) Purely non-parametric identification based on the observable Z and the nuisance functions: π ( X ) and φ ( X , θ ) (unknown but estimable ). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 17/50

  40. Estimation of θ 0 : Getting Identifiable Representation(s) of R ( θ ) Under MAR assmpn., R ( θ ) := E { L ( Y , X , θ ) } ≡ E X { φ ( X , θ ) } admits the following debiased and doubly robust (DDR) representation: � T � R ( θ ) = E X { φ ( X , θ ) } + E π ( X ) { L ( Y , X , θ ) − φ ( X , θ ) } . (1) Purely non-parametric identification based on the observable Z and the nuisance functions: π ( X ) and φ ( X , θ ) (unknown but estimable ). 2 nd term is simply 0, can be seen as a ‘debiasing’ term (of sorts). Plays a crucial role in analyzing the empirical version of (1). Ensures first order insensitivity to any estimation errors of π ( · ) and φ ( · ). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 17/50

  41. Estimation of θ 0 : Getting Identifiable Representation(s) of R ( θ ) Under MAR assmpn., R ( θ ) := E { L ( Y , X , θ ) } ≡ E X { φ ( X , θ ) } admits the following debiased and doubly robust (DDR) representation: � T � R ( θ ) = E X { φ ( X , θ ) } + E π ( X ) { L ( Y , X , θ ) − φ ( X , θ ) } . (1) Purely non-parametric identification based on the observable Z and the nuisance functions: π ( X ) and φ ( X , θ ) (unknown but estimable ). 2 nd term is simply 0, can be seen as a ‘debiasing’ term (of sorts). Plays a crucial role in analyzing the empirical version of (1). Ensures first order insensitivity to any estimation errors of π ( · ) and φ ( · ). Double robustness (DR) aspect: replace { φ ( X , θ ) , π ( X ) } by any { φ ∗ ( X , θ ) , π ∗ ( X ) } and (1) continues to hold as long as one but not necessarily both of φ ∗ ( · ) = φ ( · ) or π ∗ ( · ) = π ( · ) hold. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 17/50

  42. The DDR Estimator of θ 0 π ( · ) , � Given any estimators { � φ ( · ) } be of the nuisance fns. { π ( · ) , φ ( · ) } , we define our L 1 -penalized DDR estimator � θ DDR of θ 0 as: � � � θ DDR ≡ � L DDR θ DDR ( λ n ) := arg min ( θ ) + λ n � θ � 1 , where n θ ∈ R d � � n � ( θ ) := 1 T i � L ( Y i , X i , θ ) − � L DDR φ ( X i , θ ) + φ ( X i , θ ) , n n π ( X i ) � i =1 π ( · ) , � λ n ≥ 0 is the tuning parameter and { � φ ( · ) } are arbitrary except for satisfying two basic conditions regarding their construction: Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 18/50

  43. The DDR Estimator of θ 0 π ( · ) , � Given any estimators { � φ ( · ) } be of the nuisance fns. { π ( · ) , φ ( · ) } , we define our L 1 -penalized DDR estimator � θ DDR of θ 0 as: � � θ DDR ≡ � � L DDR θ DDR ( λ n ) := arg min ( θ ) + λ n � θ � 1 , where n θ ∈ R d � � n � ( θ ) := 1 T i � L ( Y i , X i , θ ) − � L DDR φ ( X i , θ ) + φ ( X i , θ ) , n n � π ( X i ) i =1 π ( · ) , � λ n ≥ 0 is the tuning parameter and { � φ ( · ) } are arbitrary except for satisfying two basic conditions regarding their construction: i =1 only ; { � π ( · ) obtained from the data T n := { T i , X i } n φ ( X i , θ ) } n � i =1 obtained in a ‘cross-fitted’ manner (via sample splitting). π ( · ) , � Assume (temporarily) { � φ ( · ) } are both ‘correct’. DR properties (consistency) of � θ DDR under their misspecfications discussed later. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 18/50

  44. Simplifying Assumptions and User Friendly Implementation Algorithm For simplicity, assume that the gradient ∇ L ( Y , X , θ ) of L ( · ) satisfies a ‘separable form’ as follows: for some h ( X ) ∈ R d and g ( X , θ ) ∈ R , Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 19/50

  45. Simplifying Assumptions and User Friendly Implementation Algorithm For simplicity, assume that the gradient ∇ L ( Y , X , θ ) of L ( · ) satisfies a ‘separable form’ as follows: for some h ( X ) ∈ R d and g ( X , θ ) ∈ R , ∇ L ( Y , X , θ ) = h ( X ) { Y − g ( X , θ ) } , and hence, ∇ � φ ( X , θ ) = h ( X ) { � m ( X ) − g ( X , θ ) } , where m ( X ) denotes the corresponding (cross-fitted) estimator of m ( X ). � This simplifying assumption holds for all examples given before. m ( X i ) and not � Assumed form ⇒ only need to obtain � φ ( X i , θ ). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 19/50

  46. Simplifying Assumptions and User Friendly Implementation Algorithm For simplicity, assume that the gradient ∇ L ( Y , X , θ ) of L ( · ) satisfies a ‘separable form’ as follows: for some h ( X ) ∈ R d and g ( X , θ ) ∈ R , ∇ L ( Y , X , θ ) = h ( X ) { Y − g ( X , θ ) } , and hence, ∇ � φ ( X , θ ) = h ( X ) { � m ( X ) − g ( X , θ ) } , where m ( X ) denotes the corresponding (cross-fitted) estimator of m ( X ). � This simplifying assumption holds for all examples given before. m ( X i ) and not � Assumed form ⇒ only need to obtain � φ ( X i , θ ). Implementation algorithm. � θ DDR can be obtained simply as: � � n � 1 θ DDR ≡ � � L ( � θ DDR ( λ n ) := arg min Y i , X i , θ ) + λ n � θ � 1 , n θ ∈ R d i =1 where � T i Y i := � m ( X i ) + π ( X i ) { Y i − � m ( X i ) } , ∀ i , is a ‘pseudo’ outcome. � Can use ‘glmnet’ in R . Pretend to have a ‘full’ data: { � Y i , X i } n i =1 . Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 19/50

  47. Properties of � θ DDR : Deterministic Deviation Bounds Assume L ( · ) is convex and differentiable in θ and L DDR ( θ ) satisfies n the Restricted Strong Convexity (RSC) condition (Negahban et al., 2012) at θ = θ 0 . Then, for any choice of λ n ≥ 2 � ∇ L DDR ( θ 0 ) � ∞ , n Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 20/50

  48. Properties of � θ DDR : Deterministic Deviation Bounds Assume L ( · ) is convex and differentiable in θ and L DDR ( θ ) satisfies n the Restricted Strong Convexity (RSC) condition (Negahban et al., 2012) at θ = θ 0 . Then, for any choice of λ n ≥ 2 � ∇ L DDR ( θ 0 ) � ∞ , n � � � � √ s , and � � � � �� �� θ DDR ( λ n ) − θ 0 2 � λ n θ DDR ( λ n ) − θ 0 1 � λ n s . � � where s := � θ 0 � 0 . This is a deterministic deviation bound. Holds for any choices of { � π ( · ) , � m ( · ) } and for any realization of D n . Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 20/50

  49. Properties of � θ DDR : Deterministic Deviation Bounds Assume L ( · ) is convex and differentiable in θ and L DDR ( θ ) satisfies n the Restricted Strong Convexity (RSC) condition (Negahban et al., 2012) at θ = θ 0 . Then, for any choice of λ n ≥ 2 � ∇ L DDR ( θ 0 ) � ∞ , n � � � � √ s , and � � � � �� �� θ DDR ( λ n ) − θ 0 2 � λ n θ DDR ( λ n ) − θ 0 1 � λ n s . � � where s := � θ 0 � 0 . This is a deterministic deviation bound. Holds for any choices of { � π ( · ) , � m ( · ) } and for any realization of D n . The RSC (or ‘cone’) condition for L DDR ( θ ) is exactly the same as n the usual RSC condition required under a fully observed data! The fully observed data RSC condition’s validity is well studied. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 20/50

  50. Properties of � θ DDR : Deterministic Deviation Bounds Assume L ( · ) is convex and differentiable in θ and L DDR ( θ ) satisfies n the Restricted Strong Convexity (RSC) condition (Negahban et al., 2012) at θ = θ 0 . Then, for any choice of λ n ≥ 2 � ∇ L DDR ( θ 0 ) � ∞ , n � � � � √ s , and � � � � �� �� θ DDR ( λ n ) − θ 0 2 � λ n θ DDR ( λ n ) − θ 0 1 � λ n s . � � where s := � θ 0 � 0 . This is a deterministic deviation bound. Holds for any choices of { � π ( · ) , � m ( · ) } and for any realization of D n . The RSC (or ‘cone’) condition for L DDR ( θ ) is exactly the same as n the usual RSC condition required under a fully observed data! The fully observed data RSC condition’s validity is well studied. Key quantity of interest: the random lower bound � ∇ L DDR ( θ 0 ) � ∞ for n λ n . Need probabilistic bounds to determine convergence rate of � θ DDR . Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 20/50

  51. The Main Goal from Hereon: Probabilistic Bounds for � ∇ L DDR ( θ 0 ) � ∞ n Bounds on � ∇ L DDR ( θ 0 ) � ∞ determines the rate of choice of λ n and n hence the convergence rate of � θ DDR (using the deviation bound). Probabilistic bounds for � ∇ L DDR ( θ 0 ) � ∞ : the basic decomposition n � � � ∇ L DDR � ( θ 0 ) ∞ ≤ � T 0 , n � ∞ + � T π, n � ∞ + � T m , n � ∞ + � R π, m , n � ∞ , n Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 21/50

  52. The Main Goal from Hereon: Probabilistic Bounds for � ∇ L DDR ( θ 0 ) � ∞ n Bounds on � ∇ L DDR ( θ 0 ) � ∞ determines the rate of choice of λ n and n hence the convergence rate of � θ DDR (using the deviation bound). Probabilistic bounds for � ∇ L DDR ( θ 0 ) � ∞ : the basic decomposition n � � � ∇ L DDR � ( θ 0 ) ∞ ≤ � T 0 , n � ∞ + � T π, n � ∞ + � T m , n � ∞ + � R π, m , n � ∞ , n where T 0 , n is the ‘main’ term (a centered iid average), T π, n is the ‘ π -error’ term involving � π ( · ) − π ( · ) and T m , n is the ‘ m -error’ term involving � m ( · ) − m ( · ), while R π, m , n is the ‘( π, m )-error’ term (usually lower order) involving the product of � π ( · ) − π ( · ) and � m ( · ) − m ( · ). Control each term separately. The analyses are all non-asymptotic and nuanced, especially in order to get sharp rates for T π, n and T m , n . Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 21/50

  53. The Main Goal from Hereon: Probabilistic Bounds for � ∇ L DDR ( θ 0 ) � ∞ n Bounds on � ∇ L DDR ( θ 0 ) � ∞ determines the rate of choice of λ n and n hence the convergence rate of � θ DDR (using the deviation bound). Probabilistic bounds for � ∇ L DDR ( θ 0 ) � ∞ : the basic decomposition n � � � ∇ L DDR � ( θ 0 ) ∞ ≤ � T 0 , n � ∞ + � T π, n � ∞ + � T m , n � ∞ + � R π, m , n � ∞ , n where T 0 , n is the ‘main’ term (a centered iid average), T π, n is the ‘ π -error’ term involving � π ( · ) − π ( · ) and T m , n is the ‘ m -error’ term involving � m ( · ) − m ( · ), while R π, m , n is the ‘( π, m )-error’ term (usually lower order) involving the product of � π ( · ) − π ( · ) and � m ( · ) − m ( · ). Control each term separately. The analyses are all non-asymptotic and nuanced, especially in order to get sharp rates for T π, n and T m , n . � We show: � ∇ L DDR ( θ 0 ) � ∞ � (log d ) / n with high probability, and n � hence � � θ DDR − θ 0 � 2 � s (log d ) / n . So, clearly it is rate optimal. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 21/50

  54. ( θ 0 ) � ∞ (and � Convergence Rates and Bounds for � ∇ L DDR θ DDR ) n Basic (high level) consistency conditions on { � π ( · ) , � m ( · ) } . Let { � π ( · ) , m ( · ) } be any general and ‘correct’ estimators of { π ( · ) , m ( · ) } , and � assume they satisfy the following pointwise convergence rates: Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 22/50

  55. ( θ 0 ) � ∞ (and � Convergence Rates and Bounds for � ∇ L DDR θ DDR ) n Basic (high level) consistency conditions on { � π ( · ) , � m ( · ) } . Let { � π ( · ) , m ( · ) } be any general and ‘correct’ estimators of { π ( · ) , m ( · ) } , and � assume they satisfy the following pointwise convergence rates: | � π ( x ) − π ( x ) | � P δ n ,π and | � m ( x ) − m ( x ) | � P ξ n , m ∀ x ∈ X , (2) � for some sequences δ n ,π , ξ n , m ≥ 0 such that ( δ n ,π + ξ n , m ) log( nd ) � = o (1) and the product δ n ,π ξ n , m (log n ) = o ( (log d ) / n ). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 22/50

  56. ( θ 0 ) � ∞ (and � Convergence Rates and Bounds for � ∇ L DDR θ DDR ) n Basic (high level) consistency conditions on { � π ( · ) , � m ( · ) } . Let { � π ( · ) , m ( · ) } be any general and ‘correct’ estimators of { π ( · ) , m ( · ) } , and � assume they satisfy the following pointwise convergence rates: | � π ( x ) − π ( x ) | � P δ n ,π and | � m ( x ) − m ( x ) | � P ξ n , m ∀ x ∈ X , (2) � for some sequences δ n ,π , ξ n , m ≥ 0 such that ( δ n ,π + ξ n , m ) log( nd ) � = o (1) and the product δ n ,π ξ n , m (log n ) = o ( (log d ) / n ). Under condition (2), along with some more ‘suitable’ tail assumptions (sub-Gaussian tails etc.), we have: with high probability, � � � � � log d log d � T 0 , n � ∞ � n , � T π, n � ∞ � δ n ,π log( nd ) , and n Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 22/50

  57. ( θ 0 ) � ∞ (and � Convergence Rates and Bounds for � ∇ L DDR θ DDR ) n Basic (high level) consistency conditions on { � π ( · ) , � m ( · ) } . Let { � π ( · ) , m ( · ) } be any general and ‘correct’ estimators of { π ( · ) , m ( · ) } , and � assume they satisfy the following pointwise convergence rates: | � π ( x ) − π ( x ) | � P δ n ,π and | � m ( x ) − m ( x ) | � P ξ n , m ∀ x ∈ X , (2) � for some sequences δ n ,π , ξ n , m ≥ 0 such that ( δ n ,π + ξ n , m ) log( nd ) � = o (1) and the product δ n ,π ξ n , m (log n ) = o ( (log d ) / n ). Under condition (2), along with some more ‘suitable’ tail assumptions (sub-Gaussian tails etc.), we have: with high probability, � � � � � log d log d � T 0 , n � ∞ � n , � T π, n � ∞ � δ n ,π log( nd ) , and n � � � � log d � T m , n � ∞ � ξ n , m log( nd ) , � R π, m , n � ∞ � δ n ,π ξ n , m (log n ) . n Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 22/50

  58. ( θ 0 ) � ∞ (and � Convergence Rates and Bounds for � ∇ L DDR θ DDR ) n Basic (high level) consistency conditions on { � π ( · ) , � m ( · ) } . Let { � π ( · ) , m ( · ) } be any general and ‘correct’ estimators of { π ( · ) , m ( · ) } , and � assume they satisfy the following pointwise convergence rates: | � π ( x ) − π ( x ) | � P δ n ,π and | � m ( x ) − m ( x ) | � P ξ n , m ∀ x ∈ X , (2) � for some sequences δ n ,π , ξ n , m ≥ 0 such that ( δ n ,π + ξ n , m ) log( nd ) � = o (1) and the product δ n ,π ξ n , m (log n ) = o ( (log d ) / n ). Under condition (2), along with some more ‘suitable’ tail assumptions (sub-Gaussian tails etc.), we have: with high probability, � � � � � log d log d � T 0 , n � ∞ � n , � T π, n � ∞ � δ n ,π log( nd ) , and n � � � � log d � T m , n � ∞ � ξ n , m log( nd ) , � R π, m , n � ∞ � δ n ,π ξ n , m (log n ) . n � log d Hence, � ∇ L DDR ( θ 0 ) � ∞ � n { 1 + o (1) } with high probability. n Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 22/50

  59. HD Inference for � θ DDR : Desparsification and Asymptotic Linear Expansion Consider � θ DDR for the squared loss: L ( Y , X , θ ) := { Y − Ψ ( X ) ′ θ } 2 , where Ψ ( X ) ∈ R d denotes any HD vector of basis functions of X . Define Σ := E { Ψ ( X ) Ψ ( X ) ′ } , Ω := Σ − 1 , and let � Ω be any reasonable estimator of Ω (and assume Ω is sparse if required). We then define the desparsified DDR estimator � θ DDR as follows. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 23/50

  60. HD Inference for � θ DDR : Desparsification and Asymptotic Linear Expansion Consider � θ DDR for the squared loss: L ( Y , X , θ ) := { Y − Ψ ( X ) ′ θ } 2 , where Ψ ( X ) ∈ R d denotes any HD vector of basis functions of X . Define Σ := E { Ψ ( X ) Ψ ( X ) ′ } , Ω := Σ − 1 , and let � Ω be any reasonable estimator of Ω (and assume Ω is sparse if required). We then define the desparsified DDR estimator � θ DDR as follows. n � Ω 1 θ DDR := � � θ DDR + � { � Y i − Ψ ( X i ) ′ � θ DDR } Ψ ( X i ) , where n i =1 � �� � Desparsification/Debiasing term T i � Y i := � π ( X i ) { Y i − � m ( X i ) } are the pseudo outcomes. m ( X i ) + � Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 23/50

  61. HD Inference for � θ DDR : Desparsification and Asymptotic Linear Expansion Consider � θ DDR for the squared loss: L ( Y , X , θ ) := { Y − Ψ ( X ) ′ θ } 2 , where Ψ ( X ) ∈ R d denotes any HD vector of basis functions of X . Define Σ := E { Ψ ( X ) Ψ ( X ) ′ } , Ω := Σ − 1 , and let � Ω be any reasonable estimator of Ω (and assume Ω is sparse if required). We then define the desparsified DDR estimator � θ DDR as follows. n � Ω 1 θ DDR := � � θ DDR + � { � Y i − Ψ ( X i ) ′ � θ DDR } Ψ ( X i ) , where n i =1 � �� � Desparsification/Debiasing term T i � Y i := � π ( X i ) { Y i − � m ( X i ) } are the pseudo outcomes. m ( X i ) + � Debiasing similar (in spirit) to van de Geer et al. (2014), except its the ‘right’ one for this problem (using pseudo outcomes in the full data). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 23/50

  62. The Desparisfied DDR Estimator: Asymptotic Linear Expansion Assume: the basic convergence conditions (2) for { � π ( · ) , � m ( · ) } , ΩX is sub-Gaussian and that � � Ω − Ω � 1 = O P ( a n ), � I − � Ω � Σ � max = O P ( b n ), √ log d = o (1) and b n s √ log d = o (1), where s := � θ 0 � 0 . with a n Then, � θ DDR satisfies the asymptotic linear expansion (ALE) : Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 24/50

  63. The Desparisfied DDR Estimator: Asymptotic Linear Expansion Assume: the basic convergence conditions (2) for { � π ( · ) , � m ( · ) } , ΩX is sub-Gaussian and that � � Ω − Ω � 1 = O P ( a n ), � I − � Ω � Σ � max = O P ( b n ), √ log d = o (1) and b n s √ log d = o (1), where s := � θ 0 � 0 . with a n Then, � θ DDR satisfies the asymptotic linear expansion (ALE) : n � θ DDR − θ 0 ) = 1 Ω { ψ 0 ( Z i ) } + ∆ n , where � ∆ n � ∞ = o P ( n − 1 ( � 2 ) n i =1 � � T { m ( X ) − Ψ ( X ) ′ θ 0 } + and ψ 0 ( Z ) := π ( X ) { Y − m ( X ) } Ψ ( X ) with E { ψ 0 ( Z ) } = 0 . The ALE facilitates inference (e.g. confidence intervals etc.) for any low-d component of θ 0 via Gaussian approx. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 24/50

  64. The Desparisfied DDR Estimator: Asymptotic Linear Expansion Assume: the basic convergence conditions (2) for { � π ( · ) , � m ( · ) } , ΩX is sub-Gaussian and that � � Ω − Ω � 1 = O P ( a n ), � I − � Ω � Σ � max = O P ( b n ), √ log d = o (1) and b n s √ log d = o (1), where s := � θ 0 � 0 . with a n Then, � θ DDR satisfies the asymptotic linear expansion (ALE) : n � θ DDR − θ 0 ) = 1 Ω { ψ 0 ( Z i ) } + ∆ n , where � ∆ n � ∞ = o P ( n − 1 ( � 2 ) n i =1 � � T { m ( X ) − Ψ ( X ) ′ θ 0 } + and ψ 0 ( Z ) := π ( X ) { Y − m ( X ) } Ψ ( X ) with E { ψ 0 ( Z ) } = 0 . The ALE facilitates inference (e.g. confidence intervals etc.) for any low-d component of θ 0 via Gaussian approx. Further, the ALE is also ‘optimal’. The function Ω ψ 0 ( Z ) =: Ψ eff ( Z ) is the ‘efficient’ influence function for θ 0 (Robins et al., 1994). Thus, in classical settings, � θ DDR achieves the semi-parametric efficiency bound. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 24/50

  65. The Desparsified Estimator: Asymptotic Normality and Some Final Remarks Coordinate-wise asymptotic normality of � θ DDR : ∀ 1 ≤ j ≤ d , √ n ( � d → N ( 0 , σ 2 0 , j ) , where σ 2 0 , j := Var { Ω ′ θ DDR − θ 0 ) j j · ψ 0 ( Z ) } . Further, max 1 ≤ j ≤ d | � σ 0 , j − σ 0 , j | = o P (1), where � σ 0 , j is the plug-in estimator obtained by plugging in � m ( · ) in Var { Ω ′ Ω , � π ( · ) and � j · ψ 0 ( Z ) } . Can choose � Ω to be any standard (sparse) precision matrix estimator, � e.g. the node-wise Lasso estimator. Here, a n = s Ω (log d ) / n and � 1 ≤ j ≤ d � Ω j · � 0 . b n = (log d ) / n under suitable conditions, with s Ω := max Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 25/50

  66. The Desparsified Estimator: Asymptotic Normality and Some Final Remarks Coordinate-wise asymptotic normality of � θ DDR : ∀ 1 ≤ j ≤ d , √ n ( � d → N ( 0 , σ 2 0 , j ) , where σ 2 0 , j := Var { Ω ′ θ DDR − θ 0 ) j j · ψ 0 ( Z ) } . Further, max 1 ≤ j ≤ d | � σ 0 , j − σ 0 , j | = o P (1), where � σ 0 , j is the plug-in estimator obtained by plugging in � m ( · ) in Var { Ω ′ Ω , � π ( · ) and � j · ψ 0 ( Z ) } . Can choose � Ω to be any standard (sparse) precision matrix estimator, � e.g. the node-wise Lasso estimator. Here, a n = s Ω (log d ) / n and � 1 ≤ j ≤ d � Ω j · � 0 . b n = (log d ) / n under suitable conditions, with s Ω := max The error ∆ n can be decomposed as: ∆ n = ∆ n , 1 + ∆ n , 2 + ∆ n , 3 , Ω − Ω ) � n n ( � i =1 ψ 0 ( Z i ) , ∆ n , 2 := ( I d − � Ω � Σ )( � where ∆ n , 1 := 1 θ DDR − θ 0 ) and ∆ n , 3 := � Ω ( T π, n + T m , n + R π, m , n ), with � ∆ n , 3 � ∞ � P n − 1 2 and � � log d log d � ∆ n , 1 � ∞ � a n and � ∆ n , 2 � ∞ � b n s . n n Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 25/50

  67. The DR Aspect: General Convergence Rates (under Misspecification) m ( · ) } → { π ∗ ( · ) , m ∗ ( · ) } , with either π ∗ ( · ) = π ( · ) or Finally, let { � π ( · ) , � m ∗ ( · ) = m ( · ) but not necessarily both. Assume the same pointwise convergence conditions and rates ( δ n ,π , ξ n , m ) for { � π ( · ) , � m ( · ) } as in (2), but now with { π ( · ) , m ( · ) } therein replaced by { π ∗ ( · ) , m ∗ ( · ) } . Under some ‘suitable’ assumptions, we have: with high probability, Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 26/50

  68. The DR Aspect: General Convergence Rates (under Misspecification) m ( · ) } → { π ∗ ( · ) , m ∗ ( · ) } , with either π ∗ ( · ) = π ( · ) or Finally, let { � π ( · ) , � m ∗ ( · ) = m ( · ) but not necessarily both. Assume the same pointwise convergence conditions and rates ( δ n ,π , ξ n , m ) for { � π ( · ) , � m ( · ) } as in (2), but now with { π ( · ) , m ( · ) } therein replaced by { π ∗ ( · ) , m ∗ ( · ) } . Under some ‘suitable’ assumptions, we have: with high probability, � � � log d � T 0 , n � ∞ + � T π, n � ∞ + � T m , n � ∞ � 1 + 1 ( π ∗ , m ∗ ) � =( π, m ) n Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 26/50

  69. The DR Aspect: General Convergence Rates (under Misspecification) m ( · ) } → { π ∗ ( · ) , m ∗ ( · ) } , with either π ∗ ( · ) = π ( · ) or Finally, let { � π ( · ) , � m ∗ ( · ) = m ( · ) but not necessarily both. Assume the same pointwise convergence conditions and rates ( δ n ,π , ξ n , m ) for { � π ( · ) , � m ( · ) } as in (2), but now with { π ( · ) , m ( · ) } therein replaced by { π ∗ ( · ) , m ∗ ( · ) } . Under some ‘suitable’ assumptions, we have: with high probability, � � � log d � T 0 , n � ∞ + � T π, n � ∞ + � T m , n � ∞ � 1 + 1 ( π ∗ , m ∗ ) � =( π, m ) n � � and � R π, m , n � ∞ � δ n ,π 1 ( m ∗ � = m ) + ξ n , m 1 ( π ∗ � = π ) + δ n ,π ξ n , m (log n ) . Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 26/50

  70. The DR Aspect: General Convergence Rates (under Misspecification) m ( · ) } → { π ∗ ( · ) , m ∗ ( · ) } , with either π ∗ ( · ) = π ( · ) or Finally, let { � π ( · ) , � m ∗ ( · ) = m ( · ) but not necessarily both. Assume the same pointwise convergence conditions and rates ( δ n ,π , ξ n , m ) for { � π ( · ) , � m ( · ) } as in (2), but now with { π ( · ) , m ( · ) } therein replaced by { π ∗ ( · ) , m ∗ ( · ) } . Under some ‘suitable’ assumptions, we have: with high probability, � � � log d � T 0 , n � ∞ + � T π, n � ∞ + � T m , n � ∞ � 1 + 1 ( π ∗ , m ∗ ) � =( π, m ) n � � and � R π, m , n � ∞ � δ n ,π 1 ( m ∗ � = m ) + ξ n , m 1 ( π ∗ � = π ) + δ n ,π ξ n , m (log n ) . � The 2 nd and/or 3 rd terms also contribute now to the rate (log d ) / n . The 4 th term is o (1) but no longer ignorable (and may be slower). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 26/50

  71. The DR Aspect: General Convergence Rates (under Misspecification) m ( · ) } → { π ∗ ( · ) , m ∗ ( · ) } , with either π ∗ ( · ) = π ( · ) or Finally, let { � π ( · ) , � m ∗ ( · ) = m ( · ) but not necessarily both. Assume the same pointwise convergence conditions and rates ( δ n ,π , ξ n , m ) for { � π ( · ) , � m ( · ) } as in (2), but now with { π ( · ) , m ( · ) } therein replaced by { π ∗ ( · ) , m ∗ ( · ) } . Under some ‘suitable’ assumptions, we have: with high probability, � � � log d � T 0 , n � ∞ + � T π, n � ∞ + � T m , n � ∞ � 1 + 1 ( π ∗ , m ∗ ) � =( π, m ) n � � and � R π, m , n � ∞ � δ n ,π 1 ( m ∗ � = m ) + ξ n , m 1 ( π ∗ � = π ) + δ n ,π ξ n , m (log n ) . � The 2 nd and/or 3 rd terms also contribute now to the rate (log d ) / n . The 4 th term is o (1) but no longer ignorable (and may be slower). Regardless, this establishes general convergence rates and the DR property of � θ DDR under possible misspecification of { � π ( · ) , � m ( · ) } . For the 4 th term, sharper rates need a case-by-case analysis. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 26/50

  72. Choices of the Nuisance Component Estimators � π ( · ) and � m ( · ) Note: our theory holds generally for any choices of � π ( · ) and � m ( · ) under mild conditions (provided they are both ‘correct’ estimators). Under misspecifications, consistency & general non-sharp rates are also established. Sharp rates need case-by-case analyses. Even for mean (or ATE) estimation problem, this can be quite tricky in HD settings. See Smucler et al. (2019) for a detailed analysis. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 27/50

  73. Choices of the Nuisance Component Estimators � π ( · ) and � m ( · ) Note: our theory holds generally for any choices of � π ( · ) and � m ( · ) under mild conditions (provided they are both ‘correct’ estimators). Under misspecifications, consistency & general non-sharp rates are also established. Sharp rates need case-by-case analyses. Even for mean (or ATE) estimation problem, this can be quite tricky in HD settings. See Smucler et al. (2019) for a detailed analysis. π ( · ) and � m ( · ) that may be Below we provide only some choices of � used to implement our theory & methods for � θ DDR . In general, one can use any reasonable method (including black box ML methods). Choices of � π ( · ) and � m ( · ): we consider estimators from two families. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 27/50

  74. Choices of the Nuisance Component Estimators � π ( · ) and � m ( · ) Note: our theory holds generally for any choices of � π ( · ) and � m ( · ) under mild conditions (provided they are both ‘correct’ estimators). Under misspecifications, consistency & general non-sharp rates are also established. Sharp rates need case-by-case analyses. Even for mean (or ATE) estimation problem, this can be quite tricky in HD settings. See Smucler et al. (2019) for a detailed analysis. π ( · ) and � m ( · ) that may be Below we provide only some choices of � used to implement our theory & methods for � θ DDR . In general, one can use any reasonable method (including black box ML methods). Choices of � π ( · ) and � m ( · ): we consider estimators from two families. Parametric and ‘extended’ parametric families (series estimators). Semi-parametric single index families. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 27/50

  75. Choices of � π ( · ): ‘Extended’ Parametric Families (Series Estimators) If π ( · ) is known, we set � π ( · ) := π ( · ). Otherwise, we estimate π ( · ) via two (class of) choices of � π ( · ) (each assumed to be ‘correct’). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 28/50

  76. Choices of � π ( · ): ‘Extended’ Parametric Families (Series Estimators) If π ( · ) is known, we set � π ( · ) := π ( · ). Otherwise, we estimate π ( · ) via two (class of) choices of � π ( · ) (each assumed to be ‘correct’). ‘Extended’ parametric family: π ( x ) = g { α ′ Ψ ( X ) } , where g ( · ) ∈ [0 , 1] is a known function [e.g. g expit ( u ) := exp( u ) / { 1 + exp( u ) } ], Ψ ( X ) := { ψ k ( X ) } K k =1 is any set of K basis functions (with K ≫ n possibly), and α ∈ R K is an unknown (sparse) parameter vector. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 28/50

  77. Choices of � π ( · ): ‘Extended’ Parametric Families (Series Estimators) If π ( · ) is known, we set � π ( · ) := π ( · ). Otherwise, we estimate π ( · ) via two (class of) choices of � π ( · ) (each assumed to be ‘correct’). ‘Extended’ parametric family: π ( x ) = g { α ′ Ψ ( X ) } , where g ( · ) ∈ [0 , 1] is a known function [e.g. g expit ( u ) := exp( u ) / { 1 + exp( u ) } ], Ψ ( X ) := { ψ k ( X ) } K k =1 is any set of K basis functions (with K ≫ n possibly), and α ∈ R K is an unknown (sparse) parameter vector. Example: Ψ ( X ) may correspond to the polynomial bases of X upto any fixed degree k . Note: the special case of linear bases ( k = 1) includes all standard parametric regression models. Further, the case of π ( · ) = constant (but unknown) i.e. MCAR is also included. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 28/50

  78. Choices of � π ( · ): ‘Extended’ Parametric Families (Series Estimators) If π ( · ) is known, we set � π ( · ) := π ( · ). Otherwise, we estimate π ( · ) via two (class of) choices of � π ( · ) (each assumed to be ‘correct’). ‘Extended’ parametric family: π ( x ) = g { α ′ Ψ ( X ) } , where g ( · ) ∈ [0 , 1] is a known function [e.g. g expit ( u ) := exp( u ) / { 1 + exp( u ) } ], Ψ ( X ) := { ψ k ( X ) } K k =1 is any set of K basis functions (with K ≫ n possibly), and α ∈ R K is an unknown (sparse) parameter vector. Example: Ψ ( X ) may correspond to the polynomial bases of X upto any fixed degree k . Note: the special case of linear bases ( k = 1) includes all standard parametric regression models. Further, the case of π ( · ) = constant (but unknown) i.e. MCAR is also included. π ( X ) = g { � α ′ Ψ ( X ) } , where � Estimator: we set � α denotes any suitable estimator (possibly penalized) of α based on T n := { T i , X i } n i =1 . Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 28/50

  79. Choices of � π ( · ): ‘Extended’ Parametric Families (Series Estimators) If π ( · ) is known, we set � π ( · ) := π ( · ). Otherwise, we estimate π ( · ) via two (class of) choices of � π ( · ) (each assumed to be ‘correct’). ‘Extended’ parametric family: π ( x ) = g { α ′ Ψ ( X ) } , where g ( · ) ∈ [0 , 1] is a known function [e.g. g expit ( u ) := exp( u ) / { 1 + exp( u ) } ], Ψ ( X ) := { ψ k ( X ) } K k =1 is any set of K basis functions (with K ≫ n possibly), and α ∈ R K is an unknown (sparse) parameter vector. Example: Ψ ( X ) may correspond to the polynomial bases of X upto any fixed degree k . Note: the special case of linear bases ( k = 1) includes all standard parametric regression models. Further, the case of π ( · ) = constant (but unknown) i.e. MCAR is also included. π ( X ) = g { � α ′ Ψ ( X ) } , where � Estimator: we set � α denotes any suitable estimator (possibly penalized) of α based on T n := { T i , X i } n i =1 . α : when g ( · ) = g expit ( · ), � Example of � α may be obtained based on a standard L 1 -penalized logistic regression of { T i vs. Ψ ( X i ) } n i =1 . Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 28/50

  80. Choices of � π ( · ): Semi-Parametric Single Index Families Semi-parametric single index family: π ( X ) = g ( α ′ X ), where g ( · ) ∈ (0 , 1) is unknown and α ∈ R p is a (sparse) unknown parameter (identifiable only upto scalar multiples, hence set � α � 2 = 1 wlog). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 29/50

  81. Choices of � π ( · ): Semi-Parametric Single Index Families Semi-parametric single index family: π ( X ) = g ( α ′ X ), where g ( · ) ∈ (0 , 1) is unknown and α ∈ R p is a (sparse) unknown parameter (identifiable only upto scalar multiples, hence set � α � 2 = 1 wlog). α of α , we estimate π ( X ) ≡ E ( T | α ′ X ) as: Given an estimator � � n �� � α ′ ( X i − x ) / h 1 i =1 T i K nh � , π ( x ) ≡ � � π ( � α , x ) := � n �� α ′ ( X i − x ) / h 1 i =1 K nh where K ( · ) denotes any standard (2 nd order) kernel function and h = h n > 0 denotes the bandwidth sequence with h = o (1). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 29/50

  82. Choices of � π ( · ): Semi-Parametric Single Index Families Semi-parametric single index family: π ( X ) = g ( α ′ X ), where g ( · ) ∈ (0 , 1) is unknown and α ∈ R p is a (sparse) unknown parameter (identifiable only upto scalar multiples, hence set � α � 2 = 1 wlog). α of α , we estimate π ( X ) ≡ E ( T | α ′ X ) as: Given an estimator � � n �� � α ′ ( X i − x ) / h 1 i =1 T i K nh � , π ( x ) ≡ � � π ( � α , x ) := � n �� α ′ ( X i − x ) / h 1 i =1 K nh where K ( · ) denotes any standard (2 nd order) kernel function and h = h n > 0 denotes the bandwidth sequence with h = o (1). Obtaining � α : In general, any approach (if available) from (high dimensional) single index model literature can be used. But if X is elliptically symmetric, then � α may be obtained as simply as a standard L 1 -penalized logistic regression of { T i vs. X i } n i =1 . Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 29/50

  83. Choices of � m ( · ): ‘Extended’ Parametric Families (Series Estimators) ‘Extended’ parametric family: m ( x ) = g { γ ′ Ψ ( X ) } , where g ( · ) is a known ‘link’ function [e.g. ‘canonical’ links: identity, expit or exp], Ψ ( X ) := { ψ k ( X ) } K k =1 is any set of K basis functions (with K ≫ n possibly), and γ ∈ R K is an unknown (sparse) parameter vector. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 30/50

  84. Choices of � m ( · ): ‘Extended’ Parametric Families (Series Estimators) ‘Extended’ parametric family: m ( x ) = g { γ ′ Ψ ( X ) } , where g ( · ) is a known ‘link’ function [e.g. ‘canonical’ links: identity, expit or exp], Ψ ( X ) := { ψ k ( X ) } K k =1 is any set of K basis functions (with K ≫ n possibly), and γ ∈ R K is an unknown (sparse) parameter vector. Example: Ψ ( X ) may correspond to the polynomial bases of X upto any fixed degree k . Note: the special case of linear bases ( k = 1) includes all standard parametric regression models. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 30/50

  85. Choices of � m ( · ): ‘Extended’ Parametric Families (Series Estimators) ‘Extended’ parametric family: m ( x ) = g { γ ′ Ψ ( X ) } , where g ( · ) is a known ‘link’ function [e.g. ‘canonical’ links: identity, expit or exp], Ψ ( X ) := { ψ k ( X ) } K k =1 is any set of K basis functions (with K ≫ n possibly), and γ ∈ R K is an unknown (sparse) parameter vector. Example: Ψ ( X ) may correspond to the polynomial bases of X upto any fixed degree k . Note: the special case of linear bases ( k = 1) includes all standard parametric regression models. γ ′ Ψ ( X ) } , where � Estimator: we set � m ( X ) = g { � γ denotes any suitable estimator (possibly penalized) of γ based on the data subset of ‘complete cases’: D ( c ) := { ( Y i , X i ) | T i = 1 } n i =1 . n Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 30/50

  86. Choices of � m ( · ): ‘Extended’ Parametric Families (Series Estimators) ‘Extended’ parametric family: m ( x ) = g { γ ′ Ψ ( X ) } , where g ( · ) is a known ‘link’ function [e.g. ‘canonical’ links: identity, expit or exp], Ψ ( X ) := { ψ k ( X ) } K k =1 is any set of K basis functions (with K ≫ n possibly), and γ ∈ R K is an unknown (sparse) parameter vector. Example: Ψ ( X ) may correspond to the polynomial bases of X upto any fixed degree k . Note: the special case of linear bases ( k = 1) includes all standard parametric regression models. γ ′ Ψ ( X ) } , where � Estimator: we set � m ( X ) = g { � γ denotes any suitable estimator (possibly penalized) of γ based on the data subset of ‘complete cases’: D ( c ) := { ( Y i , X i ) | T i = 1 } n i =1 . n Example of � γ : when g ( · ) := any ‘canonical’ link function, � γ may be simply obtained based on the respective usual L 1 -penalized ‘canonical’ link based regression (e.g. linear, logistic or poisson) of i =1 from the ‘complete case’ data D ( c ) { ( Y i vs. X i ) | T i = 1 } n n . Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 30/50

  87. Choices of � m ( · ): Semi-Parametric Single Index Families Semi-parametric single index family: m ( X ) = g ( γ ′ X ), where g ( · ) is an unknown ‘link’ and γ ∈ R p is a (sparse) unknown parameter (identifiable only upto scalar multiples, hence set � γ � 2 = 1 wlog). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 31/50

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend