High Dimensional M -Estimation & Inference from Observational - - PowerPoint PPT Presentation

high dimensional m estimation inference from
SMART_READER_LITE
LIVE PREVIEW

High Dimensional M -Estimation & Inference from Observational - - PowerPoint PPT Presentation

High Dimensional M -Estimation & Inference from Observational Data with Incomplete Responses A Semi-Parametric Doubly Robust Framework Abhishek Chakrabortty 1 Department of Statistics University of Pennsylvania Group Meeting April 24, 2019


slide-1
SLIDE 1

High Dimensional M-Estimation & Inference from Observational Data with Incomplete Responses

A Semi-Parametric Doubly Robust Framework Abhishek Chakrabortty1

Department of Statistics University of Pennsylvania

Group Meeting

April 24, 2019

1Joint work with Jiarui Lu, T. Tony Cai and Hongzhe Li.

slide-2
SLIDE 2

Big Data Era: The Challenges of Incomplete Information

Current era of ‘big data’ and data science rapid influx of large and high dimensional data (easily available and computationally tractable). Rich information on multitudes of variables at the same place many interesting scientific questions and also unique statistical challenges!

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 2/50

slide-3
SLIDE 3

Big Data Era: The Challenges of Incomplete Information

Current era of ‘big data’ and data science rapid influx of large and high dimensional data (easily available and computationally tractable). Rich information on multitudes of variables at the same place many interesting scientific questions and also unique statistical challenges! One frequently encountered challenge: incompleteness of the data and in particular, (partial) missingness of the response of interest. Reasons could be ‘circumstantial’ (e.g. practical constraints such as logistics, time, cost issues etc.), or it could be ‘by design’ (e.g. due to the ‘treatment’ assignment/non-assignment mechanism). The response corresponding to a ‘treatment’ of interest could not be observed for a person who is not ‘treated’ (and vice versa).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 2/50

slide-4
SLIDE 4

Big Data Era: The Challenges of Incomplete Information

Current era of ‘big data’ and data science rapid influx of large and high dimensional data (easily available and computationally tractable). Rich information on multitudes of variables at the same place many interesting scientific questions and also unique statistical challenges! One frequently encountered challenge: incompleteness of the data and in particular, (partial) missingness of the response of interest. Reasons could be ‘circumstantial’ (e.g. practical constraints such as logistics, time, cost issues etc.), or it could be ‘by design’ (e.g. due to the ‘treatment’ assignment/non-assignment mechanism). The response corresponding to a ‘treatment’ of interest could not be observed for a person who is not ‘treated’ (and vice versa). Another complication in both cases: observational nature of the data. The missingness mechanism could be informative (not randomized)!

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 2/50

slide-5
SLIDE 5

Challenges of Incompleteness Contd. and Relevance in Modern Studies

Observational data typically informative missingness (or treatment assignment) mechanism. Could depend on the person’s covariates. Often termed selection bias or treatment by indication or confounding (in causal inference) in observational studies. Has to be factored in! Need to account for the missingness in a proper principled way under minimal conditions to ensure valid, unbiased (and robust) inference.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 3/50

slide-6
SLIDE 6

Challenges of Incompleteness Contd. and Relevance in Modern Studies

Observational data typically informative missingness (or treatment assignment) mechanism. Could depend on the person’s covariates. Often termed selection bias or treatment by indication or confounding (in causal inference) in observational studies. Has to be factored in! Need to account for the missingness in a proper principled way under minimal conditions to ensure valid, unbiased (and robust) inference. Relevance: these issues occur in virtually any modern day large scale

  • bservational study arising in various scientific disciplines, including:

Biomedical studies (e.g. electronic health records (EHR) data); and Integrative genomics (e.g. gene expression data and eQTL studies). Also econometrics (policy evaluation), computer science, finance etc.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 3/50

slide-7
SLIDE 7

The Basic Framework and Set-Up

Variables of interest: outcome Y ∈ Y ⊆ R and covariates X ∈ X ⊆ Rp (possibly high dimensional, compared to the sample size). The supports Y and X of Y and X need not be continuous. Main issue: Y may not always be observed. Let T ∈ {0, 1} denote the indicator of the true Y being observed. The (partly) unobserved random vector (T, Y , X) is assumed to be jointly defined on a common probability space with measure P(·).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 4/50

slide-8
SLIDE 8

The Basic Framework and Set-Up

Variables of interest: outcome Y ∈ Y ⊆ R and covariates X ∈ X ⊆ Rp (possibly high dimensional, compared to the sample size). The supports Y and X of Y and X need not be continuous. Main issue: Y may not always be observed. Let T ∈ {0, 1} denote the indicator of the true Y being observed. The (partly) unobserved random vector (T, Y , X) is assumed to be jointly defined on a common probability space with measure P(·). Observable data: Dn := {Zi := (Ti, TiYi, Xi) : i = 1, . . . , n}

iid

∼ Z, where Z := (T, TY , X) whose distribution is defined via P(·). High dimensional setting: p can diverge with n (including p ≫ n).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 4/50

slide-9
SLIDE 9

Applicability of the Framework

Generally applicable to any missing data setting - with missing

  • utcomes Y and (possibly) high dimensional covariates X.

Causal inference problems (via ‘potential’ outcomes framework).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 5/50

slide-10
SLIDE 10

Applicability of the Framework

Generally applicable to any missing data setting - with missing

  • utcomes Y and (possibly) high dimensional covariates X.

Causal inference problems (via ‘potential’ outcomes framework). Here, X is often called ‘confounders’ (for observational studies) or ‘adjustment’ variables/features (for randomized trials). Usual set-up: binary ‘treatment’ (a.k.a. exposure/intervention) assignment: T ∈ {0, 1}, and potential outcomes: {Y(0), Y(1)}. Observed outcome: Y := Y(0)1(T = 0) + Y(1)1(T = 1), i.e. depending on T , we observe only one of {Y(0), Y(1)}.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 5/50

slide-11
SLIDE 11

Applicability of the Framework

Generally applicable to any missing data setting - with missing

  • utcomes Y and (possibly) high dimensional covariates X.

Causal inference problems (via ‘potential’ outcomes framework). Here, X is often called ‘confounders’ (for observational studies) or ‘adjustment’ variables/features (for randomized trials). Usual set-up: binary ‘treatment’ (a.k.a. exposure/intervention) assignment: T ∈ {0, 1}, and potential outcomes: {Y(0), Y(1)}. Observed outcome: Y := Y(0)1(T = 0) + Y(1)1(T = 1), i.e. depending on T , we observe only one of {Y(0), Y(1)}. For each j ∈ {0, 1}, this set-up is included based on the ‘map’: (T, Y , X) ← (Tj, Y(j), X) with Tj := 1(T = j) ∀ j ∈ {0, 1}.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 5/50

slide-12
SLIDE 12

Applicability of the Framework

Generally applicable to any missing data setting - with missing

  • utcomes Y and (possibly) high dimensional covariates X.

Causal inference problems (via ‘potential’ outcomes framework). Here, X is often called ‘confounders’ (for observational studies) or ‘adjustment’ variables/features (for randomized trials). Usual set-up: binary ‘treatment’ (a.k.a. exposure/intervention) assignment: T ∈ {0, 1}, and potential outcomes: {Y(0), Y(1)}. Observed outcome: Y := Y(0)1(T = 0) + Y(1)1(T = 1), i.e. depending on T , we observe only one of {Y(0), Y(1)}. For each j ∈ {0, 1}, this set-up is included based on the ‘map’: (T, Y , X) ← (Tj, Y(j), X) with Tj := 1(T = j) ∀ j ∈ {0, 1}. The case of any multi-category treatment also similarly included.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 5/50

slide-13
SLIDE 13

The Two Standard (Fundamental) Assumptions

1

Ignorability assumption: T ⊥ ⊥ Y | X. A.k.a. ‘missing at random’ (MAR) in the missing data literature. A.k.a. ‘no unmeasured confounding’ (NUC) in causal inference. Special case: T ⊥ ⊥ (Y , X). A.k.a. missing completely at random (MCAR) in missing data literature, and complete randomization (e.g. randomized trials) in causal inference (CI) literature.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 6/50

slide-14
SLIDE 14

The Two Standard (Fundamental) Assumptions

1

Ignorability assumption: T ⊥ ⊥ Y | X. A.k.a. ‘missing at random’ (MAR) in the missing data literature. A.k.a. ‘no unmeasured confounding’ (NUC) in causal inference. Special case: T ⊥ ⊥ (Y , X). A.k.a. missing completely at random (MCAR) in missing data literature, and complete randomization (e.g. randomized trials) in causal inference (CI) literature.

2

Positivity assumption (a.k.a. ‘sufficient overlap’ in CI literature): Let π(X) := P(T = 1 | X) be the propensity score (PS), and let π0 := P(T = 1). Then, π(·) is uniformly bounded away from 0: 1 ≥ π(x) ≥ δπ > 0 ∀ x ∈ X, for some constant δπ > 0.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 6/50

slide-15
SLIDE 15

Relevance in Biomedical Studies: EHR Data

Rich resources of data for discovery research; fast growing literature.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 7/50

slide-16
SLIDE 16

Relevance in Biomedical Studies: EHR Data

Rich resources of data for discovery research; fast growing literature. Detailed clinical and phenotypic data collected electronically for large patient cohorts, as part of routine health care delivery.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 7/50

slide-17
SLIDE 17

Relevance in Biomedical Studies: EHR Data

Rich resources of data for discovery research; fast growing literature. Detailed clinical and phenotypic data collected electronically for large patient cohorts, as part of routine health care delivery. Structured data: ICD codes, medications, lab tests, demographics etc. Unstructured text data (extracted from clinician notes via NLP): signs and symptoms, family history, social history, radiology reports etc.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 7/50

slide-18
SLIDE 18

EHR Data: The Promises and the Challenges

Information on a variety of phenotypes (unlike usual cohort studies). Opens up unique opportunities for novel integrative analyses.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 8/50

slide-19
SLIDE 19

EHR Data: The Promises and the Challenges

Information on a variety of phenotypes (unlike usual cohort studies). Opens up unique opportunities for novel integrative analyses. EHR + Bio-repositories genome-phenome association networks, PheWAS studies and genomic risk prediction of diseases.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 8/50

slide-20
SLIDE 20

EHR Data: The Promises and the Challenges

Information on a variety of phenotypes (unlike usual cohort studies). Opens up unique opportunities for novel integrative analyses. EHR + Bio-repositories genome-phenome association networks, PheWAS studies and genomic risk prediction of diseases. The key challenges and bottlenecks for EHR driven research: Logistic difficulty in obtaining validated phenotype (Y) information. Often time/labor/cost intensive (and the ICD codes are imprecise).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 8/50

slide-21
SLIDE 21

EHR Data and Incompleteness: Various Examples

Some examples of missing Y in EHRs and the reason for missingness:

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 9/50

slide-22
SLIDE 22

EHR Data and Incompleteness: Various Examples

Some examples of missing Y in EHRs and the reason for missingness:

1

Y some (binary) disease phenotype (e.g. Rheumatoid Arthritis). Requires manual chart review by physicians (logistic constraints).

2

Y some biomarker (e.g. anti-CCP, an important RA biomarker). Requires lab tests (cost constraints). Similarly, any Y requiring genomic measurements may also have cost/logistics constraints.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 9/50

slide-23
SLIDE 23

EHR Data and Incompleteness: Various Examples

Some examples of missing Y in EHRs and the reason for missingness:

1

Y some (binary) disease phenotype (e.g. Rheumatoid Arthritis). Requires manual chart review by physicians (logistic constraints).

2

Y some biomarker (e.g. anti-CCP, an important RA biomarker). Requires lab tests (cost constraints). Similarly, any Y requiring genomic measurements may also have cost/logistics constraints. Verified phenotypes/treatment response/biomarkers/genomic vars (Y) available only for a subset. Clinical features (X) available for all. Further issues: selection bias/treatment by indication/preferential labeling (e.g. sicker patients get labeled/treated/tested more often).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 9/50

slide-24
SLIDE 24

EHR Data and Incompleteness: Various Examples

Some examples of missing Y in EHRs and the reason for missingness:

1

Y some (binary) disease phenotype (e.g. Rheumatoid Arthritis). Requires manual chart review by physicians (logistic constraints).

2

Y some biomarker (e.g. anti-CCP, an important RA biomarker). Requires lab tests (cost constraints). Similarly, any Y requiring genomic measurements may also have cost/logistics constraints. Verified phenotypes/treatment response/biomarkers/genomic vars (Y) available only for a subset. Clinical features (X) available for all. Further issues: selection bias/treatment by indication/preferential labeling (e.g. sicker patients get labeled/treated/tested more often). Causal inference problems (treatment effects estimation): EHRs also facilitate comparative effectiveness research on a large scale. Many treatments/medications (and responses) being observed. All

  • ther clinical features (X) serve as potential confounders.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 9/50

slide-25
SLIDE 25

Another Example: eQTL Studies (Integrative Genomics)

Association studies for gene expression (Y ) vs. genetic variants (X).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 10/50

slide-26
SLIDE 26

Another Example: eQTL Studies (Integrative Genomics)

Association studies for gene expression (Y ) vs. genetic variants (X). Popular tools in integrative genomics (genetic association studies + gene expression profiling) for understanding gene regulatory networks.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 10/50

slide-27
SLIDE 27

Another Example: eQTL Studies (Integrative Genomics)

Association studies for gene expression (Y ) vs. genetic variants (X). Popular tools in integrative genomics (genetic association studies + gene expression profiling) for understanding gene regulatory networks. Missing data issue: gene expression data often missing (loss of power), while genetic variants data often available for a much larger group.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 10/50

slide-28
SLIDE 28

Another Example: eQTL Studies (Integrative Genomics)

Association studies for gene expression (Y ) vs. genetic variants (X). Popular tools in integrative genomics (genetic association studies + gene expression profiling) for understanding gene regulatory networks. Missing data issue: gene expression data often missing (loss of power), while genetic variants data often available for a much larger group. Causal inference: estimate the causal effect of any one variant (the ‘treatment’) on Y while all other variants are potential confounders.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 10/50

slide-29
SLIDE 29

High Dimensional M-Estimation: The Parameter(s) of Interest

Goal for M-estimation: estimation and inference, based on Dn, of θ0 ∈ Rd (possibly high dimensional), defined as the risk minimizer: θ0 ≡ θ0(P) := arg min

θ∈Rd

R(θ), where R(θ) := E{L(Y , X, θ)} and L(·) ∈ R+ is any ‘loss’ function that is convex and differentiable in θ. Existence of θ0 implicitly assumed (guaranteed for most usual probs). d can diverge with n (including d ≫ n). Also, θ0(P) is ‘model free’ (no restrictions on P). In particular, no model assumptions on Y |X.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 11/50

slide-30
SLIDE 30

High Dimensional M-Estimation: The Parameter(s) of Interest

Goal for M-estimation: estimation and inference, based on Dn, of θ0 ∈ Rd (possibly high dimensional), defined as the risk minimizer: θ0 ≡ θ0(P) := arg min

θ∈Rd

R(θ), where R(θ) := E{L(Y , X, θ)} and L(·) ∈ R+ is any ‘loss’ function that is convex and differentiable in θ. Existence of θ0 implicitly assumed (guaranteed for most usual probs). d can diverge with n (including d ≫ n). Also, θ0(P) is ‘model free’ (no restrictions on P). In particular, no model assumptions on Y |X. The key challenges: the missingness via T (if not accounted for, the estimator will be inconsistent!) and the high dimensional setting. Need suitable methods - involves estimation of nuisance functions and careful analyses (due to error terms with complex dependencies).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 11/50

slide-31
SLIDE 31

High Dimensional M-Estimation: The Parameter(s) of Interest

Goal for M-estimation: estimation and inference, based on Dn, of θ0 ∈ Rd (possibly high dimensional), defined as the risk minimizer: θ0 ≡ θ0(P) := arg min

θ∈Rd

R(θ), where R(θ) := E{L(Y , X, θ)} and L(·) ∈ R+ is any ‘loss’ function that is convex and differentiable in θ. Existence of θ0 implicitly assumed (guaranteed for most usual probs). d can diverge with n (including d ≫ n). Also, θ0(P) is ‘model free’ (no restrictions on P). In particular, no model assumptions on Y |X. The key challenges: the missingness via T (if not accounted for, the estimator will be inconsistent!) and the high dimensional setting. Need suitable methods - involves estimation of nuisance functions and careful analyses (due to error terms with complex dependencies). Special (but low-d) case: θ0 = E(Y ) and L(Y , X, θ) = (Y − θ)2. Leads to the average treatment effect (ATE) estimation prob in CI.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 11/50

slide-32
SLIDE 32

M-Estimation and Missing Data/Causal Inference Problems: A Review

The framework includes a broad class of M/Z-estimation problems. M-estimation for fully observed data: well studied with rich literature. Classical settings: Van der Vaart (2000); High dimensional settings: Negahban et al. (2012), Loh and Wainwright (2012, 2015) etc.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 12/50

slide-33
SLIDE 33

M-Estimation and Missing Data/Causal Inference Problems: A Review

The framework includes a broad class of M/Z-estimation problems. M-estimation for fully observed data: well studied with rich literature. Classical settings: Van der Vaart (2000); High dimensional settings: Negahban et al. (2012), Loh and Wainwright (2012, 2015) etc. Missing data/causal inference problems: semi-parametric inference. Classical settings: vast literature (typically for mean estimation). Tsiatis (2007); Bang and Robins (2005); Robins et al. (1994) etc.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 12/50

slide-34
SLIDE 34

M-Estimation and Missing Data/Causal Inference Problems: A Review

The framework includes a broad class of M/Z-estimation problems. M-estimation for fully observed data: well studied with rich literature. Classical settings: Van der Vaart (2000); High dimensional settings: Negahban et al. (2012), Loh and Wainwright (2012, 2015) etc. Missing data/causal inference problems: semi-parametric inference. Classical settings: vast literature (typically for mean estimation). Tsiatis (2007); Bang and Robins (2005); Robins et al. (1994) etc. High dimensional settings (but low dimensional parameters): lot of attention in recent times on mean (or ATE) estimation. Belloni et al. (2014, 2017); Farrell (2015); Chernozhukov et al. (2018).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 12/50

slide-35
SLIDE 35

M-Estimation and Missing Data/Causal Inference Problems: A Review

The framework includes a broad class of M/Z-estimation problems. M-estimation for fully observed data: well studied with rich literature. Classical settings: Van der Vaart (2000); High dimensional settings: Negahban et al. (2012), Loh and Wainwright (2012, 2015) etc. Missing data/causal inference problems: semi-parametric inference. Classical settings: vast literature (typically for mean estimation). Tsiatis (2007); Bang and Robins (2005); Robins et al. (1994) etc. High dimensional settings (but low dimensional parameters): lot of attention in recent times on mean (or ATE) estimation. Belloni et al. (2014, 2017); Farrell (2015); Chernozhukov et al. (2018). Much less attention when the parameter itself is high dimensional.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 12/50

slide-36
SLIDE 36

M-Estimation and Missing Data/Causal Inference Problems: A Review

The framework includes a broad class of M/Z-estimation problems. M-estimation for fully observed data: well studied with rich literature. Classical settings: Van der Vaart (2000); High dimensional settings: Negahban et al. (2012), Loh and Wainwright (2012, 2015) etc. Missing data/causal inference problems: semi-parametric inference. Classical settings: vast literature (typically for mean estimation). Tsiatis (2007); Bang and Robins (2005); Robins et al. (1994) etc. High dimensional settings (but low dimensional parameters): lot of attention in recent times on mean (or ATE) estimation. Belloni et al. (2014, 2017); Farrell (2015); Chernozhukov et al. (2018). Much less attention when the parameter itself is high dimensional. This work contributes to both literature above: M-estimation + missing data + high dimensional setting and parameter. (Also has applications in heterogeneous treatment effects estimation in CI).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 12/50

slide-37
SLIDE 37

HD M-Estimation: A Few (Class of) Applications

1

All standard high dimensional (HD) regression problems with: (a) missing outcomes and (b) potentially misspecified (working) models.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 13/50

slide-38
SLIDE 38

HD M-Estimation: A Few (Class of) Applications

1

All standard high dimensional (HD) regression problems with: (a) missing outcomes and (b) potentially misspecified (working) models. E.g. squared loss: L(Y , X, θ) := (Y − X′θ)2 linear regression; logistic loss: L(Y , X, θ) := log{1 + exp(X′θ)} − Y (X′θ) logistic regression (for binary Y ), exponential loss (Poisson reg.) so on . . .. Note: throughout, regardless of any motivating ‘working model’ being true or not, the definition of θ0 is completely ‘model free’.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 13/50

slide-39
SLIDE 39

HD M-Estimation: A Few (Class of) Applications

1

All standard high dimensional (HD) regression problems with: (a) missing outcomes and (b) potentially misspecified (working) models. E.g. squared loss: L(Y , X, θ) := (Y − X′θ)2 linear regression; logistic loss: L(Y , X, θ) := log{1 + exp(X′θ)} − Y (X′θ) logistic regression (for binary Y ), exponential loss (Poisson reg.) so on . . .. Note: throughout, regardless of any motivating ‘working model’ being true or not, the definition of θ0 is completely ‘model free’.

2

Series estimation problems (model free) with missing Y and HD basis functions (instead of X in Example 1 above). E.g. spline bases. Use the same choices of L(·) as in Example 1 above with X replaced by any set of d (possibly HD) basis functions Ψ(X) := {ψj(X)}d

j=1.

E.g. polynomial bases: Ψ(X) := {1, xk

j : 1 ≤ j ≤ p, 1 ≤ k ≤ d0}.

(d0 = 1 linear bases as in Example 1; d0 = 3 cubic splines).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 13/50

slide-40
SLIDE 40

Another Application: HD Single Index Models (SIMs)

Signal recovery in high dimensional single index models (SIMs) with elliptically symmetric design distribution (e.g. X is Gaussian). Let Y = f (β′

0X, ǫ) with f : R2 → Y unknown (i.e. β0 identifiable

  • nly upto scalar multiples) and ǫ ⊥

⊥ X (i.e., Y ⊥ ⊥ X | β′

0X).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 14/50

slide-41
SLIDE 41

Another Application: HD Single Index Models (SIMs)

Signal recovery in high dimensional single index models (SIMs) with elliptically symmetric design distribution (e.g. X is Gaussian). Let Y = f (β′

0X, ǫ) with f : R2 → Y unknown (i.e. β0 identifiable

  • nly upto scalar multiples) and ǫ ⊥

⊥ X (i.e., Y ⊥ ⊥ X | β′

0X).

Consider any of the regression problems introduced in Example 1. Let θ0 := arg minθ∈Rp E{L(Y , X′θ)} for any convex loss function L(·) : R2 → R (convex in the second argument). Then, θ0 ∝ β0! A remarkable result due to Li and Duan (1989).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 14/50

slide-42
SLIDE 42

Another Application: HD Single Index Models (SIMs)

Signal recovery in high dimensional single index models (SIMs) with elliptically symmetric design distribution (e.g. X is Gaussian). Let Y = f (β′

0X, ǫ) with f : R2 → Y unknown (i.e. β0 identifiable

  • nly upto scalar multiples) and ǫ ⊥

⊥ X (i.e., Y ⊥ ⊥ X | β′

0X).

Consider any of the regression problems introduced in Example 1. Let θ0 := arg minθ∈Rp E{L(Y , X′θ)} for any convex loss function L(·) : R2 → R (convex in the second argument). Then, θ0 ∝ β0! A remarkable result due to Li and Duan (1989). Classic example of a misspecified parametric model defining θ0, yet θ0 directly relates to an actual (interpretable) semi-parametric model! The proportionality result also preserves any sparsity assumptions.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 14/50

slide-43
SLIDE 43

Applications in Causal Inference (Treatment Effects Estimation)

Applications of all these problems in causal inference (estimation of treatment effects with useful applications in precision medicine):

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 15/50

slide-44
SLIDE 44

Applications in Causal Inference (Treatment Effects Estimation)

Applications of all these problems in causal inference (estimation of treatment effects with useful applications in precision medicine):

1

Linear heterogeneous treatment effects estimation: application of the linear regression example (twice). Write {Y(0), Y(1)} linearly as: Y(j) = X′β(j) + ǫ(j), E(ǫ(j)X) = 0 ∀ j = 0, 1, so that Y(1) − Y(0) = X′β∗ + ǫ∗, β∗ := β(1) − β(0), ǫ∗ := ǫ(1) − ǫ(0).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 15/50

slide-45
SLIDE 45

Applications in Causal Inference (Treatment Effects Estimation)

Applications of all these problems in causal inference (estimation of treatment effects with useful applications in precision medicine):

1

Linear heterogeneous treatment effects estimation: application of the linear regression example (twice). Write {Y(0), Y(1)} linearly as: Y(j) = X′β(j) + ǫ(j), E(ǫ(j)X) = 0 ∀ j = 0, 1, so that Y(1) − Y(0) = X′β∗ + ǫ∗, β∗ := β(1) − β(0), ǫ∗ := ǫ(1) − ǫ(0). β∗ denotes the (model free) linear projection of Y(1) − Y(0)|X. Of interest in HD settings when E{Y(1) − Y(0)|X} is difficult to model (Chernozhukov et al., 2017; Chernozhukov and Semenova, 2017).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 15/50

slide-46
SLIDE 46

Applications in Causal Inference (Treatment Effects Estimation)

Applications of all these problems in causal inference (estimation of treatment effects with useful applications in precision medicine):

1

Linear heterogeneous treatment effects estimation: application of the linear regression example (twice). Write {Y(0), Y(1)} linearly as: Y(j) = X′β(j) + ǫ(j), E(ǫ(j)X) = 0 ∀ j = 0, 1, so that Y(1) − Y(0) = X′β∗ + ǫ∗, β∗ := β(1) − β(0), ǫ∗ := ǫ(1) − ǫ(0). β∗ denotes the (model free) linear projection of Y(1) − Y(0)|X. Of interest in HD settings when E{Y(1) − Y(0)|X} is difficult to model (Chernozhukov et al., 2017; Chernozhukov and Semenova, 2017).

2

Average conditional treatment effects (ACTE) estimation via series estimators: application of the series estimation example (twice).

3

Causal inference via SIMs (signal recovery, ACTE estimation and ATE estimation): application of the SIM example (twice).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 15/50

slide-47
SLIDE 47

Before Getting Started: A Few Facts and Considerations

Some notations: m(X) := E(Y |X) and φ(X, θ) := E{L(Y , X, θ)|X}.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 16/50

slide-48
SLIDE 48

Before Getting Started: A Few Facts and Considerations

Some notations: m(X) := E(Y |X) and φ(X, θ) := E{L(Y , X, θ)|X}. It is generally necessary to ‘account’ for the missingness in Y . The ‘complete case’ estimator of θ0 in general will be inconsistent!

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 16/50

slide-49
SLIDE 49

Before Getting Started: A Few Facts and Considerations

Some notations: m(X) := E(Y |X) and φ(X, θ) := E{L(Y , X, θ)|X}. It is generally necessary to ‘account’ for the missingness in Y . The ‘complete case’ estimator of θ0 in general will be inconsistent! That estimator may be consistent only if: (1) ∇φ(X, θ0) = 0 a.s. for every X (for regression problems, this indicates the ‘correct model’ case), and/or (2) T ⊥ ⊥ (Y , X) (i.e. the MCAR case). Illustration of (1) for sq. loss: ∇φ(X, θ0) = E{X(Y − X′θ0)|X} = 0. Hence, E(Y |X) = X′θ0 (i.e. a ‘linear model’ holds for Y |X).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 16/50

slide-50
SLIDE 50

Before Getting Started: A Few Facts and Considerations

Some notations: m(X) := E(Y |X) and φ(X, θ) := E{L(Y , X, θ)|X}. It is generally necessary to ‘account’ for the missingness in Y . The ‘complete case’ estimator of θ0 in general will be inconsistent! That estimator may be consistent only if: (1) ∇φ(X, θ0) = 0 a.s. for every X (for regression problems, this indicates the ‘correct model’ case), and/or (2) T ⊥ ⊥ (Y , X) (i.e. the MCAR case). Illustration of (1) for sq. loss: ∇φ(X, θ0) = E{X(Y − X′θ0)|X} = 0. Hence, E(Y |X) = X′θ0 (i.e. a ‘linear model’ holds for Y |X). With θ0 (and X) being high dimensional (compared to n), we need some further structural constraints on θ0 to estimate it using Dn. We assume that θ0 is s-sparse: θ00 := s and s ≤ min(n, d). Note: the sparsity requirement has attractive (and fairly intuitive) geometric justification for all the examples we have given here.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 16/50

slide-51
SLIDE 51

Estimation of θ0: Getting Identifiable Representation(s) of R(θ)

Under MAR assmpn., R(θ) := E{L(Y , X, θ)} ≡ EX{φ(X, θ)} admits the following debiased and doubly robust (DDR) representation:

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 17/50

slide-52
SLIDE 52

Estimation of θ0: Getting Identifiable Representation(s) of R(θ)

Under MAR assmpn., R(θ) := E{L(Y , X, θ)} ≡ EX{φ(X, θ)} admits the following debiased and doubly robust (DDR) representation: R(θ) = EX{φ(X, θ)} + E T π(X) {L(Y , X, θ) − φ(X, θ)}

  • .

(1) Purely non-parametric identification based on the observable Z and the nuisance functions: π(X) and φ(X, θ) (unknown but estimable).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 17/50

slide-53
SLIDE 53

Estimation of θ0: Getting Identifiable Representation(s) of R(θ)

Under MAR assmpn., R(θ) := E{L(Y , X, θ)} ≡ EX{φ(X, θ)} admits the following debiased and doubly robust (DDR) representation: R(θ) = EX{φ(X, θ)} + E T π(X) {L(Y , X, θ) − φ(X, θ)}

  • .

(1) Purely non-parametric identification based on the observable Z and the nuisance functions: π(X) and φ(X, θ) (unknown but estimable). 2nd term is simply 0, can be seen as a ‘debiasing’ term (of sorts).

Plays a crucial role in analyzing the empirical version of (1). Ensures first order insensitivity to any estimation errors of π(·) and φ(·).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 17/50

slide-54
SLIDE 54

Estimation of θ0: Getting Identifiable Representation(s) of R(θ)

Under MAR assmpn., R(θ) := E{L(Y , X, θ)} ≡ EX{φ(X, θ)} admits the following debiased and doubly robust (DDR) representation: R(θ) = EX{φ(X, θ)} + E T π(X) {L(Y , X, θ) − φ(X, θ)}

  • .

(1) Purely non-parametric identification based on the observable Z and the nuisance functions: π(X) and φ(X, θ) (unknown but estimable). 2nd term is simply 0, can be seen as a ‘debiasing’ term (of sorts).

Plays a crucial role in analyzing the empirical version of (1). Ensures first order insensitivity to any estimation errors of π(·) and φ(·).

Double robustness (DR) aspect: replace {φ(X, θ), π(X)} by any {φ∗(X, θ), π∗(X)} and (1) continues to hold as long as one but not necessarily both of φ∗(·) = φ(·) or π∗(·) = π(·) hold.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 17/50

slide-55
SLIDE 55

The DDR Estimator of θ0

Given any estimators { π(·), φ(·)} be of the nuisance fns. {π(·), φ(·)}, we define our L1-penalized DDR estimator θDDR of θ0 as:

  • θDDR ≡

θDDR(λn) := arg min

θ∈Rd

  • LDDR

n

(θ) + λn θ1

  • , where

LDDR

n

(θ) := 1 n

n

  • i=1
  • φ(Xi, θ) +

Ti

  • π(Xi)
  • L(Yi, Xi, θ) −

φ(Xi, θ)

  • ,

λn ≥ 0 is the tuning parameter and { π(·), φ(·)} are arbitrary except for satisfying two basic conditions regarding their construction:

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 18/50

slide-56
SLIDE 56

The DDR Estimator of θ0

Given any estimators { π(·), φ(·)} be of the nuisance fns. {π(·), φ(·)}, we define our L1-penalized DDR estimator θDDR of θ0 as:

  • θDDR ≡

θDDR(λn) := arg min

θ∈Rd

  • LDDR

n

(θ) + λn θ1

  • , where

LDDR

n

(θ) := 1 n

n

  • i=1
  • φ(Xi, θ) +

Ti

  • π(Xi)
  • L(Yi, Xi, θ) −

φ(Xi, θ)

  • ,

λn ≥ 0 is the tuning parameter and { π(·), φ(·)} are arbitrary except for satisfying two basic conditions regarding their construction:

  • π(·) obtained from the data Tn := {Ti, Xi}n

i=1 only; {

φ(Xi, θ)}n

i=1

  • btained in a ‘cross-fitted’ manner (via sample splitting).

Assume (temporarily) { π(·), φ(·)} are both ‘correct’. DR properties (consistency) of θDDR under their misspecfications discussed later.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 18/50

slide-57
SLIDE 57

Simplifying Assumptions and User Friendly Implementation Algorithm

For simplicity, assume that the gradient ∇L(Y , X, θ) of L(·) satisfies a ‘separable form’ as follows: for some h(X) ∈ Rd and g(X, θ) ∈ R,

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 19/50

slide-58
SLIDE 58

Simplifying Assumptions and User Friendly Implementation Algorithm

For simplicity, assume that the gradient ∇L(Y , X, θ) of L(·) satisfies a ‘separable form’ as follows: for some h(X) ∈ Rd and g(X, θ) ∈ R, ∇L(Y , X, θ) = h(X){Y − g(X, θ)}, and hence, ∇ φ(X, θ) = h(X){ m(X) − g(X, θ)}, where

  • m(X) denotes the corresponding (cross-fitted) estimator of m(X).

This simplifying assumption holds for all examples given before. Assumed form ⇒ only need to obtain m(Xi) and not φ(Xi, θ).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 19/50

slide-59
SLIDE 59

Simplifying Assumptions and User Friendly Implementation Algorithm

For simplicity, assume that the gradient ∇L(Y , X, θ) of L(·) satisfies a ‘separable form’ as follows: for some h(X) ∈ Rd and g(X, θ) ∈ R, ∇L(Y , X, θ) = h(X){Y − g(X, θ)}, and hence, ∇ φ(X, θ) = h(X){ m(X) − g(X, θ)}, where

  • m(X) denotes the corresponding (cross-fitted) estimator of m(X).

This simplifying assumption holds for all examples given before. Assumed form ⇒ only need to obtain m(Xi) and not φ(Xi, θ). Implementation algorithm. θDDR can be obtained simply as:

  • θDDR ≡

θDDR(λn) := arg min

θ∈Rd

  • 1

n

n

  • i=1

L( Yi, Xi, θ) + λn θ1

  • ,

where Yi := m(Xi) +

Ti

  • π(Xi){Yi −

m(Xi)}, ∀ i, is a ‘pseudo’ outcome. Can use ‘glmnet’ in R. Pretend to have a ‘full’ data: { Yi, Xi}n

i=1.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 19/50

slide-60
SLIDE 60

Properties of θDDR: Deterministic Deviation Bounds

Assume L(·) is convex and differentiable in θ and LDDR

n

(θ) satisfies the Restricted Strong Convexity (RSC) condition (Negahban et al., 2012) at θ = θ0. Then, for any choice of λn ≥ 2 ∇LDDR

n

(θ0)∞,

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 20/50

slide-61
SLIDE 61

Properties of θDDR: Deterministic Deviation Bounds

Assume L(·) is convex and differentiable in θ and LDDR

n

(θ) satisfies the Restricted Strong Convexity (RSC) condition (Negahban et al., 2012) at θ = θ0. Then, for any choice of λn ≥ 2 ∇LDDR

n

(θ0)∞,

  • θDDR(λn) − θ0
  • 2 λn

√s, and

  • θDDR(λn) − θ0
  • 1 λns.

where s := θ00. This is a deterministic deviation bound. Holds for any choices of { π(·), m(·)} and for any realization of Dn.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 20/50

slide-62
SLIDE 62

Properties of θDDR: Deterministic Deviation Bounds

Assume L(·) is convex and differentiable in θ and LDDR

n

(θ) satisfies the Restricted Strong Convexity (RSC) condition (Negahban et al., 2012) at θ = θ0. Then, for any choice of λn ≥ 2 ∇LDDR

n

(θ0)∞,

  • θDDR(λn) − θ0
  • 2 λn

√s, and

  • θDDR(λn) − θ0
  • 1 λns.

where s := θ00. This is a deterministic deviation bound. Holds for any choices of { π(·), m(·)} and for any realization of Dn. The RSC (or ‘cone’) condition for LDDR

n

(θ) is exactly the same as the usual RSC condition required under a fully observed data! The fully observed data RSC condition’s validity is well studied.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 20/50

slide-63
SLIDE 63

Properties of θDDR: Deterministic Deviation Bounds

Assume L(·) is convex and differentiable in θ and LDDR

n

(θ) satisfies the Restricted Strong Convexity (RSC) condition (Negahban et al., 2012) at θ = θ0. Then, for any choice of λn ≥ 2 ∇LDDR

n

(θ0)∞,

  • θDDR(λn) − θ0
  • 2 λn

√s, and

  • θDDR(λn) − θ0
  • 1 λns.

where s := θ00. This is a deterministic deviation bound. Holds for any choices of { π(·), m(·)} and for any realization of Dn. The RSC (or ‘cone’) condition for LDDR

n

(θ) is exactly the same as the usual RSC condition required under a fully observed data! The fully observed data RSC condition’s validity is well studied. Key quantity of interest: the random lower bound ∇LDDR

n

(θ0)∞ for λn. Need probabilistic bounds to determine convergence rate of θDDR.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 20/50

slide-64
SLIDE 64

The Main Goal from Hereon: Probabilistic Bounds for ∇LDDR

n

(θ0)∞

Bounds on ∇LDDR

n

(θ0)∞ determines the rate of choice of λn and hence the convergence rate of θDDR (using the deviation bound). Probabilistic bounds for ∇LDDR

n

(θ0)∞: the basic decomposition

  • ∇LDDR

n

(θ0)

  • ∞ ≤ T0,n∞ + Tπ,n∞ + Tm,n∞ + Rπ,m,n∞ ,

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 21/50

slide-65
SLIDE 65

The Main Goal from Hereon: Probabilistic Bounds for ∇LDDR

n

(θ0)∞

Bounds on ∇LDDR

n

(θ0)∞ determines the rate of choice of λn and hence the convergence rate of θDDR (using the deviation bound). Probabilistic bounds for ∇LDDR

n

(θ0)∞: the basic decomposition

  • ∇LDDR

n

(θ0)

  • ∞ ≤ T0,n∞ + Tπ,n∞ + Tm,n∞ + Rπ,m,n∞ ,

where T0,n is the ‘main’ term (a centered iid average), Tπ,n is the ‘π-error’ term involving π(·) − π(·) and Tm,n is the ‘m-error’ term involving m(·) − m(·), while Rπ,m,n is the ‘(π, m)-error’ term (usually lower order) involving the product of π(·) − π(·) and m(·) − m(·). Control each term separately. The analyses are all non-asymptotic and nuanced, especially in order to get sharp rates for Tπ,n and Tm,n.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 21/50

slide-66
SLIDE 66

The Main Goal from Hereon: Probabilistic Bounds for ∇LDDR

n

(θ0)∞

Bounds on ∇LDDR

n

(θ0)∞ determines the rate of choice of λn and hence the convergence rate of θDDR (using the deviation bound). Probabilistic bounds for ∇LDDR

n

(θ0)∞: the basic decomposition

  • ∇LDDR

n

(θ0)

  • ∞ ≤ T0,n∞ + Tπ,n∞ + Tm,n∞ + Rπ,m,n∞ ,

where T0,n is the ‘main’ term (a centered iid average), Tπ,n is the ‘π-error’ term involving π(·) − π(·) and Tm,n is the ‘m-error’ term involving m(·) − m(·), while Rπ,m,n is the ‘(π, m)-error’ term (usually lower order) involving the product of π(·) − π(·) and m(·) − m(·). Control each term separately. The analyses are all non-asymptotic and nuanced, especially in order to get sharp rates for Tπ,n and Tm,n. We show: ∇LDDR

n

(θ0)∞

  • (log d)/n with high probability, and

hence θDDR − θ02

  • s(log d)/n. So, clearly it is rate optimal.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 21/50

slide-67
SLIDE 67

Convergence Rates and Bounds for ∇LDDR

n

(θ0)∞ (and θDDR)

Basic (high level) consistency conditions on { π(·), m(·)}. Let { π(·),

  • m(·)} be any general and ‘correct’ estimators of {π(·), m(·)}, and

assume they satisfy the following pointwise convergence rates:

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 22/50

slide-68
SLIDE 68

Convergence Rates and Bounds for ∇LDDR

n

(θ0)∞ (and θDDR)

Basic (high level) consistency conditions on { π(·), m(·)}. Let { π(·),

  • m(·)} be any general and ‘correct’ estimators of {π(·), m(·)}, and

assume they satisfy the following pointwise convergence rates: | π(x) − π(x)| P δn,π and | m(x) − m(x)| P ξn,m ∀ x ∈ X, (2) for some sequences δn,π, ξn,m ≥ 0 such that (δn,π + ξn,m)

  • log(nd)

= o(1) and the product δn,πξn,m(log n) = o(

  • (log d)/n).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 22/50

slide-69
SLIDE 69

Convergence Rates and Bounds for ∇LDDR

n

(θ0)∞ (and θDDR)

Basic (high level) consistency conditions on { π(·), m(·)}. Let { π(·),

  • m(·)} be any general and ‘correct’ estimators of {π(·), m(·)}, and

assume they satisfy the following pointwise convergence rates: | π(x) − π(x)| P δn,π and | m(x) − m(x)| P ξn,m ∀ x ∈ X, (2) for some sequences δn,π, ξn,m ≥ 0 such that (δn,π + ξn,m)

  • log(nd)

= o(1) and the product δn,πξn,m(log n) = o(

  • (log d)/n).

Under condition (2), along with some more ‘suitable’ tail assumptions (sub-Gaussian tails etc.), we have: with high probability, T0,n∞

  • log d

n ,

Tπ,n∞

  • log d

n

  • δn,π
  • log(nd)
  • ,

and

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 22/50

slide-70
SLIDE 70

Convergence Rates and Bounds for ∇LDDR

n

(θ0)∞ (and θDDR)

Basic (high level) consistency conditions on { π(·), m(·)}. Let { π(·),

  • m(·)} be any general and ‘correct’ estimators of {π(·), m(·)}, and

assume they satisfy the following pointwise convergence rates: | π(x) − π(x)| P δn,π and | m(x) − m(x)| P ξn,m ∀ x ∈ X, (2) for some sequences δn,π, ξn,m ≥ 0 such that (δn,π + ξn,m)

  • log(nd)

= o(1) and the product δn,πξn,m(log n) = o(

  • (log d)/n).

Under condition (2), along with some more ‘suitable’ tail assumptions (sub-Gaussian tails etc.), we have: with high probability, T0,n∞

  • log d

n ,

Tπ,n∞

  • log d

n

  • δn,π
  • log(nd)
  • ,

and Tm,n∞

  • log d

n

  • ξn,m
  • log(nd)
  • , Rπ,m,n∞ δn,πξn,m(log n).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 22/50

slide-71
SLIDE 71

Convergence Rates and Bounds for ∇LDDR

n

(θ0)∞ (and θDDR)

Basic (high level) consistency conditions on { π(·), m(·)}. Let { π(·),

  • m(·)} be any general and ‘correct’ estimators of {π(·), m(·)}, and

assume they satisfy the following pointwise convergence rates: | π(x) − π(x)| P δn,π and | m(x) − m(x)| P ξn,m ∀ x ∈ X, (2) for some sequences δn,π, ξn,m ≥ 0 such that (δn,π + ξn,m)

  • log(nd)

= o(1) and the product δn,πξn,m(log n) = o(

  • (log d)/n).

Under condition (2), along with some more ‘suitable’ tail assumptions (sub-Gaussian tails etc.), we have: with high probability, T0,n∞

  • log d

n ,

Tπ,n∞

  • log d

n

  • δn,π
  • log(nd)
  • ,

and Tm,n∞

  • log d

n

  • ξn,m
  • log(nd)
  • , Rπ,m,n∞ δn,πξn,m(log n).

Hence, ∇LDDR

n

(θ0)∞

  • log d

n {1 + o(1)} with high probability.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 22/50

slide-72
SLIDE 72

HD Inference for θDDR: Desparsification and Asymptotic Linear Expansion

Consider θDDR for the squared loss: L(Y , X, θ) := {Y − Ψ(X)′θ}2, where Ψ(X) ∈ Rd denotes any HD vector of basis functions of X. Define Σ := E{Ψ(X)Ψ(X)′}, Ω := Σ−1, and let Ω be any reasonable estimator of Ω (and assume Ω is sparse if required). We then define the desparsified DDR estimator θDDR as follows.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 23/50

slide-73
SLIDE 73

HD Inference for θDDR: Desparsification and Asymptotic Linear Expansion

Consider θDDR for the squared loss: L(Y , X, θ) := {Y − Ψ(X)′θ}2, where Ψ(X) ∈ Rd denotes any HD vector of basis functions of X. Define Σ := E{Ψ(X)Ψ(X)′}, Ω := Σ−1, and let Ω be any reasonable estimator of Ω (and assume Ω is sparse if required). We then define the desparsified DDR estimator θDDR as follows.

  • θDDR :=

θDDR + Ω 1 n

n

  • i=1

{ Yi − Ψ(Xi)′ θDDR}Ψ(Xi)

  • Desparsification/Debiasing term

, where

  • Yi :=

m(Xi) + Ti

  • π(Xi){Yi −

m(Xi)} are the pseudo outcomes.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 23/50

slide-74
SLIDE 74

HD Inference for θDDR: Desparsification and Asymptotic Linear Expansion

Consider θDDR for the squared loss: L(Y , X, θ) := {Y − Ψ(X)′θ}2, where Ψ(X) ∈ Rd denotes any HD vector of basis functions of X. Define Σ := E{Ψ(X)Ψ(X)′}, Ω := Σ−1, and let Ω be any reasonable estimator of Ω (and assume Ω is sparse if required). We then define the desparsified DDR estimator θDDR as follows.

  • θDDR :=

θDDR + Ω 1 n

n

  • i=1

{ Yi − Ψ(Xi)′ θDDR}Ψ(Xi)

  • Desparsification/Debiasing term

, where

  • Yi :=

m(Xi) + Ti

  • π(Xi){Yi −

m(Xi)} are the pseudo outcomes. Debiasing similar (in spirit) to van de Geer et al. (2014), except its the ‘right’ one for this problem (using pseudo outcomes in the full data).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 23/50

slide-75
SLIDE 75

The Desparisfied DDR Estimator: Asymptotic Linear Expansion

Assume: the basic convergence conditions (2) for { π(·), m(·)}, ΩX is sub-Gaussian and that Ω − Ω1 = OP(an), I − Ω Σmax = OP(bn), with an √log d = o(1) and bns√log d = o(1), where s := θ00. Then, θDDR satisfies the asymptotic linear expansion (ALE):

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 24/50

slide-76
SLIDE 76

The Desparisfied DDR Estimator: Asymptotic Linear Expansion

Assume: the basic convergence conditions (2) for { π(·), m(·)}, ΩX is sub-Gaussian and that Ω − Ω1 = OP(an), I − Ω Σmax = OP(bn), with an √log d = o(1) and bns√log d = o(1), where s := θ00. Then, θDDR satisfies the asymptotic linear expansion (ALE): ( θDDR − θ0) = 1 n

n

  • i=1

Ω{ψ0(Zi)} + ∆n, where ∆n∞ = oP(n− 1

2 )

and ψ0(Z) :=

  • {m(X) − Ψ(X)′θ0} +

T π(X){Y − m(X)}

  • Ψ(X)

with E{ψ0(Z)} = 0. The ALE facilitates inference (e.g. confidence intervals etc.) for any low-d component of θ0 via Gaussian approx.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 24/50

slide-77
SLIDE 77

The Desparisfied DDR Estimator: Asymptotic Linear Expansion

Assume: the basic convergence conditions (2) for { π(·), m(·)}, ΩX is sub-Gaussian and that Ω − Ω1 = OP(an), I − Ω Σmax = OP(bn), with an √log d = o(1) and bns√log d = o(1), where s := θ00. Then, θDDR satisfies the asymptotic linear expansion (ALE): ( θDDR − θ0) = 1 n

n

  • i=1

Ω{ψ0(Zi)} + ∆n, where ∆n∞ = oP(n− 1

2 )

and ψ0(Z) :=

  • {m(X) − Ψ(X)′θ0} +

T π(X){Y − m(X)}

  • Ψ(X)

with E{ψ0(Z)} = 0. The ALE facilitates inference (e.g. confidence intervals etc.) for any low-d component of θ0 via Gaussian approx. Further, the ALE is also ‘optimal’. The function Ωψ0(Z) =: Ψeff(Z) is the ‘efficient’ influence function for θ0 (Robins et al., 1994). Thus, in classical settings, θDDR achieves the semi-parametric efficiency bound.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 24/50

slide-78
SLIDE 78

The Desparsified Estimator: Asymptotic Normality and Some Final Remarks

Coordinate-wise asymptotic normality of θDDR: ∀1 ≤ j ≤ d, √n( θDDR − θ0)j

d

→ N(0, σ2

0,j), where σ2 0,j := Var{Ω′ j·ψ0(Z)}.

Further, max1≤j≤d | σ0,j − σ0,j| = oP(1), where σ0,j is the plug-in estimator obtained by plugging in Ω, π(·) and m(·) in Var{Ω′

j·ψ0(Z)}.

Can choose Ω to be any standard (sparse) precision matrix estimator, e.g. the node-wise Lasso estimator. Here, an = sΩ

  • (log d)/n and

bn =

  • (log d)/n under suitable conditions, with sΩ := max

1≤j≤dΩj·0.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 25/50

slide-79
SLIDE 79

The Desparsified Estimator: Asymptotic Normality and Some Final Remarks

Coordinate-wise asymptotic normality of θDDR: ∀1 ≤ j ≤ d, √n( θDDR − θ0)j

d

→ N(0, σ2

0,j), where σ2 0,j := Var{Ω′ j·ψ0(Z)}.

Further, max1≤j≤d | σ0,j − σ0,j| = oP(1), where σ0,j is the plug-in estimator obtained by plugging in Ω, π(·) and m(·) in Var{Ω′

j·ψ0(Z)}.

Can choose Ω to be any standard (sparse) precision matrix estimator, e.g. the node-wise Lasso estimator. Here, an = sΩ

  • (log d)/n and

bn =

  • (log d)/n under suitable conditions, with sΩ := max

1≤j≤dΩj·0.

The error ∆n can be decomposed as: ∆n = ∆n,1 + ∆n,2 + ∆n,3, where ∆n,1 := 1

n(

Ω − Ω) n

i=1 ψ0(Zi), ∆n,2 := (Id −

Ω Σ)( θDDR − θ0) and ∆n,3 := Ω(Tπ,n + Tm,n + Rπ,m,n), with ∆n,3∞ P n− 1

2 and

∆n,1∞ an

  • log d

n and ∆n,2∞ bns

  • log d

n .

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 25/50

slide-80
SLIDE 80

The DR Aspect: General Convergence Rates (under Misspecification)

Finally, let { π(·), m(·)} → {π∗(·), m∗(·)}, with either π∗(·) = π(·) or m∗(·) = m(·) but not necessarily both. Assume the same pointwise convergence conditions and rates (δn,π, ξn,m) for { π(·), m(·)} as in (2), but now with {π(·), m(·)} therein replaced by {π∗(·), m∗(·)}. Under some ‘suitable’ assumptions, we have: with high probability,

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 26/50

slide-81
SLIDE 81

The DR Aspect: General Convergence Rates (under Misspecification)

Finally, let { π(·), m(·)} → {π∗(·), m∗(·)}, with either π∗(·) = π(·) or m∗(·) = m(·) but not necessarily both. Assume the same pointwise convergence conditions and rates (δn,π, ξn,m) for { π(·), m(·)} as in (2), but now with {π(·), m(·)} therein replaced by {π∗(·), m∗(·)}. Under some ‘suitable’ assumptions, we have: with high probability, T0,n∞ +Tπ,n∞ + Tm,n∞

  • log d

n

  • 1 + 1(π∗,m∗)=(π,m)
  • Abhishek Chakrabortty

High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 26/50

slide-82
SLIDE 82

The DR Aspect: General Convergence Rates (under Misspecification)

Finally, let { π(·), m(·)} → {π∗(·), m∗(·)}, with either π∗(·) = π(·) or m∗(·) = m(·) but not necessarily both. Assume the same pointwise convergence conditions and rates (δn,π, ξn,m) for { π(·), m(·)} as in (2), but now with {π(·), m(·)} therein replaced by {π∗(·), m∗(·)}. Under some ‘suitable’ assumptions, we have: with high probability, T0,n∞ +Tπ,n∞ + Tm,n∞

  • log d

n

  • 1 + 1(π∗,m∗)=(π,m)
  • and Rπ,m,n∞
  • δn,π1(m∗=m) + ξn,m1(π∗=π) + δn,πξn,m
  • (log n).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 26/50

slide-83
SLIDE 83

The DR Aspect: General Convergence Rates (under Misspecification)

Finally, let { π(·), m(·)} → {π∗(·), m∗(·)}, with either π∗(·) = π(·) or m∗(·) = m(·) but not necessarily both. Assume the same pointwise convergence conditions and rates (δn,π, ξn,m) for { π(·), m(·)} as in (2), but now with {π(·), m(·)} therein replaced by {π∗(·), m∗(·)}. Under some ‘suitable’ assumptions, we have: with high probability, T0,n∞ +Tπ,n∞ + Tm,n∞

  • log d

n

  • 1 + 1(π∗,m∗)=(π,m)
  • and Rπ,m,n∞
  • δn,π1(m∗=m) + ξn,m1(π∗=π) + δn,πξn,m
  • (log n).

The 2nd and/or 3rd terms also contribute now to the rate

  • (log d)/n.

The 4th term is o(1) but no longer ignorable (and may be slower).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 26/50

slide-84
SLIDE 84

The DR Aspect: General Convergence Rates (under Misspecification)

Finally, let { π(·), m(·)} → {π∗(·), m∗(·)}, with either π∗(·) = π(·) or m∗(·) = m(·) but not necessarily both. Assume the same pointwise convergence conditions and rates (δn,π, ξn,m) for { π(·), m(·)} as in (2), but now with {π(·), m(·)} therein replaced by {π∗(·), m∗(·)}. Under some ‘suitable’ assumptions, we have: with high probability, T0,n∞ +Tπ,n∞ + Tm,n∞

  • log d

n

  • 1 + 1(π∗,m∗)=(π,m)
  • and Rπ,m,n∞
  • δn,π1(m∗=m) + ξn,m1(π∗=π) + δn,πξn,m
  • (log n).

The 2nd and/or 3rd terms also contribute now to the rate

  • (log d)/n.

The 4th term is o(1) but no longer ignorable (and may be slower). Regardless, this establishes general convergence rates and the DR property of θDDR under possible misspecification of { π(·), m(·)}. For the 4th term, sharper rates need a case-by-case analysis.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 26/50

slide-85
SLIDE 85

Choices of the Nuisance Component Estimators π(·) and m(·)

Note: our theory holds generally for any choices of π(·) and m(·) under mild conditions (provided they are both ‘correct’ estimators). Under misspecifications, consistency & general non-sharp rates are also established. Sharp rates need case-by-case analyses.

Even for mean (or ATE) estimation problem, this can be quite tricky in HD settings. See Smucler et al. (2019) for a detailed analysis.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 27/50

slide-86
SLIDE 86

Choices of the Nuisance Component Estimators π(·) and m(·)

Note: our theory holds generally for any choices of π(·) and m(·) under mild conditions (provided they are both ‘correct’ estimators). Under misspecifications, consistency & general non-sharp rates are also established. Sharp rates need case-by-case analyses.

Even for mean (or ATE) estimation problem, this can be quite tricky in HD settings. See Smucler et al. (2019) for a detailed analysis.

Below we provide only some choices of π(·) and m(·) that may be used to implement our theory & methods for θDDR. In general, one can use any reasonable method (including black box ML methods). Choices of π(·) and m(·): we consider estimators from two families.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 27/50

slide-87
SLIDE 87

Choices of the Nuisance Component Estimators π(·) and m(·)

Note: our theory holds generally for any choices of π(·) and m(·) under mild conditions (provided they are both ‘correct’ estimators). Under misspecifications, consistency & general non-sharp rates are also established. Sharp rates need case-by-case analyses.

Even for mean (or ATE) estimation problem, this can be quite tricky in HD settings. See Smucler et al. (2019) for a detailed analysis.

Below we provide only some choices of π(·) and m(·) that may be used to implement our theory & methods for θDDR. In general, one can use any reasonable method (including black box ML methods). Choices of π(·) and m(·): we consider estimators from two families. Parametric and ‘extended’ parametric families (series estimators). Semi-parametric single index families.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 27/50

slide-88
SLIDE 88

Choices of π(·): ‘Extended’ Parametric Families (Series Estimators)

If π(·) is known, we set π(·) := π(·). Otherwise, we estimate π(·) via two (class of) choices of π(·) (each assumed to be ‘correct’).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 28/50

slide-89
SLIDE 89

Choices of π(·): ‘Extended’ Parametric Families (Series Estimators)

If π(·) is known, we set π(·) := π(·). Otherwise, we estimate π(·) via two (class of) choices of π(·) (each assumed to be ‘correct’). ‘Extended’ parametric family: π(x) = g{α′Ψ(X)}, where g(·) ∈ [0, 1] is a known function [e.g. gexpit(u) := exp(u)/{1 + exp(u)}], Ψ(X) := {ψk(X)}K

k=1 is any set of K basis functions (with K ≫ n

possibly), and α ∈ RK is an unknown (sparse) parameter vector.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 28/50

slide-90
SLIDE 90

Choices of π(·): ‘Extended’ Parametric Families (Series Estimators)

If π(·) is known, we set π(·) := π(·). Otherwise, we estimate π(·) via two (class of) choices of π(·) (each assumed to be ‘correct’). ‘Extended’ parametric family: π(x) = g{α′Ψ(X)}, where g(·) ∈ [0, 1] is a known function [e.g. gexpit(u) := exp(u)/{1 + exp(u)}], Ψ(X) := {ψk(X)}K

k=1 is any set of K basis functions (with K ≫ n

possibly), and α ∈ RK is an unknown (sparse) parameter vector.

Example: Ψ(X) may correspond to the polynomial bases of X upto any fixed degree k. Note: the special case of linear bases (k = 1) includes all standard parametric regression models. Further, the case

  • f π(·) = constant (but unknown) i.e. MCAR is also included.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 28/50

slide-91
SLIDE 91

Choices of π(·): ‘Extended’ Parametric Families (Series Estimators)

If π(·) is known, we set π(·) := π(·). Otherwise, we estimate π(·) via two (class of) choices of π(·) (each assumed to be ‘correct’). ‘Extended’ parametric family: π(x) = g{α′Ψ(X)}, where g(·) ∈ [0, 1] is a known function [e.g. gexpit(u) := exp(u)/{1 + exp(u)}], Ψ(X) := {ψk(X)}K

k=1 is any set of K basis functions (with K ≫ n

possibly), and α ∈ RK is an unknown (sparse) parameter vector.

Example: Ψ(X) may correspond to the polynomial bases of X upto any fixed degree k. Note: the special case of linear bases (k = 1) includes all standard parametric regression models. Further, the case

  • f π(·) = constant (but unknown) i.e. MCAR is also included.

Estimator: we set π(X) = g{ α′Ψ(X)}, where α denotes any suitable estimator (possibly penalized) of α based on Tn := {Ti, Xi}n

i=1. Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 28/50

slide-92
SLIDE 92

Choices of π(·): ‘Extended’ Parametric Families (Series Estimators)

If π(·) is known, we set π(·) := π(·). Otherwise, we estimate π(·) via two (class of) choices of π(·) (each assumed to be ‘correct’). ‘Extended’ parametric family: π(x) = g{α′Ψ(X)}, where g(·) ∈ [0, 1] is a known function [e.g. gexpit(u) := exp(u)/{1 + exp(u)}], Ψ(X) := {ψk(X)}K

k=1 is any set of K basis functions (with K ≫ n

possibly), and α ∈ RK is an unknown (sparse) parameter vector.

Example: Ψ(X) may correspond to the polynomial bases of X upto any fixed degree k. Note: the special case of linear bases (k = 1) includes all standard parametric regression models. Further, the case

  • f π(·) = constant (but unknown) i.e. MCAR is also included.

Estimator: we set π(X) = g{ α′Ψ(X)}, where α denotes any suitable estimator (possibly penalized) of α based on Tn := {Ti, Xi}n

i=1.

Example of α: when g(·) = gexpit(·), α may be obtained based on a standard L1-penalized logistic regression of {Ti vs. Ψ(Xi)}n

i=1. Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 28/50

slide-93
SLIDE 93

Choices of π(·): Semi-Parametric Single Index Families

Semi-parametric single index family: π(X) = g(α′X), where g(·) ∈ (0, 1) is unknown and α ∈ Rp is a (sparse) unknown parameter (identifiable only upto scalar multiples, hence set α2 = 1 wlog).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 29/50

slide-94
SLIDE 94

Choices of π(·): Semi-Parametric Single Index Families

Semi-parametric single index family: π(X) = g(α′X), where g(·) ∈ (0, 1) is unknown and α ∈ Rp is a (sparse) unknown parameter (identifiable only upto scalar multiples, hence set α2 = 1 wlog). Given an estimator α of α, we estimate π(X) ≡ E(T | α′X) as:

  • π(x) ≡

π( α, x) :=

1 nh

n

i=1 TiK

  • α′(Xi − x)/h
  • 1

nh

n

i=1 K

  • α′(Xi − x)/h

, where K(·) denotes any standard (2nd order) kernel function and h = hn > 0 denotes the bandwidth sequence with h = o(1).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 29/50

slide-95
SLIDE 95

Choices of π(·): Semi-Parametric Single Index Families

Semi-parametric single index family: π(X) = g(α′X), where g(·) ∈ (0, 1) is unknown and α ∈ Rp is a (sparse) unknown parameter (identifiable only upto scalar multiples, hence set α2 = 1 wlog). Given an estimator α of α, we estimate π(X) ≡ E(T | α′X) as:

  • π(x) ≡

π( α, x) :=

1 nh

n

i=1 TiK

  • α′(Xi − x)/h
  • 1

nh

n

i=1 K

  • α′(Xi − x)/h

, where K(·) denotes any standard (2nd order) kernel function and h = hn > 0 denotes the bandwidth sequence with h = o(1). Obtaining α: In general, any approach (if available) from (high dimensional) single index model literature can be used. But if X is elliptically symmetric, then α may be obtained as simply as a standard L1-penalized logistic regression of {Ti vs. Xi}n

i=1.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 29/50

slide-96
SLIDE 96

Choices of m(·): ‘Extended’ Parametric Families (Series Estimators)

‘Extended’ parametric family: m(x) = g{γ′Ψ(X)}, where g(·) is a known ‘link’ function [e.g. ‘canonical’ links: identity, expit or exp], Ψ(X) := {ψk(X)}K

k=1 is any set of K basis functions (with K ≫ n

possibly), and γ ∈ RK is an unknown (sparse) parameter vector.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 30/50

slide-97
SLIDE 97

Choices of m(·): ‘Extended’ Parametric Families (Series Estimators)

‘Extended’ parametric family: m(x) = g{γ′Ψ(X)}, where g(·) is a known ‘link’ function [e.g. ‘canonical’ links: identity, expit or exp], Ψ(X) := {ψk(X)}K

k=1 is any set of K basis functions (with K ≫ n

possibly), and γ ∈ RK is an unknown (sparse) parameter vector. Example: Ψ(X) may correspond to the polynomial bases of X upto any fixed degree k. Note: the special case of linear bases (k = 1) includes all standard parametric regression models.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 30/50

slide-98
SLIDE 98

Choices of m(·): ‘Extended’ Parametric Families (Series Estimators)

‘Extended’ parametric family: m(x) = g{γ′Ψ(X)}, where g(·) is a known ‘link’ function [e.g. ‘canonical’ links: identity, expit or exp], Ψ(X) := {ψk(X)}K

k=1 is any set of K basis functions (with K ≫ n

possibly), and γ ∈ RK is an unknown (sparse) parameter vector. Example: Ψ(X) may correspond to the polynomial bases of X upto any fixed degree k. Note: the special case of linear bases (k = 1) includes all standard parametric regression models. Estimator: we set m(X) = g{ γ′Ψ(X)}, where γ denotes any suitable estimator (possibly penalized) of γ based on the data subset of ‘complete cases’: D(c)

n

:= {(Yi, Xi) | Ti = 1}n

i=1.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 30/50

slide-99
SLIDE 99

Choices of m(·): ‘Extended’ Parametric Families (Series Estimators)

‘Extended’ parametric family: m(x) = g{γ′Ψ(X)}, where g(·) is a known ‘link’ function [e.g. ‘canonical’ links: identity, expit or exp], Ψ(X) := {ψk(X)}K

k=1 is any set of K basis functions (with K ≫ n

possibly), and γ ∈ RK is an unknown (sparse) parameter vector. Example: Ψ(X) may correspond to the polynomial bases of X upto any fixed degree k. Note: the special case of linear bases (k = 1) includes all standard parametric regression models. Estimator: we set m(X) = g{ γ′Ψ(X)}, where γ denotes any suitable estimator (possibly penalized) of γ based on the data subset of ‘complete cases’: D(c)

n

:= {(Yi, Xi) | Ti = 1}n

i=1.

Example of γ: when g(·) := any ‘canonical’ link function, γ may be simply obtained based on the respective usual L1-penalized ‘canonical’ link based regression (e.g. linear, logistic or poisson) of {(Yi vs. Xi) | Ti = 1}n

i=1 from the ‘complete case’ data D(c) n .

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 30/50

slide-100
SLIDE 100

Choices of m(·): Semi-Parametric Single Index Families

Semi-parametric single index family: m(X) = g(γ′X), where g(·) is an unknown ‘link’ and γ ∈ Rp is a (sparse) unknown parameter (identifiable only upto scalar multiples, hence set γ2 = 1 wlog).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 31/50

slide-101
SLIDE 101

Choices of m(·): Semi-Parametric Single Index Families

Semi-parametric single index family: m(X) = g(γ′X), where g(·) is an unknown ‘link’ and γ ∈ Rp is a (sparse) unknown parameter (identifiable only upto scalar multiples, hence set γ2 = 1 wlog). Given an estimator γ of γ, we estimate m(X) ≡ E(Y | γ′X, T) as:

  • m(x) ≡

m( γ, x) :=

1 nh

n

i=1 TiYi K

  • γ′(Xi − x)/h
  • 1

nh

n

i=1 Ti K

  • γ′(Xi − x)/h

, where K(·) denotes any standard (2nd order) kernel function, and h = hn > 0 denotes the bandwidth sequence with h = o(1).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 31/50

slide-102
SLIDE 102

Choices of m(·): Semi-Parametric Single Index Families

Semi-parametric single index family: m(X) = g(γ′X), where g(·) is an unknown ‘link’ and γ ∈ Rp is a (sparse) unknown parameter (identifiable only upto scalar multiples, hence set γ2 = 1 wlog). Given an estimator γ of γ, we estimate m(X) ≡ E(Y | γ′X, T) as:

  • m(x) ≡

m( γ, x) :=

1 nh

n

i=1 TiYi K

  • γ′(Xi − x)/h
  • 1

nh

n

i=1 Ti K

  • γ′(Xi − x)/h

, where K(·) denotes any standard (2nd order) kernel function, and h = hn > 0 denotes the bandwidth sequence with h = o(1). Obtaining γ: In general, any approach (if available) from HD SIM literature can be used on the complete case data subset D(c)

n .

If X is elliptically symmetric and Y = f (γ′X; ǫ) with f unknown and ǫ ⊥ ⊥ (T, X), then γ may be obtained as L1-penalized IPW estimator θIPW for any ‘canonical’ link based regression problem.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 31/50

slide-103
SLIDE 103

Convergence Rates Regarding The Choices of π(·)

For either choices of π(·), assume that the ingredient estimator α satisfies: α − α1 P an for some an = o(1). Then, under suitable smoothness and tail assumptions, with high probability (w.h.p.),

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 32/50

slide-104
SLIDE 104

Convergence Rates Regarding The Choices of π(·)

For either choices of π(·), assume that the ingredient estimator α satisfies: α − α1 P an for some an = o(1). Then, under suitable smoothness and tail assumptions, with high probability (w.h.p.), | π(x) − π(x)| an = o(1), for any fixed x ∈ X, (for method 1).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 32/50

slide-105
SLIDE 105

Convergence Rates Regarding The Choices of π(·)

For either choices of π(·), assume that the ingredient estimator α satisfies: α − α1 P an for some an = o(1). Then, under suitable smoothness and tail assumptions, with high probability (w.h.p.), | π(x) − π(x)| an = o(1), for any fixed x ∈ X, (for method 1). For method 2 (SIM), assume that h = o(1), log(np)/(nh) = o(1) and (an/h)√log p = o(1). Then, under some suitable smoothness and tail assumptions, we have: with high probability, for any fixed x ∈ X, | π(x) − π(x)|

  • h2 +

1 √ nh

  • +
  • an + log(np)

nh + a2

n

h2

  • = o(1).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 32/50

slide-106
SLIDE 106

Convergence Rates Regarding The Choices of π(·)

For either choices of π(·), assume that the ingredient estimator α satisfies: α − α1 P an for some an = o(1). Then, under suitable smoothness and tail assumptions, with high probability (w.h.p.), | π(x) − π(x)| an = o(1), for any fixed x ∈ X, (for method 1). For method 2 (SIM), assume that h = o(1), log(np)/(nh) = o(1) and (an/h)√log p = o(1). Then, under some suitable smoothness and tail assumptions, we have: with high probability, for any fixed x ∈ X, | π(x) − π(x)|

  • h2 +

1 √ nh

  • +
  • an + log(np)

nh + a2

n

h2

  • = o(1).

Usually, we expect the L1 error rate of α to be an = sα

  • (log d∗)/n

where sα := α0 and d∗ = K or p (depending on the method).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 32/50

slide-107
SLIDE 107

Convergence Rates Regarding the Choices of m(·)

For either choices of m(·), assume that the ingredient estimator γ satisfies: γ − γ1 P bn for some bn = o(1). Then, under suitable smoothness and tail assumptions, we have: with high probability,

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 33/50

slide-108
SLIDE 108

Convergence Rates Regarding the Choices of m(·)

For either choices of m(·), assume that the ingredient estimator γ satisfies: γ − γ1 P bn for some bn = o(1). Then, under suitable smoothness and tail assumptions, we have: with high probability, | m(x) − m(x)| bn = o(1) for any fixed x ∈ X (for method 1).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 33/50

slide-109
SLIDE 109

Convergence Rates Regarding the Choices of m(·)

For either choices of m(·), assume that the ingredient estimator γ satisfies: γ − γ1 P bn for some bn = o(1). Then, under suitable smoothness and tail assumptions, we have: with high probability, | m(x) − m(x)| bn = o(1) for any fixed x ∈ X (for method 1). For method 2 (SIM), assume that h = o(1), log(np)/(nh) = o(1) and (an/h)√log p = o(1). Then, under some suitable smoothness and tail assumptions, we have: with high probability, for any fixed x ∈ X, | m(x) − m(x)|

  • h2 +

1 √ nh

  • +
  • bn + log(np)

nh + b2

n

h2

  • = o(1).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 33/50

slide-110
SLIDE 110

Convergence Rates Regarding the Choices of m(·)

For either choices of m(·), assume that the ingredient estimator γ satisfies: γ − γ1 P bn for some bn = o(1). Then, under suitable smoothness and tail assumptions, we have: with high probability, | m(x) − m(x)| bn = o(1) for any fixed x ∈ X (for method 1). For method 2 (SIM), assume that h = o(1), log(np)/(nh) = o(1) and (an/h)√log p = o(1). Then, under some suitable smoothness and tail assumptions, we have: with high probability, for any fixed x ∈ X, | m(x) − m(x)|

  • h2 +

1 √ nh

  • +
  • bn + log(np)

nh + b2

n

h2

  • = o(1).

We typically expect the L1 error rate of γ to be bn = sγ

  • (log d∗)/n

where sγ := α0 and d∗ = K or p (depending on the method).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 33/50

slide-111
SLIDE 111

Simulation Studies: The Setup

Basic parameters: n = 1000, p = 50 or 500 and X ∼ N(0, Σp). Three data generating processes (DGPs) for Y |X and T |X as follows:

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 34/50

slide-112
SLIDE 112

Simulation Studies: The Setup

Basic parameters: n = 1000, p = 50 or 500 and X ∼ N(0, Σp). Three data generating processes (DGPs) for Y |X and T |X as follows:

1

“Linear-Linear” DGP: Y = γ0 + γ′X + ε, ε|X ∼ N(0, 1). logit{π(X)} ≡ logit{E(T|X)} = α0 + α′X.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 34/50

slide-113
SLIDE 113

Simulation Studies: The Setup

Basic parameters: n = 1000, p = 50 or 500 and X ∼ N(0, Σp). Three data generating processes (DGPs) for Y |X and T |X as follows:

1

“Linear-Linear” DGP: Y = γ0 + γ′X + ε, ε|X ∼ N(0, 1). logit{π(X)} ≡ logit{E(T|X)} = α0 + α′X.

2

“Quad-Quad” DGP: Y = γ0 + γ′X +

p

  • j=1

γ∗

j X2 j + ε,

ε|X ∼ N(0, 1). logit{π(X)} ≡ logit{E(T|X)} = α0 + α′Xi +

p

  • j=1

α∗

j X2 ij. Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 34/50

slide-114
SLIDE 114

Simulation Studies: The Setup

Basic parameters: n = 1000, p = 50 or 500 and X ∼ N(0, Σp). Three data generating processes (DGPs) for Y |X and T |X as follows:

1

“Linear-Linear” DGP: Y = γ0 + γ′X + ε, ε|X ∼ N(0, 1). logit{π(X)} ≡ logit{E(T|X)} = α0 + α′X.

2

“Quad-Quad” DGP: Y = γ0 + γ′X +

p

  • j=1

γ∗

j X2 j + ε,

ε|X ∼ N(0, 1). logit{π(X)} ≡ logit{E(T|X)} = α0 + α′Xi +

p

  • j=1

α∗

j X2 ij. 3

“SIM-SIM” DGP: Y = γ0 + γ′X + cY (γ′X)2 + ε, ε|X ∼ N(0, 1). logit{π(X)} ≡ logit{E(T|X)} = α0 + α′X + cT(α′X)2.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 34/50

slide-115
SLIDE 115

Simulation Settings: Choice of Parameters

Choices of the parameters:

1

Covariance matrix Σp (for today): Σp = Ip (identity matrix).

2

We set cT = 0.2, cY = 0.3 and γ0 = 1, α0 = 0.5.

3

When p = 50, α = 1/ √ 5(1, −1, 0.5, −0.5, 0.5, 0, · · · , 0) with α0 = 5, γ = (1, 1, 1, −1, −1, 0.5, 0.5, −0.5, −0.5, −0.5, 0, · · · , 0) with γ0 = 10, α∗ = (0.25, −0.25, 0, · · · , 0) and γ∗ = (1, −1, 0.5, 0.5, −0.5, 0, · · · , 0).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 35/50

slide-116
SLIDE 116

Simulation Settings: Choice of Parameters

Choices of the parameters:

1

Covariance matrix Σp (for today): Σp = Ip (identity matrix).

2

We set cT = 0.2, cY = 0.3 and γ0 = 1, α0 = 0.5.

3

When p = 50, α = 1/ √ 5(1, −1, 0.5, −0.5, 0.5, 0, · · · , 0) with α0 = 5, γ = (1, 1, 1, −1, −1, 0.5, 0.5, −0.5, −0.5, −0.5, 0, · · · , 0) with γ0 = 10, α∗ = (0.25, −0.25, 0, · · · , 0) and γ∗ = (1, −1, 0.5, 0.5, −0.5, 0, · · · , 0).

4

When p = 500, α0 = 10 and α consists of three 1s, two −1s, two 0.5s and three −0.5s normalized by 1/ √ 10, while γ0 = 15 and γ consists of three 1s, two −1s, five 0.5s, five −0.5s, two 0.25s and three −0.25s. Further, we set α∗ = (0.25, 0.25, −0.25, −0.25, 0, · · · , 0) and γ∗ = (1, −1, 0.5, 0.5, −0.5, 0, · · · , 0).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 35/50

slide-117
SLIDE 117

Simulation Settings: Choice of Parameters

Choices of the parameters:

1

Covariance matrix Σp (for today): Σp = Ip (identity matrix).

2

We set cT = 0.2, cY = 0.3 and γ0 = 1, α0 = 0.5.

3

When p = 50, α = 1/ √ 5(1, −1, 0.5, −0.5, 0.5, 0, · · · , 0) with α0 = 5, γ = (1, 1, 1, −1, −1, 0.5, 0.5, −0.5, −0.5, −0.5, 0, · · · , 0) with γ0 = 10, α∗ = (0.25, −0.25, 0, · · · , 0) and γ∗ = (1, −1, 0.5, 0.5, −0.5, 0, · · · , 0).

4

When p = 500, α0 = 10 and α consists of three 1s, two −1s, two 0.5s and three −0.5s normalized by 1/ √ 10, while γ0 = 15 and γ consists of three 1s, two −1s, five 0.5s, five −0.5s, two 0.25s and three −0.25s. Further, we set α∗ = (0.25, 0.25, −0.25, −0.25, 0, · · · , 0) and γ∗ = (1, −1, 0.5, 0.5, −0.5, 0, · · · , 0). K = 2 fold cross-fitting used; all simulation settings replicated 500 times.

  • Ω obtained as

Σ

−1 for p = 50 and using the nodewise Lasso for p = 500. Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 35/50

slide-118
SLIDE 118

Simulation Settings: Estimators Implemented

Obtain the DDR estimator θDDR for linear regression: θ0 = Σ−1E(XY ).

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 36/50

slide-119
SLIDE 119

Simulation Settings: Estimators Implemented

Obtain the DDR estimator θDDR for linear regression: θ0 = Σ−1E(XY ). Two choices of the working nuisance models for π(X) to obtain π(X):

1

Linear: L1 penalized logistic-linear regression.

2

Quad: L1 penalized logistic-linear regression with quadratic terms.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 36/50

slide-120
SLIDE 120

Simulation Settings: Estimators Implemented

Obtain the DDR estimator θDDR for linear regression: θ0 = Σ−1E(XY ). Two choices of the working nuisance models for π(X) to obtain π(X):

1

Linear: L1 penalized logistic-linear regression.

2

Quad: L1 penalized logistic-linear regression with quadratic terms. Three choices of the working nuisance models for m(X) to obtain m(X):

1

Linear: L1 penalized linear regression.

2

Quad: L1 penalized linear regression with quadratic terms.

3

SIM: Single index model (with index parameter estimated via IPW Lasso)

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 36/50

slide-121
SLIDE 121

Simulation Settings: Estimators Implemented

Obtain the DDR estimator θDDR for linear regression: θ0 = Σ−1E(XY ). Two choices of the working nuisance models for π(X) to obtain π(X):

1

Linear: L1 penalized logistic-linear regression.

2

Quad: L1 penalized logistic-linear regression with quadratic terms. Three choices of the working nuisance models for m(X) to obtain m(X):

1

Linear: L1 penalized linear regression.

2

Quad: L1 penalized linear regression with quadratic terms.

3

SIM: Single index model (with index parameter estimated via IPW Lasso) Estimators used for comparison:

1

  • θorac (Oracle): obtained assuming both π(·) and m(·) are known.

2

  • θfull (Super oracle): obtained assuming a full dataset is observed.

Criteria: L2 errors for estimation and coverage probability for inference.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 36/50

slide-122
SLIDE 122

Simulation Results: L2 Error Comparison (p = 50) - I

p = 50, DGP: Linear-Linear.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 37/50

slide-123
SLIDE 123

Simulation Results: L2 Error Comparison (p = 50) - II

p = 50, DGP: Quad-Quad.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 38/50

slide-124
SLIDE 124

Simulation Results: L2 Error Comparison (p = 50) - III

p = 50, DGP: SIM-SIM.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 39/50

slide-125
SLIDE 125

Simulation Results: L2 Error Comparison (p = 500) - I

p = 500, DGP: Linear-Linear.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 40/50

slide-126
SLIDE 126

Simulation Results: L2 Error Comparison (p = 500) - II

p = 500, DGP: Quad-Quad.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 41/50

slide-127
SLIDE 127

Simulation Results: L2 Error Comparison (p = 500) - III

p = 500, DGP: SIM-SIM.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 42/50

slide-128
SLIDE 128

Simulation Results: Coverage Probabilities for HD Inference - I

Coverage probability (covg. prob.) of the DDR estimator: DGP: Linear-Linear.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 43/50

slide-129
SLIDE 129

Simulation Results: Coverage Probabilities for HD Inference - I

Coverage probability (covg. prob.) of the DDR estimator: DGP: Linear-Linear.

1

When p = 50:

  • m: linear
  • m: quad
  • m: SIM
  • m: linear
  • m: quad
  • m: SIM

Average Covg. Prob. (zero coeffs.) Average Covg. Prob. (non-zero coeffs.)

  • π: logit

0.94 (0.01) 0.94 (0.01) 0.95 (0.01) 0.94 (0.01) 0.94 (0.01) 0.93 (0.01)

  • π: quad

0.94 (0.01) 0.95 (0.01) 0.95 (0.01) 0.94 (0.01) 0.94 (0.01) 0.94 (0.01)

2

When p = 500:

  • m: linear
  • m: quad
  • m: SIM
  • m: linear
  • m: quad
  • m: SIM

Average Covg. Prob. (zero coeffs.) Average Covg. Prob. (non-zero coeffs.)

  • π: logit

0.94 (0.01) 0.94 (0.01) 0.94 (0.01) 0.92 (0.01) 0.91 (0.02) 0.92 (0.01)

  • π: quad

0.94 (0.01) 0.94 (0.01) 0.94 (0.01) 0.91 (0.02) 0.91 (0.02) 0.92 (0.01)

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 43/50

slide-130
SLIDE 130

Simulation Results: Coverage Probabilities for HD Inference - II

Coverage probability (covg. prob.) of the DDR estimator: DGP: Quad-Quad.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 44/50

slide-131
SLIDE 131

Simulation Results: Coverage Probabilities for HD Inference - II

Coverage probability (covg. prob.) of the DDR estimator: DGP: Quad-Quad.

1

When p = 50:

  • m: linear
  • m: quad
  • m: SIM
  • m: linear
  • m: quad
  • m: SIM

Average Covg. Prob. (zero coeffs.) Average Covg. Prob. (non-zero coeffs.)

  • π: logit

0.94 (0.01) 0.94 (0.01) 0.95 (0.01) 0.88 (0.16) 0.94 (0.01) 0.88 (0.16)

  • π: quad

0.95 (0.01) 0.94 (0.01) 0.95 (0.01) 0.89 (0.12) 0.94 (0.01) 0.89 (0.12)

2

When p = 500:

  • m: linear
  • m: quad
  • m: SIM
  • m: linear
  • m: quad
  • m: SIM

Average Covg. Prob. (zero coeffs.) Average Covg. Prob. (non-zero coeffs.)

  • π : logit

0.95 (0.01) 0.94 (0.01) 0.95 (0.01) 0.91 (0.03) 0.92 (0.01) 0.91 (0.05)

  • π : quad

0.95 (0.01) 0.94 (0.01) 0.95 (0.01) 0.91 (0.03) 0.92 (0.01) 0.91 (0.04)

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 44/50

slide-132
SLIDE 132

Simulation Results: Coverage Probabilities for HD Inference - III

Coverage probability (covg. prob.) of the DDR estimator: DGP: SIM-SIM.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 45/50

slide-133
SLIDE 133

Simulation Results: Coverage Probabilities for HD Inference - III

Coverage probability (covg. prob.) of the DDR estimator: DGP: SIM-SIM.

1

When p = 50:

  • m: linear
  • m: quad
  • m: SIM
  • m: linear
  • m: quad
  • m: SIM

Average Covg. Prob. (zero coeffs.) Average Covg. Prob. (non-zero coeffs.)

  • π: logit

0.94 (0.01) 0.95 (0.01) 0.95 (0.01) 0.94 (0.01) 0.94 (0.01) 0.94 (0.01)

  • π: quad

0.94 (0.01) 0.95 (0.01) 0.95 (0.01) 0.94 (0.01) 0.94 (0.01) 0.94 (0.01)

2

When p = 500:

  • m: linear
  • m:quad
  • m: SIM
  • m: linear
  • m:quad
  • m: SIM

Average Covg. Prob. (zero coeffs.) Average Covg. Prob. (non-zero coeffs.)

  • π: logit

0.94 (0.01) 0.95 (0.01) 0.95 (0.01) 0.87 (0.05) 0.88 (0.04) 0.93 (0.02)

  • π: quad

0.94 (0.01) 0.95 (0.01) 0.95 (0.01) 0.87 (0.05) 0.87 (0.05) 0.93 (0.02)

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 45/50

slide-134
SLIDE 134

Investigating Double Robustness via Large Sample Results (p = 50)

Consider n = 50000 and p = 50. In addition, also consider the complete case estimator θcc (obtained by using only the data with Ti = 1). DGP: Quad-Quad (p = 50)

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 46/50

slide-135
SLIDE 135

Investigating Double Robustness via Large Sample Results (p = 50)

Consider n = 50000 and p = 50. In addition, also consider the complete case estimator θcc (obtained by using only the data with Ti = 1). DGP: Quad-Quad (p = 50) L2 Error Comparison:

model

  • θDDR
  • θorac
  • θfull
  • θcc
  • m: linear
  • π: logit

0.460 (0.026) 0.072 (0.011) 0.069 (0.01) 0.528 (0.021)

  • π: quad

0.204 (0.137) 0.072 (0.011) 0.069 (0.01) 0.528 (0.021)

  • m: quad
  • π: logit

0.071 (0.010) 0.072 (0.011) 0.069 (0.01) 0.528 (0.021)

  • π: quad

0.072 (0.011) 0.072 (0.011) 0.069 (0.01) 0.528 (0.021)

  • m: SIM
  • π: logit

0.323 (0.019) 0.072 (0.011) 0.069 (0.01) 0.528 (0.021)

  • π: quad

0.172 (0.078) 0.072 (0.011) 0.069 (0.01) 0.528 (0.021)

Inference:

  • m: linear
  • m:quad
  • m: SIM
  • m: linear
  • m:quad
  • m: SIM

Average Covg. Prob. (zero coeffs.) Average Covg. Prob. (non-zero coeffs.)

  • π: logit

0.94 (0.03) 0.94 (0.03) 0.94 (0.03) 0.68 (0.39) 0.93 (0.03) 0.80 (0.19)

  • π: quad

0.96 (0.02) 0.94 (0.03) 0.95 (0.02) 0.96 (0.02) 0.94 (0.02) 0.95 (0.02)

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 46/50

slide-136
SLIDE 136

Investigating Double Robustness via Large Sample Results (p = 500)

Consider n = 50000 and p = 500. In addition, also consider the complete case estimator θcc (obtained by using only the data with Ti = 1). DGP: Quad-Quad (p = 500)

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 47/50

slide-137
SLIDE 137

Investigating Double Robustness via Large Sample Results (p = 500)

Consider n = 50000 and p = 500. In addition, also consider the complete case estimator θcc (obtained by using only the data with Ti = 1). DGP: Quad-Quad (p = 500) L2 Error Comparison:

model

  • θDDR
  • θorac
  • θfull
  • θcc
  • m: linear
  • π: logit

0.297 (0.017) 0.178 (0.009) 0.173 (0.007) 0.325 (0.018)

  • π: quad

0.282 (0.113) 0.178 (0.009) 0.173 (0.007) 0.325 (0.018)

  • m: quad
  • π: logit

0.177 (0.008) 0.178 (0.009) 0.173 (0.007) 0.325 (0.018)

  • π: quad

0.180 (0.010) 0.178 (0.009) 0.173 (0.007) 0.325 (0.018)

  • m: SIM
  • π: logit

0.407 (0.022) 0.178 (0.009) 0.173 (0.007) 0.325 (0.018)

  • π: quad

0.294 (0.045) 0.178 (0.009) 0.173 (0.007) 0.325 (0.018)

Inference:

  • m: linear
  • m:quad
  • m: SIM
  • m: linear
  • m:quad
  • m: SIM

Average Covg. Prob. (zero coeffs.) Average Covg. Prob. (non-zero coeffs.)

  • π: logit

0.95 (0.02) 0.95 (0.02) 0.95 (0.02) 0.78 (0.32) 0.94 (0.02) 0.75 (0.38)

  • π: quad

0.95 (0.02) 0.95 (0.02) 0.95 (0.02) 0.94 (0.04) 0.94 (0.02) 0.88 (0.12)

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 47/50

slide-138
SLIDE 138

References I

Bang, H. and Robins, J. M. (2005). Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973. Belloni, A., Chernozhukov, V., Fern´ andez-Val, I., and Hansen, C. (2017). Program evaluation and causal inference with high-dimensional data. Econometrica, 85(1):233–298. Belloni, A., Chernozhukov, V., and Hansen, C. (2014). Inference on treatment effects after selection among high-dimensional controls. The Review of Economic Studies, 81(2):608–650. Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1–C68. Chernozhukov, V., Demirer, M., Duflo, E., and Fernandez-Val, I. (2017). Generic machine learning inference on heterogenous treatment effects in randomized

  • experiments. ArXiv preprint arXiv:1712.04802.

Chernozhukov, V. and Semenova, V. (2017). Simultaneous inference for best linear predictor of the conditional average treatment effect and other structural functions. ArXiv preprint arXiv:1702.06240v2. Farrell, M. H. (2015). Robust inference on average treatment effects with possibly more covariates than observations. Journal of Econometrics, 189(1):1–23. Li, K.-C. and Duan, N. (1989). Regression analysis under link violation. The Annals of Statistics, 17(3):1009–1052.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 48/50

slide-139
SLIDE 139

References II

Loh, P.-L. and Wainwright, M. J. (2012). High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. The Annals of Statistics, 40(3):1637. Loh, P.-L. and Wainwright, M. J. (2015). Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Journal of Machine Learning Research, 16:559–616. Negahban, S. N., Ravikumar, P., Wainwright, M. J., and Yu, B. (2012). A unified framework for high-dimensional analysis of M-estimators with decomposable

  • regularizers. Statistical Science, 27(4):538–557.

Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 89(427):846–866. Smucler, E., Rotnitzky, A., and Robins, J. M. (2019). A unifying approach for doubly-robust l1 regularized estimation of causal contrasts. ArXiv preprint arXiv:1904.03737v1. Tsiatis, A. (2007). Semiparametric Theory and Missing Data. Springer Science & Business Media. van de Geer, S., B¨ uhlmann, P., Ritov, Y., and Dezeure, R. (2014). On asymptotically

  • ptimal cnfidence regions and tests for high-dimensional models. The Annals of

Statistics, 42(3):1166–1202. Van der Vaart, A. W. (2000). Asymptotic Statistics, volume 3. Cambridge University Press.

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 49/50

slide-140
SLIDE 140

Thank You!

Abhishek Chakrabortty High-Dim. M-Estimation with Missing Responses: A Semi-Parametric Framework 50/50