Treatment choice with many covariate values Aleksey Tetenov - PowerPoint PPT Presentation

Treatment choice with many covariate values Aleksey Tetenov (University of Bristol) Cemmap masterclass Statistical decision theory for treatment choice and prediction May 30-31, 2017

Stoye (2009), Proposition 4: If covariate X is continuously distributed, minimax regret is constant and does not decrease with sample size. This result changes if 1. Stronger assumptions on how average treatment response E [ Y t | X ] varies with X (Stoye, 2012) 2. The set of feasible treatment rules is restricted Source: Kitagawa and Tetenov (2017), “Who Should be Treated? Empirical Welfare Maximization Methods for Treatment Choice” Cemmap working paper CWP24/17

Regret is evaluated relative to the best implementable treatment rule. Stoye (2009, Proposition 4) assumes that any treatment allocation is feasible, including arbitrarily complex treatment rules. This is an unreasonable benchmark for public policies. Constraints frequently restrict the complexity and other characteristics of feasible treatment rules. ◮ Treatment rules are often publicly communicated to individuals and need to be understandable and transparent ◮ Monotonicity of treatment rules in some covariates if desirable (e.g., cannot treat the rich but not the poor) ◮ Some treatments may be capacity-constrained ◮ Other aggregate constraints (e.g., aggregate proportion treated cannot vary with race)

Setup A Randomized Controlled Trial (RCT) sample ◮ X i ∈ X - pre-treatment observed covariates ◮ D i ∈ { 0 , 1 } - randomized treatment ◮ Y i ∈ R - treatment outcome ◮ Y 0 , i , Y 1 , i - potential outcomes ◮ e ( x ) ∈ [ κ, 1 − κ ] - the probability of being randomized to treatment 1 in the experiment We consider a restricted set of treatment rules G . Each G specifies which subset of the population will be treated (after analyzing the experimental data) ◮ X ∈ G will be assigned to treatment 1 ◮ X / ∈ G will be assigned to treatment 0 (excludes randomized/fractional treatment rules) ˆ G ∈ G treatment rule as a function of the sample

Empirical Welfare Maximization ◮ Estimate the policy directly by maximizing empirical welfare ˆ G EWM = arg max G ∈G W n ( G ) , ◮ Sample analogue � Y i D i � � n e ( X i ) · 1 { X i ∈ G } + Y i (1 − D i ) W n ( G ) ≡ 1 1 − e ( X i ) · 1 { X i / ∈ G } n i =1 consistently estimates the population welfare of policy G , W ( G ) = E [ Y 1 · 1 { X ∈ G } + Y 0 · 1 { X i / ∈ G } ] . ◮ EWM treatment rule: � G EWM ≡ arg max G ∈G W n ( G )

Empirical Illustration ◮ National Job Training Partnership Act (JTPA) Study (Bloom et al, 1997) ◮ Sample: 11,204 adult applicants ◮ Propensity score = 2/3 (probability of treatment) ◮ Outcome Y = D ( Y 1 − cost ) + (1 − D ) Y 0 : ◮ Total individual earnings in the 30-month period following treatment assignment ◮ Total earnings minus $774 (average cost of each treatment assignment, taking into account variable take-up) ◮ Covariates X : Years of education, pre-program earnings ◮ Average treatment effect: $1,157 ◮ 95% CI: ($513, $1,801)

Parametric plug-in treatment rule: estimate E ( Y 1 | X ) and E ( Y 0 | X ) by OLS. Assign treatment 1 if X ′ β 1 > X ′ β 0 No cost: treat everyone, est. gain $1,157 With $774 cost: treat 96%, est. gain $466 (per population member) OLS plug-in rule, no cost OLS plug-in rule, $774 cost per assignee Population density $20K Pre-program annual earnings $10K $0 6 8 10 12 14 16 18 Years of education

EWM linear rule: maximizes the sample analog of welfare among linear decision rules ˆ G = 1 { X ′ β ≥ 0 } No cost: treat 90%, est. gain $1,408. 95% CI: ($592, $2,225) $774 cost: treat 90%, est. gain $712. 95% CI: (-$107, $1,532) EWM linear rule, no cost or $774 cost per assignee Population density $20K Pre-program annual earnings $10K $0 6 8 10 12 14 16 18 Years of education

EWM quadrant rule: select best min or max threshold for each covariate ˆ G = 1 { x 1 > ( < ) t 1 , x 2 > ( < ) t 2 } No cost: treat 93%, est. gain $1,277. 95% CI: ($519, $2,034) $774 cost: treat 83%, est. gain $687. 95% CI: (-$71, $1,445) EWM quadrant rule, no cost EWM quadrant rule, $774 cost per assignee Population density $20K Pre-program annual earnings $10K $0 6 8 10 12 14 16 18 Years of education

Non-parametric plug-in rule: bivariate kernel reg of Y 1 | X and Y 0 | X (ROT bandwidth). No cost: treat 82%, est. gain $1,867 $774 cost: treat 69%, est. gain $1,257

Welfare Criterion Object of interest: policy with the highest utilitarian (additive) welfare Outcome variable Y should reflect social preferences, so it may need to ◮ give different weight to different individuals ◮ non-linearly transform outcomes ◮ aggregate multi-dimensional outcomes ◮ subtract treatment costs from outcomes

◮ The utilitarian welfare of treatment rule G is W ( G ) ≡ E [ Y 1 · 1 { X ∈ G } + Y 0 · 1 { X / ∈ G } ] = E [ Y 0 ] + E [ τ ( X )1 { X ∈ G } ] , τ ( X ) ≡ E ( Y 1 − Y 0 | X ) : the conditional treatment effect ◮ We can equivalently work with the welfare gain of treating subset G relative to treating no one V ( G ) ≡ W ( G ) − W ( ∅ ) = E [ τ ( X ) · 1 { X ∈ G } ] ,

First Best treatment rule (with no constraints on G ) G ∗ ≡ { x : τ ( x ) ≥ 0) } FB ∈ arg max G ∈B ( X ) W ( G ) ∈ arg max G ∈B ( X ) V ( G ) Second Best treatment rule maximizing welfare in a constrained class G G ∗ ∈ arg max G ∈G W ( G ) ∈ arg max G ∈G V ( G ) The maximized feasible welfare W ∗ W ( G ) ≤ W ( G ∗ G ≡ sup FB ) G ∈G

Assumptions: Distribution of ( Y 0 , Y 1 , D , Y ) is P ∈ P . The only assumption on the distribution of treatment response: � � − M 2 , M ◮ Bounded Outcomes : Y 1 , Y 0 ∈ , M < ∞ , implying 2 | τ ( x ) | ≤ M , ∀ x . Restriction on experimental design (point-identifies τ ( x )) ◮ Strict Overlap : There exist κ > 0, s.t. e ( x ) ∈ [ κ, 1 − κ ], ∀ x . Restriction on G : ◮ Complexity of Decision Sets : G is a countable VC-class of subsets with finite VC-dimension: v = the maximal number of points in X that can be shattered by G .

Examples of VC-classes G Linear eligibility score: � { x : x ′ β ≥ 0 } : β ∈ R d x � G = has v = d x + 1. Generalized eligibility score: �� m � : ( a 1 , . . . , a m ) ∈ R m G = x : a k f k ( x ) ≥ g ( x ) k =1 has v ≤ m + 1. Multiple index rules: G = {{ x : ( f 1 ( x 1 ) ≤ c 1 ) ∩ · · · ( f K ( x K ) ≤ c K ) } : ( c 1 , . . . , c m ) ∈ R m } has v ≤ K + 1.

Upper bound on maximum regret of EWM Theorem 2.1 : Let P be a class of DGPs satisfying assumptions Bounded Outcomes and Strict Overlap. Let G be a VC-class of treatment choice rules. Then � v � � M G − W ( ˆ W ∗ sup E P n G EWM ) ≤ C 1 n , κ P ∈P where C 1 is a universal constant. Remarks on rate bounds: ◮ This rate bound is valid whether G ∗ FB ∈ G or not. ◮ Parametric plug-in with misspecified regressions does not have such second-best optimality.

Proof: sketch For any ˜ G ∈ G , W ( ˜ G ) − W ( ˆ G EWM ) W n ( ˆ G EWM ) − W n ( ˜ G ) + W ( ˜ G ) − W ( ˆ ≤ G EWM ) � � � � � � � � � W n ( ˆ G EWM ) − W ( ˆ � W n ( ˜ G ) − W ( ˜ ≤ G EWM ) � + G ) � ≤ 2 sup | W n ( G ) − W ( G ) | . G ∈G So, W ∗ G − W ( ˆ G EWM ) ≤ 2 sup | W n ( G ) − W ( G ) | G ∈G

Proof: sketch W n ( G ) = E n ( f ( · ; G )) and W ( G ) = E ( f ( · ; G )), where � Y i D i � e ( X i )1 { X i ∈ G } + Y i (1 − D i ) f ( · ; G ) = 1 − e ( X i ) 1 { X i / ∈ G } Lemma A.1 If G is a VC-class of sets with VC-dimension v and g ( · ) , h ( · ) are two given real-valued functions of observations, then functions { f ( · ; G ) = g ( · ) · 1 { x ∈ G } + h ( · ) · 1 { x / ∈ G } , G ∈ G} form a VC-subgraph class with VC-dimension ≤ v . Using this lemma, we can apply a well-known maximal inequality for centered empirical processes to sup | W n ( G ) − W ( G ) | = sup | E n ( f ) − E ( f ) | G ∈G G ∈G

Lower bound on minimax regret Theorem 2.2 : Let P be a class of DGPs satisfying Bounded Outcomes and Strict Overlap. Let G be a VC-class of treatment choice rules. Then, for any treatment choice rule ˆ G � v � � √ ≥ M W ∗ G − W ( ˆ 2 e − 2 2 sup G ) for all n ≥ 16 v , E P n n P ∈P Remarks on rate bounds: ◮ Both are finite-sample bounds (but not sharp). ◮ ˆ G EWM is minimax rate optimal: no ˆ G has maximum regret converging to zero at a faster rate uniformly over P . ◮ EWM is minimax rate optimal even when v grows with n .

Proof: sketch For the lower bound, we adapt the argument in Lugosi (2002): E P n � � W ∗ G − W ( G n ) sup P ∈P P ∈P ∗ E P n � � W ∗ ≥ sup G − W ( G n ) � P ∗ E P n � � W ∗ ≥ G − W ( G n ) d µ ( P ) � � � W ∗ G − W ( ˆ ≥ G bayes ) d µ ( P ) , P ∗ E P n where P ∗ ⊂ P is a class of DGPs that has a discrete support of X with v points and τ ( x ) = γ or − γ . For uniform prior µ , the Bayes risk can be analytically computed as a function of γ . Setting � γ = v / n gives the lower bound.

Treatment choice with many covariate values Aleksey Tetenov - PowerPoint PPT Presentation

Treatment choice with many covariate values Aleksey Tetenov (University of Bristol) Cemmap masterclass Statistical decision theory for treatment choice and prediction May 30-31, 2017 Stoye (2009), Proposition 4: If covariate X is continuously

Covariate Adjustment and Statistical Power Tara Slough EGAP Learning Days X Covariate Adjustment

Model Selection Model Selection under Covariate Shift under Covariate Shift Masashi Sugiyama

Motivation: disease progression modelling Covariate-GPLVM Motivation: disease progression

Covariate Balancing Propensity Score for General Treatment Regimes Kosuke Imai Princeton

MOBILITY CHOICE STUDY MOBILITY CHOICE STUDY MOBILITY CHOICE STUDY Planning for Mobility in

Covariate Balancing Propensity Score Kosuke Imai Princeton University Winter Conference in

Covariate Balancing Propensity Score Kosuke Imai Princeton University June 1, 2012 Joint work

Importance-Weighted Cross- Importance-Weighted Cross- Validation for Covariate Shift Validation

Values Learning Outcomes Define what values are Identify your personal values Relate

Voting in Maines Ranked Choice Election A non-partisan guide to ranked choice elections

Homecare Choice Program Presented by Jenny Cokeley Homecare Choice Program Manager Homecare

MALE INFERTILITY CASE-I Before Treatment: After Treatment: After Treatment: CASE 2 BEFORE

Comparing covariate adjustment in interventional and observational studies Markus Kalisch,

Estimating Treatment Effects in Cluster Randomized Trials by Calibrating Covariate Imbalances

Toward Efficient Many-to-Many Broadcast in Dynamic Wireless Networks Fabian Mager , Carsten

HOW WILL HOW WILL RANKED CHOICE VOTING RANKED CHOICE VOTING WORK IN HI? WORK IN HI? VOTERS

0 Structural Challenges and Opportunities in the U.S. Economy Jason Furman Chairman, Council of

Growth and Shared Prosperity in Brazil Marcelo Neri FGV Social and EPGE/FGV With Nanak Kakwani

Normative Theory ECON 499: The Economics of Inequality Winter 2018 Readings (on Canvas):

Parallel Task Frameworks for FMM Patrick k Atki kinson, p.atki kinson@bristol.ac.uk uk Pr

3/13/2013 The Shoppes at Eagle Harbor Carrollton, VA www. WHLR.us 1 3/13/2013 Walnut Hill

Layered permutations and rational generating functions Anders Bjrner Department of Mathematics

Virtual Memory Anne Bracy CS 3410 Computer Science Cornell University The slides are the

The coupling method and operator relations Sanne ter Horst 1 North-West University IWOTA 2017