policy evaluation with latent confounders via optimal
play

Policy Evaluation with Latent Confounders via Optimal Balance Andrew - PowerPoint PPT Presentation

Policy Evaluation with Latent Confounders via Optimal Balance Andrew Bennett 1 Cornell University awb222@cornell.edu Nathan Kallus 1 Cornell University kallus@cornell.edu 1 Alphabetical order. 1 / 33 Policy Learning Problem Given some


  1. Policy Evaluation with Latent Confounders via Optimal Balance Andrew Bennett 1 Cornell University awb222@cornell.edu Nathan Kallus 1 Cornell University kallus@cornell.edu 1 Alphabetical order. 1 / 33

  2. Policy Learning Problem Given some observational data on individuals described by some covariates ( X ), interventions performed on those individuals ( T ), and resultant outcomes ( Y ), wish to estimate utility of policies that assign treatment to individuals based on covariates Challenging problem when the relationship between T and Y in the logged data is confounded, even controlling for X Examples: Drug assignment policy: X is patient information available to doctors, T is drug assigned, Y is medical outcome, and confounding due to factors not fully accounted for by X (e.g. socieoconomics) deciding drug assignment in observational data Personalized education: X contains individual student statistics, T is an educational intervention, Y is measure of post-intervention student outcomes, and confounding due to X poorly accounting for criteria used by decision makers in observational data (e.g. X contains standardized test score but decisions made based on actual student capability) 2 / 33

  3. Setup - Latent Confounder Framework Logged Data Model: Latent Confounders: Z ∈ Z ⊆ R p Observed Proxies: X ∈ X ⊆ R q Treatment: T ∈ { 1 , . . . , m } Potential Outcomes: Y ( t ) ∈ R Assumption ( Z are true confounders) For every t ∈ { 1 , . . . , m } , the variables X , T , Y ( t ) are mutually independent, conditioned on Z. X Z T Y 3 / 33

  4. Setup - Logging and Behavior Policies Evaluation Policy: π t ( x ) denotes the probability of assigning treatment T = t given observed proxies X = x by evaluation policy Logging Policy: e t ( z ) denotes the probability of assigning treatment T = t given observed proxies Z = z by logging policy η t ( x ) denotes the probability of assigning treatment T = t given observed proxies X = x by logging policy 4 / 33

  5. Setup - Policy Evaluation Goal Definition (Policy Value) τ π = E [ � m t =1 π t ( X ) Y ( t )]. Goal: Our goal is to estimate the policy value τ π given iid logged data of the form (( X 1 , T 1 , Y 1 ) , . . . , ( X n , T n , Y n )) τ π that minimizes the MSE E [(ˆ τ π − τ π ) 2 ] Want to find an estimator ˆ 5 / 33

  6. Setup - Latent Confounder Model X Z T Y We denote by ϕ ( z ; x , t ) the conditional density of Z given X = x , T = t Assumption (Latent Confounder Model) We assume that we have an identified model for ϕ ( z ; x , t ) , and that we can calculate conditional densities and sample Z values using this model 6 / 33

  7. Setup - Observed Proxies X Z T Y We do not assume ignorability given X This means standard approaches based on inverse propensity scores are bound to fail Instead the proxies X can be used (along with T ) to calculate the posterior of the true confounders Z , which can be used for evaluation 7 / 33

  8. Setup - Additional Assumptions Assumption (Weak Overlap) E [ e − 2 ( Z )] < ∞ t Assumption (Bounded Variance) The conditional variance of our potential outcomes given X , T is bounded: V [ Y ( t ) | X , T ] ≤ σ 2 . 8 / 33

  9. Setup - Mean Value Functions Define the following mean value functions: µ t ( z ) = E [ Y ( t ) | Z = z ] ν t ( x , t ′ ) = E [ Y ( t ) | X = x , T = t ′ ] = E [ µ t ( Z ) | X = x , T = t ′ ] ρ t ( x ) = E [ Y ( t ) | X = x ] = E [ µ t ( Z ) | X = x ] Note that we can equivalently redefine policy value as: m τ π = E [ � π t ( X ) Y ( t )] t =1 m � = E [ π t ( X ) µ t ( Z )] t =1 m � = E [ π t ( X ) ν t ( X , T )] t =1 9 / 33

  10. Past Work - Standard Estimator Types Weighted, Direct, and Doubly Robust estimators: n W = 1 � ˆ τ π W i Y i n i =1 n m ρ = 1 � � τ π ˆ π t ( X i )ˆ ρ t ( X i ) ˆ n i =1 t =1 n m n ρ = 1 ρ t ( X i ) + 1 � � � τ π ˆ π t ( X i )ˆ W i ( Y i − ˆ ρ T i ( X i )) W , ˆ n n i =1 t =1 i =1 Note that ˆ ρ t is not straightforward to estimate via regression since ρ t ( x ) = E [ Y ( t ) | X = x ] � = E [ Y | X = x ] Correct IPW weights W i = π T i ( X i ) / e T i ( Z i ) are infeasible since Z i is not observed, and naively misspecified IPW weights W i = π T i ( X i ) /η T i ( X i ) lead to biased evaluation 10 / 33

  11. Past Work - Optimal Balancing Optimal Balancing (Kallus 2018) seeks to come up with a set of weights W i that ˆ τ π W minimize an estimate of the worst-case MSE of policy evaluation, given a class of functions for the unknown mean value function Define CMSE ( W , µ ) to be the conditional mean squared error given the logged data of ˆ τ π W as an estimate of the sample average policy effect (SAPE), if the mean value function were given by µ Choose weights W ∗ for evaluation according to the rule: W ∗ = arg min sup CMSE ( W , µ ) W ∈W µ ∈F Permits simple QP algorithm when F is a class of RKHS functions 11 / 33

  12. Generalized IPS Weights I Suppose we want to define weights W ( X , T ) IPS-style such that the weighted estimator is unbiased term-by-term, this requires solving: E [ W ( X , T ) δ T i t Y ( t )] = E [ π t ( X ) Y ( t )] Can easily verify that if we assume ignorability given X this equation is solved by standard IPS weights W ( X , T ) = π T ( X ) /η T ( X ) Theorem (Generalized IPS Weights) If W ( x , t ) satisfies the above equation then for each t ∈ { 1 , . . . , m } � m t ′ =1 η t ′ ( x ) ν t ( x , t ′ ) + Ω t ( x ) W ( x , t ) = π t ( x ) , η t ( x ) ν t ( x , t ) for some Ω t ( x ) such that E [Ω t ( X )] = 0 ∀ t. 12 / 33

  13. Generalized IPS Weights II Calculating these generalized IPS weights is not straightforward since it involves the counterfactual estimation of ν t ( x , t ′ ) for t � = t ′ (requires knowledge of Z ) In addition would expect high variance from error in estimating ν t due to its position in denominator However the fact that such weights exist supports idea of using optimal balancing style approach, and choosing weights that balance a flexible class of possible mean outcome functions 13 / 33

  14. Adversarial Objective Motivation Define the following, where we embed the dependence on µ inside ν t implicitly: f it = W i δ T i t − π t ( X i ) � 2 � n m + 2 σ 2 1 � � n 2 � W � 2 J ( W , µ ) = f it ν t ( X i , T i ) 2 , n i =1 t =1 Theorem (CMSE Upper Bound) W − τ π ) 2 | X 1: n , T 1: n ] ≤ 2 J ( W , µ ) + O p (1 / n ) . E [(ˆ τ π Lemma (CMSE Convergence implies Consistency) W = τ π + O p (1 / √ n ) . W − τ π ) 2 | X 1: n , T 1: n ] = O p (1 / n ) then ˆ If E [(ˆ τ π τ π 14 / 33

  15. Balancing Objective Our optimal balancing objective is to choose weights W ∗ for evaluation according to the following optimzation problem: W ∗ = arg min sup J ( W , µ ) W ∈W µ ∈F 15 / 33

  16. Feasibility of Balancing Objective I Minimizing J ( W , µ ) over some class of µ ∈ F corresponds to balancing some class of functions ν implicitly indexed by µ , since: � 2 � n n m 1 W i ν T i ( X i , T i ) − 1 � � � J ( W , µ ) = π t ( X i ) ν t ( X i , T i ) n n i =1 i =1 t =1 + 2 σ 2 n 2 � W � 2 2 Note that such balancing would be impossible over a generic flexible class of functions ν ignoring Z , due to ν t ( x , t ′ ) terms for t � = t ′ 16 / 33

  17. Feasibility of Balancing Objective II The following lemma suggests that this fundamental counterfactual issue may not be a problem given our implicit constraint imposed by indexing using µ and our overlap assumption: Lemma (Mean Value Function Overlap) Assuming � µ t � ∞ ≤ b, under our weak overlap assumption, for all x ∈ X , and t , t ′ , t ′′ ∈ { 1 , . . . , m } we have | ν t ( x , t ′′ ) | ≤ η t ′ ( x ) � 8 b E [ e − 2 ( Z ) | X = x , T = t ′ ] | ν t ( x , t ′ ) | . t η t ′′ ( x ) 17 / 33

  18. Assumptions for Consistent Evaluation I Define F t = { µ t : ∃ ( µ ′ 1 , . . . , µ ′ m ) ∈ F with µ ′ t = µ t } , then we make the following assumptions: Assumption (Normed) For each t ∈ { 1 , . . . , m } there exists a norm � · � t on span( F t ) , and there exists a norm � · � on span( F ) which is defined given some R m norm as � µ � = � ( � µ 1 � 1 , . . . , � µ m � m ) � . Assumption (Absolutely Star Shaped) For every µ ∈ F and | λ | ≤ 1 , we have λµ ∈ F . Assumption (Convex Compact) F is convex and compact 18 / 33

  19. Assumptions for Consistent Evaluation II Assumption (Square Integrable) For each t ∈ { 1 , . . . , m } the space F t is a subset of L 2 ( Z ) , and its norm dominates the L 2 norm (i.e., inf µ t ∈F t � µ t � / � µ t � L 2 > 0 ). Assumption (Nondegeneracy) Define B ( γ ) = { µ ∈ span( F ) : � µ � ≤ γ } . Then we have B ( γ ) ⊆ F for some γ > 0 . Assumption (Boundedness) sup µ ∈F � µ � ∞ < ∞ . 19 / 33

  20. Assumptions for Consistent Evaluation III Definition (Rademacher Complexity) 1 � n R n ( F ) = E [sup f ∈F i =1 ǫ i f ( Z i )], where ǫ i are iid Rademacher random n variables. Assumption (Complexity) For each t ∈ { 1 . . . , m } we have R n ( F t ) = o (1) . 20 / 33

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend