Causal Inference and Stable Learning Peng Cui Tong Zhang Tsinghua - PowerPoint PPT Presentation

Causal Inference and Stable Learning Peng Cui Tong Zhang Tsinghua University Hong Kong University of Science and Technology

2 ML techniques are impacting our life • A day in our life with ML techniques 10:00 am 6:00 pm 8:00 am 8:00 pm 4:00 pm 8:30 am

3 Now we are stepping into risk-sensitive areas Shifting from Performance Driven to Risk Sensitive

4 Problems of today’s ML - Explainability Most machine learning models are black-box models Unexplainable Human in the loop Health Military Finance Industry

5 Problems of today’s ML - Stability Most ML methods are developed under I.I.D hypothesis

6 Problems of today’s ML - Stability Yes Maybe No

7 Problems of today’s ML - Stability • Cancer survival rate prediction Testing Data Training Data City Hospital Predictive Model City Hospital Higher income, higher survival rate. University Hospital Survival rate is not so correlated with income.

8 A plausible reason: Correlation Correlation is the very basics of machine learning.

9 Correlation is not explainable

10 Correlation is ‘ unstable ’

11 It’s not the fault of correlation , but the way we use it • Three sources of correlation: • Causation Ice Cream T Y Summer • Causal mechanism Sales • Stable and explainable X Income • Confounding • Ignoring X T Y Financial • Spurious Correlation Accepted product offer • Sample Selection Bias T Y Grass Dog • Conditional on S • Spurious Correlation S Sample Selection

12 A Practical Definition of Causality Definition: T causes Y if and only if X changing T leads to a change in Y, while keeping everything else constant. T Y Causal effect is defined as the magnitude by which Y is changed by a unit change in T. Called the “interventionist” interpretation of causality. http://plato.stanford.edu/entries/causation-mani/

13 The benefits of bringing causality into learning Grass—Label: Strong correlation Causal Framework Weak causation Dog nose—Label: Strong correlation X Strong causation T Y T ： grass X ： dog nose Y ： label More Explainable and More Stable

14 The gap between causality and learning p How to evaluate the outcome? p Wild environments p High-dimensional p Highly noisy p Little prior knowledge (model specification, confounding structures) p Targeting problems p Understanding v.s. Prediction p Depth v.s. Scale and Performance How to bridge the gap between causality and (stable) learning ?

15 Outline Ø Correlation v.s. Causality Ø Causal Inference Ø Stable Learning Ø NICO: An Image Dataset for Stable Learning Ø Conclusions

16 Paradigms - Structural Causal Model A graphical model to describe the causal mechanisms of a system U Z W • Causal Identification with back door criterion • Causal Estimation with do T Y calculus How to discover the causal structure?

17 Paradigms – Structural Causal Model • Causal Discovery • Constraint-based: conditional independence • Functional causal model based A generative model with strong expressive power. But it induces high complexity.

18 Paradigms - Potential Outcome Framework • A simpler setting • Suppose the confounders of T are known a priori • The computational complexity is affordable • Under stronger assumptions • E.g. all confounders need to be observed More like a discriminative way to estimate treatment’s partial effect on outcome.

19 Causal Effect Estimation • Treatment Variable: 𝑈 = 1 or 𝑈 = 0 • Treated Group ( 𝑈 = 1 ) and Control Group (𝑈 = 0 ) • Potential Outcome: 𝑍(𝑈 = 1) and 𝑍(𝑈 = 0) • Average Causal Effect of Treatment (ATE): 𝐵𝑈𝐹 = 𝐹[𝑍 𝑈 = 1 − 𝑍 𝑈 = 0 ]

20 Counterfactual Problem • Two key points for causal effect 𝒁 𝑼.𝟐 𝒁 𝑼.𝟏 Person T estimation P1 1 0.4 ? • Changing T P2 0 ? 0.6 • Keeping everything else constant P3 1 0.3 ? P4 0 ? 0.1 • For each person, observe only one: P5 1 0.5 ? either 𝑍 -./ or 𝑍 -.0 P6 0 ? 0.5 • For different group (T=1 and T=0), P7 0 ? 0.1 something else are not constant

21 Ideal Solution: Counterfactual World • Reason about a world that does not exist • Everything in the counterfactual world is the same as the real world, except the treatment 𝑍 𝑈 = 1 𝑍 𝑈 = 0

22 Randomized Experiments are the “Gold Standard” • Drawbacks of randomized experiments: • Cost • Unethical • Unrealistic

23 Causal Inference with Observational Data • Counterfactual Problem: 𝑍 𝑈 = 1 𝑍 𝑈 = 0 or • Can we estimate ATE by directly comparing the average outcome between treated and control groups? • Yes with randomized experiments (X are the same) • No with observational data (X might be different)

24 Confounding Effect age smoking weight Balancing Confounders’ Distribution

25 Methods for Causal Inference • Matching • Propensity Score • Directly Confounder Balancing

26 Matching 𝑈 = 0 𝑈 = 1

27 Matching

28 Matching • Identify pairs of treated (T=1) and control (T=0) units whose confounders X are similar or even identical to each other 𝒋 𝒌 𝐸𝑗𝑡𝑢𝑏𝑜𝑑𝑓 𝑌 A , 𝑌 C ≤ 𝜗 • Paired units guarantee that the everything else (Confounders) approximate constant • Small 𝜗 : less bias, but higher variance • Fit for low-dimensional settings • But in high-dimensional settings, there will be few exact matches

30 Propensity Score Based Methods • Propensity score 𝑓(𝑌) is the probability of a unit to get treated 𝑓 𝑌 = 𝑄(𝑈 = 1|𝑌) • Then, Donald Rubin shows that the propensity score is sufficient to control or summarize the information of confounders 𝑈 ⫫ 𝑌 | 𝑓(𝑌) 𝑈 ⫫ (𝑍 1 , 𝑍(0)) | 𝑓(𝑌) • Propensity scores cannot be observed, need to be estimated

31 Propensity Score Matching 𝑓̂ 𝑌 = 𝑄(𝑈 = 1|𝑌) • Estimating propensity score: • Supervised learning : predicting a known label T based on observed covariates X. • Conventionally, use logistic regression • Matching pairs by distance between 𝐸𝑗𝑡𝑢𝑏𝑜𝑑𝑓 𝑌 A , 𝑌 C ≤ 𝜗 propensity score: 𝐸𝑗𝑡𝑢𝑏𝑜𝑑𝑓 𝑌 A , 𝑌 C = |𝑓̂ 𝑌 A − 𝑓̂ 𝑌 C | • High dimensional challenge: from matching to PS estimation P. C. Austin. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate behavioral research, 46(3):399–424, 2011.

32 Inverse of Propensity Weighting (IPW) • Why weighting with inverse of propensity score? • Propensity score induces the distribution bias on confounders X 𝑓 𝑌 = 𝑄(𝑈 = 1|𝑌) 𝒇(𝒀) 𝟐 − 𝒇(𝒀) Unit #units #units #units Unit #units #units (T=1) (T=0) (T=1) (T=0) Confounders A 0.7 0.3 10 7 3 A 10 10 are the same! B 0.6 0.4 50 30 20 B 50 50 C 0.2 0.8 40 8 32 C 40 40 Distribution Bias 𝑥 A = 𝑈 A + 1 − 𝑈 A Reweighting by inverse of propensity score: 𝑓 A 1 − 𝑓 A P. R. Rosenbaum and D. B. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.

33 Inverse of Propensity Weighting (IPW) 𝑥 A = 𝑈 A + 1 − 𝑈 • Estimating ATE by IPW [1]: A 𝑓 A 1 − 𝑓 A • Interpretation: IPW creates a pseudo-population where the confounders are the same between treated and control groups. • But requires correct model specification for propensity score • High variance when 𝑓 is close to 0 or 1 P. R. Rosenbaum and D. B. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.

34 Non-parametric solution • Model specification problem is inevitable • Can we directly learn sample weights that can balance confounders’ distribution between treated and control groups?

36 Directly Confounder Balancing • Motivation : The collection of all the moments of variables uniquely determine their distributions. • Methods : Learning sample weights by directly balancing confounders’ moments as follows (ATT problem) The first moments of X The first moments of X on the Treated Group on the Control Group With moments, the sample weights can be learned without any model specification. J. Hainmueller. Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Political Analysis, 20(1):25–46, 2012.

37 Entropy Balancing • Directly confounder balancing by sample weights W • Minimize the entropy of sample weights W Either know confounders a priori or regard all variables as confounders . All confounders are balanced equally. Athey S, et al. Approximate residual balancing: debiased inference of average treatment effects in high dimensions. Journal of the Royal Statistical Society: Series B, 2018, 80(4): 597-623.

Causal Inference and Stable Learning Peng Cui Tong Zhang Tsinghua - PowerPoint PPT Presentation

Causal Inference and Stable Learning Peng Cui Tong Zhang Tsinghua University Hong Kong University of Science and Technology 2 ML techniques are impacting our life A day in our life with ML techniques 10:00 am 6:00 pm 8:00 am 8:00 pm

Political Science 209 - Fall 2018 Causal Inference Florian Hollenbach 7th September 2018 Causal

Causal Effect Evaluation and Causal Network Learning Zhi Geng Peking University, China June

Causal Inference By: Miguel A. Hern an and James M. Robins Part I: Causal inference without

A Brief Introduction to Causal Inference Brady Neal causalcourse.com What is causal inference?

Introduction to Causal Inference Lan Liu University of Minnesota at Twin Cities liux3771@umn.edu

Foundations of Causal Discovery Frederick Eberhardt KDD Causality Workshop 2016 Causal Discovery

Causal Inference Theory and Applications Dr. Matthias Uflacker, Johannes Huegle, Christopher

Modes of Statistical Inference for Causal Efgects Plus an overview of the testing based approach

Geographic Data Science - Lecture IX Causal Inference Dani Arribas-Bel Today Correlation Vs

Causal inference Gary Goertz Kroc Institute for International Peace Studies University of Notre

Causal Inference An introduction based on S. Wagers course on Causal Inference (OIT 661) Imke

Geographic Data Science - Lecture IX Causal Inference Dani Arribas-Bel Today Correlation Vs

Geographic Data Science - Lecture IX Causal Inference Dani Arribas-Bel Today Correlation Vs

Causal Inference and Response Surface Modeling Inference and

Causal Programming Causal Programming Joshua Brul Joshua Brul

Few-shot Domain Adaptation 1/12 by Causal Mechanism Transfer Domain adaptation Causal mechanism

ROP78 Cancer Care, Survivorship, Pain Control and Palliative Care Thyroid Cancer Survivors

scRNA-seq Differential expression analyses Olga Dethlefsen olga.dethlefsen@nbis.se NBIS,

Digital Image Processing (CS/ECE 545) Lecture 8: Regions in Binary Images (Part 2) and Color (Part

Post registration Specialist Practice qualifications review General Practice Nursing Webinar

ASCO Os Pay ayment ment Ref efor orm m Model odel Washington State Medical Oncology

Multiple Nested Reductions of Single Data Modes as a Tool to Deal with Large Data Sets Iven Van

JUST THE MATHS SLIDES NUMBER 10.3 DIFFERENTIATION 3 (Elementary techniques of

What Can It Look Like in the Science Classroom? Jeremy Peacock, Science Northeast Georgia RESA