causality in a wide sense lecture iv
play

Causality in a wide sense Lecture IV Peter B uhlmann Seminar for - PowerPoint PPT Presentation

Causality in a wide sense Lecture IV Peter B uhlmann Seminar for Statistics ETH Z urich Recap from yesterday data from different known observed environments or experimental conditions or perturbations or sub-populations e E : ( X


  1. Causality – in a wide sense Lecture IV Peter B¨ uhlmann Seminar for Statistics ETH Z¨ urich

  2. Recap from yesterday data from different known observed environments or experimental conditions or perturbations or sub-populations e ∈ E : ( X e , Y e ) ∼ F e , e ∈ E with response variables Y e and predictor variables X e consider “many possible” but mostly non-observed environments/perturbations F ⊃ E ���� observed a pragmatic prediction problem: predict Y given X such that the prediction works well (is “robust”) for “many possible” environments e ∈ F based on data from much fewer environments from E

  3. the causal parameter optimizes a worst case risk: e ∈{F E [( Y e − ( X e ) T β ) 2 ] ∋ β causal argmin β max if F = { arbitrarily strong perturbations not acting directly on Y } agenda for today: consider other classes F ... and give up on causality

  4. Anchor regression: as a way to formalize the extrapolation from E to F ( Rothenh¨ ausler, Meinshausen, PB & Peters, 2018 ) the environments from before, denoted as e : they are now outcomes of a variable A ���� anchor A H hidden ? β 0 X Y

  5. Anchor regression and causal regularization ( Rothenh¨ ausler, Meinshausen, PB & Peters, 2018 ) the environments from before, denoted as e : they are now outcomes of a variable A ���� anchor A H hidden β 0 X Y Y ← X β 0 + ε Y + H δ, X ← A α 0 + ε X + H γ, Instrumental variables regression model (cf. Angrist, Imbens, Lemieux, Newey, Rosenbaum, Rubin,... )

  6. Anchor regression and causal regularization ( Rothenh¨ ausler, Meinshausen, PB & Peters, 2018 ) the environments from before, denoted as e : they are now outcomes of a variable A ���� anchor hidden A H A is an “anchor” source node! β 0 X Y ❀ Anchor regression     X X  ← B  + ε + MA Y Y   H H

  7. Anchor regression and causal regularization ( Rothenh¨ ausler, Meinshausen, PB & Peters, 2018 ) the environments from before, denoted as e : they are now outcomes of a variable A ���� anchor hidden A H A is an “anchor” allowing also for source node! feedback loops β 0 X Y ❀ Anchor regression     X X  ← B  + ε + MA Y Y   H H

  8. allow that A acts on Y and H ❀ there is a fundamental identifiability problem cannot identify β 0 this is the price for more realistic assumptions than IV model

  9. ... but “Causal Regularization” offers something find a parameter vector β such that the residuals ( Y − X β ) stabilize, have the same distribution across perturbations of A = environments/sub-populations we want to encourage orthogonality of residuals with A something like β = argmin β � Y − X β � 2 ˜ 2 / n + ξ � A T ( Y − X β ) / n � 2 2

  10. ˜ β = argmin β � Y − X β � 2 2 / n + ξ � A T ( Y − X β ) / n � 2 2 causal regularization: ˆ β = argmin β � ( I − Π A )( Y − X β ) � 2 2 / n + γ � Π A ( Y − X β ) � 2 2 / n Π A = A ( A T A ) − 1 A T (projection onto column space of A ) ◮ for γ = 1: least squares ◮ for γ = 0: adjusting for heterogeneity due to A ◮ for 0 ≤ γ < ∞ : general causal regularization

  11. ˜ β = argmin β � Y − X β � 2 2 / n + ξ � A T ( Y − X β ) / n � 2 2 causal regularization: ˆ β = argmin β � ( I − Π A )( Y − X β ) � 2 2 / n + γ � Π A ( Y − X β ) � 2 2 / n + λ � β � 1 Π A = A ( A T A ) − 1 A T (projection onto column space of A ) ◮ for γ = 1: least squares + ℓ 1 -penalty ◮ for γ = 0: adjusting for heterogeneity due to A + ℓ 1 -penalty ◮ for 0 ≤ γ < ∞ : general causal regularization + ℓ 1 -penalty convex optimization problem

  12. It’s simply linear transformation consider W γ = I − ( 1 − √ γ )Π A , X = W γ X , ˜ ˜ Y = W γ Y then: ( ℓ 1 -regularized) anchor regression is (Lasso-penalized) least squares of ˜ Y versus ˜ X ❀ super-easy (but have to choose a tuning parameter γ )

  13. ... there is a fundamental identifiability problem... but causal regularization solves for e ∈F E | Y e − X e β | 2 argmin β max for a certain class of shift perturbations F recap: causal parameter solves for argmin β max e ∈F E | Y e − X e β | 2 for F = “essentially all” perturbations

  14. Model for F : shift perturbations model for observed heterogeneous data (“corresponding to E ”)     X X  = B  + ε + MA Y Y   H H model for unobserved perturbations F (in test data) shift vectors v acting on (components of) X , Y , H     X v X v Y v  = B Y v  + ε + v   H v H v v ∈ C γ ⊂ span ( M ) , γ measuring the size of v i.e. v ∈ C γ = { v ; v = Mu for some u with E [ uu T ] � γ E [ AA T ] }

  15. A fundamental duality theorem ( Rothenh¨ ausler, Meinshausen, PB & Peters, 2018 ) P A the population projection onto A : P A • = E [ •| A ] For any β v ∈ C γ E [ | Y v − X v β | 2 ] = E �� � � 2 � �� � 2 � � max � ( Id − P A )( Y − X β ) + γ E � P A ( Y − X β ) ≈ � ( I − Π A )( Y − X β ) � 2 2 / n + γ � Π A ( Y − X β ) � 2 2 / n � �� � objective function on data worst case shift interventions ← → regularization! in the population case

  16. for any β worst case test error � �� � �� � Y v − X v β � � 2 � max v ∈ C γ E �� � �� � � 2 � � 2 � = � ( Id − P A )( Y − X β ) + γ E � P A ( Y − X β ) E � �� � criterion on training population sample

  17. worst case test error � �� � �� � Y v − X v β � � 2 � argmin β max v ∈ C γ E �� � �� � � 2 � � 2 � = argmin β E � ( Id − P A )( Y − X β ) + γ E � P A ( Y − X β ) � �� � criterion on training population sample and “therefore” also finite sample guarantee: β = argmin β � ( I − Π A )( Y − Xu ) � 2 ˆ 2 / n + γ � Π A ( Y − X β ) � 2 2 (+ λ � β � 1 ) leads to predictive stability (i.e. optimizing a worst case risk)

  18. fundamental duality in anchor regression model: v ∈ C γ E [ | Y v − X v β | 2 ] = E �� � � 2 � �� � 2 � � max � ( Id − P A )( Y − X β ) + γ E � P A ( Y − X β ) ❀ robustness ← → causal regularization Causality Adversarial Robustness machine learning, Generative Networks e.g. Ian Goodfellow e.g. Judea Pearl

  19. robustness ← → causal regularization the languages are rather different: ◮ causal graphs ◮ metric for robustness Wasserstein, f-divergence ◮ Markov properties on ◮ minimax optimality graphs ◮ perturbation models ◮ inner and outer optimization ◮ identifiability of systems ◮ regularization ◮ transferability of systems ◮ ... ◮ ... mathematics allows to classify equivalences and differences ❀ can be exploited for better methods and algorithms taking “the good” from both worlds!

  20. indeed: causal regularization is nowadays used (still a “side-branch”) in robust deep learning Bouttou et al. (2013), ... , Heinze-Deml & Meinshausen (2017), ... and indeed, we can improve prediction

  21. Stickmen classification ( Heinze-Deml & Meinshausen (2017) ) Classification into { child, adult } based on stickmen images 5-layer CNN, training data ( n = 20 ′ 000) 5-layer CNN 5-layer CNN with some causal regularization training set 4% 4% test set 1 3% 4% test set 2 (domain shift) 41 % 9 % in training and test set 1: children show stronger movement than adults in test set 2 data: adults show stronger movement spurious correlation between age and movement is reversed!

  22. Connection to distributionally robust optimization (Ben-Tal, El Ghaoui & Nemirovski, 2009; Sinha, Namkoong & Duchi, 2017) P ∈P E P P [( Y − X β ) 2 ] argmin β max perturbations are within a class of distributions P = { P ; d ( P , P 0 ) ≤ ρ } ���� emp. distrib. the “model” is the metric d ( ., . ) and is simply postulated often as Wasserstein distance Perturbations from distributional robustness metric d(.,.) radius rho

  23. our anchor regression approach: b γ = argmin β max v ∈ C γ E [ | Y v − X v β | 2 ] perturbations are assumed from a causal-type model the class of perturbations is learned from data

  24. anchor regression robust optimization learned from data amplified pre−specified radius perturbations anchor regression: the class of perturbations is an amplification of the observed and learned heterogeneity from E

  25. Science aims for causal understanding ... but this may be a bit ambitious... in absence of randomized studies, causal inference necessarily requires (often untestable) additional assumptions in anchor regression model: we cannot find/identify the causal (“systems”) parameter β 0 hidden A H β 0 X Y

  26. The parameter b →∞ : “diluted causality” b γ = argmin β E �� � �� � � 2 � � 2 � � ( Id − P A )( Y − X β ) + γ E � P A ( Y − X β ) ) b →∞ = lim γ →∞ b γ by the fundamental duality: it leads to “invariance” the parameter which optimizes worst case prediction risk over shift interventions of arbitrary strength it is generally not the causal parameter but because of shift invariance: name it “diluted causal” note: causal = invariance w.r.t. very many perturbations

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend