Causality in a wide sense Lecture III Peter B uhlmann Seminar - PowerPoint PPT Presentation

Causal Dantzig without regularization for low-dimensional settings consider two environments e = 1 and e ′ = 2 differences of Gram matrices: 1 ( X 1 ) T Y 1 − n − 1 ˆ Z = n − 1 2 ( X 2 ) T Y 2 , 1 ( X 1 ) T X 1 − n − 1 ˆ G = n − 1 2 ( X 2 ) T X 2 under inner product invariance with β ∗ : Z − ˆ E [ˆ G β ∗ ] = 0 ❀ ˆ β = argmin β � ˆ Z − ˆ G β � ∞ asymptotic Gaussian distribution with explicit estimable covariance matrix Γ if β causal is non-identifiable: the covariance matrix Γ is singular in certain directions ❀ infinite marginal confidence intervals for non-identifiable coefficients β causal , k

Regularized Causal Dantzig ˆ β = argmin β � β � 1 such that � ˆ Z − ˆ G β � ∞ ≤ λ in analogy to the classical Dantzig selector ( Candes & Tao, 2007 ) which uses Z = n − 1 X T Y , ˜ ˜ G = n − 1 X T X using the machinery of high-dimensional statistics and assuming identifiability (e.g. δ e ′ � = 0 except for δ e ′ Y = 0) ... β − β causal � q ≤ O ( s 1 / q � � ˆ log( p ) / min( n 1 , n 2 )) for q ≥ 1

various options to deal with more than two environments: e.g. all pairs and aggregation

Flow cytometry data (Sachs et al., 2005) ◮ p = 11 abundances of chemical reagents ◮ 8 different environments (not “well-defined” interventions) (one of them observational; 7 different reagents added) ◮ each environment contains n e ≈ 700 − 1 ′ 000 samples goal: Erk recover network of causal relations (linear SEM) Mek Akt PIP3 PLCg PKA Raf PKC PIP2 JNK p38 approach: “pairwise” invariant causal prediction (one variable the response Y ; the other 10 the covariates X ; do this 11 times with every variable once the response)

Erk Mek Akt PIP3 PLCg PKA Raf PKC PIP2 JNK p38 blue edges: only invariant causal prediction approach (ICP) red: only ICP allowing hidden variables and feedback purple: both ICP with and without hidden variables solid: all relations that have been reported in literature broken: new findings not reported in the literature ❀ reasonable consensus with existing results but no real ground-truth available serves as an illustration that we can work with “vaguely defined interventions”

Causal Regularization the causal parameter optimizes a worst case risk: e ∈{F E [( Y e − ( X e ) T β ) 2 ] ∋ β causal argmin β max if F = { arbitrarily strong perturbations not acting directly on Y } agenda for today: consider other classes F ... and give up on causality

Anchor regression: as a way to formalize the extrapolation from E to F ( Rothenh¨ ausler, Meinshausen, PB & Peters, 2018 ) the environments from before, denoted as e : they are now outcomes of a variable A �� anchor A H hidden ? β 0 X Y

Anchor regression and causal regularization ( Rothenh¨ ausler, Meinshausen, PB & Peters, 2018 ) the environments from before, denoted as e : they are now outcomes of a variable A �� anchor A H hidden β 0 X Y Y ← X β 0 + ε Y + H δ, X ← A α 0 + ε X + H γ, Instrumental variables regression model (cf. Angrist, Imbens, Lemieux, Newey, Rosenbaum, Rubin,... )

Anchor regression and causal regularization ( Rothenh¨ ausler, Meinshausen, PB & Peters, 2018 ) the environments from before, denoted as e : they are now outcomes of a variable A �� anchor hidden A H A is an “anchor” source node! β 0 X Y ❀ Anchor regression     X X  ← B  + ε + MA Y Y   H H

Anchor regression and causal regularization ( Rothenh¨ ausler, Meinshausen, PB & Peters, 2018 ) the environments from before, denoted as e : they are now outcomes of a variable A �� anchor hidden A H A is an “anchor” allowing also for source node! feedback loops β 0 X Y ❀ Anchor regression     X X  ← B  + ε + MA Y Y   H H

allow that A acts on Y and H ❀ there is a fundamental identifiability problem cannot identify β 0 this is the price for more realistic assumptions than IV model

... but “Causal Regularization” offers something find a parameter vector β such that the residuals ( Y − X β ) stabilize, have the same distribution across perturbations of A = environments/sub-populations we want to encourage orthogonality of residuals with A something like β = argmin β � Y − X β � 2 ˜ 2 / n + ξ � A T ( Y − X β ) / n � 2 2

˜ β = argmin β � Y − X β � 2 2 / n + ξ � A T ( Y − X β ) / n � 2 2 causal regularization: ˆ β = argmin β � ( I − Π A )( Y − X β ) � 2 2 / n + γ � Π A ( Y − X β ) � 2 2 / n Π A = A ( A T A ) − 1 A T (projection onto column space of A ) ◮ for γ = 1: least squares ◮ for γ = 0: adjusting for heterogeneity due to A ◮ for 0 ≤ γ < ∞ : general causal regularization

˜ β = argmin β � Y − X β � 2 2 / n + ξ � A T ( Y − X β ) / n � 2 2 causal regularization: ˆ β = argmin β � ( I − Π A )( Y − X β ) � 2 2 / n + γ � Π A ( Y − X β ) � 2 2 / n + λ � β � 1 Π A = A ( A T A ) − 1 A T (projection onto column space of A ) ◮ for γ = 1: least squares + ℓ 1 -penalty ◮ for γ = 0: adjusting for heterogeneity due to A + ℓ 1 -penalty ◮ for 0 ≤ γ < ∞ : general causal regularization + ℓ 1 -penalty convex optimization problem

It’s simply linear transformation consider W γ = I − ( 1 − √ γ )Π A , X = W γ X , ˜ ˜ Y = W γ Y then: ( ℓ 1 -regularized) anchor regression is (Lasso-penalized) least squares of ˜ Y versus ˜ X ❀ super-easy (but have to choose a tuning parameter γ )

... there is a fundamental identifiability problem... but causal regularization solves for e ∈F E | Y e − X e β | 2 argmin β max for a certain class of shift perturbations F recap: causal parameter solves for argmin β max e ∈F E | Y e − X e β | 2 for F = “essentially all” perturbations

Model for F : shift perturbations model for observed heterogeneous data (“corresponding to E ”)     X X  = B  + ε + MA Y Y   H H model for unobserved perturbations F (in test data) shift vectors v acting on (components of) X , Y , H     X v X v Y v  = B Y v  + ε + v   H v H v v ∈ C γ ⊂ span ( M ) , γ measuring the size of v i.e. v ∈ C γ = { v ; v = Mu for some u with E [ uu T ] � γ E [ AA T ] }

A fundamental duality theorem ( Rothenh¨ ausler, Meinshausen, PB & Peters, 2018 ) P A the population projection onto A : P A • = E [ •| A ] For any β v ∈ C γ E [ | Y v − X v β | 2 ] = E �� 2 � �� 2 � � max � ( Id − P A )( Y − X β ) + γ E � P A ( Y − X β ) ≈ � ( I − Π A )( Y − X β ) � 2 2 / n + γ � Π A ( Y − X β ) � 2 2 / n � �� objective function on data worst case shift interventions ← → regularization! in the population case

for any β worst case test error � �� Y v − X v β � � 2 � max v ∈ C γ E �� 2 � � 2 � = � ( Id − P A )( Y − X β ) + γ E � P A ( Y − X β ) E � �� criterion on training population sample

worst case test error � �� Y v − X v β � � 2 � argmin β max v ∈ C γ E �� 2 � � 2 � = argmin β E � ( Id − P A )( Y − X β ) + γ E � P A ( Y − X β ) � �� criterion on training population sample and “therefore” also finite sample guarantee: β = argmin β � ( I − Π A )( Y − Xu ) � 2 ˆ 2 / n + γ � Π A ( Y − X β ) � 2 2 (+ λ � β � 1 ) leads to predictive stability (i.e. optimizing a worst case risk)

fundamental duality in anchor regression model: v ∈ C γ E [ | Y v − X v β | 2 ] = E �� 2 � �� 2 � � max � ( Id − P A )( Y − X β ) + γ E � P A ( Y − X β ) ❀ robustness ← → causal regularization Causality Adversarial Robustness machine learning, Generative Networks e.g. Ian Goodfellow e.g. Judea Pearl

robustness ← → causal regularization the languages are rather different: ◮ causal graphs ◮ metric for robustness Wasserstein, f-divergence ◮ Markov properties on ◮ minimax optimality graphs ◮ perturbation models ◮ inner and outer optimization ◮ identifiability of systems ◮ regularization ◮ transferability of systems ◮ ... ◮ ... mathematics allows to classify equivalences and differences ❀ can be exploited for better methods and algorithms taking “the good” from both worlds!

indeed: causal regularization is nowadays used (still a “side-branch”) in robust deep learning Bouttou et al. (2013), ... , Heinze-Deml & Meinshausen (2017), ... and indeed, we can improve prediction

Stickmen classification ( Heinze-Deml & Meinshausen (2017) ) Classification into { child, adult } based on stickmen images 5-layer CNN, training data ( n = 20 ′ 000) 5-layer CNN 5-layer CNN with some causal regularization training set 4% 4% test set 1 3% 4% test set 2 (domain shift) 41 % 9 % in training and test set 1: children show stronger movement than adults in test set 2 data: adults show stronger movement spurious correlation between age and movement is reversed!

Connection to distributionally robust optimization (Ben-Tal, El Ghaoui & Nemirovski, 2009; Sinha, Namkoong & Duchi, 2017) P ∈P E P P [( Y − X β ) 2 ] argmin β max perturbations are within a class of distributions P = { P ; d ( P , P 0 ) ≤ ρ } �� emp. distrib. the “model” is the metric d ( ., . ) and is simply postulated often as Wasserstein distance Perturbations from distributional robustness metric d(.,.) radius rho

our anchor regression approach: b γ = argmin β max v ∈ C γ E [ | Y v − X v β | 2 ] perturbations are assumed from a causal-type model the class of perturbations is learned from data

anchor regression robust optimization learned from data amplified pre−specified radius perturbations anchor regression: the class of perturbations is an amplification of the observed and learned heterogeneity from E

Science aims for causal understanding ... but this may be a bit ambitious... in absence of randomized studies, causal inference necessarily requires (often untestable) additional assumptions in anchor regression model: we cannot find/identify the causal (“systems”) parameter β 0 hidden A H β 0 X Y

The parameter b →∞ : “diluted causality” b γ = argmin β E �� 2 � � 2 � � ( Id − P A )( Y − X β ) + γ E � P A ( Y − X β ) ) b →∞ = lim γ →∞ b γ by the fundamental duality: it leads to “invariance” the parameter which optimizes worst case prediction risk over shift interventions of arbitrary strength it is generally not the causal parameter but because of shift invariance: name it “diluted causal” note: causal = invariance w.r.t. very many perturbations

notions of associations causal* invariance regression marginal correlation under faithfulness conditions, the figure is valid (causal* are the causal variables as in e.g. large parts of Dawid, Pearl, Robins, Rubin, ...)

Stabilizing Tukey (1954) John W. Tukey (1915 – 2000) “One of the major arguments for regression instead of correlation is potential stability. We are very sure that the correlation cannot remain the same over a wide range of situations, but it is possible that the regression coefficient might. ... We are seeking stability of our coefficients so that we can hope to give them theoretical significance.” causal* invariance regression marginal correlation

“Diluted causality” and robustness in proteomics Ruedi Aebersold, ETH Z¨ urich Niklas Pfister, ETH Z¨ urich 3934 other proteins which of those are “diluted causal” for cholesterol experiments with mice: 2 environments with fat/low fat diet high-dimensional regression, total sample size n = 270 Y = cholesterol pathway activity, X = 3934 protein expressions

1.00 Gstm5 Idi1 Fdft1 0.75 Cyp51 x-axis: importance w.r.t Sc4mol Sqle selection probability − SB I (Y) Nnt Fdps regression but non-invariant Nsdhl Pmvk Ceacam1 0.50 Dhcr7 Acsl3 Rdh11 Gpx4 y-axis: importance w.r.t. Acss2 Mavs Hsd17b7 invariance 0610007P14Rik 0.25 0.00 0.00 0.25 0.50 0.75 1.00 selection probability − NSB I (Y)

beyond cholesterol: with transcriptomics and proteomics Mito RIbosome Network: Beta Oxidation Network: mRNA: Very Significant mRNA: Very Significant Protein: Very Significant Protein: Very Significant Across: No correlation Across: Very Significant Self-Made: Mito Ribosome Reactome: Beta-Oxidation 1.0 mRNA 1.0 Protein Lrpprc 1.0 mRNA Etfa ● 1.0 Protein Acat1 Dlst ● ● ● selection probability (stability) Age: 0.11 Psmb6 ● selection probability (stability) Age: 1e-4 ● selection probability (stability) Diet: 7e-3 Aldh3a2 ● selection probability (stability) Diet: 2e-2 Acot6 Ndufb9 Ndufs8 ● ● Echs1 Hmgcs2 Cpt2 0.75 Cox4i1 ● ● 0.75 Vnn1 ● 0.75 Ppa2 Etfdh ● ● ● Psmb3 Sdhd ● 0.75 mt-CO2 ● ● Ephx2 ● ● Hmgcs2 ● ● Etfb ● ● Etfa Psmb1 ● ● Cox6c Cpt1b ● Dap3 Aass ● ● ● ● Dlst ● Ech1 ● Hspa9 Hibadh Phb ● ● Ech1 ● ● Ndufb8 ● Cox6b1 ● Cap1Ppa2 ● Mut Aldh6a1 Pdhb Bphl ● ● ● ● Atp5b 0.50 ● ● Ndufa13 Timm13 ● ● Ndufs4 0.50 Grpel1 Sdha ● ● Hsd17b10 0.50 0.50 Pdha1 ● Aldh5a1 ● ● Isoc2a ● ● Gm4952 Got2 ● ● Phb Atp5j Ndufb4 ● ● Spryd4 Atp5s ● ● ● Hspa9 ● ● ● ● Pdha1 Dhrs4 ● S14l1 ● Nipsnap1 ● ● ● Ndufb2 Ict1 ● ● Cox7a2 Ict1 ● Sucla2 ● ● ● ● Gcsh ● Atp5a1 0.25 ● ● Ndufs5 ● ● ● ● Uqcr10 Psmb4 0.25 ● ● ● Mcee Got2 0.25 ● ● Abcd3 Cpt2 ● Etfdh ● 0.25 ● ● ● ● ● ● ● Acaca ● ● ● ● ● ● ● ● ● Rps27lUqcr11 Psma7 ● ● ● ● Cox4i1 ● ● ● ● ● ● ● ● ● Lonp1 ● Slc25a20 ● Pex11a ● ● ● Lyrm5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Bdh1 ● Cmbl ● Mgll ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● Bbox1 ● ● Acaa1b ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● 0 ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● 0 0.25 0.50 0.75 1.0 0 0.25 0.50 0.75 1.0 0 0.25 0.50 0.75 1.0 0 0.25 0.50 0.75 1.0 selection probability (prediction) selection probability (prediction) selection probability (prediction) selection probability (prediction) ER Unfolded Protein Response Network: Ribosome Network: mRNA: NOT Significant mRNA: Very Significant Protein: Significant Protein: Very Significant Across: Slight correlation Across: No correlation Reactome: Unfolded Protein Response KEGG: Ribosome 1.0 1.0 1.0 1.0 Gm8290 ● selection probability (stability) mRNA Age: 3e-2 Lrrc59 Pdia4 ● ● selection probability (stability) Diet: 2e-2 Protein Lancl1 Gnb2l1 selection probability (stability) mRNA Age: 0.1 Eef1b2 ● selection probability (stability) Protein Age: 3e-2 Gm5453 Rps3a ● ● Wdr1 ● ● ● ● Bola2 ● Ostc Ddost ● ● Tmem214 Nars ● ● ● ● Psd3 Eif3f Ssr4 Calr ● 0.75 0.75 Srsf1 0.75 ● 0.75 Stt3a ● Rpn2 ● Anxa11 ● ● Dnm2 Sec11a ● ● Gnb2l1 Ssr1 ● ● Lancl1 Hspa5 ●● ● ● Pdia3 ● Csk Slc39a7 ● ● Pdia3 Stt3a ● Eif3k ● ● Tpt1 Rpl10l ● ● Pdia6 Eif4a1 ● Eif4a3 ● 0.50 Sec61a1 Cad ● ● ● ● ● ● 0.50 Arpc2 ● Ppib ● ● 0.50 ●● Btf3 0.50 Hsp90b1 Gnb2l1 Eef1d ● ● ● ● Cltc ● ● Myh9 Ganab Ssr4 ● Ywhaq ● ● ● ● Rps14 ● Hspa2 Ppib ● ● Sec23b ● ● Rps21 ● Csk Snrpg Eef1g ● ● Pfdn5 ● ● ● 0.25 ● ● Uggt1 ● ● ● Snd1 ● 0.25 ● ● Rpl13 ● Hspa2 ● ● 0.25 0.25 Pabpc1 ● ● ● ● Ctnnb1 ● ● ● ● ● ● ● ● ● Tgm2 Hnrnpa3 ● ● Ube2n ● ● ● Rpsa ● Npm3 ● ● Snrpd2 ● ● ● Eef2 ● 1700047I17Rik2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Copa ● Ap1b1 ● ● ● ● ● ● ● ● Rpl31 Pabpc1 ● ● ● ● ● ● ● Pard3 ● ● ● ●● ● ● ● Lrrc59 Eef1b2 Ddx5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Rpl22 ● ● Rplp2 ● ● ● ● ● ● ● ● ● ● ● ● ● Ssr4 Hint1 ● Eif3h ● ● ● ● ● ● ●● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Naca Eif3i 0 ● ● ● ●● ● ● ● ● ● ● ● ● ● ● 0 0.25 0.50 0.75 1.0 0 0.25 0.50 0.75 1.0 0 0.25 0.50 0.75 1.0 0 0.25 0.50 0.75 1.0 selection probability (prediction) selection probability (prediction) selection probability (prediction) selection probability (prediction) Proteasome Network: Peroxisome Network: mRNA: Very Significant mRNA: Very Significant Protein: Very Significant Protein: Very Significant Across: No correlation Across: Not significant KEGG: Proteasome KEGG: Peroxisome not all of the predictive variables 1.0 mRNA 1.0 Protein Txn1 1.0 mRNA 1.0 Protein Acad11 Acaa1a Acaa1b ● ● ● selection probability (stability) Diet: 0.16 Ran ● selection probability (stability) Diet: 2e-2 Atp5b Iigp1 ● ● ● Ctsc ● selection probability (stability) Diet: 0.34 Acaa1b selection probability (stability) Diet: 6e-4 Txn1 ● Mrpl11 Adh5 ● Aldh3a2 ● ● 0.75 ● 0.75 Acot8 0.75 Acaa2 Aldh9a1 ● ● 0.75 Fads2 Cct3 Ndufv2 ● Ssu72 ● ● Eif6 Aldh8a1 Coasy Sec14l2 Pex11c ● Fads1 Acot12 ● ● ● Eif3i ● Ndufs4 ● Bbox1 Gcdh Decr1 ● ● ● Ufd1l ● ● ● Btf3 Hax1 Txnl1 Mrpl13 ● ● ● ● ● ● Trap1 Etfdh ● ● ● ● Msh6 ● Actr2 ● 0.50 ● ● ● ● ● ● Snrpc ● 0.50 ● ● ● ● ● ● 0.50 ● Nadk2 Hadh ● 0.50 ● ● Nck1 ● ● Acot4 Gabarapl2 ● Fis1 ● Tbca ● Rbx1 ● ● ● ● ● ● ● ● ● Ube2n ● ● Eci1 ● E430025E21Rik ● ● ● ● ● ● ● Mrps16 ● ●● ● ● ● ● ● ● ● Aldh6a1 ● ● ● ● Tmem135 ● 0.25 ● ● ● ● ● ● ● 0.25 ● ●● ● ●● ● ● ● ● 0.25 ● ● ● Sucla2 ● ● ● 0.25 ● ● ● ● ● ● ● Fabp2 Mt2 Aldh3a1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● Acot4Acad11 Acadl ● ● ● ● ● ● ● ● Arhgdia Mt1 ● from regression lead to invariance! ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Gpd1Acot6 ● ● ● ● Acot1 ● Lin7a ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● 0 ● ●● ● ● ●● 0 ● ● 0 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● 0 0.25 0.50 0.75 1.0 0 0.25 0.50 0.75 1.0 0 0.25 0.50 0.75 1.0 0 0.25 0.50 0.75 1.0 selection probability (prediction) selection probability (prediction) selection probability (prediction) selection probability (prediction) Cholesterol Synthesis Network: Spliceosome Network: mRNA: Very Significant mRNA: Very Significant Protein: Very Significant Protein: Very Significant Across: Significant Across: Not significant Reactome: Cholesterol Biosynthesis KEGG: Spliceosome 1.0 Hnrnpa2b1 1.0 mRNA Erg28 Rdh11 ● ● 1.0 Protein Rdh11 ● selection probability (stability) mRNA selection probability (stability) 1.0 Protein Sumo3 Sub1 ● ● selection probability (stability) Diet: 2e-4 ● selection probability (stability) Diet: 8e-22 Erg28 ● Fabp2 ● Diet: 0.37 Diet: 5e-4 Hnrnpdl ● ● ● Nono Mmab Hadhb ● ● Cyp2c70 Hnrnph1 ● G3bp1 Hnrnpab Psap ●● ● Arpc4 ● Set 0.75 0.75 Hadha Plin2 ● ● 0.75 Srrt ● ● 0.75 Ddx17 ● ● Ddx39 Pklr Nckap1 ● ● Pfkp ● ● Hmha1 ● Ewsr1 Calm1 Gnb2l1 ● ● ● Gnb4 ● Eno3 ● ● ● Hnrnpab Dnajc7 ● ● Gtf2f1 ● ● ● ● ● ● ● ● ● Arpc5 Capzb ● Snrpn ● Eci1 ● Fbp1 ● Por Prkcd ● Mta2 ● ● 0.50 0.50 ● ● 0.50 ● ● Ptpn6 ActbHnrnpf 0.50 ● ● ● ● Acsl3 ● Fasn ● Aldh1l1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Ugdh ● ● Acat1 ● ● ● ● ● ● ● ● Ssrp1 ● ●● ● ● ● ● ● ● 0.25 ● Gcat Gstm6 ● Qdpr 0.25 ●● ● ● ● ● ● ● ● Cyp1a2 0.25 ● ● ● ● ● ● Ywhaz ● ● 0.25 ● ●● ● ● ● ● ● ● ● ● ● ● ● Acat2 ● ● ● ● ● ● Aldh1l2 ● ● ● ● ● ● Acss2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● Aldoc ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 0 ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.50 0.75 1.0 0 0 ● ● ● ● ● ● ● ● 0.25 0.50 0.75 1.0 0 ● ● ● ● ● ● ● ● ● ●● ● ● 0.25 0.50 0.75 1.0 0 0.25 0.50 0.75 1.0 selection probability (prediction) selection probability (prediction) selection probability (prediction) selection probability (prediction) causal* invariance regression marginal correlation

“validation” in terms of ◮ finding known pathways (here for Ribosome pathway) Ribosome − diet, mRNA relative pAUC 0.0 ● ● ● ● ● ● ● ● ● 0.4 ● pAUC ● − 0.2 ● ● 0.2 ● ● ● ● ● ● − 0.4 ● ● ● ● ● ● ● 0.0 ● ● ● Lasso Ridge SRpred Lasso Ridge SRpred corr corr (env) IV (Lasso) SR corr corr (env) IV (Lasso) ❀ invariance-type modeling improves over regression! ◮ reported results in the literature

Distributional Replicability The replicability crisis ... scholars have found that the results of many scientific studies are difficult or impossible to replicate (Wikipedia)

Distributional Replicability Replicability on new and different data ◮ regression parameter b is estimated on one (possibly heterogeneous) dataset with distributions P e , e ∈ E ◮ can we see replication for b on another different dataset with distribution P e ′ , e ′ / ∈ E ? this is a question of “zero order” replicability it is a first step before talking about efficient inference (in an i.i.d. or stationary setting) it’s not about accurate p-values, selective inference, etc.

The projectability condition I = { β ; E [ Y − X β | A ] ≡ 0 } � = ∅ it holds iff rank ( Cov ( A , X )) = rank ( Cov ( A , X ) | Cov ( A , Y )) example: rank ( Cov ( A , X )) is full rank and dim ( A ) ≤ dim ( X ) “under- or just-identified case” in IV literature checkable! in practice

the “diluted causal” parameter b →∞ is replicable assume ◮ new dataset arises from shift perturbations v ∈ span(M) (as before) ◮ projectability condition holds consider b →∞ which is estimated from the first dataset b ′→∞ which is estimated from the second (new) dataset Then: b →∞ is replicable, i.e., b →∞ = b ′→∞

Replicability for b →∞ in GTEx data across tissues ◮ 13 tissues ◮ gene expression measurements for 12’948 genes, sample size between 300 - 700 ◮ Y = expression of a target gene X = expressions of all other genes A = 65 PEER factors (potential confounders) estimation and findings on one tissue ❀ are they replicable on other tissues?

Replicability for b →∞ in GTEx data across tissues anchor regression − anchor regression number of replicable features on a different tissue lasso − anchor regression 12 lasso − lasso 10 8 6 4 2 0 5 10 15 20 x-axis: “model size” = K K y-axis: how many of the top K ranked associations (found by a method on a tissue t are among the top K on a tissue t ′ � = t summed over 12 different tissues t ′ � = t , averaged over all 13 t and averaged over 1000 random choice of a gene as the response

additional information in anchor regression path! the anchor regression path: anchor stability: b 0 = b →∞ (= b γ ∀ γ ≥ 0 ) checkable! assume: ◮ anchor stability ◮ projectability condition ❀ the least squares parameter b 1 is replicable! we can safely use “classical” least squares principle and methods (Lasso/ ℓ 1 -norm regularization, de-biased Lasso, etc.) for transferability to some class of new data generating distributions P e ′ e ′ / ∈ E

Replicability for least squares par. in GTEx data across tissues using anchor stability, denoted here as “anchor regression” anchor regression − anchor regression number of replicable features on a different tissue lasso − anchor regression 4 lasso − lasso 3 2 1 5 10 15 20 K x-axis: “model size” = K y-axis: how many of the top K ranked associations (found by a method on a tissue t are among the top K on a tissue t ′ � = t summed over 12 different tissues t ′ � = t , averaged over all 13 t and averaged over 1000 random choice of a gene as the response

We can make relevant progress by exploiting invariances/stability ◮ finding more promising proteins and genes: based on high-throughput proteomics ◮ replicable findings across tissues: based on high-throughput transcriptomics ◮ prediction of gene knock-downs: based on transcriptomics ( Meinshausen, Hauser, Mooij, Peters, Versteeg, and PB, 2016 ) ◮ large-scale kinetic systems (not shown): based on metabolomics ( Pfister, Bauer and Peters, 2019 )

What if there is only observational data with hidden confounding variables? can lead to spurious associations number of Nobel prizes vs. chocolate consumption F. H. Messerli: Chocolate Consumption, Cognitive Function, and Nobel Laureates , N Engl J Med 2012

Hidden confounding can be a major problem

Hidden confounding, causality and perturbation of sparsity does smoking cause lung cancer? “genetic factors” H (unobserved) ? X Y smoking lung cancer

Genes mirror geography within Europe ( Novembre et al., 2008 ) confounding effects are found on the first principal components

also for “non-causal” questions: want to adjust for unobserved confounding when interpreting regression coefficients, correlations, undirected graphical models, ... ... interpretable AI ... ..., Leek and Storey, 2007; Gagnon-Bartsch and Speed, 2012; Wang, Zhao, Hastie and Owen, 2017; Wang and Blei, 2018;... in particular: we want to “robustify” the Lasso against hidden confounding variables

Linear model setting response Y , covariates X aim: estimate the regression parameter of Y versus X in presence of hidden confounding ◮ want to be “robust” against unobserved confounding we might not completely address the unobserved confounding problem in a particular application but we are “essentially always” better than doing nothing against it! ◮ the procedure should be simple with almost zero effort to be used! ❀ it’s just linearly transforming the data! ◮ some mathematical guarantees

The setting and a first formula H Y = X β + H δ + η X = H Γ + E β X Y goal: infer β from observations ( X 1 , Y 1 ) , . . . , ( X n , Y n ) the population least squares principle leads to the parameter β ∗ = argmin u E [( Y − X T u ) 2 ] , β ∗ = β + b �� “bias”/”perturbation” � δ � 2 � b � 2 ≤ � “number of X -components affected by H ” small “bias”/”perturbation” if confounder has dense effects!

Perturbation of sparsity the hidden confounding model Y = X β + H δ + η X = H Γ + E can be written as Y = X β ∗ + ε, β ∗ = β + b �� ”dense” ”sparse” ε uncorrelated of X , E [ ε ] = 0 � δ � 2 and � b � 2 ≤ � “number of X -components affected by H ”

hidden confounding is perturbation to sparsity H β + b ❀ X Y β X Y Y = X ( β + b ) + ε, b = Σ − 1 Γ T δ (”dense”) Y = X β + H δ + η, Σ = Σ E + Γ T Γ , X = H Γ + E σ 2 ε = σ 2 η + δ T ( I − ΓΣΓ T ) δ

and thus ❀ consider the more general model Y = X ( β + b ) + ε, β ”sparse” , b ”dense” goal: recover β Lava method ( Chernozhukov, Hansen & Liao, 2017 ) is considering this model/problem ◮ with no connection to hidden confounding ◮ we improve the results and provide a “somewhat simpler” methodology

What has been proposed earlier (among many other suggestions) ◮ adjust for a few first PCA components from X motivation: low-rank structure is generated from a few unobserved confounders well known among practitioners: often pretty reasonable... but we will improve on it ◮ latent variable models and EM-type or MCMC algorithms ( Wang and Blei, 2018 ) need precise knowledge of hidden confounding structure cumbersome for fitting to data ◮ undirected graphical model search with penalization encouraging sparsity plus low-rank ( Chandrasekharan et al., 2012 ) two tuning parameters to choose, not so straightforward ..., Leek and Storey, 2007; Gagnon-Bartsch and Speed, 2012; Wang, Zhao, Hastie and Owen, 2017; ... ❀ different

motivation: when using Lasso for the non-sparse problem with β ∗ = β + b a bias term � Xb � 2 2 / n enters for the bound of � X ˆ 2 / n + � ˆ β − X β ∗ � 2 β − β ∗ � 1 strategy: linear transformation F : R n → R n Y = FY , ˜ ˜ X = FX , ˜ ε = F ε, X β ∗ + ˜ Y = ˜ ˜ ε and use Lasso for ˜ Y versus ˜ X such that ◮ � ˜ Xb � 2 2 / n small ◮ ˜ X β “large” ◮ ˜ ε remains “of order O ( 1 ) ”

Spectral transformations which transform singular values of X will achieve ◮ � ˜ Xb � 2 2 / n small ◮ ˜ X β “large” ◮ ˜ ε remains “of order O ( 1 ) consider SVD of X : X = UDV T , U n × n , V p × n , U T U = V T V = I , D = diag ( d 1 , . . . , d n ) , d 1 ≥ d 2 ≥ . . . ≥ d n ≥ 0 map d i to ˜ d i : spectral transformation is defined as F = U diag (˜ d 1 / d 1 , . . . , ˜ d n / d n ) U T X = U ˜ ˜ DV T ❀

Examples of spectral transformations 1. adjustment with r largest principal components equivalent to ˜ d 1 = . . . = ˜ d r = 0 2. Lava ( Chernozhukov, Hansen & Liao, 2017 ) argmin β, b � Y − X ( β + b ) � 2 2 / n + λ 1 � β � 1 + λ 2 � b � 2 2 can be represented as a spectral transform plus Lasso 3. Puffer transform (Jia & Rohe 2015) uses ˜ d i ≡ 1 ❀ if d n is small, the errors are inflated...! 4. Trim transform ( ´ Cevid, PB & Meinshausen, 2018 ) ˜ d i = min( d i , τ ) with τ = d ⌊ n / 2 ⌋

singular values of ˜ X Lasso = no transformation

Heuristics in hidden confounding model: ◮ b points towards singular vectors with large singular val. ❀ it suffices to shrink only large singular values to make the “bias” � ˜ Xb � 2 2 / n small ◮ β typically does not point to singular vectors with large singular val.: since β is sparse and V is dense (unless there is a tailored dependence between β and the structure of X ) ❀ “signal” � ˜ X β � 2 2 / n does not change too much when shrinking only large singular values

Some (subtle) theory consider confounding model Y = X β + H δ + η, X = H Γ + E Theorem ( ´ Cevid, PB & Meinshausen, 2018 ) Assume: ◮ Γ must spread to O ( p ) components of X components of Γ and δ are i.i.d. sub-Gaussian r.v.s (but then thought as fixed) ◮ condition number of Σ E = O ( 1 ) ◮ dim ( H ) = q < s log( p ) , s = supp ( β ) (sparsity) Then, when using Lasso on ˜ X and ˜ Y : � � � σ s log( p ) � ˆ β − β � 1 = O P λ min (Σ) n same optimal rate of Lasso as without confounding variables

limitation: when hidden confounders only spread to/affect m components of X √ s � δ � 2 � � � σ s log( p ) � ˆ β − β � 1 ≤ O P + √ m λ min (Σ) n ❀ if only few (the number m is small) of the X -components are affected by hidden confounding variables, this and other techniques for adjustment must fail without further information (that is, without going to different settings)

Some numerical examples � ˆ β − β � 1 versus no. of confounders left: the confounding model black: Lasso, blue: Trim transform, red: Lava, PCA adjustment

� ˆ β − β � 1 versus σ left: the confounding model black: Lasso, blue: Trim transform, red: Lava, PCA adjustment

� ˆ β − β � 1 versus no. of factors (“confounders”) but with b = 0 (no confounding) black: Lasso, blue: Trim transform, red: Lava, PCA adjustment using Trim transform does not hurt: plain Lasso is not better

using Trim transform does not hurt: plain Lasso is not better spectral deconfounding leads to robustness against hidden confounders ◮ much improvement in presence of confounders ◮ (essentially) no loss in cases with no confounding!

Example from genomics (GTEx data) a (small) aspect of GTEx data p = 14713 protein-coding gene expressions n = 491 human tissue samples (same tissue) q = 65 different covariates which are proxys for hidden confounding variables ❀ we can check robustness/stability of Trim transform in comparison to adjusting for proxys of hidden confounders

singular values of X adjusted for 65 proxys of confounders ❀ some evidence for factors, potentially being confounders

robustness/stability of selected variables do we see similar selected variables for the original and the proxy-adjusted dataset? ◮ expression of one randomly chosen gene is response Y ; all other gene expressions are the covariates X ◮ use a variable selection method ˆ S = supp (ˆ β ) : S ( 1 ) based on original dataset ˆ S ( 2 ) based on dataset adjusted with proxies ˆ S ( 2 ) ) = 1 − | ˆ S ( 1 ) ∩ ˆ S ( 2 ) | ◮ compute Jaccard distance d (ˆ S ( 1 ) , ˆ | ˆ S ( 1 ) ∪ ˆ S ( 2 ) | ◮ repeat over 500 randomly chosen genes

Jaccard distance d ( supp (ˆ β original , supp (ˆ β adjusted ) ( vs. size ) between original and adjusted data averaged over 500 randomly chosen responses adjusted for 5 proxy-confounders black: Lasso, blue: Trim transform, red: Lava Trim transform (and Lava): more stable w.r.t. confounding

Causality in a wide sense Lecture III Peter B uhlmann Seminar - PowerPoint PPT Presentation

Causality in a wide sense Lecture III Peter B uhlmann Seminar for Statistics ETH Z urich Recap from yesterday causality is giving a prediction to an intervention/manipulation Predicting a potential outcome 10 5

Simultaneous Causality: Part IV on Causality James J. Heckman Econ 312, Spring 2019 1 / 29

Causality in a wide sense Lecture III Peter B uhlmann Seminar for Statistics ETH Z

AEFI Causality Assessment Approach to causality assessment in deaths following immunization

Econometric Causality: Part I on Causality Based in part on Heckman (2008) International

Causality and Algebraic Geometry Andrew Critch UC Berkeley September, 2012 Causality and

Granger Causality and Dynamic Structural Systems Halbert White and Xun Lu Department of

Causality V. Bunkin, L. Steffen (Seminar in Statistics) Causality 02.05.2016 1 / 23

Causality in a wide sense Lecture II Peter B uhlmann Seminar for Statistics ETH Z

Causality in a wide sense Lecture IV Peter B uhlmann Seminar for Statistics ETH Z

Causality in a wide sense Lecture I Peter B uhlmann Seminar for Statistics ETH Z

Causality in a wide sense Lecture II Peter B uhlmann Seminar for Statistics ETH Z

Word Sense Word Sense Word Sense Disambiguation Disambiguation Disambiguation Presented by

Causality and the benefits of relocation Causality and the benefits of relocation Presentation to

Causality Along Subspaces Majid Al-Sadoon University of Cambridge Royal Economic Society Fifth

Causality: Explanation versus Prediction Department of Government London School of Economics and

Expressing Causality in Categorical Models of Functional Reactive Programming Wolfgang Jeltsch

ICG plc Preliminary Results 2014 20 May 2014 Intermediate Capital Group plc Highlights

Maximum Margin based Semi-supervised Spectral Kernel Learning Zenglin Xu, Jianke Zhu, Michael R.

Parallel Scenario Decomposition of Risk Averse 0-1 Stochastic Programs Shabbir Ahmed ISyE,

November 28, 2014 Qualified Person: Dallas Cox, B.Eng (Mining), MAusIMM(CP), Principal Consultant

Robust Transmission Planning An Application to the Case of Germany in 2050 Alexander Weber,

Prior and loss robustness for varoius loss functions Agnieszka Kami nska and Zdzis law

STRATEGIC REPORT Hong Kong, China CHEUNG Ming Yan Fion FOR AMANGO Bachelor of Business

Colorado DSM Roundtable Colorado DSM Roundtable August 21, 2013 1:00 4:00 pm 1800 Larimer