Causal analysis within the framework of structural autoregressive - - PowerPoint PPT Presentation
Causal analysis within the framework of structural autoregressive - - PowerPoint PPT Presentation
Causal analysis within the framework of structural autoregressive models Alessio Moneta Scuola Superiore SantAnna a.moneta@santannapisa.it Universitat Rovira i Virgili 13 February 2018 Outline 1. Why VAR models? 2. From VAR to SVAR
Outline
- 1. Why VAR models?
- 2. From VAR to SVAR models.
- 3. Problem of identification: two solutions based on causal search
methods:
◮ Graphical Causal Models ◮ Independent Component Analysis
Why VAR models?
The Vector Autoregressive Model
Given a vector yt of k variables: yt = µt + D1yt−1 + D2yt−2 + . . . + Dpyt−p + ut where Di are (i = 1, . . . , p) are (k × k) matrices ut is a (k × 1) vector of error terms (residuals), which are white noise and E(utu′
t) = Σu.
µt is a (k × 1) vector of constants (possibly including a deterministic trend).
⊲ Cfr. Sims, C. (1980) “Macroeconomics and Reality” Econometrica.
Structural equations and causality
⊲ Cowles Commission Approach: dominant approach in macro-econometrics 1940s-1960s
◮ Haavelmo’s “The Probability Approach in Econometrics” (1944) ◮ Economic theory dictates the causal structure ◮ If the structure is adequate: error terms conform to standard
probabilistic properties (independence and normality)
◮ Measuring the strengths of causal linkages.
Example of structural model
⊲ Example: (Hoover 2006) m = αy + εm (1) y = βm + εy, (2) where m ≡ money, y ≡ GDP (both in logs). ⊲ Statistical properties of εm and εy tell whether the model is goods specified. ⊲ can we get α, β, εm, εy from the data?
⊲ Cfr. Hoover, K.D. (2006), “Economic Theory and Causal Inference”, in Handbook of the Philosophy of Economics.
Example of structural model
⊲ Problem of identification solved introducing two exogenous variables: m = αy + δr + εm (3) y = βm + γp + εy, (4) where r ≡ interest rate, p ≡ price level (both in logs). ⊲ can we get α, β, δ, γ εm, εy from the data? ⊲ Yes. Model identified under the assumption that p is not a direct cause of m and r is not a direct cause of y ⊲ Omiting the error terms: r − → m ← → y← − p
Reduced form model
⊲ Afer substituting eq. (3) into eq. (4) and simplifying: m = αγ 1 − αβ p + δ 1 − αβ r + α 1 − αβ εm + 1 1 − αβ εy (5) y = γ 1 − αβ p + βδ 1 − αβ r + β 1 − αβ εm + 1 1 − αβ εy (6)
Reduced form model
⊲ Afer substituting eq. (3) into eq. (4) and simplifying: m = αγ 1 − αβ p + δ 1 − αβ r + α 1 − αβ εm + 1 1 − αβ εy (5) y = γ 1 − αβ p + βδ 1 − αβ r + β 1 − αβ εm + 1 1 − αβ εy (6) ⊲ This system of reduced-form equations can be now estimated, since on the r.h.s. there are only exogenous variables: m = a p + b r + um (7) y = c p + d r + uy (8)
Reduced form model
⊲ Afer substituting eq. (3) into eq. (4) and simplifying: m = αγ 1 − αβ p + δ 1 − αβ r + α 1 − αβ εm + 1 1 − αβ εy (5) y = γ 1 − αβ p + βδ 1 − αβ r + β 1 − αβ εm + 1 1 − αβ εy (6) ⊲ This system of reduced-form equations can be now estimated, since on the r.h.s. there are only exogenous variables: m = a p + b r + um (7) y = c p + d r + uy (8)
Reduced form model
⊲ Afer substituting eq. (3) into eq. (4) and simplifying: m = αγ 1 − αβ p + δ 1 − αβ r + α 1 − αβ εm + 1 1 − αβ εy (5) y = γ 1 − αβ p + βδ 1 − αβ r + β 1 − αβ εm + 1 1 − αβ εy (6) ⊲ This system of reduced-form equations can be now estimated, since on the r.h.s. there are only exogenous variables: m = a p + b r + um (7) y = c p + d r + uy (8)
Reduced form model
⊲ Afer substituting eq. (3) into eq. (4) and simplifying: m = αγ 1 − αβ p + δ 1 − αβ r + α 1 − αβ εm + 1 1 − αβ εy (5) y = γ 1 − αβ p + βδ 1 − αβ r + β 1 − αβ εm + 1 1 − αβ εy (6) ⊲ This system of reduced-form equations can be now estimated, since on the r.h.s. there are only exogenous variables: m = a p + b r + um (7) y = c p + d r + uy (8)
Reduced form model
⊲ Afer substituting eq. (3) into eq. (4) and simplifying: m = αγ 1 − αβ p + δ 1 − αβ r + α 1 − αβ εm + 1 1 − αβ εy (5) y = γ 1 − αβ p + βδ 1 − αβ r + β 1 − αβ εm + 1 1 − αβ εy (6) ⊲ This system of reduced-form equations can be now estimated, since on the r.h.s. there are only exogenous variables: m = a p + b r + um (7) y = c p + d r + uy (8)
Reduced form model
⊲ Afer substituting eq. (3) into eq. (4) and simplifying: m = αγ 1 − αβ p + δ 1 − αβ r + α 1 − αβ εm + 1 1 − αβ εy (5) y = γ 1 − αβ p + βδ 1 − αβ r + β 1 − αβ εm + 1 1 − αβ εy (6) ⊲ This system of reduced-form equations can be now estimated, since on the r.h.s. there are only exogenous variables: m = a p + b r + um (7) y = c p + d r + uy (8)
Reduced form model
⊲ Afer substituting eq. (3) into eq. (4) and simplifying: m = αγ 1 − αβ p + δ 1 − αβ r + α 1 − αβ εm + 1 1 − αβ εy (5) y = γ 1 − αβ p + βδ 1 − αβ r + β 1 − αβ εm + 1 1 − αβ εy (6) ⊲ This system of reduced-form equations can be now estimated, since on the r.h.s. there are only exogenous variables: m = a p + b r + um (7) y = c p + d r + uy (8) ⊲ From the OLS estimates ˆ a, ˆ b,ˆ c, ˆ d, ˆ um, ˆ uy, using the respective
Reduced form model
⊲ Afer substituting eq. (3) into eq. (4) and simplifying: m = αγ 1 − αβ p + δ 1 − αβ r + α 1 − αβ εm + 1 1 − αβ εy (5) y = γ 1 − αβ p + βδ 1 − αβ r + β 1 − αβ εm + 1 1 − αβ εy (6) ⊲ This system of reduced-form equations can be now estimated, since on the r.h.s. there are only exogenous variables: m = a p + b r + um (7) y = c p + d r + uy (8) ⊲ From the OLS estimates ˆ a, ˆ b,ˆ c, ˆ d, ˆ um, ˆ uy, using the respective
Instrumental variables
⊲ Notice that the structural-form coefficients (α, β in the ex.) could be equivalently obtained by instrumental variables estimation. ⊲ Following the previous example:
◮ p: instrumental variable for eq. (1) (m = αy + εm) ◮ r: instrumental variable for eq. (2) (y = βm + εy)
⊲ Two-stage least squares estimation for eq. (1): 1 OLS regression of y on p: obtain ˆ y 2 OLS regression of m on ˆ y: obtain ˆ αIV
Summary on Structural Equation Modeling
⊲ SEM: provide quantitative assessment of cause-effect relationships. ⊲ Interpretation of causality:
◮ counterfactual ◮ manipulability
⊲ Probabilistic methods are used to measure causality and partially to test it, but not for the sake of causal discovery. ⊲ Dependence on a priori economic theory:
◮ necessity of identifying restrictions.
⊲ Empiricist query: where does the economic theory come from?
The crisis of the Cowles Commission approach
⊲ Up to the 1970s: consensus on the Cowles Commission approach
◮ Important economic events in the 1970s:
Oil crisis (1973), stagflation. Increasing skepticism towards so-called “Keynesian Macroeconomics”.
⊲ Two major critiques:
◮ Lucas Critique (1976) on economic policy evaluation
structural equations are ineffective for policy evaluation: they are unstable under intervention
◮ Sims (1980) article Macroeconomics and Reality
restrictions used in the Cowles Commission approach are incredible, i.e. empirically not validated ⊲ Cfr. Favero, C.A. (2001) Applied Macroeconometrics, OUP.
The crisis of the Cowles Commission approach
Reactions to the criticisms: ⊲ Change the theory maintaining the structural, theory-driven approach
◮ cfr. New Classical Macroeconomics (Lucas, Sargent), and rational
expectations models
◮ even more extreme theory-driven approach: calibration
⊲ Adopt a more data-driven approach: Time-series econometric models, more intensive use of statistical methods
◮ Granger Causality (1969) ◮ Vector Autoregressive Models (VAR) (Sims 1980)
Granger causality
Consider two time series Xt and Yt. The idea of Granger’s (1969, 1980) causality:
◮ {Yt} causes {Xt} if Yt helps to predict Xt+h.
Definition (G.-causality, Granger 1980: 330):
◮ Yt is said to cause Xt+1 if
P(Xt+1 ∈ A|Ωt) = P(Xt+1 ∈ A|Ωt\Yt), for a set A and for a set Ωt of “relevant information” available at time t (Ωt\Yt:= relevant information except Yt). Merely probabilistic notion of causality. Importance to appraise usefulness of G-causality but differences with structural causality.
Testing G-non-causality
Granger non-causality (GNC) tests:
◮ Regression of Xt+1 on Ωt (including Xt, Yt and their lagged values) +
same regression excluding Yt (and its lagged values).
◮ Check error variances of the two regressions (with and without Yt);
and/or
◮ Test whether the Yt coefficients in the regression of Xt+1 on Ωt are
zero (F or t test).
Structural vs. G-causality (Example from Hoover 2001: 151)
Suppose the unobserved structural model (DGP) is: Yt = θXt + β11Yt−1 + β12Xt−1 + ǫ1t (9) Xt = γYt + β21Yt−1 + β22Xt−1 + ǫ2t (10) Reduced form model: Yt = a11Yt−1 + a12Xt−1 + ν1t (11) Xt = a21Yt−1 + a22Xt−1 + ν2t, (12)
(coefficients in (11)-(12) are functions of coefficients in (9)-(10))
GNC tests:
◮ if a12 = 0, then X does not G-causes Y ◮ if a21 = 0, then Y does not G-causes X.
Structural model: Yt = θXt + β11Yt−1 + β12Xt−1 + ǫ1t (9) Xt = γYt + β21Yt−1 + β22Xt−1 + ǫ2t (10) Reduced form model: Yt = a11Yt−1 + a12Xt−1 + ν1t (11) Xt = a21Yt−1 + a22Xt−1 + ν2t, (12) We have a12 = β12 + θβ22 1 − θγ a21 = γβ11 + β21 1 − θγ Note: if θ = 0, β12 = 0, then a12 = 0 (lack of structural and G-causality from X to Y). But a12 = 0 does not necessarily implies that θ = 0 and β12 = 0. Thus GNC is not sufficient for structural non-causality.
Granger causality and VARs
◮ Example of a VAR model
yt mt pt rt = µ+A1 yt−1 mt−1 pt−1 rt−1 +. . .+Ap yt−p mt−p pt−p rt−p +ut (13) GNC: pt − → × mt is tested checking whether the (2, 3) element of the matrices A1, . . . , Ap are zero.
Note that the matrices A1, . . . , Ap are of dimension (4 × 4).
From VAR to SVAR models
Vector Autoregressive (VAR) Model
Reduced form: Given a vector yt of k variables and (k × k) Di matrices: yt = D1yt−1 + D2yt−2 + . . . + Dpyt−p + ut (NB: we omit here for convenience constant or deterministic trends). Structural form: Given ut = Aεt, where εt is a vector of orthogonal (or independent?) shocks. A−1yt = A−1(D1yt−1 + D2yt−2 + . . . + Dpyt−p + ut) Wyt = C1yt−1 + C2yt−2 + . . . + Cpyt−p + εt yt = Byt + C1yt−1 + C2yt−2 + . . . + Cpyt−p + εt where W = A−1; Ci = A−1Di; B = I − W.
Structural VAR
Alternative SVAR formulations... Wyt = C1yt−1 + C2yt−2 + . . . + Cpyt−p + εt (1) yt = D1yt−1 + D2yt−2 + . . . + Dpyt−p + Aεt (2) Gyt = H1yt−1 + H2yt−2 + . . . + Hpyt−p + Fεt (3) ...corresponding to different causal structures, for example: y1t y2t y3t y1t y2t y3t y1t y2t y3t ✲ ✲ ✲ ✲ ε1t ε2t ε3t ε1t ε2t ε3t ε1t ε2t ε3t ✻ ✻ ✻ ✻ ✻ ✻ ✻ ✻ ✻
- ✒
✟✟✟✟✟✟ ✟ ✯
- ✒
- ✒
✟✟✟✟✟✟ ✟ ✯
- ✒
In this example: yt = (y1t, y2t, y3t)′, εt = (ε1t, ε2t, ε3t)′
Structural VAR (cont’d)
Wyt = C1yt−1 + C2yt−2 + . . . + Cpyt−p + εt (1) yt = D1yt−1 + D2yt−2 + . . . + Dpyt−p + Aεt (2) Gyt = H1yt−1 + H2yt−2 + . . . + Hpyt−p + Fεt (3) Formulation (2) is more general than (1) and (3). From (2) seting A = W−1 and WDi = Ci, one gets (1). From (3) by seting A = EF, G = E−1, and GDi = Hi, one gets (3). In fact one can see (2) as a middle formulation between reduced-form and structural model.
Impulse Response Functions
Wold decomposition (inverting the autoregressive part): yt = (I − D1L − . . . − DpLp)−1ut =
∞
- j=0
Φjut−j where Φ0 = I , Φi = i
j=1 DjΦi−j for i = 1, 2, . . .
yt =
∞
- j=0
Φjut =
∞
- j=0
ΦjW−1Wut =
∞
- j=0
ΦjAεt =
∞
- j=0
Ψjεt The elements of Ψj are the impulse response functions: ∂yt+j ∂εt = Ψj They depend only on the mixing matrix A and the reduced-form matrices Di (i = 1, . . . , p).
Note on the Wold decomposition: The Wold decomposition yt = (I − D1L − . . . − DpLp)−1ut is possible only under stability, that is if det D(z) = det(I − D1z − . . . − Dpzp) = 0 for z ∈ C, z ≤ 1. But in general (even with non stationary variables), the forecast error associated with an h-step forecast is: yt+h − yt+h|t = ut+h + Φ1ut+h−1 + . . . + Φh−1ut+1. Thus we have: ∂yt+j ∂ut = Φj; ∂yt+j ∂εt = ΦjA = Ψj (IRF)
Methods of identification
Identification of a SVAR model reduces to the problem of finding the “right” mixture of ut: Wut = εt
- r
ut = Aεt Possible methods:
- 1. Choleski decomposition of the ut covariance matrix (cfr. Sims 1980).
- 2. A priori zero restrictions on W (cfr. Bernanke 1986; Blanchard and
Watson 1986).
- 3. Long-run restrictions, exploiting cointegration implications of the
long-run impact matrix (cfr. Blanchard and Qah 1989; King et al. 1991).
- 4. Sign restrictions (cfr. Uhlig 2005).
Methods of identification
Identification of a SVAR model reduces to the problem of finding the “right” mixture of ut: Wut = εt
- r
ut = Aεt Possible methods (cont’d):
- 5. Graphical causal models (cfr. Bessler 2002; Demiralp and Hoover
2003; Moneta 2008).
- 6. Independent component analysis (cfr. Moneta et al. 2013; Gourieroux
et al. 2016; Lanne et al. 2016; Herwartz and Plödt 2016.)
Graphical Models for SVAR Identification
⊲ Idea: apply graphical causal search to ut : (u1t, . . . , ukt). ⊲ The causal structure among u1t, . . . , ukt will deliver information
- n the matrix W
⊲ Notice that W determines causal relationships among y1t, . . . , ykt
Graphical Causal Models
References
Spirtes, P., C. Glymour, R. Scheines (2000), Causation, Prediction, and Search, 2nd edition, MIT Press. Pearl, J. (2000), Causality: Models, Reasoning, and Inference, Cambridge University Press.
Example of a graph
V1
✲ V2 ✲V4 ✲V5
V3
- ✒
◮ Directed paths: < V1, V2, V4, V5 >; < V3, V2, V4, V5 >; < V2, V4, V5 >,
etc.
◮ Undirected paths: < V1, V3, V2, V4, V5 >; < V1, V2, V3 >, etc. ◮ Undirected cyclic path: < V1, V2, V3, V1 > ◮ No directed cyclic paths.
More terminology
Collider: vertex V such that A − → V ← − B Unshielded collider: vertex V such that A − → V ← − B and A and B are not adjacent (≡ connected by edge) in the graph Complete graph: graph in which every pair of vertices are adjacent Directed Acyclic Graph (DAG): directed graph that contains no directed cyclic paths Directed Cyclic Graph (DCG): directed graph that contains directed cyclic paths
Graphs and probabilistic dependence
◮ Observed data are generated by a chance set up (DGP) ◮ Unobserved causal relationships are represented by a causal graph ◮ Causal relationships constraint the probability distribution
(conditional independence relations)
◮ We can envisage algorithms that determine the set of causal
structures that are consistent with the conditional independence relations.
Conditional Independence
If X, Y, Z are random variables, we say that X is conditionally independent
- f Y given Z, and write
X ⊥ ⊥ Y|Z if
◮ for discrete variables:
P(X = x, Y = y|Z = z) = P(X = x|Z = z)P(Y = y|Z = z)
◮ for continuous variables:
fXY|Z(x, y|z) = fX|Z(x|z)fY|Z(y|z) We can also write (simplifying the notation): X ⊥ ⊥ Y|Z ⇐ ⇒ f (x, y, z)f (z) = f (x, z)f (y, z)
Conditional independence
⊲ Some equalities:
◮ X ⊥
⊥ Y|Z ⇐ ⇒ f (x, y|z) = f (x|z)f (y|z)
◮ X ⊥
⊥ Y|Z ⇐ ⇒ f (x, y, z)f (z) = f (x, z)f (y, z)
◮ X ⊥
⊥ Y|Z ⇐ ⇒ f (x|y, z) = f (x|z)
◮ X ⊥
⊥ Y|Z ⇐ ⇒ f (x, z|y) = f (x|z)f (z|y)
◮ X ⊥
⊥ Y|Z ⇐ ⇒ f (x, y, z) = f (x|z)f (y, z)
Note: f (x, y|z) = f (x, y, z)/f (z)
Conditional independence
⊲ It holds also:
◮ X ⊥
⊥ Y|Z ⇐ ⇒ Y ⊥ ⊥ X|Z (symmetry)
◮ If Z is empty (trivial) X ⊥
⊥ Y: X is independent of Y.
⊲ Other properties (partial list):
◮ X ⊥
⊥ {Y, W}|Z = ⇒ X ⊥ ⊥ Y|Z (decomposition)
◮ X ⊥
⊥ {Y, W}|Z = ⇒ X ⊥ ⊥ Y|{Z, W} (weak union)
See Pearl 2000:11
Interpretations of C.I.
⊲ Useful interpretations of C.I. X ⊥ ⊥ Y|Z:
◮ once we know Z, learning the value of Y does not provide
additional information about X.
◮ once we observe realizations of Z, observing realizations of Y is
irrelevant for predicting the realizations of X.
Independence and uncorrelatedness
Importance of distinguishing between (conditional) independence and (conditional or partial) correlation.
◮ Recall:
⊲ Correlation coefficient (Pearson): ρXY := σXY σXσY ⊲ Linear regression coefficient: rXY := σXY σ2
Y
= ρXY σX σY ⊲ This suggests that correlation is a measure of linear dependence ⊲ Notice: σXY = σYX and ρXY = ρYX but rXY = rYX
Independence and uncorrelatedness
Partial correlation between X and Y given Z ρXY.Z = ρXY − ρYZρXZ
- 1 − ρ2
YZ
- 1 − ρ2
XZ
Conditional independence X ⊥ ⊥ Y|Z: fXY|Z(x, y|z) = fX|Z(x|z)fY|Z(y|z) It holds:
◮ X ⊥
⊥ Y = ⇒ ρXY = 0
◮ X ⊥
⊥ Y|Z = ⇒ ρXY.Z = 0 and (of course):
◮ ρXY = 0 =
⇒ X ⊥ ⊥ / Y
◮ ρXY.Z = 0 =
⇒ X ⊥ ⊥ / Y|Z
Independence and uncorrelatedness
⊲ In general:
◮ ρXY = 0 =
⇒ × X ⊥ ⊥ Y
◮ ρXY.Z = 0 =
⇒ × X ⊥ ⊥ Y|Z
⊲ However, if the joint distribution F(XYZ) is normal:
◮ ρXY = 0 =
⇒ X ⊥ ⊥ Y
◮ ρXY.Z = 0 =
⇒ X ⊥ ⊥ Y|Z
Note (1)
If, given the r.v. X and Y, the moments E(X k) < ∞ and E(Y m) < ∞, it turns out that X ⊥ ⊥ Y iff E(X kY m) = E(X k)E(Y m), for all k, m = 1, 2, . . . X and Y are (k, m)-order dependent iff E(X kY m) = E(X k)E(Y m), for any k, m = 1, 2, . . . (1-1)-order linear dependence: E(XY) = E(X)E(Y) (1-1)-order independence: E(XY) = E(X)E(Y) ⇔ E{[X−E(X)][Y−E(Y)]} = 0 ⇔ σXY = 0 ⇔ ρXY = 0 Orthogonality E(XY) = 0
Note (2)
- 1. if X and Y are uncorrelated (ρXY = 0), this is equivalent to say that
their mean deviations are orthogonal (if X and Y are “centered”, subtracting their mean, they become orthogonal).
- 2. if X and Y are orthogonal, ρXY = 0 only if E(X) = 0 or E(Y) = 0
In summary
independence = ⇒ (k, m) order independence independence = ⇒ non-correlation ⇐ ⇒ orthogonality mean-subtracted variables non-correlation = ⇒ × independence (there could be non-linear dependencies!)
(cfr. Spanos 1999: 272-279)
The Markov Condition
⊲ Key assumption connecting causal graphs as DAGs to probability distributions: Causal Markov Condition ⊲ It says that a particular causal structure will generate a particular set
- f conditional independence relations
In any probability distribution D generated by a causal graph G each variable X (node of G) is probabilistically independent of the set Y consisting of all variables that are not effects of X, conditional on the direct causes of X. (cfr. Scheines 2005) That is (V is the set of nodes of G ):
∀X ∈ V, X ⊥ ⊥ Non-effects of(X) | Direct Causes of(X). Or equivalently: ∀X ∈ V, X ⊥ ⊥ V\(Descendants(X) ∪ Parents(X))| Parents(X).
Markov Condition (example)
V1 ✲ V2 ✲V4 ✲V5 ✻ V3
- ✒
◮ The DAG above and the probability distribution P(V1, V2, V3, V4)
satisfy CMC iff: (1) V4 ⊥ ⊥ {V1, V3}|V2 (2) V5 ⊥ ⊥ {V1, V2, V3}|V4
◮ Notice that many other c.i. relations follow from (1) and (2) by applying
symmetry, decomposition, and weak union, etc. For example
◮ {V1, V3} ⊥
⊥ V4|V2; V1 ⊥ ⊥ V4|V2; V3 ⊥ ⊥ V4|V2; V1 ⊥ ⊥ V4|{V2, V3}; etc.
◮ {V1, V2, V3} ⊥
⊥ V5|V4; V5 ⊥ ⊥ {V1, V2}|V4; etc.
Causal search
⊲ Observational data ⊲ Constraint-based causal search:
◮ Statistical tests: find out conditional (in-)dependencies among
the variables;
◮ Constraint causal structure: use these tests to disprove causal
relationships till one remains with a set of admissible causal relationships.
Assumptions for constraint-based causal search
⊲ Two fundamental conditions: ⊲ Causal Markov condition:
◮ Markov condition with explicit causal interpretation: any variable
in the causal graph is conditionally independent of its non-effects (i.e. nondescendants), given its direct causes (i.e. parents).
⊲ Faithfulness condition:
◮ Let G be a causal graph and P be a probability associated with
the vertices of G. < G, P > satisfies the Faithfulness Condition iff every conditional independence relation true in P is entailed by the Causal Markov Condition applied to G.
⊲ Both conditions: X ⊥ ⊥ Y|Z ⇐ ⇒ X is d-separated from Y by Z
Causal search
⊲ Given a set of n variables X1, . . . , Xn, for which we observe T realizations X (T × n), suppose we know all the C.I. relations. ⊲ Using the two conditions (CMC and FC), Spirtes et al. (2000) developed some algorithms to infer from the data the set of causal graphs which are compatible with the C.I. relations. ⊲ Different algorithms for different setings. Most simple seting:
◮ causal sufficiency (no latent variables) ◮ acyclicity (no feedback loops), i.e. DAG structure
Causal Markov Condition: meaning
⊲ CMC tells us that a causal structure generates some C.I. relations ⊲ C.I. relations are generated by the screening-off property of the type: A − → B − → C, A ← − B − → C: ⊲ Equivalently, CMC tells us that if some conditional dependence is present, then some causal relation occurs (cfr. principle of common cause).
Faithfulness Condition: meaning
⊲ FC tells us all the C.I. relations are generated by the causal structure. ⊲ FC permits the inference from C.I. tests to causal relations. ⊲ Equivalently: causal relations entail dependencies. ⊲ Principle of the common effect (unshielded collider): from X − → Z ← − Y it follows: X ⊥ ⊥ Y, but X ⊥ ⊥ / Y|Z. ⊲ FC is connected with stability (Pearl 2000: 63).
Faithfulness Condition: violations
⊲ Violations of FC:
◮ Situations in which dependencies are exactly compensated:
Y ✲ + Z ✻ + X
- ✒
Causal search algorithm
◮ Assumptions: causal sufficiency, acyclicity, CMC, FC (+
statistical assumptions for testing C.I.)
◮ Input: conditional independence tests ◮ Output: set of DAGs
- Cfr. PC algorithm in (Spirtes, Glymour and Scheines 2000: 84-85).
Causal search algorithm
⊲ Start: complete undirected graph G among V1, . . . , Vk
◮ Example: True unobserved structure
A
✲
B
✲ C
D
✻
Complete undirected graph
A B C D
- ❅
❅ ❅ ❅ ❅ ❅
Causal search algorithm
⊲ First step: recursively eliminate edges using C.I. tests.
Select < Vj, Vm > s.t. Vj and Vm are adjacent in G. Select a group of n variables Z ≡ V1, . . . , Vn s.t. Vj / ∈ Z and Vm / ∈ Z. Delete the edge between Vj and Vm if Vj ⊥ ⊥ Vm|Z. Start with Z = ∅. Repeat this procedure recursively for i = 1, . . . , n.
◮ Example: i = 1 True unobserved structure
A
✲
B
✲ C
D
✻
Intermediate output
A B C D
- ❅
❅ ❅ ❅ ❅ ❅
We have: A ⊥ ⊥ B, B ⊥ ⊥ D
Causal search algorithm
⊲ First step: recursively eliminate edges using C.I. tests.
Select < Vj, Vm > s.t. Vj and Vm are adjacent in G. Select a group of n variables Z ≡ V1, . . . , Vn s.t. Vj / ∈ Z and Vm / ∈ Z. Delete the edge between Vj and Vm if Vj ⊥ ⊥ Vm|Z. Repeat this procedure recursively for i = 1, . . . , n.
◮ Example: i = 1 True unobserved structure
A
✲
B
✲ C
D
✻
Intermediate output
A B C D
- ❅
❅ ❅ ❅ ❅ ❅
We have: A ⊥ ⊥ B, B ⊥ ⊥ D
Causal search algorithm
⊲ First step: recursively eliminate edges using C.I. tests.
Select < Vj, Vm > s.t. Vj and Vm are adjacent in G. Select a group of n variables Z ≡ V1, . . . , Vn s.t. Vj / ∈ Z and Vm / ∈ Z. Delete the edge between Vj and Vm if Vj ⊥ ⊥ Vm|Z. Repeat this procedure recursively for i = 1, . . . , n.
◮ Example: i = 1 True unobserved structure
A
✲
B
✲ C
D
✻
Intermediate output
A B C D
- We have: A ⊥
⊥ B, B ⊥ ⊥ D
Causal search algorithm
⊲ First step: recursively eliminate edges using C.I. tests.
Select < Vj, Vm > s.t. Vj and Vm are adjacent in G. Select a group of n variables Z ≡ V1, . . . , Vn s.t. Vj / ∈ Z and Vm / ∈ Z. Delete the edge between Vj and Vm if Vj ⊥ ⊥ Vm|Z. Repeat this procedure recursively for i = 1, . . . , n.
◮ Example: i = 2 True unobserved structure
A
✲
B
✲ C
D
✻
Intermediate output
A B C D
- We have: A ⊥
⊥ C|D, A ⊥ ⊥ B|D
Causal search algorithm
⊲ First step: recursively eliminate edges using C.I. tests.
Select < Vj, Vm > s.t. Vj and Vm are adjacent in G. Select a group of n variables Z ≡ V1, . . . , Vn s.t. Vj / ∈ Z and Vm / ∈ Z. Delete the edge between Vj and Vm if Vj ⊥ ⊥ Vm|Z. Repeat this procedure recursively for i = 1, . . . , n.
◮ Example: i = 2 True unobserved structure
A
✲
B
✲ C
D
✻
Intermediate output
A B C D
- We have: A ⊥
⊥ C|D, A ⊥ ⊥ B|D
Causal search algorithm
⊲ First step: recursively eliminate edges using C.I. tests.
Select < Vj, Vm > s.t. Vj and Vm are adjacent in G. Select a group of n variables Z ≡ V1, . . . , Vn s.t. Vj / ∈ Z and Vm / ∈ Z. Delete the edge between Vj and Vm if Vj ⊥ ⊥ Vm|Z. Repeat this procedure recursively for i = 1, . . . , n.
◮ Example: i = 2 True unobserved structure
A
✲
B
✲ C
D
✻
Intermediate output
A B C D We have: A ⊥ ⊥ C|D, A ⊥ ⊥ B|D
Causal search algorithm
⊲ Second step: identify unshielded colliders.
Select < Vi, Vj, Vk > s.t. Vi and Vj are adjacent in G, Vj and Vk are adjacent in G, but Vi and Vk are not adjacent in G. Orient Vi — Vj — Vk as Vi − → Vj ← − Vk > if Vi ⊥ ⊥ / Vk|Vj .
◮ Example: True unobserved structure
A
✲
B
✲ C
D
✻
Intermediate output
A B C D We have: B ⊥ ⊥ / D|C, A ⊥ ⊥ C|D
Causal search algorithm
⊲ Second step: identify unshielded colliders.
Select < Vi, Vj, Vk > s.t. Vi and Vj are adjacent in G, Vj and Vk are adjacent in G, but Vi and Vk are not adjacent in G. Orient Vi — Vj — Vk as Vi − → Vj ← − Vk > if Vi ⊥ ⊥ / Vk|Vj .
◮ Example: True unobserved structure
A
✲
B
✲ C
D
✻
Intermediate output
A B
✲ C
D
✻
We have: B ⊥ ⊥ / D|C, A ⊥ ⊥ C|D
Causal search algorithm
⊲ Third step: identify chains.
Select < Vi, Vj, Vk > s.t. Vi − → Vj, Vj — Vk, and Vi and Vk are not adjacent. Orient Vj — Vk as Vj − → Vk.
◮ Example: the third step is not applicable in this example True unobserved structure
A
✲
B
✲ C
D
✻
Intermediate output
A B
✲ C
D
✻
Causal search algorithm
⊲ Fourth step: avoid cycles.
If there is a directed path from Vi to Vk, and an edge between Vi to Vk, then orient Vi — Vk as Vi − → Vk
◮ Example: the fourth step is not applicable in this example. True unobserved structure
A
✲
B
✲ C
D
✻
Intermediate output
A B
✲ C
D
✻
Causal search algorithm
◮ Final Output of the Example: two observational equivalent
DAGs
(Notice that the number of possible DAGs among 4 variables is 543) True unobserved structure
A
✲
B
✲ C
D
✻
Final output
A
✲
B
✲ C
D
✻
Final output
- r
A ✛ B
✲ C
D
✻
In this example the algorithm is able to identify all the edges and to guess all the causal directions except for the causal direction between A and D
Testing conditional independence
Numerous tests of C.I. exist. Which one should be applied depends on the nature of data and the properties of the statistical model. The search algorithm can incorporate any of them. In a parametric - Gaussian seting testing C.I. is straightforward. It reduces to test zero partial correlations.
⊲ Spirtes et al. (2000: 94) suggest to use Fisher’s z to tests ρXY.C = 0: z(ρXY.C, n) = 1 2
- n − |C| − 3 log
|1 + ρXY.C| |1 − ρXY.C|
- ,
where |C| is the number of variables in C and n is the length of the sample. ⊲ If X, Y, C ∼ N, under the null-hypothesis of zero partial correlation: z(ˆ ρXY.C, n) ∼ N(0, 1)
Testing conditional independence
⊲ In a nonparametric seting, testing C.I. is much more difficult, but still theoretically possible. ⊲ Recall definition of C.I. f (x, y|z) = f (x|z)f (y|z) ⇐ ⇒ f (x, y, z)f (z) = f (x, z)f (y, z) ⊲ It is then possible to obtain kernel density estimates of ˆ f (x, y, z) ˆ f (z), ˆ f (x, z), ˆ f (y, z) and then check whether the distance: d(ˆ f (x, y, z)ˆ f (z) , ˆ f (x, z)ˆ f (y, z)) is close enough to zero. ⊲ Major difficulty: curse of dimensionality in nonparametric estimation.
Extensions
⊲ Structures in which latent variables are allowed:
◮ FCI algorithms (Spirtes et al. 2000: 145) ◮ Use X◦−
→ Y to denote a common latent variable between X and Y
◮ CMC and FC are still used for causal search
⊲ Structures in which feedbacks (loops) are allowed
◮ CCD algorithm (Richardson and Spirtes 1999) ◮ Directed Cyclic Graphs instead of DAGs ◮ CMC does not hold anymore, but d−separation criterion still
valid to detect C.I. (global directed Markov condition)
Graphical Models for SVAR Identification
⊲ Graphical causal search applied to ut : (u1t, . . . , ukt): ⊲ Test conditional independence among (u1t, . . . , ukt)
◮ In case of Gaussianity: Fisher’s z or Wald tests (see Moneta et al.
2011) on zero partial correlations
◮ alternative semi-parametric or NP methods in case of
non-Gaussianity (see Moneta et al. 2011)
⊲ Apply search algorithm (e.g. PC algorithm):
◮ Build a complete undirected graph among (u1t, . . . , ukt); ◮ Recursively eliminate edges using C.I. tests among (u1t, . . . , ukt); ◮ Identify unshielded colliders; ◮ Identify chains; ◮ Avoid cycles.
Example
◮ Moneta (2008): King-Stock-Plosser-Watson (1991) data set:
Y = C I M Y R ∆P
◮ Taking into account non-stationarity / cointegration ◮ Get the matrix of residuals ˆ
ut
Results from King et al. (1991) data set:
R I Y M ∆P C
❅ ❅ ❅ ❅ ❅ ❅
Configurations R − → I ← − Y and R − → I ← − C are excluded.
Results from King et al. (1991) data set:
Example of one member of the set of possible causal graphs: R
✲ I ✲ Y
M
✲ ∆P ❄
C
❅ ❅ ❅ ❅ ❅ ❅ ❘
Impulse response functions
- 1
- 0.5
0.5 1 1.5 3 6 9 12 15 18 21 24 27 lags Responses of Y to Y
- 1
- 0.5
0.5 1 1.5 2 3 6 9 12 15 18 21 24 27 lags Responses of Y to M
- 1
- 0.5
0.5 1 1.5 3 6 9 12 15 18 21 24 27 lags Responses of Y to I
- 1
- 0.5
0.5 1 1.5 3 6 9 12 15 18 21 24 27 lags Responses of Y to C
Impulse response functions
- 1.5
- 1
- 0.5
0.5 1 1.5 3 6 9 12 15 18 21 24 27 lags Responses of Y to DP
- 1.5
- 1
- 0.5
0.5 1 1.5 3 6 9 12 15 18 21 24 27 lags Responses of Y to R
Independent Component Analysis
References
Hyvärinen, Karhunen, Oja (2001) Independent Component Analysis. Shimizu, S., Hoyer, P. O., Hyvärinen, A., Kerminen, A. (2006). A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research.
ICA-based search
⊲ Assumptions of the method based on Independent Component Analysis:
◮ the structural shocks εt are non-normal; ◮ the elements of εt: ε1t, . . . , εkt are mutually independent; ◮ some general assumptions on the causal structure (e.g.
<acyclicity + causal sufficiency>) but not strictly necessary.
◮ linearity ◮ Note: Faithfulness condition is not needed here.
ICA-based search
⊲ Search methods: LiNGAM (Shimizu et al. 2006) applied to VAR models (cfr. also Hyvärinen et al. 2008 and Moneta et al. 2013) and LiNG (Lacerda et al. 2008). 1 Estimate the VAR model yt = D1yt−1 + . . . + Dpyt−p + ut. Check whether the residuals are non-Gaussian. Denote by ˆ U the K × T matrix of estimated residuals. 2 Use FastICA (Hyvärinen et al. 2001) to obtain ˆ U = P^ E, where P is K × K and ˆ E is K × T, such that the rows of ˆ E are the independent components of ˆ U.
Note: FastICA finds transformations of the data that maximizes (approximate) negentropy. This is equivalent to minimize mutual information. Note: the order and scaling of the independent components (output of the FastICA algorithm) is lef undetermined.
Note: Negentropy
Differential entropy: H(u) = −
- f(u) log f(u)du
is a measure of “unpredictability”. Gaussian variables have the largest d. entropy.
Negentropy: J(u) = H(ugauss) − H(u)
ugauss is a Gaussian random variable of the same covariance matrix as u.
Note: Mutual Information
Kullback-Leibler divergence between f (x) and g(x): +∞
−∞
f (x) log f (x) g(x)dx Mutual information between u1, . . . , uk: I(u1, . . . , uk) =
k
- i=1
H(yi) − H(u) N.B. it is equivalent to the KL divergence between the joint density f (u) and the product of the marginal densities f (u1)f (u2) . . . f (uk).
ICA-based search
⊲ Acyclic case LiNGAM: 3 Let ˜ ˜ W0 = P−1. Find the permutation of rows of ˜ ˜ W0 which yields a matrix ˜ W0 without any zeros on the main diagonal. The permutation is sought which minimizes
i 1/| ˜
W0ii|. 4 Divide each row of ˜ W0 by its diagonal element, to obtain a matrix ˆ W0 with all ones on the diagonal. 5 Let ˜ B = I − ˆ W0. 6 Find the permutation matrix Z which makes Z˜ BZT as close as possible to strictly lower triangular. (Minimize the sum of squares of the permuted upper-triangular elements). Set the upper-triangular elements to zero, and permute back to obtain ˆ B which now contains the acyclic contemporaneous structure.
Example Steps 2-6: Suppose the actual structure is: 1 γ α 1 β 1 u1 u2 u3 = e1 e2 e3 Equivalently: u1 u2 u3 = −γ −α −β u1 u2 u3 + e1 e2 e3 Causal order: ✄ ✂ ✲ u3 ❄ u1 ❄ u2
Suppose from step 2 one finds: 2α 2 2β 4 3 3γ u1 u2 u3 = e1 e2 e3 From step 3: 3 3γ 2α 2 2β 4 u1 u2 u3 = e1 e2 e3 From step 4-5: u1 u2 u3 = −γ −α −β u1 u2 u3 + e1 e2 e3 Step 6: u3 u1 u2 = −γ −β −α u3 u1 u2 + e3 e1 e2
ICA-based search
⊲ Cyclic case LiNG 3 Let ˜ ˜ W0 = P−1. Test hypothesis of non-zero coefficients by bootstrap sampling. Prune out very statistically insignificant elements of ˜ ˜
- W0. Find all row permutation matrices of ˜
˜ W0, which yield ˜ W0 that has zeroless diagonal.
Note: There might be several candidates of ˜ W0, depending how sparse is W0.
4 Divide each row of ˜ W0 by its diagonal element, to obtain a matrix ˆ W0 with all ones on the diagonal. 5 Let ˜ B = I − ˆ W0. ⊲ Final step (both for LiNG and LiNGAM):
◮ Calculate lagged causal effects ˆ