Probabilistic Graphical Models for Cellular Pathways Florian - - PowerPoint PPT Presentation

probabilistic graphical models for cellular pathways
SMART_READER_LITE
LIVE PREVIEW

Probabilistic Graphical Models for Cellular Pathways Florian - - PowerPoint PPT Presentation

Probabilistic Graphical Models for Cellular Pathways Florian Markowetz florian.markowetz@molgen.mpg.de Max Planck Institute for Molecular Genetics


slide-1
SLIDE 1

Probabilistic Graphical Models for Cellular Pathways

Florian Markowetz florian.markowetz@molgen.mpg.de Max Planck Institute for Molecular Genetics Computational Diagnostics Group Berlin, Germany

  • IPM workshop

Tehran, 2005 April

slide-2
SLIDE 2
  • Cellular networks

Figure from http://array.mbb.yale.edu/yeast/transcription/ Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 1

slide-3
SLIDE 3
  • Modelling networks

High-throughput assays can probe cells at a genome-wide scale. Very prominent: microarrays that measure mRNA transcript quantitites. Need to use probabilistic models, which account for

  • measurement noise,
  • variability in the biological system, and
  • aspects of the system not captured by

the model.

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 2

slide-4
SLIDE 4
  • Clustering by coexpression

19017 17003 18001 LAL4 19014 20005 02020 43015 28008 31015 10005 11002 28009 01007 04018 15006 24006 09002 16007 16002 64005 43006 12008 83001 26009 65003 56007 19008 01003 44001 49004 37001 19002 04016 28007 24022 03002 36002 09017 27004 49006 62001 43004 20002 12012 64001 65005 28036 84004 26003 62002 15001 24008 26005 26001 08024 48001 12019 25003 11005 01005 24011 43007 04007 31011 12007 22011 24017 14016 37013 22013 68003 24010 12006 43001 08001 04006 26008 28032 16004 15004 19005 24005 28028 31007 63001 57001 24019 64002 36001 08018 28003 LAL5 22010 12026 06002 04008 16009 68001 25006 22009 24018 04010 28021 24001 30001 28035 28024 27003 28037 28006 28001 28043 28031 33005 28042 43012 28023 28047 08012 08011 28019 01010 28044 28005 62003 15005 09008 31525_s_at 31687_f_at 41165_g_at 41164_at 37039_at 35016_at 38833_at 38095_i_at 38355_at 41214_at 39878_at 39729_at 296_at 39829_at 33238_at 39317_at 32855_at 41468_at 40775_at 37399_at 32649_at 38319_at 38917_at 1110_at 995_g_at 38446_at 41504_s_at 38147_at 37809_at 37558_at 41470_at 36927_at 35372_r_at 37623_at 32612_at 33809_at 36536_at 40953_at 36275_at 34800_at 37625_at 37006_at 1325_at 37280_at 914_g_at 34168_at 36650_at 41193_at 36638_at 36108_at 38052_at 36239_at 307_at 38604_at 39389_at 266_s_at 38242_at 33516_at 38585_at 35926_s_at 33232_at 39710_at 32794_g_at 33412_at 280_g_at 36711_at 37701_at 38354_at 36103_at 1369_s_at 38514_at 37043_at 34210_at 38968_at 38096_f_at 33439_at 41215_s_at 33274_f_at 33273_f_at 39318_at 41166_at 1096_g_at 37988_at 37344_at 41266_at 34362_at 32035_at 33705_at 40936_at 40570_at 1065_at 41356_at 40202_at 38994_at 32542_at 34033_s_at 39839_at 41723_s_at 36878_f_at 36773_f_at

Assumption: Coexpression ∼ coregulation If genes show the same expression profiles they follow the same regulatory regimes [7, 25].

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 3

slide-5
SLIDE 5
  • Correlation graphs

An expression profile is a random vector X = (X1, . . . , Xp). Correlation graph: Depict genes as vertices of a graph and draw an edge (i, j) iff the correlation coefficient ρij = 0. Advantage: This representation of the marginal dependence structure is easy to interpret and can be accurately estimated even if p ≫ N. Application: Stuart et. al [28] build a graph from coexpression across multiple organisms.

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 4

slide-6
SLIDE 6
  • Problems of correlation based approaches

We cannot distinguish direct from indirect dependencies! Three reasons, why X, Y , and Z are highly correlated:

X Y Z X Z Y X Z Y H

As a cure: search for correlations which cannot be explained by other variables.

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 5

slide-7
SLIDE 7
  • Overview
  • 1. Gaussian graphical models
  • conditional independence
  • partial correlations
  • 2. Bayesian networks
  • d-separation
  • PC algorithm
  • equivalence of networks
  • 3. Bayesian structure learning
  • marginal likelihood
  • search strategies

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 6

slide-8
SLIDE 8
  • Part I.

Gaussian graphical models

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 7

slide-9
SLIDE 9
  • Conditional independence

Be X, Y, Z random variables with joint distribution P. X is conditionally independent of Y given Z X | = Y | Z ⇔ P(X = x, Y = y|Z = z) = P(X = x|Z = z) · P(Y = y|Z = z) P(X = x|Y = y, Z = z) = P(X = x|Z = z)

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 8

slide-10
SLIDE 10
  • Conditional independence: interpretation

Interpret random variables as abstract pieces of knowledge obtained from, say, reading books [16]. Then X | = Y | Z means Knowing Z, reading Y is irrelevant for reading X If I already know Z, then Y offers me no new information to understand X.

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 9

slide-11
SLIDE 11
  • Conditional independence in Gaussian models
  • Consider a random vector X = (X1, . . . , Xp).
  • Assume that X ∼ N(µ, Σ), where Σ is regular.
  • Let K = Σ−1 be the concentration matrix of the distribution (aka

precision matrix). Then it holds for i, j ∈ {1, . . . , p} with i = j that Xi | = Xj | Xrest ⇔ kij = 0, where rest = {1, . . . , p} \ {i, j} [16].

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 10

slide-12
SLIDE 12
  • Gaussian Graphical models (GGM)

Given a random vector X = (X1, . . . , Xp). A Gaussian graphical model [16, 6] is an undirected graph on vertex set V , with |V | = p . To each vertex i ∈ V corresponds a random variable Xi ∈ X. Draw an edge between vertices i and j if and only if kij = 0. Note: In correlation graphs we modeled via Σ, in GGMs we use K = Σ−1.

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 11

slide-13
SLIDE 13
  • Example of a GGM

2 3 4 1

Missing edges indicate independencies: Xi | = Xj | Xrest X1 | = X4 | {X2, X3} X2 | = X3 | {X1, X4} X2 | = X4 | {X1, X3}

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 12

slide-14
SLIDE 14
  • Estimation from data

Likelihood n(x; K) = (2π)−p

2 |K| 1 2 exp

  • −1

2xTKx

  • Test Null-Hypothesis kij = 0 versus Alternative kij = 0.
  • The Null-Hypothesis constrains the precision matrix K,
  • the alternative leaves K unconstrained.

Likelihood ratio test statistic is asymptotically χ2 distributed [16].

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 13

slide-15
SLIDE 15
  • What if p ≫ N?

Full conditional relationships can only be accurately estimated if the number of samples N is relatively large compared to the number of variables p. Thus, if p ≫ N, you can . . . either improve your estimators of partial correlations (e.g. Sch¨ afer and Strimmer [23] use the Moore-Penrose pseudoinverse and bootstrap aggregation (bagging) to stabilize the estimator.)

  • r resort to a simpler model.

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 14

slide-16
SLIDE 16
  • Sparse graphical Gaussian modeling

Do not condition on the complete rest as in GGMs. Instead explore dependency of two variables conditioned on a third [30, 31, 17, 5]. Draw an edge between vertices i and j (i = j) if and only if the correlation coefficient ρij = 0 and no third variable can explain the correlation: Xi | = / Xj | Xk for all k ∈ rest, whrere again rest = {1, . . . , p} \ {i, j}.

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 15

slide-17
SLIDE 17
  • Summary of part I

We have seen methods to build graphs from

  • 1. marginal independencies

Xi | = Xj,

  • 2. full conditional independence

Xi | = Xj | X{1,...,p}\{i,j},

  • 3. first order independencies

Xi | = Xj | Xk ∀k ∈ {1, . . . , p} \ {i, j}.

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 16

slide-18
SLIDE 18
  • Summary of part I

We have seen methods to build graphs from

  • 1. marginal independencies

Xi | = Xj,

  • 2. full conditional independence

Xi | = Xj | X{1,...,p}\{i,j},

  • 3. first order independencies

Xi | = Xj | Xk ∀k ∈ {1, . . . , p} \ {i, j}. Where does this lead us?

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 16

slide-19
SLIDE 19
  • Include all higher order dependencies

Draw an edge between vertices i and j if Xi | = / Xj | XS for all S ⊆ {1, . . . , p} \ {i, j}. This includes testing marginal, first order and full conditional independencies.

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 17

slide-20
SLIDE 20
  • Include all higher order dependencies

Draw an edge between vertices i and j if Xi | = / Xj | XS for all S ⊆ {1, . . . , p} \ {i, j}. This includes testing marginal, first order and full conditional independencies. In the next part we will see:

  • It will be possible to direct some of the edges.
  • The resulting probabilistic model is a Bayesian network.
  • Causation instead of just correlation [21, 26].

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 17

slide-21
SLIDE 21
  • Part II.

Bayesian networks

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 18

slide-22
SLIDE 22
  • Factorization of joint distribution

Given random vector X = (X1, . . . , Xp) we can always decompose p(x) = p(x1, . . . , xp) = p(x1, . . . , xp−1) p(xp|x1, . . . , xp−1)

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 19

slide-23
SLIDE 23
  • Factorization of joint distribution

Given random vector X = (X1, . . . , Xp) we can always decompose p(x) = p(x1, . . . , xp) = p(x1, . . . , xp−1) p(xp|x1, . . . , xp−1) = p(x1)

p

  • v=2

p(xv|x1, . . . , xv−1)

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 19

slide-24
SLIDE 24
  • Factorization of joint distribution

Given random vector X = (X1, . . . , Xp) we can always decompose p(x) = p(x1, . . . , xp) = p(x1, . . . , xp−1) p(xp|x1, . . . , xp−1) = p(x1)

p

  • v=2

p(xv|x1, . . . , xv−1)

1 2 3

Example: p(x1, x2, x3) = p(x1) p(x2|x1) p(x3|x1, x2)

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 19

slide-25
SLIDE 25
  • Factorization of joint distribution

Given random vector X = (X1, . . . , Xp) we can always decompose p(x) = p(x1, . . . , xp) = p(x1, . . . , xp−1) p(xp|x1, . . . , xp−1) = p(x1)

p

  • v=2

p(xv|x1, . . . , xv−1)

1 2 3

Example: p(x1, x2, x3) = p(x1) p(x2|x1) p(x3|x1, x2) ⇒ completely connected directed acyclic graph

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 19

slide-26
SLIDE 26
  • Bayesian network

A Bayesian Network for a random vector X consists of

  • 1. a network structure
  • directed acyclic graph (DAG) on vertex set V ,
  • node v corresponds to variable Xv,
  • 2. a set of probability distributions
  • locally: conditional distribution of a gene given its parents.
  • such that globally

p(x) =

v∈V

p(xv | xpa(v), θv)

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 20

slide-27
SLIDE 27
  • Questions
  • 1. How do the local probability distributions look like?

− → Conditional Gaussian networks

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 21

slide-28
SLIDE 28
  • Questions
  • 1. How do the local probability distributions look like?

− → Conditional Gaussian networks

  • 2. How is conditional independence defined for directed models?

− → Global Directed Markov Property

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 21

slide-29
SLIDE 29
  • Questions
  • 1. How do the local probability distributions look like?

− → Conditional Gaussian networks

  • 2. How is conditional independence defined for directed models?

− → Global Directed Markov Property

  • 3. How can we learn a Bayesian network structure from data?

− → Constraint-based algorithm (and a Bayesian in Part III)

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 21

slide-30
SLIDE 30
  • Questions
  • 1. How do the local probability distributions look like?

− → Conditional Gaussian networks

  • 2. How is conditional independence defined for directed models?

− → Global Directed Markov Property

  • 3. How can we learn a Bayesian network structure from data?

− → Constraint-based algorithm (and a Bayesian in Part III)

  • 4. Are there natural limits in structure learning?

− → equivalence of network structures

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 21

slide-31
SLIDE 31
  • Children depend on parents

The DAG defines families. Relationships are further characterized by local probability distributions:

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 22

slide-32
SLIDE 32
  • Children depend on parents

The DAG defines families. Relationships are further characterized by local probability distributions:

0 1 X 0 1 2 Z 0 1 Y

p(x) = (0.6 0.4) p(y) = (0.2 0.8) p(z|x, y) =

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 22

slide-33
SLIDE 33
  • Children depend on parents

The DAG defines families. Relationships are further characterized by local probability distributions:

0 1 X 0 1 2 Z 0 1 Y

p(x) = (0.6 0.4) p(y) = (0.2 0.8) p(z|x, y) =            (0.8 0.1 0.1) if (X, Y ) = (0, 0) (0.1 0.8 0.1) if (X, Y ) = (0, 1) (0.1 0.8 0.1) if (X, Y ) = (1, 0) (0.1 0.1 0.8) if (X, Y ) = (1, 1)

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 22

slide-34
SLIDE 34
  • Local probability distributions I

Discrete node with discrete parents Xv | xpa(v), θv ∼ Multin(1, θv|xpa(v)) Parametrization: θv = {θv|xpa(v)} is a set of probability vectors –

  • ne for each configuration xpa(v) of parents of v.

Density: [12] p(xv | xpa(v), θv) =

  • x′

v

θ1(x′

v=xv)

x′

v|xpa(v)

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 23

slide-35
SLIDE 35
  • Local probability distributions II

Continous node with continuous parents Xv | xpa(v), θv ∼ N(µv, σ2

v),

where µv = β(0)

v

+

i∈pa(v) β(i) v xi.

Parametrization: θv = (βv, σ2

v) contains a vector of regression

coefficients and a variance for node v. Density: p(xv | xpa(v), θv) = 1 √ 2πσ exp

  • −(xv − µv)2

2σ2

v

  • Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April

24

slide-36
SLIDE 36
  • Local probability distributions III

Continous node with mixed parents Calling continous variables Y and discrete variables I [16], we can write Yv | ipa(v), ypa(v), θv ∼ N(µv|ipa(v), σ2

v|ipa(v)),

where µv|ipa(v) = β(0)

ipa(v) + i∈pa(v) β(i) ipa(v)xi.

Parametrization: θv = (βv|ipa(v), σ2

v|ipa(v)) contains a vector of

regression coefficients and a variance for node v, which depend

  • n the state of the discrete parents [1].

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 25

slide-37
SLIDE 37
  • Conditional Gaussian networks

We can combine the different LPDs in the framework of CG networks:

d d d c c

The random vector X has a discrete part I and a continuous part Y and the distribution decomposes as p(x) = p(i, y) = p(i) p(y|i). These are the general parametric networks used in statistics [16].

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 26

slide-38
SLIDE 38
  • Conditional Independence I

X Y Z

Chain/linear X | = Z | Y and X | = / Z | ∅

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 27

slide-39
SLIDE 39
  • Conditional Independence I

X Y Z

Chain/linear X | = Z | Y and X | = / Z | ∅ p(x, z|y) = p(x, y, z) p(y) = p(x) p(y|x) p(z|y) p(y) = p(x|y) p(z|y)

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 27

slide-40
SLIDE 40
  • Conditional Independence II

X Y Z

Fork/diverging X | = Z | Y and X | = / Z | ∅

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 28

slide-41
SLIDE 41
  • Conditional Independence II

X Y Z

Fork/diverging X | = Z | Y and X | = / Z | ∅ p(x, z|y) = p(x, y, z) p(y) = p(x|y) p(y) p(z|y) p(y) = p(x|y) p(z|y)

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 28

slide-42
SLIDE 42
  • Conditional Independence III

X Y Z

Collider/converging X | = Z | ∅ and X | = / Z | Y

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 29

slide-43
SLIDE 43
  • Conditional Independence III

X Y Z

Collider/converging X | = Z | ∅ and X | = / Z | Y p(x, y, z) = p(x) p(y|x, z) p(z) = p(x) p(z) p(x, y, z) p(x, z)

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 29

slide-44
SLIDE 44
  • PC algorithm, part 1

How to find the skeleton of a Bayesian network [26, 21] Form the complete undirected graph on node set {1, . . . , p}. For each pair of variables Xi and Xj:

  • 1. Remove the edge i ∼ j iff there exists a subset S ⊆ {1, . . . , p} \

{i, j} such that Xi | = Xj | XS.

  • 2. Start with S = ∅, then continue for increasing |S|.
  • 3. This includes testing marginal, first order and full conditional

independencies.

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 30

slide-45
SLIDE 45
  • PC algorithm, part 2

How to direct the edges [26, 21] Once we have the skeleton, we can start putting directions on the edges. First identify v-structures: Orient X—Y —Z into X − → Y ← − Z whenever X | = / Z | Y . Second direct as many edges as possible while respecting acyclicity and the independence constraints from step 1.

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 31

slide-46
SLIDE 46
  • Equivalence of Networks

Two structures and are equivalent if both represent the same set of independence assertions.

X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 32

slide-47
SLIDE 47
  • Part III.

Bayesian structure learning

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 33

slide-48
SLIDE 48
  • Situation

Model: We assume that the dependency structure of a random vector X follows an unknown DAG D. The distribution p(x) is Conditional Gaussian and factors according to D. Data: We observe independent and identically distributed data d = {x1, . . . , xN}. Each observation is a realization of X. Goal: Estimate D from d.

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 34

slide-49
SLIDE 49
  • Being Bayesian about structure learning
  • 1. Score model

devise a scoring function that evaluates each network with respect to the training data.

  • 2. Search for best model

search for the optimal network according to this score.

  • 3. Assess model uncertainty

use MCMC or Bootstrap.

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 35

slide-50
SLIDE 50
  • Scoring metric for networks

The posterior distribution of structure and parameters given data is p(D, θ | d) ∝ p(d | D, θ) · p(θ|D) · p(D).

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 36

slide-51
SLIDE 51
  • Scoring metric for networks

The posterior distribution of structure and parameters given data is p(D, θ | d) ∝ p(d | D, θ) · p(θ|D) · p(D). Integrating out nuisance parameters yields p(D | d) ∝ p(D) ·

  • p(d | D, θ) p(θ|D) dθ.

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 36

slide-52
SLIDE 52
  • Scoring metric for networks

The posterior distribution of structure and parameters given data is p(D, θ | d) ∝ p(d | D, θ) · p(θ|D) · p(D). Integrating out nuisance parameters yields p(D | d) ∝ p(D) ·

  • p(d | D, θ) p(θ|D) dθ.

The righthand side will be our score for network fitness. It consists

  • f a structure prior p(D) and a marginal likelihood p(d | D).

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 36

slide-53
SLIDE 53
  • A local view of marginal likelihood

We zoom in on one discrete family of nodes with a fixed configuration of parents. Assuming parameter independence [13] we will solve the integral p(batch | D) =

  • p(batch | D, θ) p(θ|D) dθ,

where “batch” means the part of data d corresponding to this one family. To solve it analytically, we need priors, which fit to the likelihood.

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 37

slide-54
SLIDE 54
  • Conjugate priors

Discrete part: Multinomial likelihood with Dirichlet prior: p(batch | D, θ) =

  • k

θnk

k

p(θ | D) = Γ(α+)

  • k

Γ(αk)

  • k

θαk−1

k

. Mixed part: Gaussian likelihood with Normal-inverse-χ2 prior. Data likelihood is multivariate Normal, vector of regression β coefficients has Normal prior, variance σ2 has inverse-χ2 prior [1, 18].

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 38

slide-55
SLIDE 55
  • Marginal likelihood of discrete family

p(batch | D) =

  • p(batch | D, θ) p(θ|D) dθ

= Γ(α+)

  • k Γ(αk)
  • Θ
  • k

θnk+αk−1

k

dθv = Γ(α+)

  • k Γ(αk) ·
  • k Γ(αk + nk)

Γ(α+ + n+) with counts nk and Dirichlet parameters αk. For the marginal likelihood of the complete network, you have to multiply terms like this for all nodes and all configurations of discrete parents [3, 13].

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 39

slide-56
SLIDE 56
  • Where are we?

We learned in the case of discrete networks, how to compute the marginal likelihood p(d | D). This is the right part of the score: p(D | d) ∝ p(D) ·

  • p(d | D, θ) p(θ|D) dθ.

To complete the score, we need a structure prior p(D). And after that, we have to come up with a smart strategy to find high-scoring network structures.

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 40

slide-57
SLIDE 57
  • Search for high scores

Exhaustive search: Infeasible for more than 5 nodes! [22] If topological order of nodes is known Start with empty network and iteratively add parents [3]. Hillclimbing (with random restarts)

  • Start at randomly chosen network D.
  • Score all neighbors (single edge deletions, insertions, inversions).
  • Repeat for highest scoring neighbor.
  • Runs into next local maximum.

Simulated annealing Choose suboptimal neighbor with decreasing probability.

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 41

slide-58
SLIDE 58
  • On true models

A quote from Edwards [6]: “Any method (or statistician) that takes a complex multivariate dataset and, from it, claims to identify one true model, is both naive and misleading.” What we have found is just a simple model consistent with the data — nothing more, nothing less.

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 42

slide-59
SLIDE 59
  • Assessing uncertainty

Predicting the best network tells us nothing about the robustness of the solution. MCMC: Use Markov Chain Monte Carlo to sample from the posterior distribution [14, 10]. Bootstrap: Computationally efficient approach to address confidence in network features [9, 11]. Biased-corrected bootstrap: Graphical models learned from bootstrap samples are biased towards too complex models. Steck and Jaakkola [27] suggest a bootstrap procedure corrected for this bias.

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 43

slide-60
SLIDE 60
  • A caveat [8]

If the expression of gene A is regulated by proteins B and C, then A’s expression level is a function of the joint activity levels of B and C. We treat the expression of A as a stochastic function of its regulators.

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 44

slide-61
SLIDE 61
  • A caveat [8]

If the expression of gene A is regulated by proteins B and C, then A’s expression level is a function of the joint activity levels of B and C. We treat the expression of A as a stochastic function of its regulators. Problem 1: In most current biological data sets, however, we do not have access to measurements of protein activity levels.

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 44

slide-62
SLIDE 62
  • A caveat [8]

If the expression of gene A is regulated by proteins B and C, then A’s expression level is a function of the joint activity levels of B and C. We treat the expression of A as a stochastic function of its regulators. Problem 1: In most current biological data sets, however, we do not have access to measurements of protein activity levels. Resort: Expression levels of genes as a proxy for the activity level of the proteins they encode.

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 44

slide-63
SLIDE 63
  • A caveat [8]

If the expression of gene A is regulated by proteins B and C, then A’s expression level is a function of the joint activity levels of B and C. We treat the expression of A as a stochastic function of its regulators. Problem 1: In most current biological data sets, however, we do not have access to measurements of protein activity levels. Resort: Expression levels of genes as a proxy for the activity level of the proteins they encode. Problem 2: There are numerous examples where an activation or silencing of a regulator is carried out by posttranscriptional protein modifications.

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 44

slide-64
SLIDE 64
  • Books on Graphical models
  • 1. Lauritzen: Graphical Models [16]
  • 2. Edwards: Introduction to Graphical Modelling [6]
  • 3. Pearl: Probabilistic Reasoning in Intelligent Systems [20]
  • 4. Cowell et al.: Probabilistic Networks and Expert Systems [4]
  • 5. Jordan: Learning in Graphical Models [15]

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 45

slide-65
SLIDE 65
  • Software on Graphical models
  • 1. BNT [19] http://www.cs.ubc.ca/∼murphyk/Software/BNT/bnt.html
  • 2. MGraph [29] http://folk.uio.no/junbaiw/mgraph/mgraph.html
  • 3. PNL https://sourceforge.net/projects/openpnl/
  • 4. GeneTS [23] http://www.stat.uni-muenchen.de/∼strimmer/genets/
  • 5. DEAL [2] http://www.math.aau.dk/∼dethlef/novo/deal/
  • 6. MIM [6] http://www.hypergraph.dk/
  • 7. TETRAD [26] http://www.phil.cmu.edu/projects/tetrad/

Much more on http://www.cs.ubc.ca/∼murphyk/Software/BNT/bnsoft.html.

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 46

slide-66
SLIDE 66
  • Summary
  • 1. Increasing order of resolution:

Clustering, Graphical Gaussian models, Bayesian networks;

  • 2. Central concept: Conditional independence;
  • 3. Learning structure:

Constraint-based approach and Bayesian scoring.

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 47

slide-67
SLIDE 67
  • Summary
  • 1. Increasing order of resolution:

Clustering, Graphical Gaussian models, Bayesian networks;

  • 2. Central concept: Conditional independence;
  • 3. Learning structure:

Constraint-based approach and Bayesian scoring.

Thank you! Questions?

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 47

slide-68
SLIDE 68
  • References

[1] Susanne Gammelgaard Bøttcher. Learning Bayesian Networks with Mixed Variables. PhD thesis, Aalborg University, Denmark, 2004. [2] Susanne Gammelgaard Bøttcher and Claus Dethlefsen. deal: A package for learning bayesian networks. Journal of Statistical Software, 8(20), 2003. [3] Gregory F. Cooper and Edward Herskovits. A Bayesian Method for the Induction of Probabilistic Networks from Data. Machine Learning, 9:309–347, 1992. [4] R.G. Cowell, A.P. Dawid, S.L. Lauritzen, and D.J. Spiegelhalter. Probabilistic Networks and Expert Systems. Springer-Verlag, New York, 1999. [5] Alberto de la Fuente, Nan Bing, Ina Hoeschele, and Pedro Mendes. Discovery of meaningful associations in genomic data using partial correlation coefficients. Bioinformatics, 20(18):3565–3574, 2004. [6] David Edwards. Introduction to Graphical Modelling. Springer, 2000. [7] MB Eisen, PT Spellman, PO Brown, and D Botstein. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A, 95(25):14863–8, Dec 1998. [8] Nir Friedman. Inferring Cellular Networks Using Probabilistic Graphical Models. Science, 303(5659):799–805, 2004. [9] Nir Friedman, Moises Goldszmidt, and Abraham Wyner. Data analysis with Bayesian networks: A bootstrap approach. In Uncertainty in Artificial Intelligence: Proceedings of the Fifteenth Conference (UAI-1999), pages 196–205, San Francisco, CA, 1999. Morgan Kaufmann Publishers. [10] Nir Friedman and Daphne Koller. Being Bayesian about network structure: A Bayesian approach to structure discovery in Bayesian

  • networks. Machine Learning, 50:95–126, 2003.

[11] Nir Friedman, Michal Linial, Iftach Nachman, and Dana Pe’er. Using Bayesian networks to analyze expression data. Journal of Computational Biology, 7(3):601–620, August 2000. [12]

  • A. Gelman, J. B. Carlin, H.S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman and Hall-CRC, 1996.

[13] David Heckerman, Dan Geiger, and David Maxwell Chickering. Learning Bayesian Networks: The Combination of Knowledge and Statistical Data. Machine Learning, 20(3):197–243, Sep. 1995. Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 48

slide-69
SLIDE 69
  • [14]

Dirk Husmeier. Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian

  • networks. Bioinformatics, 19(17):2271–2282, 2003.

[15] Michael I. Jordan, editor. Learning in Graphical Models. MIT Press, Cambridge, MA, 1999. [16] Steffen L. Lauritzen. Graphical Models. Clarendon Press, Oxford, 1996. [17] Paul M Magwene and Junhyong Kim. Estimating genomic coexpression networks using first-order conditional independence. Genome Biol, 5(12):R100, 2004. [18] Florian Markowetz, Steffen Grossmann, and Rainer Spang. Probabilistic soft interventions in conditional gaussian networks. In Robert Cowell and Zoubin Ghahramani, editors, Proc. Tenth International Workshop on Artificial Intelligence and Statistics, Jan 2005. [19] Kevin Murphy. The Bayes Net Toolbox for Matlab. Computing Science and Statistics, 33, 2001. [20] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: networks of plausible inference. Morgan Kaufmann, 1988. [21] Judea Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, Cambridge, 2000. [22] Robert W. Robinson. Counting labeled acyclic digraphs. In F. Harary, editor, New Directions in the Theory of Graphs, pages 239–273. Academic Press, New York, 1973. [23] Juliane Schfer and Korbinian Strimmer. An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics, 21(6):754–64, Mar 2005. [24] Peter W. F. Smith and Joe Whittaker. Edge exclusion tests for graphical Gaussian models. In Michael Jordan, editor, Learning in Graphical Models, pages 555 – 574. MIT Press, 1999. [25] PT Spellman, G Sherlock, MQ Zhang, VR Iyer, K Anders, MB Eisen, PO Brown, D Botstein, and B Futcher. Comprehensive identification

  • f cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell, 9(12):3273–97, Dec 1998.

[26] Peter Spirtes, Clark Glymour, and Richard Scheines. Causation, Prediction, and Search. MIT Press, Cambridge, MA, second edition, 2000. [27] Harald Steck and Tommi S. Jaakkola. Bias-corrected bootstrap and model uncertainty. In Sebastian Thrun, Lawrence Saul, and Bernhard Sch¨

  • lkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.

[28] Joshua M Stuart, Eran Segal, Daphne Koller, and Stuart K Kim. A gene-coexpression network for global discovery of conserved genetic

  • modules. Science, 302(5643):249–55, Oct 2003.

Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 49

slide-70
SLIDE 70
  • [29]

Junbai Wang, Ola Myklebost, and Eivind Hovig. Mgraph: graphical models for microarray data analysis. Bioinformatics, 19(17):2210–2211, 2003. [30] Anja Wille and Peter B¨

  • uhlmann. Tri-graph: a novel graphical model with application to genetic regulatory networks. Technical report,

Seminar for Statistics, ETH Zrich, 2004. [31] Anja Wille, Philip Zimmermann, Eva Vranov´ a, Andreas F¨ urholz, Oliver Laule, Stefan Bleuler, Lars Hennig, Amela Prelic, Peter von Rohr, Lothar Thiele, Eckart Zitzler, Wilhelm Gruissem, and Peter B¨

  • uhlmann. Sparse graphical Gaussian modeling of the isoprenoid gene

network in Arabidopsis thaliana. Genome Biol, 5(11):R92, 2004. Florian Markowetz, Probabilistic Graphical Models for Cellular Pathways, 2005 April 50