Introduction to Microarray Data Analysis and Gene Networks lecture - - PowerPoint PPT Presentation
Introduction to Microarray Data Analysis and Gene Networks lecture - - PowerPoint PPT Presentation
Introduction to Microarray Data Analysis and Gene Networks lecture 8 Alvis Brazma European Bioinformatics Institute Lecture 8 Gene networks part 2 Network topology (part 2) Network logics Network dynamics Gene Networks
Lecture 8
- Gene networks – part 2
– Network topology (part 2) – Network logics – Network dynamics
Gene Networks - four levels of hierarchical description
- Parts list – genes, transcription factors,
promoters, binding sites, …
- Topology – a graph describing the connections
between the parts
- Control logics – how combinations of
regulatory signals interact (e.g., promoter logics)
- Dynamics – how does it all work in real time
Gene Networks - four levels of hierarchical description
- Parts list – genes, transcription factors,
promoters, binding sites, …
- Topology – a graph describing the connections
between the parts
- Control logics – how combinations of
regulatory signals interact (e.g., promoter logics)
- Dynamics – how does it all work in real time
The arcs can have different meaning
G1 G2
- The product of gene G1 is a
transcription factor, which binds to the promoter of gene G2 (in Chip-chip experiment) – physical interaction network (direct network) G1 G2
- The disruption of gene G1 changes
the expression level of gene G2 – data interpretation network (indirect network)
How both networks compare
- How much the two networks have in
common
- We can look at the intersection of the
networks whether the common parts have evidence in our existing knowledge
- If the target sets of the transcription factors
present in both networks are similar
- Are the network topology (e.g.,
connectivity) properties similar
How both networks compare
- How much the two networks have in
common
- We can look at the intersection of the
networks whether the common parts have evidence in our existing knowledge
- If the target sets of the transcription factors
present in both networks are similar
- Are the network topology (e.g.,
connectivity) properties similar
A couple of simple notions
- Any gene (node in the graph) with
- utgoing edges is called a source gene
- Any gene with incoming edges is a target
gene
- Target set
source node target node target set
A problem:
- Both network depend on the chosen
significance threshold - i.e., what level of microarray signal to use to draw and edge in the network
The size of the networks for different significance thresholds
45.6 73.8 135.7 36.5 93.3 edges per source gene 1507 (14.6%) 2425 (13.9%) 4096 (12.8%) 857 (13.9%) 3694 (19.6%) edges where source gene and target gene have the same cellular role annotation in YPD (http://www.proteome.com ) 10356 17436 32017 6170 18842 edges 3959 4798 5654 2930 4980 genes 3920 4778 5396 2845 4939 target genes 226 236 250 169 202 source genes mutant network (γ=3.0) mutant network (γ=2.5) mutant network (γ=2.0) ChIP network (p<0.001) ChIP network (p<0.01)
How both networks compare
- How much networks have in common
- We can look at the intersection of the
networks whether the common parts have evidence in our existing knowledge
- If the target sets of the transcription factors
present in both networks are similar
- Are the network topology (e.g.,
connectivity) properties similar
Intersection of the networks – many connections are consistent with out a priori knowledge
YGR086C CCW6 SIC1 YLR194C CHS1 ARO1 CPA2 ARG10 MET22 STE12 FUS1 KAR4 STE2 GPA1 SST2 YAP1 GSH1 YLR460C SWI5 ARG5 ECM40 LEU4 GCN4 HOM3 CLB2 MBP1 SCW10 CIS3 MNN1 SWI4 GIC2 SW I6 YKL185W YPL158C YLR049C PST1 YHR149C YBR070C MNN5 SGA1 PCL1 PCL2 YER079W YHR150W YDR528W YLR297W YER128W SWE1 YPR157W YER078C PRY2 PLB3 SVS1 ABF1 RNR1 HCM1 MCD1 YLR103C DUN1 SMC3 RFA2 MUT5 SPT21 YLR104W YJR030C PDS1 YNL313C YOX1 UFE1 YDR115W CDC21 RAD27 PDS5 IRR1 DIN7 ERP3 YJL073W GIN4 YPL267W
Figure 6
YG R086C CCW 6 S IC1 YLR194C CHS 1 A RO 1 CP A 2 A RG 10 M E T22 S TE 12 FUS 1 K A R4 S TE 2 GP A 1 S S T2 YA P 1 GS H1 YLR460C S W I5 A RG5 E CM 40 LE U4 G CN4 HO M 3 CLB 2 M B P 1 S CW 10 CIS 3 M NN1 S W I4 GIC2 S W I6 YK L185W YP L158C YLR 049C P S T1 YHR149C YB R070C M NN5 S GA 1 P CL1 P CL2 YE R079W YHR150W YDR528W YLR297W YE R128W S W E 1 YP R157W YE R078C P RY2 P LB 3 S V S 1 A B F1 RNR1 HC M 1 M CD1 YLR 103C DUN1 S MC3 RFA 2 M UT5 S P T21 YLR104W YJR030C P DS 1 YNL313C YO X1 UFE 1 YDR115W CDC21 RA D27 P DS 5 IRR1 DIN7 E RP 3 YJ L073W G IN4 YP L267W
How both networks compare
- How much networks have in common
- We can look at the intersection of the
networks whether the common parts have evidence in our existing knowledge
- If the target sets of the transcription factors
present in both networks are similar
- Are the network topology (e.g.,
connectivity) properties similar
How Chip-chip and disruption networks relate?
All genes All genes Transcription factors Disrupted genes t
Regulation Regulation set o set of t f t
h
Ef Effectual fectual set set
- f h
- f h
All genes All genes Transcription factors Disrupted genes
Regulation Regulation set o set of g f g Ef Effectual fectual set set
- f g
- f g
How Chip-chip and disruption networks relate?
How to estimate that the overlap is more than expected by random?
G R E R∩E
We assume that the elements of the set E are marked, and pick the set of size |R| at
- random. Then the size x=|R∩E| of the
intersection are distributed according to hypergeometric distribution. The probability of observing an intersection of size k or larger can be computed according to formula:
∑
= − −
− = ≥
k i R G i R E G i E
k x P
| | | | | | | | | | | |
1 ) (
All genes All genes Transcription factors Disrupted genes
Regulation Regulation set o set of g f g Ef Effectual fectual set set
- f g
- f g
146 213 23
How Chip-chip and disruption networks relate?
(9)
From 23 transcription factors studied in both networks only 9 have their target sets overlapping more than expected by chance L
From 23 transcription factors studied in both networks only 9 have their target sets
- verlapping more than expected by chance
- Is it as bad as my look?
– We will expect many indirect connections in the disruption network that are not present in Chip network – is this the case?
Direct vs. indirect interactions
Y Z X
Direct Direct Indirect
GLN3 YHM1 ARO3 ARO1 ARG4 YJL200C CPA2 GCN4 LYS2 YAP1 YOL158C YAP6 FET4 ROX1 RNR1 GDH3 YBL029W ECM33 SWI5 SLY1 YDR451C UTR2 YER189W YER190W PMA1 YGL114W Y HL029C Y IL158W YJL051W CIS3 SUR7 CDC5 CLN1 SRL1 YOR248W YOR315W CLB2 NCE102 SWI6 YBR070C GIC2 YNL058C SVS1 NDD1 SOK2 SWI4 MBP1 HIS4 ADE3 ADE13 ADE17 ADE4 BAS1 RTG1
From 23 transcription factors studied in both networks only 9 have their target sets
- verlapping more than expected by chance
- Is it as bad as my look?
– We will expect many indirect connections in the disruption network that are not present in Chip network – is this the case? There is an anecdotal evidence that this is the case – What about the connections present in the Chip network, but not in the disruption network? – can be explained by nonfunctional relationships in the chip network and combinatorial regulatory effects
Conclusions
- We want to think that networks share
enough in common both to be meaningful, but at the same time apparently there is a lots of noise in at least one of them present
How both networks compare
- How much networks have in common
- We can look at the intersection of the
networks whether the common parts have evidence in our existing knowledge
- If the target sets of the transcription factors
present in both networks are similar
- Are the network topology (e.g.,
connectivity) properties similar – and what are they
The central node has degree = 7 indegree = 3
- utdegree = 4
Degree of a node in a graph
Most genes have only a few incoming / outgoing edges, but some have high numbers (>500)
Important genes and genes with complex regulation
Indegree Outdegree
γ
- utdegree
m n indegree m n 2.0 Carbohydrate metabolism 363 4 Amino-acid metabolism 9 194 RNA turnover 353 4 Nucleotide metabolism 6 82 Meiosis 244 3 Energy generation 5 242 Cellstress 207 9 Small molecule transport 5 343 Protein translocation 197 3 Other metabolism 5 148 2.8 RNA turnover 110 4 Amino-acid metabolism 4 167 Cellstress 62 8 Nucleotide metabolism 3 67 Meiosis 54 3 Energy generation 2 184 Proteinsynthesis 53 7 Differentiation 2 43 Cellwallmaintenance 47 6 Small molecule transport 2 286 3.6 RNA turnover 48 4 Small molecule transport 2 230 RNA processing/ modification 41 4 Other metabolism 2 96 Cellstress 27 8 Nucleotide metabolism 2 58 Small molecule transport 19 8 Matingresponse 2 57 Cellwallmaintenance 19 6 Amino-acid metabolism 2 133 Cellular role table showing the top 5 groups with the highest median degrees for the networks with γ=2.0, 2.8 and 3.6 with a minimum group size of 3 for
- utdegree and 40 for the indegree (m median degree, n number of genes per
group)
Genes with highest in- and out-degree
ChIP network
0.0001 0.001 0.01 0.1 1 1 10 100 1000
connections per gene proportion of genes
0.0001 0.001 0.01 0.1 1 1 10 100 1000 10000
mutant network
connections per gene proportion of genes
Node degree distributions for both networks – roughly follow power-law
ChIP network
0.0001 0.001 0.01 0.1 1 1 10 100 1000connections per gene proportion of genes
0.0001 0.001 0.01 0.1 1 1 10 100 1000 10000mutant network
connections per gene proportion of genes
Yeast network topology properties
- Power-law property – on logarithmic scale
approximately linear relationship
- Whether this is so is still open to debate – what
is clear however is that most genes have relatively few connections, a few have many
Why topology is important?
- Reduce hypothesis space when analysing next
layers of model complexity – instead of default – all genes depend on all, topology tells us which genes are independent
- What is the complexity of gene regulation
– Given a transcription factor T – how many genes does T regulate? – Given a gene A, how many transcription factors regulate A?
- Are networks modular?
What does ‘modular’ mean?
- Are there only one connected component
- r several
- In scale free graphs there are hub nodes
(nodes with high degree keeping everything together) and there is a theory that networks fall into pieces (modules) if the hub nodes are removed – is this the case for real netoworks
Looking for modules
full removed 1% 5% 10% largest 2403 2201 1859 1721 ChIP network second 11 11 11 11 total number 3 3 4 6 largest 5583 5416 4988 4259 in-silico network second total number 1 1 1 1 largest 4095 3209 2301 1815 mutant network second 2 2 3 3 total number 2 2 3 8
Network modularity
- On static topology level there are no
- bvious modules in yeast transcription
regulation network
- This does not mean however that there
are no modules?
– there is evidence for modules – More subtle methods may be needed to find them
What have we learned on the topology level?
- Comparison of different networks shows
that we have some idea of what the true topology is like, but it is far from perfect
- The network topology is roughly scale free
- There are not obvious modules in these
networks on the topology level – one giant component
Gene Networks - four levels of hierarchical description
- Parts list – genes, transcription factors,
promoters, binding sites, …
- Topology – a graph describing the
connections between the parts
- Control logics – how combinations of
regulatory signals interact (e.g., promoter logics)
- Dynamics – how does it all work in real time
Gene Networks - four levels of hierarchical description
- Parts list – genes, transcription factors,
promoters, binding sites, …
- Topology – a graph describing the
connections between the parts
- Control logics – how combinations of
regulatory signals interact (e.g., promoter logics)
- Dynamics – how does it all work in real time
More complex interactions
G1 G2 G3
Logics
D
&
¬
A B C
D = A & B & ¬C
A B C D
&
Control functions
- Discrete vs. continuous
D = A & B & ¬C D = w1A + w2B + w3C
A>5 B>2 C>3 D=1 D=0 E=1 E=0 yes no yes no yes no if A>5 and B<=2 then D=1 if A>5 and B>2 then D=0 if A<=5 and C<=3 then E=1 if A<=5 and C>3 then E=0
Decision trees
- Decision tree for CLN2 gene in yeast
1.1 >1.1 0.81 >0.81 0.8 >0.8
Logics – high throughput data is
- nly now beginning to have impact
- Predicting gene expression from combination of
expression levels of other genes (Soinov et al, 2003)
– Limited to about 20 genes – For instance, by choosing genes well known to be involved in yeast cell cycle regulation it is possible to derive decision trees describing the combinatorial regulatory effects for these genes – At least some of the conclusions are supported by a priori knowledge
Yuh, C.H., Bolouri, H. and Davidson, E.H. (1998) Genomic cis- regulatory logic: experimental and computational analysis of a sea urchin gene. Science 279, 1896-902
What is known about the regulatory logics from classical low throughput approaches?
Boolean, linear and decision tree concepts are all used – 12 input variables
Probabilistic approaches
Canalizing Boolean functions
- There is one input and one value for that input
that determines the output regardless of the values of other inputs
– F = x V y – canalizing – x=1 -> F=1 – F = x & y – canalizing – x= 0 -> F=0 – F = x ⊕ y – not canalizing – none of the values of none of the inputs can determine the value of F
- For Boolean functions of many inputs only a
small number of the possible functions are canalizing
Gene Networks - four levels of hierarchical description
- Parts list – genes, transcription factors,
promoters, binding sites, …
- Topology – a graph depicting the connections
- f the parts
- Control logics – how combinations of
regulatory signals interact (e.g., promoter logics)
- Dynamics – how does it all work in real time
Classification of dynamic network models
- Continuous versus discrete state (e.g,
boolean)
- Deterministic versus probabilistic state
transitions (e.g. differential equations versus Bayesian models);
- Ignoring spatial effects vs modelling
spatial effects
Differential equation based models
The basic assumption – the rate of changes in gene product abundance at a particular time are determined by the abundance of gene products at the time
gi(t) – the abundance of the product of gene i at time t wij – the weight of the contribution of gene j to the expression of gene i
Differential equation based models
Difference equation model: g1(t+∆t) − g1(t) = (w11 g1(t) + ... + w1n gn(t)) ∆t ... gn(t+∆t) − gn(t) = (wn1 g1(t) + ... + wnn gn(t)) ∆t where:
gi(t) – the abundance of the product of gene i at time t wij – the weight of the contribution of gene j to the expression of gene I The main problem – we don’t know the constants wij
von Dassow, G., Meir, E., Munro, E.M. and Odell, G.M. (2000) The segment polarity network is a robust developmental
- module. Nature 406, 188-92.
Differential equation model for drosophila embryo development
Synchronous Boolean networks – the assumptions
- Each gene the system (cell) can be in one of
two states –
- ‘expressed’ – 1,
- ‘not expressed’ – 0
- The genes can switch from state to state all
simultaneously in synchronous manner
- The next state of each gene is determined by
previous states of all genes by Boolean functions describing the network
&
X Y Z
Y=X&Z, X=Y, Z= ¬X 000 001 010 011 111 101 110 100 State transitions
X Y Z X Y Z 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
t t+1
State space
Reverse engineering:
- Given the state space transitions:
000 001 010 011 111 101 110 100
- Reconstruct the network:
Reverse engineering problem
- On one hand the problem is trivial – the stage
space immediately gives one a transition table, which is an equivalent representation to the wiring diagram
- However the problem of building the smallest
wiring diagram from the table is NP-hard, i.e., it takes exponential time to do this
- For small networks (3 genes as above) this is
not a problem, but for thousands of genes this is not a solution
Exponential algorithm
- Assume that all genes depend on all, i.e., in the
wiring diagram connect each to all
- The Boolean function is the disjunctions of all
vectors as given in the table
- This gives a hugely long Boolean functions for
each gene (i.e, n2n for a network of each gene)
- The minimisation of this Boolean function to the
smallest equivalent one is a classic NP hard problem
Solution
- Instead of the minimal possible network
look for simply ‘small’ network
- Somogyi et al – algorithm using mutual
information – not clear how good is this heuristics
Attractors in the state space
000 001 010 011 111 101 110 100
Canalizing Boolean functions
- There is one input and one value, which
determines the output regardless of the values
- f other imputs
- F = x V y – canalizing – x=1 -> F=1
- F = x & y – canalizing – x= 0 -> F=0
- F = x ⊕ y – not canalizing – none of the values
- f none of the inputs can determine the value of
F
- For Boolean functions of many inputs only a
small number of the possible functions are canalizing
Kaufman’s simulations
- Randomly constructed Boolean networks
such that
– the number of inputs of each ‘gene’ is small – the control functions are canalizing
have a property that
– their state space consists of a relatively small number of attractors – most of the time the spend in attractor states
Attractors in the state space
000 001 010 011 111 101 110 100
Kaufman’s hypothesis
- Gene networks are predominantly
controlled by canalizing functions
- Attractors are cell types
- He estimated that under certain conditions
- n network connectivity and assuming
100000 genes, there should be a few hundred different cell types
A hybrid models – the finite state linear model
&
¬
ri=(-1.5, 0.5) Fi
B2 B1 B3
&
¬
Fi
B2 1 B1 1 B3 0
ri=(-1.5, 0.5)
&
¬
ri=(-1.5, 0.5) Fi
B2 1
1
B1 1 B3 0
&
Fi
B2 1
1 ci t
B1 1 B3 0
¬
&
¬
Fi
B2 1
ci t
B1 1 B3 1
assorep dissorep time concentration
- f repressor
assorep dissorep time concentration
- f repressor
t1 assorep dissorep time concentration
- f repressor
t1 t2
brep
Srep ¬
1
r+
brep
Srep ¬
1
r-
brep
Srep ¬
1
r+
Simulations on a FSLM
Lac operon in E.coli bacteria
- There are two modes in E.coli – glucose or
lactose utilisation mode that is regulated by the presence or absence of lactose
lacZ ... Promoter Operator Repressor lacI Promoter Activator Glucose Lactose Glucose Galactose + Galactosidase
repressor repressor galactose activator glucose activator see table galactosidase galactosidase repressor activator galactosidase & & glucose galactose
FSLM representation
Finite state model for Lac-Operon network
cro/cI PL Pint cII N xis cIII PL1 PL1 PL2 PL2 N Struc PR’ Q 0,1 0,1 0,1,2 0,1 0,1 cII PE PE PcI cI N cI/cro cI/cro cI/cro PM Q PR1 PR cII O P PR2 N 0,1,2,3 0,1,2 0,1,2,3, 4,5,6 0,1,2,3 0,1 0,1,2 OR1 OR2 OR3 cIII cII cell Other cII stress Other cI
not implemented in the simulator
N int PM cro
Finite state linear model for lambda phage lytic/lysogenic mode switch
Network dynamics – the state of art
- Most of the current dynamic models
include less than 10 genes, and the knowledge used in the modelling mostly comes from traditional biology studies
- There are no convincing examples where
high throughput technologies had a substantial impact on network modelling
- n the dynamics level yet
Conclusions – what have we learned on each level so far?
- Parts list – we are dealing with thousands to tens of
thousands of elements in these networks, and hundreds to thousands regulating elements;
- Topology – may be tens of thousands of connections,
it seems to be scale free, no obvious modules
- Control logics – a gene can be controlled by dozens
- f transcription factors in a rather complex way
- Dynamics – we are not yet able to model dynamics of
genome scale transcription regulation networks
What do I hope you have learned in this course
- Some feel what real microarray data are
like, some idea of the basic methods (if you didn’t know this before)
- How to use ArrayExpress and Expression
Profiler if you need this
- A flavour what is our current knowledge