Introduction to Microarray Data Analysis and Gene Networks lecture - - PowerPoint PPT Presentation

introduction to microarray data analysis and gene
SMART_READER_LITE
LIVE PREVIEW

Introduction to Microarray Data Analysis and Gene Networks lecture - - PowerPoint PPT Presentation

Introduction to Microarray Data Analysis and Gene Networks lecture 8 Alvis Brazma European Bioinformatics Institute Lecture 8 Gene networks part 2 Network topology (part 2) Network logics Network dynamics Gene Networks


slide-1
SLIDE 1

Introduction to Microarray Data Analysis and Gene Networks lecture 8

Alvis Brazma European Bioinformatics Institute

slide-2
SLIDE 2

Lecture 8

  • Gene networks – part 2

– Network topology (part 2) – Network logics – Network dynamics

slide-3
SLIDE 3

Gene Networks - four levels of hierarchical description

  • Parts list – genes, transcription factors,

promoters, binding sites, …

  • Topology – a graph describing the connections

between the parts

  • Control logics – how combinations of

regulatory signals interact (e.g., promoter logics)

  • Dynamics – how does it all work in real time
slide-4
SLIDE 4

Gene Networks - four levels of hierarchical description

  • Parts list – genes, transcription factors,

promoters, binding sites, …

  • Topology – a graph describing the connections

between the parts

  • Control logics – how combinations of

regulatory signals interact (e.g., promoter logics)

  • Dynamics – how does it all work in real time
slide-5
SLIDE 5

The arcs can have different meaning

G1 G2

  • The product of gene G1 is a

transcription factor, which binds to the promoter of gene G2 (in Chip-chip experiment) – physical interaction network (direct network) G1 G2

  • The disruption of gene G1 changes

the expression level of gene G2 – data interpretation network (indirect network)

slide-6
SLIDE 6

How both networks compare

  • How much the two networks have in

common

  • We can look at the intersection of the

networks whether the common parts have evidence in our existing knowledge

  • If the target sets of the transcription factors

present in both networks are similar

  • Are the network topology (e.g.,

connectivity) properties similar

slide-7
SLIDE 7

How both networks compare

  • How much the two networks have in

common

  • We can look at the intersection of the

networks whether the common parts have evidence in our existing knowledge

  • If the target sets of the transcription factors

present in both networks are similar

  • Are the network topology (e.g.,

connectivity) properties similar

slide-8
SLIDE 8

A couple of simple notions

  • Any gene (node in the graph) with
  • utgoing edges is called a source gene
  • Any gene with incoming edges is a target

gene

  • Target set

source node target node target set

slide-9
SLIDE 9

A problem:

  • Both network depend on the chosen

significance threshold - i.e., what level of microarray signal to use to draw and edge in the network

slide-10
SLIDE 10

The size of the networks for different significance thresholds

45.6 73.8 135.7 36.5 93.3 edges per source gene 1507 (14.6%) 2425 (13.9%) 4096 (12.8%) 857 (13.9%) 3694 (19.6%) edges where source gene and target gene have the same cellular role annotation in YPD (http://www.proteome.com ) 10356 17436 32017 6170 18842 edges 3959 4798 5654 2930 4980 genes 3920 4778 5396 2845 4939 target genes 226 236 250 169 202 source genes mutant network (γ=3.0) mutant network (γ=2.5) mutant network (γ=2.0) ChIP network (p<0.001) ChIP network (p<0.01)

slide-11
SLIDE 11
slide-12
SLIDE 12

How both networks compare

  • How much networks have in common
  • We can look at the intersection of the

networks whether the common parts have evidence in our existing knowledge

  • If the target sets of the transcription factors

present in both networks are similar

  • Are the network topology (e.g.,

connectivity) properties similar

slide-13
SLIDE 13

Intersection of the networks – many connections are consistent with out a priori knowledge

YGR086C CCW6 SIC1 YLR194C CHS1 ARO1 CPA2 ARG10 MET22 STE12 FUS1 KAR4 STE2 GPA1 SST2 YAP1 GSH1 YLR460C SWI5 ARG5 ECM40 LEU4 GCN4 HOM3 CLB2 MBP1 SCW10 CIS3 MNN1 SWI4 GIC2 SW I6 YKL185W YPL158C YLR049C PST1 YHR149C YBR070C MNN5 SGA1 PCL1 PCL2 YER079W YHR150W YDR528W YLR297W YER128W SWE1 YPR157W YER078C PRY2 PLB3 SVS1 ABF1 RNR1 HCM1 MCD1 YLR103C DUN1 SMC3 RFA2 MUT5 SPT21 YLR104W YJR030C PDS1 YNL313C YOX1 UFE1 YDR115W CDC21 RAD27 PDS5 IRR1 DIN7 ERP3 YJL073W GIN4 YPL267W

slide-14
SLIDE 14

Figure 6

YG R086C CCW 6 S IC1 YLR194C CHS 1 A RO 1 CP A 2 A RG 10 M E T22 S TE 12 FUS 1 K A R4 S TE 2 GP A 1 S S T2 YA P 1 GS H1 YLR460C S W I5 A RG5 E CM 40 LE U4 G CN4 HO M 3 CLB 2 M B P 1 S CW 10 CIS 3 M NN1 S W I4 GIC2 S W I6 YK L185W YP L158C YLR 049C P S T1 YHR149C YB R070C M NN5 S GA 1 P CL1 P CL2 YE R079W YHR150W YDR528W YLR297W YE R128W S W E 1 YP R157W YE R078C P RY2 P LB 3 S V S 1 A B F1 RNR1 HC M 1 M CD1 YLR 103C DUN1 S MC3 RFA 2 M UT5 S P T21 YLR104W YJR030C P DS 1 YNL313C YO X1 UFE 1 YDR115W CDC21 RA D27 P DS 5 IRR1 DIN7 E RP 3 YJ L073W G IN4 YP L267W

slide-15
SLIDE 15

How both networks compare

  • How much networks have in common
  • We can look at the intersection of the

networks whether the common parts have evidence in our existing knowledge

  • If the target sets of the transcription factors

present in both networks are similar

  • Are the network topology (e.g.,

connectivity) properties similar

slide-16
SLIDE 16

How Chip-chip and disruption networks relate?

All genes All genes Transcription factors Disrupted genes t

Regulation Regulation set o set of t f t

h

Ef Effectual fectual set set

  • f h
  • f h
slide-17
SLIDE 17

All genes All genes Transcription factors Disrupted genes

Regulation Regulation set o set of g f g Ef Effectual fectual set set

  • f g
  • f g

How Chip-chip and disruption networks relate?

slide-18
SLIDE 18

How to estimate that the overlap is more than expected by random?

G R E R∩E

We assume that the elements of the set E are marked, and pick the set of size |R| at

  • random. Then the size x=|R∩E| of the

intersection are distributed according to hypergeometric distribution. The probability of observing an intersection of size k or larger can be computed according to formula:

=                 − −        

− = ≥

k i R G i R E G i E

k x P

| | | | | | | | | | | |

1 ) (

slide-19
SLIDE 19

All genes All genes Transcription factors Disrupted genes

Regulation Regulation set o set of g f g Ef Effectual fectual set set

  • f g
  • f g

146 213 23

How Chip-chip and disruption networks relate?

(9)

From 23 transcription factors studied in both networks only 9 have their target sets overlapping more than expected by chance L

slide-20
SLIDE 20

From 23 transcription factors studied in both networks only 9 have their target sets

  • verlapping more than expected by chance
  • Is it as bad as my look?

– We will expect many indirect connections in the disruption network that are not present in Chip network – is this the case?

slide-21
SLIDE 21

Direct vs. indirect interactions

Y Z X

Direct Direct Indirect

slide-22
SLIDE 22

GLN3 YHM1 ARO3 ARO1 ARG4 YJL200C CPA2 GCN4 LYS2 YAP1 YOL158C YAP6 FET4 ROX1 RNR1 GDH3 YBL029W ECM33 SWI5 SLY1 YDR451C UTR2 YER189W YER190W PMA1 YGL114W Y HL029C Y IL158W YJL051W CIS3 SUR7 CDC5 CLN1 SRL1 YOR248W YOR315W CLB2 NCE102 SWI6 YBR070C GIC2 YNL058C SVS1 NDD1 SOK2 SWI4 MBP1 HIS4 ADE3 ADE13 ADE17 ADE4 BAS1 RTG1

slide-23
SLIDE 23

From 23 transcription factors studied in both networks only 9 have their target sets

  • verlapping more than expected by chance
  • Is it as bad as my look?

– We will expect many indirect connections in the disruption network that are not present in Chip network – is this the case? There is an anecdotal evidence that this is the case – What about the connections present in the Chip network, but not in the disruption network? – can be explained by nonfunctional relationships in the chip network and combinatorial regulatory effects

slide-24
SLIDE 24

Conclusions

  • We want to think that networks share

enough in common both to be meaningful, but at the same time apparently there is a lots of noise in at least one of them present

slide-25
SLIDE 25

How both networks compare

  • How much networks have in common
  • We can look at the intersection of the

networks whether the common parts have evidence in our existing knowledge

  • If the target sets of the transcription factors

present in both networks are similar

  • Are the network topology (e.g.,

connectivity) properties similar – and what are they

slide-26
SLIDE 26

The central node has degree = 7 indegree = 3

  • utdegree = 4

Degree of a node in a graph

slide-27
SLIDE 27

Most genes have only a few incoming / outgoing edges, but some have high numbers (>500)

Important genes and genes with complex regulation

Indegree Outdegree

slide-28
SLIDE 28

γ

  • utdegree

m n indegree m n 2.0 Carbohydrate metabolism 363 4 Amino-acid metabolism 9 194 RNA turnover 353 4 Nucleotide metabolism 6 82 Meiosis 244 3 Energy generation 5 242 Cellstress 207 9 Small molecule transport 5 343 Protein translocation 197 3 Other metabolism 5 148 2.8 RNA turnover 110 4 Amino-acid metabolism 4 167 Cellstress 62 8 Nucleotide metabolism 3 67 Meiosis 54 3 Energy generation 2 184 Proteinsynthesis 53 7 Differentiation 2 43 Cellwallmaintenance 47 6 Small molecule transport 2 286 3.6 RNA turnover 48 4 Small molecule transport 2 230 RNA processing/ modification 41 4 Other metabolism 2 96 Cellstress 27 8 Nucleotide metabolism 2 58 Small molecule transport 19 8 Matingresponse 2 57 Cellwallmaintenance 19 6 Amino-acid metabolism 2 133 Cellular role table showing the top 5 groups with the highest median degrees for the networks with γ=2.0, 2.8 and 3.6 with a minimum group size of 3 for

  • utdegree and 40 for the indegree (m median degree, n number of genes per

group)

Genes with highest in- and out-degree

slide-29
SLIDE 29

ChIP network

0.0001 0.001 0.01 0.1 1 1 10 100 1000

connections per gene proportion of genes

0.0001 0.001 0.01 0.1 1 1 10 100 1000 10000

mutant network

connections per gene proportion of genes

Node degree distributions for both networks – roughly follow power-law

slide-30
SLIDE 30

ChIP network

0.0001 0.001 0.01 0.1 1 1 10 100 1000

connections per gene proportion of genes

0.0001 0.001 0.01 0.1 1 1 10 100 1000 10000

mutant network

connections per gene proportion of genes

Yeast network topology properties

  • Power-law property – on logarithmic scale

approximately linear relationship

  • Whether this is so is still open to debate – what

is clear however is that most genes have relatively few connections, a few have many

slide-31
SLIDE 31

Why topology is important?

  • Reduce hypothesis space when analysing next

layers of model complexity – instead of default – all genes depend on all, topology tells us which genes are independent

  • What is the complexity of gene regulation

– Given a transcription factor T – how many genes does T regulate? – Given a gene A, how many transcription factors regulate A?

  • Are networks modular?
slide-32
SLIDE 32

What does ‘modular’ mean?

  • Are there only one connected component
  • r several
  • In scale free graphs there are hub nodes

(nodes with high degree keeping everything together) and there is a theory that networks fall into pieces (modules) if the hub nodes are removed – is this the case for real netoworks

slide-33
SLIDE 33
slide-34
SLIDE 34
slide-35
SLIDE 35
slide-36
SLIDE 36
slide-37
SLIDE 37

Looking for modules

full removed 1% 5% 10% largest 2403 2201 1859 1721 ChIP network second 11 11 11 11 total number 3 3 4 6 largest 5583 5416 4988 4259 in-silico network second total number 1 1 1 1 largest 4095 3209 2301 1815 mutant network second 2 2 3 3 total number 2 2 3 8

slide-38
SLIDE 38

Network modularity

  • On static topology level there are no
  • bvious modules in yeast transcription

regulation network

  • This does not mean however that there

are no modules?

– there is evidence for modules – More subtle methods may be needed to find them

slide-39
SLIDE 39

What have we learned on the topology level?

  • Comparison of different networks shows

that we have some idea of what the true topology is like, but it is far from perfect

  • The network topology is roughly scale free
  • There are not obvious modules in these

networks on the topology level – one giant component

slide-40
SLIDE 40

Gene Networks - four levels of hierarchical description

  • Parts list – genes, transcription factors,

promoters, binding sites, …

  • Topology – a graph describing the

connections between the parts

  • Control logics – how combinations of

regulatory signals interact (e.g., promoter logics)

  • Dynamics – how does it all work in real time
slide-41
SLIDE 41

Gene Networks - four levels of hierarchical description

  • Parts list – genes, transcription factors,

promoters, binding sites, …

  • Topology – a graph describing the

connections between the parts

  • Control logics – how combinations of

regulatory signals interact (e.g., promoter logics)

  • Dynamics – how does it all work in real time
slide-42
SLIDE 42

More complex interactions

G1 G2 G3

slide-43
SLIDE 43

Logics

D

&

¬

A B C

D = A & B & ¬C

A B C D

&

slide-44
SLIDE 44

Control functions

  • Discrete vs. continuous

D = A & B & ¬C D = w1A + w2B + w3C

slide-45
SLIDE 45

A>5 B>2 C>3 D=1 D=0 E=1 E=0 yes no yes no yes no if A>5 and B<=2 then D=1 if A>5 and B>2 then D=0 if A<=5 and C<=3 then E=1 if A<=5 and C>3 then E=0

Decision trees

slide-46
SLIDE 46
  • Decision tree for CLN2 gene in yeast

1.1 >1.1 0.81 >0.81 0.8 >0.8

slide-47
SLIDE 47

Logics – high throughput data is

  • nly now beginning to have impact
  • Predicting gene expression from combination of

expression levels of other genes (Soinov et al, 2003)

– Limited to about 20 genes – For instance, by choosing genes well known to be involved in yeast cell cycle regulation it is possible to derive decision trees describing the combinatorial regulatory effects for these genes – At least some of the conclusions are supported by a priori knowledge

slide-48
SLIDE 48

Yuh, C.H., Bolouri, H. and Davidson, E.H. (1998) Genomic cis- regulatory logic: experimental and computational analysis of a sea urchin gene. Science 279, 1896-902

What is known about the regulatory logics from classical low throughput approaches?

Boolean, linear and decision tree concepts are all used – 12 input variables

slide-49
SLIDE 49

Probabilistic approaches

slide-50
SLIDE 50

Canalizing Boolean functions

  • There is one input and one value for that input

that determines the output regardless of the values of other inputs

– F = x V y – canalizing – x=1 -> F=1 – F = x & y – canalizing – x= 0 -> F=0 – F = x ⊕ y – not canalizing – none of the values of none of the inputs can determine the value of F

  • For Boolean functions of many inputs only a

small number of the possible functions are canalizing

slide-51
SLIDE 51

Gene Networks - four levels of hierarchical description

  • Parts list – genes, transcription factors,

promoters, binding sites, …

  • Topology – a graph depicting the connections
  • f the parts
  • Control logics – how combinations of

regulatory signals interact (e.g., promoter logics)

  • Dynamics – how does it all work in real time
slide-52
SLIDE 52

Classification of dynamic network models

  • Continuous versus discrete state (e.g,

boolean)

  • Deterministic versus probabilistic state

transitions (e.g. differential equations versus Bayesian models);

  • Ignoring spatial effects vs modelling

spatial effects

slide-53
SLIDE 53

Differential equation based models

The basic assumption – the rate of changes in gene product abundance at a particular time are determined by the abundance of gene products at the time

gi(t) – the abundance of the product of gene i at time t wij – the weight of the contribution of gene j to the expression of gene i

slide-54
SLIDE 54

Differential equation based models

Difference equation model: g1(t+∆t) − g1(t) = (w11 g1(t) + ... + w1n gn(t)) ∆t ... gn(t+∆t) − gn(t) = (wn1 g1(t) + ... + wnn gn(t)) ∆t where:

gi(t) – the abundance of the product of gene i at time t wij – the weight of the contribution of gene j to the expression of gene I The main problem – we don’t know the constants wij

slide-55
SLIDE 55

von Dassow, G., Meir, E., Munro, E.M. and Odell, G.M. (2000) The segment polarity network is a robust developmental

  • module. Nature 406, 188-92.

Differential equation model for drosophila embryo development

slide-56
SLIDE 56

Synchronous Boolean networks – the assumptions

  • Each gene the system (cell) can be in one of

two states –

  • ‘expressed’ – 1,
  • ‘not expressed’ – 0
  • The genes can switch from state to state all

simultaneously in synchronous manner

  • The next state of each gene is determined by

previous states of all genes by Boolean functions describing the network

slide-57
SLIDE 57

&

X Y Z

Y=X&Z, X=Y, Z= ¬X 000 001 010 011 111 101 110 100 State transitions

X Y Z X Y Z 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

t t+1

State space

slide-58
SLIDE 58

Reverse engineering:

  • Given the state space transitions:

000 001 010 011 111 101 110 100

  • Reconstruct the network:
slide-59
SLIDE 59

Reverse engineering problem

  • On one hand the problem is trivial – the stage

space immediately gives one a transition table, which is an equivalent representation to the wiring diagram

  • However the problem of building the smallest

wiring diagram from the table is NP-hard, i.e., it takes exponential time to do this

  • For small networks (3 genes as above) this is

not a problem, but for thousands of genes this is not a solution

slide-60
SLIDE 60

Exponential algorithm

  • Assume that all genes depend on all, i.e., in the

wiring diagram connect each to all

  • The Boolean function is the disjunctions of all

vectors as given in the table

  • This gives a hugely long Boolean functions for

each gene (i.e, n2n for a network of each gene)

  • The minimisation of this Boolean function to the

smallest equivalent one is a classic NP hard problem

slide-61
SLIDE 61

Solution

  • Instead of the minimal possible network

look for simply ‘small’ network

  • Somogyi et al – algorithm using mutual

information – not clear how good is this heuristics

slide-62
SLIDE 62

Attractors in the state space

000 001 010 011 111 101 110 100

slide-63
SLIDE 63

Canalizing Boolean functions

  • There is one input and one value, which

determines the output regardless of the values

  • f other imputs
  • F = x V y – canalizing – x=1 -> F=1
  • F = x & y – canalizing – x= 0 -> F=0
  • F = x ⊕ y – not canalizing – none of the values
  • f none of the inputs can determine the value of

F

  • For Boolean functions of many inputs only a

small number of the possible functions are canalizing

slide-64
SLIDE 64

Kaufman’s simulations

  • Randomly constructed Boolean networks

such that

– the number of inputs of each ‘gene’ is small – the control functions are canalizing

have a property that

– their state space consists of a relatively small number of attractors – most of the time the spend in attractor states

slide-65
SLIDE 65

Attractors in the state space

000 001 010 011 111 101 110 100

slide-66
SLIDE 66

Kaufman’s hypothesis

  • Gene networks are predominantly

controlled by canalizing functions

  • Attractors are cell types
  • He estimated that under certain conditions
  • n network connectivity and assuming

100000 genes, there should be a few hundred different cell types

slide-67
SLIDE 67

A hybrid models – the finite state linear model

&

¬

ri=(-1.5, 0.5) Fi

B2 B1 B3

slide-68
SLIDE 68

&

¬

Fi

B2 1 B1 1 B3 0

ri=(-1.5, 0.5)

slide-69
SLIDE 69

&

¬

ri=(-1.5, 0.5) Fi

B2 1

1

B1 1 B3 0

slide-70
SLIDE 70

&

Fi

B2 1

1 ci t

B1 1 B3 0

¬

slide-71
SLIDE 71

&

¬

Fi

B2 1

ci t

B1 1 B3 1

slide-72
SLIDE 72

assorep dissorep time concentration

  • f repressor

assorep dissorep time concentration

  • f repressor

t1 assorep dissorep time concentration

  • f repressor

t1 t2

brep

Srep ¬

1

r+

brep

Srep ¬

1

r-

brep

Srep ¬

1

r+

slide-73
SLIDE 73

Simulations on a FSLM

slide-74
SLIDE 74

Lac operon in E.coli bacteria

  • There are two modes in E.coli – glucose or

lactose utilisation mode that is regulated by the presence or absence of lactose

slide-75
SLIDE 75

lacZ ... Promoter Operator Repressor lacI Promoter Activator Glucose Lactose Glucose Galactose + Galactosidase

repressor repressor galactose activator glucose activator see table galactosidase galactosidase repressor activator galactosidase & & glucose galactose

FSLM representation

Finite state model for Lac-Operon network

slide-76
SLIDE 76

cro/cI PL Pint cII N xis cIII PL1 PL1 PL2 PL2 N Struc PR’ Q 0,1 0,1 0,1,2 0,1 0,1 cII PE PE PcI cI N cI/cro cI/cro cI/cro PM Q PR1 PR cII O P PR2 N 0,1,2,3 0,1,2 0,1,2,3, 4,5,6 0,1,2,3 0,1 0,1,2 OR1 OR2 OR3 cIII cII cell Other cII stress Other cI

not implemented in the simulator

N int PM cro

Finite state linear model for lambda phage lytic/lysogenic mode switch

slide-77
SLIDE 77

Network dynamics – the state of art

  • Most of the current dynamic models

include less than 10 genes, and the knowledge used in the modelling mostly comes from traditional biology studies

  • There are no convincing examples where

high throughput technologies had a substantial impact on network modelling

  • n the dynamics level yet
slide-78
SLIDE 78

Conclusions – what have we learned on each level so far?

  • Parts list – we are dealing with thousands to tens of

thousands of elements in these networks, and hundreds to thousands regulating elements;

  • Topology – may be tens of thousands of connections,

it seems to be scale free, no obvious modules

  • Control logics – a gene can be controlled by dozens
  • f transcription factors in a rather complex way
  • Dynamics – we are not yet able to model dynamics of

genome scale transcription regulation networks

slide-79
SLIDE 79

What do I hope you have learned in this course

  • Some feel what real microarray data are

like, some idea of the basic methods (if you didn’t know this before)

  • How to use ArrayExpress and Expression

Profiler if you need this

  • A flavour what is our current knowledge

how genes are regulated and how little we know