1
Gene regulation, protein networks and disease – a computational perspective
Ron Shamir School of Computer Science Tel Aviv University
CPM Helsinki July 3 2012
1
Gene regulation, protein networks and disease a computational - - PowerPoint PPT Presentation
Gene regulation, protein networks and disease a computational perspective Ron Shamir School of Computer Science Tel Aviv University CPM Helsinki July 3 2012 1 1 Outline Finding regulatory motifs I, II, III Utilizing
1
CPM Helsinki July 3 2012
1
DEGAS
2
TF TF Gene 5’ 3’ BS BS promoter
4
0.2 0.7 0.8 0.1 A 0.6 0.4 0.1 0.5 0.1 C 0.1 0.4 0.1 0.5 G 0.3 0.1 0.1 0.9 T
5
enome Res e Resea earch 08 08
Cluster I Cluster II Cluster III
Gene e exp xpression microarray ays
Clust stering
Location a analysis (C (ChIP-chip, … …) Functional g l group (e (e.g., G GO term)
Pr Promoter sequences
Motif discov
6
repre resented motifs in given sets of
l spatial f l feature res given onl nly the genomic sequences.
A general pipeli line a arc rchi hitecture for enumerating motifs.
Different statistical sc scoring sc scheme mes of motifs for different motif discovery tasks.
7
Pipeline of refinement phases of increased complexity
k-mer Prepr
process Mismat atch List o
Merge
PW
PWM Optimiza zation
Cutoff = = 0.005 005
PW
PWM
Motif Mo Model el:
8
Input
Mo
Hyper-geom
Binne
Bino
B T b t
9
Le Length GC GC-conte tent B1 T1 b1 B2 b2 B3 T3 b3 T4 b4 T2 B4
20 20-40 40% 40 40-60 60% 0.4-0.7kb kbp 0.7-1kbp bp
42 42 targ rget s sets of
26 TFs, s, 8 8 miRNAs As from from 29 29 studies s (expre ression
, Chip-ChIP hIP,..) ,..) i in hu human, , mou
, fly fly, w , worm
All ll m mot
fs a are re experi rimentally ve veri rified Ave verage t targ rget s set size: : 400 400 genes ( (383 383 Kb Kbp) )
10
11
12
Promoter sequences Output
Motif(s) Gene e expression
microarrays Location anal analysis ( (ChIP-chip, … …) Functi tional g group ( p (e.g., G GO te term) m)
Co Co-re regula lated g gene set
13
Localization w.r.t the TSS Strand-bias Chromosomal preference
TSS SS
5’
14
Input: t:
All
ll wo worm promoters rs (~ (~18 18,000 00) )
Score
re: : chromosomal al prefere rence Re Results: Novel m l motif on
rom IV IV
15
Input: t:
All
ll wo worm promoters rs ( (~18 18,000 000) )
Score
re: : chrom hromosomal p pre refe ference Re Results: Novel m l motif on
hrom
IV
16
17
AR 1 0 1 0
Cluster I Cluster II Cluster III
Gene expression microarrays
Clustering
Promoter sequences Expression data Output
Motif(s)
Co-regulated gene set
19
Discretization of expression patterns
e1=Up (U) e2=Same (S) e3=Down (D) ≥1.0 (-1.0, 1.0) ≤-1.0
cm … c2 c1 1.5
g cm … c2 c1 U … S D g
Ex Expressi ssion p pattern Discrete e expression
Pattern ( (DEP EP)
Condition frequency matrix (CFM) Condition weight matrix (CWM
WM)
cm … c2 c1 0.78 … 0.1 0.05 U 0.14 … 0.2 0.9 S 0.08 … 0.7 0.05 D
( )
W ij ij
( R={rij} is the BG CFM)
⇒ Log-likelihood ratio (LLR
LLR) score
20
21
Allegro can discover multiple motifs with diverse expression patterns,
even if the response is in a small fraction of the conditions
Extant two-step techniques recovered only 4 of the above motifs:
K-means/C
/CLI LICK + + Amadeus/W /Weeder: RRPE, PAC, MBF, STRE
Iclust
st + + FIRE: E: RRPE, PAC, Rap1, STRE
~6,000 genes, 133 conditions [O’Rourke et al. ’04]
22
~14,000 genes, 124 conditions (various types of
Biases in length / GC-content of 3’ UTRs, e.g.:
100 highly-expressed genes in… 3’ UTR: length GC Embryoid bodies 584 47% Undifferentiated ESCs 774 44% ESC-derived fibroblasts 1240 39% Fetal NSCs 1422 43% (ESCs = embryonic stem cells, NSCs = neural stem cells)
Extant methods / Allegro with HG score: report
23
Most highly expressed miRNAs in human/mouse ESCs
Current knowledge
Abundant & functional in neural cell lineage Expressed specifically in neural lineage; active role in neurogenesis miRN RNA expressi ssion targets s expressi ssion
miRNA expression from [Laurent ’08]
24
Yaron Orenstein Chaim Linhart 25 Yonit Halperin Igor Ulitsky
Better PWM inference: new scores, algs Richer models for in vivo / in vitro data – really
How to evaluate model quality: match to
Integration of motif finding & expression Principled means to find motif pairs
26
27
09
LoS One ne 1 0 1 0
27
28
29
samples genes sick healthy ?
30
31
Control 1 Control 2
A B C D E
Control 3 Control 4 Case 1 Case 2 Case 3
1 1 A 1 1 B 1 C 1 D
1
1 1 E
Case 1 Case 2 Case 3
Case 1 Case 2 Case 3 B A C E D
32
Case 1 Case 2 Case 3 B A C E D
– Bipartite graph: genes, cases – Edge (gene g, case c) if g is dysregulated in c – A network over the genes
Case 1 Case 2 Case 3 B A C E D
k= 2,l= 1
≤l ≥k
33
k l G Problem 1 Clique Set cover k Clique Set k-cover 1 >0 Clique Partial set cover 1 Any Connected set cover (Shuai & Hu 06)
DEGAS
34
– HD modifiers – HD relevant genes – Calcium signaling
Huntingtin
35
prognosis
– Van’t Veer et al. Nature 2002 – Van de Vijver et al. NEJM 2002 – Wang et al. Lancet 2005 – Minn et al. Nature 2005 – Sotiriou et al. PNAS 2003 – Pawitan et al. Breast Cancer Research 2005
prognosis and one associated with good prognosis
36
p<0.005
cell-cycle associated genes (p=2·10−26) & YY1 targets (p=2.42·10−16)
genes localized to the nucleus
37
– Only DP2 network is strongly enriched with stem cell genes – DP2 enriched with cytoplasmic genes
discovered breast cancer risk factor
(k=50; p < 0.005)
38
39
Igor Ulitsky, Whitehead Inst Dick Karp, Berkeley Akshay Krishnamurthy CMU 40
41
Support: Israel Academy of Sciences, Wolfson Foundation, Edmond J. Safra Foundation, US-Israel BSF, German-Israeli Fund, EU 6th and 7th Frameworks, Intel, IBM, I-CORE Gene regulation & disease. postdocs available