Gene Regulation Bioinformatics
Wyeth Wasserman
Centre for Molecular Medicine and Therapeutics Department of Medical Genetics Children’s & Women’s Hospitals
University of British Columbia
Gene Regulation Bioinformatics Wyeth Wasserman Centre for Molecular - - PowerPoint PPT Presentation
Gene Regulation Bioinformatics Wyeth Wasserman Centre for Molecular Medicine and Therapeutics Department of Medical Genetics Childrens & Womens Hospitals University of British Columbia Overview CMMT Basics of promoter analysis
Centre for Molecular Medicine and Therapeutics Department of Medical Genetics Children’s & Women’s Hospitals
University of British Columbia
CMMT
– Bioinformatics for detection of transcription factor binding sites
– Given binding models for relevant TFs, predict regulatory sequences – Genetic variation within regulatory regions
– Given a set of co-regulated genes, predict binding sites for contributing TFs – \Given a newly discovered binding profile, predict genes in a regulon
CMMT
TATA URE
A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3 C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12 G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2 T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4
Set of binding sites AAGTTAATGA CAGTTAATAA GAGTTAAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTGATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA AAGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAATGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA Set of binding sites AAGTTAATGA CAGTTAATAA GAGTTAAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTGATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA AAGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAATGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA
CMMT
CMMT
CMMT
CMMT
CMMT
SIDENOTE: Global Progressive Alignments (ORCA Algorithm)
BLAST) and running global method on banded sub-segments
CMMT
% Identity
200 bp Window Start Position (human sequence)
CMMT
0.2 0.4 0.6 0.8 1 1000 2000 3000 4000 5000 6000 7000
100% 80% 60% 40% 20% 0%
% Identity Start Position of 200bp Window
CMMT
CMMT
CMMT
HUMAN HUMAN HUMAN
CMMT
SELECTIVITY SENSITIVITY
CMMT
NEW: Ortholog Sequence Retrieval Service
CMMT
CMMT
CMMT
500 bp
GO GO GO
20 000 bp
GO GO GO GO GO GO GO GO GO
CMMT
CMMT
CMMT
(10 second slide for 3 months of work)
CMMT
CMMT
CMMT
0.2 0.4 0.6 0.8 1 100 510 920 1330 1740 2150 2560 2970 3380 3790 4200 4610 5020 5430 5840 Series1 Series2 Wildtype
Other
CMMT
(w/ J. Lagergren, Royal Technical University of Sweden)
genome sequence
CMMT
CMMT
CMMT
CMMT
TSS AaGT
Koivisto et al., 1994 Familial hypercholesterolemia LDLR I DeVivo et al., 2002 Endometrial cancer PR
Obesity PEPCK J Hager et al., 1998 Leptin levels Ob KY Zwarts et al., 2002 Coronary artery disease ABCA1 H Hackstein et al., 2001 Reduced soluble IL4R IL4Ralpha JC Engert et al., 2002 Elevated Body Mass Resistin JC Knight et al., 1999 Malaria Susceptibility TNFalpha S Otabe et al., 2000 Elevated Body Mass UCP3 PJ Bosma, et al., 1995 Gilbert’s Syndrome –jaundice UGT1A1 REFERENCE DISEASE/CONDITION (associated) GENE
CMMT
CMMT
0.2 0.4 0.6 0.8 1 1000 2000 3000 4000 5000 6000 7000
100% 80% 60% 40% 20% 0%
CMMT
CMMT
ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT
CMMT
1234567890123456789012345 ACGCATAAGTTAAtGAATAACAGAT .............c...........
2 4 1 2 3 4 5 6 7 8 9 10 11
Differences in scores
CMMT
CMMT
CMMT
CMMT
0.2 0.4 0.6 0.8 1 100 510 920 1330 1740 2150 2560 2970 3380 3790 4200 4610 5020 5430 5840 Series1 Series2
Wildtype Mutant
CMMT
0.1 0.2
Maximum Differential for any potential SNP
CMMT
CMMT
CMMT
CMMT
CMMT
Set of co- expressed genes Automated sequence retrieval from EnsEMBL Phylogenetic Footprinting Detection of transcription factor binding sites Statistical significance of binding sites
Putative mediating transcription factors
CMMT
++++ p<1e-30, +++ p<1e-10, ++ p<1e-05, + p<1e-02
+ Ahr-ARNT + GATA-2 + GATA-2 + Sox-5 + SPI-B + + MZF_1-4 + Sox-5 + Yin-Yang + Elk-1 + HNF-3beta * + MZF_5-13 ++ Brachyury + S8 + Thing1-E47 +++ Irf-1 + RORalpha-1 + Gklf + +++ SPI-1 ++ FREAC-4 + Tal1beta- E47S + ++++ c-FOS ++ E4BP4 ++ Pax-2 + ++++ SPI-B +++ FREAC-3 ++ c-MYB-1 + ++++ Irf-2 +++ GATA-2 ++ TEF-1 * + ++++ p50 * +++ FREAC-7 + ++++ FREAC-7 ++ ++++ p65 * + ++++ Gfi + ++++ Myf * + ++++ c-REL * + ++++ COUP-TF + ++++ SRF * ++ ++++ NF-κB * + ++++ HNF-1 * ++ ++++ Mef2 * Fisher p-value z-score p-value Fisher P-value z-score p-value Fisher p-value z-score p-value
(61)
CMMT
MICROARRAY APPLICATION:
+ + HMG SRY + ++ bHLH-ZIP Max + +++ Forkhead FREAC-4 + +++ Forkhead HFH-2 + +++ ETS SPI-B + +++ HMG Sox-5 + ++++ Homeo Pbx + ++++ Rel/NFkB p50 + ++++ Rel/NFkB c-Rel ++ ++++ Rel/NFkB p65 ++ ++++ Rel/NFkB NF-kappaB Fisher p- value z-score p- value Class Genes Significantly Down-regulated After Treatment with Inhibitor
++++ p<1e-30, +++ p<1e-10, ++ p<1e-05, + p<1e-02
CMMT
CMMT
CMMT
CMMT
CMMT
CMMT
CMMT
rank LEU3 STE12 RLM1 MIG1 OAF1 GAL4 XBP1 CBF1 RPN4 PDR3 ADR1 REB1 ABF1 RAP1 GCN4 PHO4 39.0 17.0 21.0 17.8 na 7.2 0.7 na 1.5 na 1.1 1.0 0.9 1.1 1.1 0.8 7 10 7 16 5 6 28 20 24 10 20 24 27 17 12 18 5 10 15 20 25 30 35 40 comparison rank of correct pattern
+
Rank of found pattern in verified promoters Rank of found pattern in randomly selected promoters
a b
average comparison rank (random promoters) average comparison rank (verified promoters) Number of promoters (sequence depth)
Sequence depth dependancy of MAP scoresCMMT
10 12 14 16 18 100 200 300 400 500 600
SEQUENCE LENGTH PATTERN SIMILARITY
True Mef2 Binding Sites
CMMT
CMMT
CMMT
CMMT
CMMT
Score Frequency
CMMT
CMMT
CMMT
CMMT
– Gibbs sampling, LRA, neural networks, SVMs, etc
– Phylogenetic Footprinting
– Regulatory Modules – Familial Binding Profiles
profiles for critical transcription factors
Wynand Alkema Dave Arenillas Jochen Brumm Alice Choi Shannan Ho Sui Danielle Kemmer Jonathan Lim Raf Podowski Dora Pak Albin Sandelin Chris Walsh
Malin Andersson (KTH) Öjvind Johansson (UCSD) Stuart Lithwick (U.Toronto)
Support: CIHR, CGDN, CFI, Merck-Frosst, BC Children’s Hospital Foundation, Pharmacia, EC–Marie Curie, KI-Funder
Collaborators Boris Lenhard (K.I.) Chip Lawrence (Wadsworth) William Thompson (Wadsworth) Jens Lagergren (KTH) Christer Höög (K.I.) Brenda Gallie (OCI) Jacob Odeberg (KTH) Niclas Jareborg (AZ) William Hayes (AZ) James Mortimer (MF) Group Alumni Elena Herzog Annette Höglund William Krivan Luis Mendoza
CMMT
CMMT
clpP
taccgctattgaggta taccccgatcggggta tacccattaaggagta taactctaaagtggta tacctcaatagcggta taccccgatcggggta tactccttaatgggta taccactttagagtta
TACCNCN(A/T)(A/T)NGNGGTA TACCNRWAAYGBGGTA
A [0 8 1 0 1 1 1 5 3 5 0 2 1 0 0 8] C [0 0 7 7 4 7 0 0 0 2 0 1 0 0 0 0] G [0 0 0 0 1 0 2 0 0 0 7 4 7 7 0 0] T [8 0 0 1 2 0 5 3 5 1 1 1 0 1 8 0]
Pattern detection
CMMT
MAP value
1 2 3 4 5
Frequency
0.00 0.05 0.10 0.15 0.20 0.25
1430 patterns
Gibbs sampling Compare to random sequences
1818 sets of orthologs from S. aureus real
CMMT
MAP value
1 2 3 4 5
Frequency
0.00 0.05 0.10 0.15 0.20 0.25
1430 patterns 318 significant patterns
Gibbs sampling Compare to random sequences Remove redundancies
1818 sets of orthologs from S. aureus real random
CMMT
1818 sets of orthologs from S. aureus 1430 patterns
Gibbs sampling Compare to random sequences
318 significant patterns
Cluster with MatrixAligner (Sandelin et al 2003)
154 unique patterns in S. aureus
Remove redundancies
CMMT
CMMT
Site score threshold (p-value)
0.00 0.02 0.04 0.06 0.08 0.10 0.12
Fraction of total ORFS in regulon
0.00 0.02 0.04 0.06 0.08 0.10
175 members in E. coli => Site searches produce too many false positive hits
CMMT
gene geneA geneB geneC geneD geneF geneA geneB geneC geneD geneF geneC geneD geneF
geneG geneG geneG geneA geneC geneD geneE geneG geneF
geneB = regulon 1 1 0.66 0.33 1 gene = regulog geneA geneB geneC geneD geneE geneG geneF
Regulon Conservation Filtering (RECF)
= putative binding site
CMMT
10.4 3 21 3 218 4 metR 11 4 7 4 77 4 torR 12.5 3 11 3 137 5
15.2 2 12 2 182 2 ilvY 25.5 1 4 1 102 4 pdhR Pos Total Pos Total Efficiency REGULOG REGULON #known TF
Efficiency
RECF
SpecificityREGULOG x SensitivityREGULOG SpecificityREGULON x SensitivityREGULON
. . . . . . . . . . . . . . . . . . . . . 10.4 3 21 3 218 4 metR 4.2 3.8 20 7.2 174 9.8 AVG. 11 4 7 4 77 4 torR 12.5 3 11 3 137 5
15.2 2 12 2 182 2 ilvY 25.5 1 4 1 102 4 pdhR Pos Total Pos Total Efficiency REGULOG REGULON #known TF
Efficiency
RECF
SpecificityREGULOG x SensitivityREGULOG SpecificityREGULON x SensitivityREGULON
CMMT
RCS Consensus Members (leftmost members are the members with the highest confidence)
1.00 AACACAATATATAGTG nrdD,SA2409,nrdI,nrdE,cspC,mtlF 1.00 TGTTAGAAAATCTAAC glnR,nrgA,glnA 1.00 AGGTGCTAAATCCTGC SA0011 0.89 GCCAGCGTAGGGAAGT SA0928,SA0929,thiD,thiE,SA1897,gapR,thiM 0.88 ACAGGTCATAAGGGTC SA0929,SA1897,SA0928,thiD,polC,thiE,thiM 0.87 AAGGGTGGAACCACGA thrS,leuS,alaS,cysE,cysS,SA0489,SA0490,SA0491,pheS,pheT,S A1931,aspS,hisS,ileS,tyrS,trpG,valS,serS,SA0331,SA2101,SA148 6,SA1289,SA1290,SA1291,SA2205,SA1392,truncated(radC),SA1 578,murE,SA2102,SA1562,SA1199,trpD,trpC,trpF,trpB,trpA 0.86 TGTGAA?T?TTTCAC? narG,narI,SA2183,narH,pflB,SA2174,lctE,SA1455,narK,SA0293,m smX,adhE,rpsU,fbaA 0.83 AAAAGAGTGCTAACA? crtM,groES,hrcA,SA1747,SA1582,SA1581,SA2305,SA1748,grpE 0.83 TTGAAAATGATTATCA SA0307,SA0116,SA0689,SA0117,SA0690,SA0331,SA0977,SA09 78,SA1329,SA2162,ahpF,SA1979,SA0688,feoB,SA2338,sirA,SA2 079,katA,SA0757,ahpC,fhuA,fhuB,fhuG,SA0335,SA2101,SA0160, SA0170,hemX,sirB,hemL,hemB,hemD,hemC,dapD,hemA,SA2102 ,SA0588,SA0589,SA0115,dps,fer,SA0774,SA1678 0.82 ?A?AAAAGTTATCCAC SA0339,orfX,dnaA,dnaN,SA1419,SA1420,SA1421,SA1422,SA14 23,aroE,SA1425,SA1426,SA0248
Known in
Known in
Unknown
CMMT