Discovery and Analysis of Regulatory Regions in the Human Genome
Wyeth Wasserman
Centre for Molecular Medicine and Therapeutics Children’s and Women’s Hospital University of British Columbia
Discovery and Analysis of Regulatory Regions in the Human Genome - - PowerPoint PPT Presentation
Discovery and Analysis of Regulatory Regions in the Human Genome Wyeth Wasserman Centre for Molecular Medicine and Therapeutics Childrens and Womens Hospital University of British Columbia Acknowledgements Wasserman Group CMMT
Centre for Molecular Medicine and Therapeutics Children’s and Women’s Hospital University of British Columbia
Wasserman Group – CMMT Dave Arenillas Jochen Brumm Danielle Kemmer Jonathan Lim Wasserman Group - Karolinska Albin Sandelin Raf Podowski Wynand Alkema Collaborating Trainees Malin Andersson (KTH) Öjvind Johansson (UCSD) Stuart Lithwick (U.Toronto)
Support: CIHR, CGDN, Merck-Frosst, BC Children’s Hospital Foundation, Pharmacia, EC–Marie Curie, KI-Funder
Collaborators Chip Lawrence (Wadsworth) William Thompson (Wadsworth) Jens Lagergren (SBC/KTH) Christer Höög (K.I.) Brenda Gallie (OCI) Jacob Odeberg (KTH) Niclas Jareborg (AZ) William Hayes (AZ)
Boris Lenhard (K.I.)
Group Alumni Elena Herzog Annette Höglund William Krivan Luis Mendoza
CMMT
– Bioinformatics for detection of transcription factor binding sites
– Pattern recognition for discovery of novel regulatory mechanisms
– Given binding models for relevant TFs, identify potential regulatory sequences – Analyze potentially important genetic variation within predicted regulatory regions
– Given a set of co-regulated genes, predict important classes of TFBS – Given a newly discovered binding profile, predict candidate regulon members
CMMT
TATA URE
A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3 C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12 G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2 T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4
Set of binding sites AAGTTAATGA CAGTTAATAA GAGTTAAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTGATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA AAGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAATGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA Set of binding sites AAGTTAATGA CAGTTAATAA GAGTTAAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTGATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA AAGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAATGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA
CMMT
CMMT
CMMT
CMMT
CMMT
SIDENOTE: Global Progressive Alignments (ORCA Algortihm)
BLAST) and running global method on banded sub-segments
CMMT
% Identity
200 bp Window Start Position (human sequence)
CMMT
0.2 0.4 0.6 0.8 1 1000 2000 3000 4000 5000 6000 7000
100% 80% 60% 40% 20% 0%
% Identity Start Position of 200bp Window
CMMT
CMMT
CMMT
HUMAN HUMAN HUMAN
CMMT
SELECTIVITY SENSITIVITY
CMMT
Now driven by the ORCA Aligner
CMMT
CMMT
CMMT
500 bp
GO GO GO
20 000 bp
GO GO GO GO GO GO GO GO GO
CMMT
years…Berman; Markstein; Frith; Noble; Wagner;…
– Most difficulty comes from local direct repeats
CMMT
CMMT
(10 second slide for 3 months of work)
CMMT
CMMT
CMMT
0.2 0.4 0.6 0.8 1 100 510 920 1330 1740 2150 2560 2970 3380 3790 4200 4610 5020 5430 5840 Series1 Series2 Wildtype Mutant
CMMT
0.1 0.2 0.3 0.4 0.5 1500 3000 4500 6000 7500 9000
CMMT
(w/ Jens Lagergren, Royal Technical University of Sweden)
genome sequence
CMMT
CMMT
CMMT
CMMT
TSS AaGT
Koivisto et al., 1994 Familial hypercholesterolemia LDLR I DeVivo et al., 2002 Endometrial cancer PR
Obesity PEPCK J Hager et al., 1998 Leptin levels Ob KY Zwarts et al., 2002 Coronary artery disease ABCA1 H Hackstein et al., 2001 Reduced soluble IL4R IL4Ralpha JC Engert et al., 2002 Elevated Body Mass Resistin JC Knight et al., 1999 Malaria Susceptibility TNFalpha S Otabe et al., 2000 Elevated Body Mass UCP3 PJ Bosma, et al., 1995 Gilbert’s Syndrome –jaundice UDP-GT1 REFERENCE DISEASE/CONDITION (associated) GENE
CMMT
CMMT
0.2 0.4 0.6 0.8 1 1000 2000 3000 4000 5000 6000 7000
100% 80% 60% 40% 20% 0%
CMMT
CMMT
ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT
CMMT
Identify TFs with altered binding predictions overlapping variation 1234567890123456789012345 ACGCATAAGTTAATGAATAACAGAT .............C...........
2 4 1 2 3 4 5 6 7 8 9 10 11
Differences in scores
CMMT
CMMT
0.1 0.2
Maximum Differential for any potential SNP
CMMT
0.2 0.4 0.6 0.8 1 1 5 1 9 2 1 3 3 1 7 4 2 1 5 2 5 6 2 9 7 3 3 8 3 7 9 4 2 4 6 1 5 2 5 4 3 5 8 4 Series1 Series2
Wildtype Mutant
Liver Module Model Score
CMMT
CMMT
CMMT
CMMT
CMMT
CMMT
CMMT
CMMT
CMMT
CMMT
CMMT
CMMT
CMMT
CMMT
CMMT
CMMT
CMMT
CMMT
rank LEU3 STE12 RLM1 MIG1 OAF1 GAL4 XBP1 CBF1 RPN4 PDR3 ADR1 REB1 ABF1 RAP1 GCN4 PHO4 39.0 17.0 21.0 17.8 na 7.2 0.7 na 1.5 na 1.1 1.0 0.9 1.1 1.1 0.8 7 10 7 16 5 6 28 20 24 10 20 24 27 17 12 18 5 10 15 20 25 30 35 40 comparison rank of correct pattern
+
Rank of found pattern in verified promoters Rank of found pattern in randomly selected promoters
a b
average comparison rank (random promoters) average comparison rank (verified promoters) Number of promoters (sequence depth)
Sequence depth dependancy of MAP scoresCMMT
10 12 14 16 18 100 200 300 400 500 600
SEQUENCE LENGTH PATTERN SIMILARITY
True Mef2 Binding Sites
CMMT
CMMT
CMMT
CMMT
CMMT
CMMT
CMMT
CMMT
Score Frequency
CMMT
CMMT
CMMT
CMMT
– Gibbs sampling, LRA, neural networks, SVMs, etc
– Phylogenetic Footprinting
– Regulatory Modules – Familial Binding Profiles
profiles for critical transcription factors
Wasserman Group – CMMT Dave Arenillas Jochen Brumm Danielle Kemmer Jonathan Lim Wasserman Group - Karolinska Albin Sandelin Raf Podowski Wynand Alkema Collaborating Trainees Malin Andersson (KTH) Öjvind Johansson (UCSD) Stuart Lithwick (U.Toronto)
Support: CIHR, CGDN, Merck-Frosst, BC Children’s Hospital Foundation, Pharmacia, EC–Marie Curie, KI-Funder
Collaborators Chip Lawrence (Wadsworth) William Thompson (Wadsworth) Jens Lagergren (SBC/KTH) Christer Höög (K.I.) Brenda Gallie (OCI) Jacob Odeberg (KTH) Niclas Jareborg (AZ) William Hayes (AZ)
Boris Lenhard (K.I.)
Group Alumni Elena Herzog Annette Höglund William Krivan Luis Mendoza
CMMT
CMMT
clpP
taccgctattgaggta taccccgatcggggta tacccattaaggagta taactctaaagtggta tacctcaatagcggta taccccgatcggggta tactccttaatgggta taccactttagagtta
TACCNCN(A/T)(A/T)NGNGGTA TACCNRWAAYGBGGTA
A [0 8 1 0 1 1 1 5 3 5 0 2 1 0 0 8] C [0 0 7 7 4 7 0 0 0 2 0 1 0 0 0 0] G [0 0 0 0 1 0 2 0 0 0 7 4 7 7 0 0] T [8 0 0 1 2 0 5 3 5 1 1 1 0 1 8 0]
Pattern detection
CMMT
MAP value
1 2 3 4 5
Frequency
0.00 0.05 0.10 0.15 0.20 0.25
1430 patterns
Gibbs sampling Compare to random sequences
1818 sets of orthologs from S. aureus real
CMMT
MAP value
1 2 3 4 5
Frequency
0.00 0.05 0.10 0.15 0.20 0.25
1430 patterns 318 significant patterns
Gibbs sampling Compare to random sequences Remove redundancies
1818 sets of orthologs from S. aureus real random
CMMT
1818 sets of orthologs from S. aureus 1430 patterns
Gibbs sampling Compare to random sequences
318 significant patterns
Cluster with MatrixAligner (Sandelin et al 2003)
154 unique patterns in S. aureus
Remove redundancies
CMMT
CMMT
Site score threshold (p-value)
0.00 0.02 0.04 0.06 0.08 0.10 0.12
Fraction of total ORFS in regulon
0.00 0.02 0.04 0.06 0.08 0.10
175 members in E. coli => Site searches produce too many false positive hits
CMMT
gene geneA geneB geneC geneD geneF geneA geneB geneC geneD geneF geneC geneD geneF
geneG geneG geneG geneA geneC geneD geneE geneG geneF
geneB = regulon 1 1 0.66 0.33 1 gene = regulog geneA geneB geneC geneD geneE geneG geneF
Regulon Conservation Filtering (RECF)
= putative binding site
CMMT
10.4 3 21 3 218 4 metR 11 4 7 4 77 4 torR 12.5 3 11 3 137 5
15.2 2 12 2 182 2 ilvY 25.5 1 4 1 102 4 pdhR Pos Total Pos Total Efficiency REGULOG REGULON #known TF
Efficiency
RECF
SpecificityREGULOG x SensitivityREGULOG SpecificityREGULON x SensitivityREGULON
. . . . . . . . . . . . . . . . . . . . . 10.4 3 21 3 218 4 metR 4.2 3.8 20 7.2 174 9.8 AVG. 11 4 7 4 77 4 torR 12.5 3 11 3 137 5
15.2 2 12 2 182 2 ilvY 25.5 1 4 1 102 4 pdhR Pos Total Pos Total Efficiency REGULOG REGULON #known TF
Efficiency
RECF
SpecificityREGULOG x SensitivityREGULOG SpecificityREGULON x SensitivityREGULON
CMMT
RCS Consensus Members (leftmost members are the members with the highest confidence)
1.00 AACACAATATATAGTG nrdD,SA2409,nrdI,nrdE,cspC,mtlF 1.00 TGTTAGAAAATCTAAC glnR,nrgA,glnA 1.00 AGGTGCTAAATCCTGC SA0011 0.89 GCCAGCGTAGGGAAGT SA0928,SA0929,thiD,thiE,SA1897,gapR,thiM 0.88 ACAGGTCATAAGGGTC SA0929,SA1897,SA0928,thiD,polC,thiE,thiM 0.87 AAGGGTGGAACCACGA thrS,leuS,alaS,cysE,cysS,SA0489,SA0490,SA0491,pheS,pheT,S A1931,aspS,hisS,ileS,tyrS,trpG,valS,serS,SA0331,SA2101,SA148 6,SA1289,SA1290,SA1291,SA2205,SA1392,truncated(radC),SA1 578,murE,SA2102,SA1562,SA1199,trpD,trpC,trpF,trpB,trpA 0.86 TGTGAA?T?TTTCAC? narG,narI,SA2183,narH,pflB,SA2174,lctE,SA1455,narK,SA0293,m smX,adhE,rpsU,fbaA 0.83 AAAAGAGTGCTAACA? crtM,groES,hrcA,SA1747,SA1582,SA1581,SA2305,SA1748,grpE 0.83 TTGAAAATGATTATCA SA0307,SA0116,SA0689,SA0117,SA0690,SA0331,SA0977,SA09 78,SA1329,SA2162,ahpF,SA1979,SA0688,feoB,SA2338,sirA,SA2 079,katA,SA0757,ahpC,fhuA,fhuB,fhuG,SA0335,SA2101,SA0160, SA0170,hemX,sirB,hemL,hemB,hemD,hemC,dapD,hemA,SA2102 ,SA0588,SA0589,SA0115,dps,fer,SA0774,SA1678 0.82 ?A?AAAAGTTATCCAC SA0339,orfX,dnaA,dnaN,SA1419,SA1420,SA1421,SA1422,SA14 23,aroE,SA1425,SA1426,SA0248
Known in
Known in
Unknown
CMMT