Bioinformatics for the Identification of Sequences Regulating Gene Transcription
Wyeth W. Wasserman
University of British Columbia
Bioinformatics for the Identification of Sequences Regulating Gene - - PowerPoint PPT Presentation
Bioinformatics for the Identification of Sequences Regulating Gene Transcription Wyeth W. Wasserman University of British Columbia www.cisreg.ca Acknowledgements Collaborators Wasserman Group Dave Arenillas Jenny Bryan (UBC) Jochen Brumm
University of British Columbia
Wasserman Group
Dave Arenillas Jochen Brumm Alice Chou Debra Fulton Shannan Ho Sui Carol Huang Danielle Kemmer (KI) Byron Kuo Jonathan Lim Raf Podowski (KI) Dora Pak Chris Walsh Dimas Yusuf
Collaborating Trainees
Malin Andersson (KTH) Öjvind Johansson (KTH) Stuart Lithwick (U.Toronto)
Support: CIHR, CGDN, MSFHR, CFI, Merck-Frosst, BC Children’s Hospital Foundation
Collaborators Jenny Bryan (UBC) Brenda Gallie (OCI) Jens Lagergren (KTH) Chip Lawrence (Brown) Boris Lenhard (K.I.) James Mortimer (MF) Jacob Odeberg (KTH) Group Alumni Wynand Alkema Elena Herzog Annette Höglund William Krivan Luis Mendoza Albin Sandelin
CMMT
CMMT
TATA URE
A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3 C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12 G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2 T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4
Set of binding sites AAGTTAATGA CAGTTAATAA GAGTTAAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTGATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA AAGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAATGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA Set of binding sites AAGTTAATGA CAGTTAATAA GAGTTAAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTGATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA AAGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAATGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA
CMMT
Scanning a sequence against a PW M
A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ] C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ] G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ] T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ]
ACCCTCCCCAGGGGCGGGGGGCGGTGGCCAGGACGGTAGCTCC
Abs_score = 13.4 (sum of column scores)
Sp1
Calculating the relative score
A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ] C [-0.2284 -0.2284 -1.5 -1.5 1.5128 1.5128
G [ 1.2348 1.2348 1.2348 1.2348 2.1222 2.1222 2.1222 2.1222 0.4368 1.2348 1.2348 1.5128 1.5128 1.7457 1.7457 1.7457 1.7457
T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 1.7457 ] A [-0.2284 0.4368 -1.5 -1.5 -
1.5 0.4368 -
1.5 -
1.5 -0.2284 0.4368 ] C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -
1.5 -0.2284 -1.5 -0.2284 -1.5 ] G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -
1.5 ] T [ 0.4368 0.4368 -
0.2284
1.5 -
1.5 -0.2284 0.4368 0.4368 0.4368 -
1.5 1.7457 ]
Max_score = 15.2 (sum of highest column scores) Min_score = -10.3 (sum of lowest column scores)
93% = ⋅ − − = ⋅ =
100% 10.3) ( 15.2 (-10.3)
% 100 Min_score
Min_score
Rel_score
Scanning 1 3 0 0 bp of hum an insulin receptor gene w ith Sp1 at rel_ score threshold of 7 5 %
Ouch.
CMMT
CMMT
CMMT
Human Mouse Actin, alpha cardiac
CMMT
SELECTIVITY SENSITIVITY
CMMT
Now Featuring: Ortholog Sequence Retrieval Service
CMMT
ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT
CMMT
TSS AaGT
GENE DISEASE/CONDITION (associated) REFERENCE UGT1A1 Gilbert’s Syndrome –jaundice PJ Bosma, et al., 1995 UCP3 Elevated Body Mass S Otabe et al., 2000 TNFalpha Malaria Susceptibility JC Knight et al., 1999 Resistin Elevated Body Mass JC Engert et al., 2002 IL4Ralpha Reduced soluble IL4R H Hackstein et al., 2001 ABCA1 Coronary artery disease KY Zwarts et al., 2002 Ob Leptin levels J Hager et al., 1998 PEPCK Obesity
PR Endometrial cancer I DeVivo et al., 2002 LDLR Familial hypercholesterolemia Koivisto et al., 1994
CMMT
1234567890123456789012345 ACGCATAAGTTAAtGAATAACAGAT .............c...........
2 4 1 2 3 4 5 6 7 8 9 10 11
2 1
CMMT
CMMT
CMMT
CMMT
CMMT
CMMT
(Brief, as this is well described in the literature)
CMMT
0.2 0.4 0.6 0.8 1 100 510 920 1330 1740 2150 2560 2970 3380 3790 4200 4610 5020 5430 5840 Series1 Series2 Wildtype
Other
CMMT
(w/ J. Lagergren, Royal Technical University of Sweden)
genome sequence
CMMT
CMMT
CMMT
CMMT
CMMT
CMMT
Set of co- expressed genes Automated sequence retrieval from EnsEMBL Phylogenetic Footprinting Detection of transcription factor binding sites Statistical significance of binding sites Putative mediating transcription factors
CMMT
CMMT
CMMT
TFs with experimentally-verified sites in the reference sets.
analyzed)
Rank Z-score Fisher Rank Z-score Fisher SRF 1 21.41 1.18e-02 HNF-1 1 38.21 8.83e-08 MEF2 2 18.12 8.05e-04 HLF 2 11.00 9.50e-03 c-MYB_1 3 14.41 1.25e-03 Sox-5 3 9.822 1.22e-01 Myf 4 13.54 3.83e-03 FREAC-4 4 7.101 1.60e-01 TEF-1 5 11.22 2.87e-03 HNF-3beta 5 4.494 4.66e-02 deltaEF1 6 10.88 1.09e-02 SOX17 6 4.229 4.20e-01 S8 7 5.874 2.93e-01 Yin-Yang 7 4.070 1.16e-01 Irf-1 8 5.245 2.63e-01 S8 8 3.821 1.61e-02 Thing1-E47 9 4.485 4.97e-02 Irf-1 9 3.477 1.69e-01 HNF-1 10 3.353 2.93e-01 COUP-TF 10 3.286 2.97e-01
TF Class Rank Z-score Fisher
p65 REL 1 36.57 5.66e-12 62 NF-kappaB REL 2 32.58 5.82e-11 61 c-REL REL 3 26.02 8.59e-08 63 Irf-2 TRP-CLUSTER 4 20.39 5.74e-04 6 SPI-B ETS 5 16.59 1.23e-03 135 Irf-1 TRP-CLUSTER 6 15.4 9.55e-04 23 Sox-5 HMG 7 15.38 2.56e-02 126 p50 REL 8 14.72 2.23e-03 19 Nkx HOMEO 9 13.66 2.29e-03 111 Bsap PAIRED 10 13.2 9.92e-02 1 FREAC-4 FORKHEAD 11 12.05 1.66e-03 92 n-MYC bHLH-ZIP 25 6.695 1.84e-03 102 ARNT bHLH 26 6.695 1.84e-03 102 HNF-3beta FORKHEAD 29 5.948 3.32e-03 47 SOX17 HMG 31 5.406 8.60e-03 79
CMMT
CMMT
TF Class Rank Z-score Fisher
Myc-Max bHLH-ZIP 1 21.68 5.35e-03 7 Staf ZN-FINGER, C2H2 2 20.17 1.70e-02 2 Max bHLH-ZIP 3 18.32 2.16e-02 12 SAP-1 ETS 4 13.23 1.61e-04 13 USF bHLH-ZIP 5 11.90 1.84e-01 16 SP1 ZN-FINGER, C2H2 6 11.68 4.40e-02 12 n-MYC bHLH-ZIP 7 11.11 1.55e-01 20 ARNT bHLH 8 11.11 1.55e-01 20 Elk-1 ETS 9 10.92 3.88e-03 19 Ahr-ARNT bHLH 10 10.17 1.11e-01 25
CMMT
TF Class Rank Z-score Fisher
c-FOS bZIP 1 17.53 2.60e-05 45 RREB-1 ZN-FINGER, C2H2 2 8.899 1.41e-01 1 PPARgamma-RXRal NUCLEAR RECEPTOR 3 3.991 2.98e-01 1 CREB bZIP 4 3.626 1.25e-01 10 E2F Unknown 5 2.965 7.67e-02 15 NF-kappaB REL 6 2.915 1.04e-01 17 SRF MADS 7 2.707 2.24e-01 2 MEF2 MADS 8 2.634 1.32e-01 13 c-REL REL 9 2.467 5.79e-02 22 Staf ZN-FINGER, C2H2 10 2.385 3.74e-01 1 Ahr-ARNT bHLH 15 1.716 2.57e-03 63 deltaEF1 ZN-FINGER, C2H2 23 0.271 5.39e-03 75 Elk-1 ETS 21 0.7875 8.12e-03 37 MZF_1-4 ZN-FINGER, C2H2 27
5.41e-03 73 n-MYC bHLH-ZIP 30
8.20e-03 51 ARNT bHLH 31
8.20e-03 51
CMMT
CMMT
INPUT A LIST OF CO-EXPRESSED GENES
CMMT
SELECT YOUR TFBS PROFILES
CMMT
SELECT:
CMMT
CMMT
CMMT
CMMT
tgacttcc tgatctct agacctca tgacctct
tgacttcc tgatctct agacctca tgacctct
j j i j i
, ,
Pseudocount for symbol j Sum of all pseudocounts in column
CMMT
10 12 14 16 18 100 200 300 400 500 600
SEQUENCE LENGTH PATTERN SIMILARITY
True Mef2 Binding Sites
CMMT
CMMT
Information content distributions of TFBS are distinctly non-random
CMMT
CMMT
Score Frequency
CMMT
CMMT
CMMT
CMMT
CMMT
clpP TACCNCN(A/T)(A/T)NGNGGTA TACCNRWAAYGBGGTA
taccgctattgaggta taccccgatcggggta tacccattaaggagta taactctaaagtggta tacctcaatagcggta taccccgatcggggta tactccttaatgggta taccactttagagtta
A [0 8 1 0 1 1 1 5 3 5 0 2 1 0 0 8] C [0 0 7 7 4 7 0 0 0 2 0 1 0 0 0 0] G [0 0 0 0 1 0 2 0 0 0 7 4 7 7 0 0] T [8 0 0 1 2 0 5 3 5 1 1 1 0 1 8 0]
Pattern detection
CMMT
1818 sets of orthologs from S. aureus 1430 patterns
Gibbs sampling Compare to random sequences
318 significant patterns
Cluster with MatrixAligner (Sandelin et al 2003)
154 unique patterns in S. aureus
Remove redundancies
CMMT
CMMT
Site score threshold (p-value)
0.00 0.02 0.04 0.06 0.08 0.10 0.12
Fraction of total ORFS in regulon
0.00 0.02 0.04 0.06 0.08 0.10
175 members in E. coli = > Site searches produce too many false positive hits
CMMT
gene geneA geneB geneC geneD geneF geneA geneB geneC geneD geneF geneC geneD geneF
geneG geneG geneG geneA geneC geneD geneE geneG geneF
geneB = regulon 1 1 0.66 0.33 1 gene = regulog geneA geneB geneC geneD geneE geneG geneF
Regulon Conservation Filtering (RECF)
= putative binding site
CMMT
10.4 3 21 3 218 4 metR 11 4 7 4 77 4 torR 12.5 3 11 3 137 5
15.2 2 12 2 182 2 ilvY 25.5 1 4 1 102 4 pdhR Pos Total Pos Total Efficiency REGULOG REGULON # known TF
Efficiency
RECF
SpecificityREGULOG SpecificityREGULON x SensitivityREGULOG x SensitivityREGULON
. . . . . . . . . . . . . . . . . . . . . 10.4 3 21 3 218 4 metR 4.2 3.8 20 7.2 174 9.8 AVG. 11 4 7 4 77 4 torR 12.5 3 11 3 137 5
15.2 2 12 2 182 2 ilvY 25.5 1 4 1 102 4 pdhR Pos Total Pos Total Efficiency REGULOG REGULON # known TF
Efficiency
RECF
SpecificityREGULOG SpecificityREGULON x SensitivityREGULOG x SensitivityREGULON
CMMT
RCS Consensus Members (leftmost members are the members with the highest confidence)
1.00 AACACAATATATAGTG nrdD,SA2409,nrdI,nrdE,cspC,mtlF 1.00 TGTTAGAAAATCTAAC glnR,nrgA,glnA 1.00 AGGTGCTAAATCCTGC SA0011 0.89 GCCAGCGTAGGGAAGT SA0928,SA0929,thiD,thiE,SA1897,gapR,thiM 0.88 ACAGGTCATAAGGGTC SA0929,SA1897,SA0928,thiD,polC,thiE,thiM 0.87 AAGGGTGGAACCACGA thrS,leuS,alaS,cysE,cysS,SA0489,SA0490,SA0491,pheS,pheT,S A1931,aspS,hisS,ileS,tyrS,trpG,valS,serS,SA0331,SA2101,SA148 6,SA1289,SA1290,SA1291,SA2205,SA1392,truncated(radC),SA1 578,murE,SA2102,SA1562,SA1199,trpD,trpC,trpF,trpB,trpA 0.86 TGTGAA?T?TTTCAC? narG,narI,SA2183,narH,pflB,SA2174,lctE,SA1455,narK,SA0293,m smX,adhE,rpsU,fbaA 0.83 AAAAGAGTGCTAACA? crtM,groES,hrcA,SA1747,SA1582,SA1581,SA2305,SA1748,grpE 0.83 TTGAAAATGATTATCA SA0307,SA0116,SA0689,SA0117,SA0690,SA0331,SA0977,SA09 78,SA1329,SA2162,ahpF,SA1979,SA0688,feoB,SA2338,sirA,SA2 079,katA,SA0757,ahpC,fhuA,fhuB,fhuG,SA0335,SA2101,SA0160, SA0170,hemX,sirB,hemL,hemB,hemD,hemC,dapD,hemA,SA2102 ,SA0588,SA0589,SA0115,dps,fer,SA0774,SA1678 0.82 ?A?AAAAGTTATCCAC SA0339,orfX,dnaA,dnaN,SA1419,SA1420,SA1421,SA1422,SA14 23,aroE,SA1425,SA1426,SA0248
Known in
Known in
Unknown
CMMT
– Gibbs sampling, LRA, neural networks, SVMs, etc
– Phylogenetic Footprinting
– Regulatory Modules – Familial Binding Profiles
profiles for critical transcription factors
CMMT