Gene Regulation Bioinformatics Wyeth Wasserman Centre for Molecular - - PowerPoint PPT Presentation

gene regulation bioinformatics
SMART_READER_LITE
LIVE PREVIEW

Gene Regulation Bioinformatics Wyeth Wasserman Centre for Molecular - - PowerPoint PPT Presentation

Gene Regulation Bioinformatics Wyeth Wasserman Centre for Molecular Medicine and Therapeutics Department of Medical Genetics Childrens & Womens Hospitals University of British Columbia Overview CMMT Basics of promoter analysis


slide-1
SLIDE 1

Gene Regulation Bioinformatics

Wyeth Wasserman

Centre for Molecular Medicine and Therapeutics Department of Medical Genetics Children’s & Women’s Hospitals

University of British Columbia

slide-2
SLIDE 2

CMMT

Overview

  • Basics of promoter analysis

– Bioinformatics for detection of transcription factor binding sites

  • Discrimination of Regulatory Regions

– Given binding models for relevant TFs, predict regulatory sequences – Genetic variation within regulatory regions

  • Pattern discovery (as time permits)

– Given a set of co-regulated genes, predict binding sites for contributing TFs – \Given a newly discovered binding profile, predict genes in a regulon

slide-3
SLIDE 3

CMMT

Transcription Simplified

TATA URE

URF Pol-II

slide-4
SLIDE 4

Teaching a computer to find TFBS…

slide-5
SLIDE 5

Representing Binding Sites for a TF

  • A set of sites represented as a consensus
  • VDRTWRWWSHD (IUPAC degenerate DNA)

A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3 C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12 G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2 T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4

  • A matrix describing a a set of sites
  • A single site
  • AAGTTAATGA

Set of binding sites AAGTTAATGA CAGTTAATAA GAGTTAAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTGATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA AAGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAATGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA Set of binding sites AAGTTAATGA CAGTTAATAA GAGTTAAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTGATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA AAGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAATGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA

slide-6
SLIDE 6

CMMT

TGCTG = 0.9

PFMs to PWMs

One would like to add the following features to the model:

  • 1. Correcting for the base frequencies in DNA
  • 2. Weighting for the confidence (depth) in the pattern
  • 3. Convert to log-scale probability for easy arithmetic

A 5 0 1 0 0 C 0 2 2 4 0 G 0 3 1 0 4 T 0 0 1 1 1 A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5 0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3 T -1.7 -1.7 -0.2 -0.2 -0.2 f matrix w matrix Log(

)

f(b,i)+ s(N) p(b)

slide-7
SLIDE 7

CMMT

Performance of Profiles

  • 95% of predicted sites bound in vitro

(Tronche 1997)

  • MyoD binding sites predicted about once

every 600 bp (Fickett 1995)

  • The Futility Theorem

– Nearly 100% of predicted transcription factor binding sites have no function in vivo

slide-8
SLIDE 8

CMMT

A 1 kbp promoter screened with collection of TF profiles

slide-9
SLIDE 9

CMMT

Phylogenetic Footprinting for better specificity 70,000,000 years of evolution reveals most regulatory regions.

slide-10
SLIDE 10

CMMT

SIDENOTE: Global Progressive Alignments (ORCA Algorithm)

  • Global alignments memory = product of sequence lengths
  • Progressive alignment by banding with local alignments (e.g.

BLAST) and running global method on banded sub-segments

  • Recursion with decreasingly stringent parameters

ORCA

slide-11
SLIDE 11

CMMT

Phylogenetic Footprinting Identifies Functional Segments

% Identity

Actin gene compared between human and mouse by ORCA.

200 bp Window Start Position (human sequence)

slide-12
SLIDE 12

CMMT

Phylogenetic Footprinting (2)

  • 0.2

0.2 0.4 0.6 0.8 1 1000 2000 3000 4000 5000 6000 7000

FoxC2

100% 80% 60% 40% 20% 0%

% Identity Start Position of 200bp Window

slide-13
SLIDE 13

CMMT

Recall...

slide-14
SLIDE 14

CMMT

1kbp promoter with phylogenetic footprinting

slide-15
SLIDE 15

CMMT

Choosing the ”right” species...

COW MOUSE CHICKEN

HUMAN HUMAN HUMAN

slide-16
SLIDE 16

CMMT

Performance: Human vs. Mouse

  • Testing set: 40 experimentally defined sites in 15 well

studied genes (Replicated with 100+ site set)

  • 75-90% of defined sites detected with conservation filter,

while only 11-16% of total predictions retained

SELECTIVITY SENSITIVITY

slide-17
SLIDE 17

CMMT

ConSite (www.phylofoot.org)

NEW: Ortholog Sequence Retrieval Service

slide-18
SLIDE 18

CMMT

Emerging Issues

  • Multiple sequence comparisons

– Incorporate phylogenetic trees – Visualization

  • Analysis of closely related species

– Phylogenetic shadowing

  • Genome rearrangements

– Inversion compatible alignment algorithm

  • Higher order models of TFBS
slide-19
SLIDE 19

CMMT

Improving Pattern Discrimination TFs do NOT act in isolation

slide-20
SLIDE 20

Layers of Complexity in Metazoan Transcription

slide-21
SLIDE 21

CMMT

Biochemical complexity enables greater complexity in regulation

500 bp

Yeast ORF A

GO GO GO

Humans

20 000 bp

EXON 1 EXON 3 2

GO GO GO GO GO GO GO GO GO

slide-22
SLIDE 22

CMMT

Detecting Clusters of TF Binding Sites

  • Trained Methods

– Sufficient examples of real clusters to establish weights on the relative importance of each TF

  • Statistical Over-representation

– Binding profiles available for a set of biologically motivated

slide-23
SLIDE 23

CMMT

Training for the detection of liver cis-regulatory modules (CRMs)

slide-24
SLIDE 24

CMMT

Models for Liver TFs…

(10 second slide for 3 months of work)

HNF1 C/EBP HNF3 HNF4

slide-25
SLIDE 25

CMMT

Logistic Regression Analysis

∗ α1 ∗ α2 ∗ α3 ∗ α4

Σ

“logit” Optimize α vector to maximize the distance between output values for positive and negative training data. Output value is: elogit p(x)= 1 + elogit

slide-26
SLIDE 26

CMMT

Performance of the Liver Model

  • Performance

– Sensitivity: 60% of known CRMs detected – Specificity: 1 prediction/35,000bp

  • Limitations

– Applies to genes expressed late in hepatocyte differentiation – Requires 10-15 genes in positive training set – This model doesn’t account for multiple sites for the same TF

  • New methods from several groups address this limit
slide-27
SLIDE 27

CMMT

UGT1A1

  • 0.2

0.2 0.4 0.6 0.8 1 100 510 920 1330 1740 2150 2560 2970 3380 3790 4200 4610 5020 5430 5840 Series1 Series2 Wildtype

Other

Liver Module Model Score “Window” Position in Sequence

slide-28
SLIDE 28

CMMT

MSCAN: An untrained method for CRM detection

(w/ J. Lagergren, Royal Technical University of Sweden)

  • MSCAN takes as input a user-defined set of TF

profiles

  • Calculates significance for each observed “site”

based on local sequence characteristics

  • Calculates cluster significance using a dynamic

programming approach

  • Approximately 1 significant liver cluster / 18 000 bp in human

genome sequence

  • Filters out statistically significant clusters of sites

that contain local repeats

  • Identification of non-random characteristics in DNA

http://mscan.cgb.ki.se

slide-29
SLIDE 29

CMMT

JASPAR (jaspar.cgb.ki.se) OPEN-ACCESS DATABASE OF TF BINDING PROFILES

slide-30
SLIDE 30

CMMT

Making better predictions

  • Profiles make far too many false predictions to

have predictive value in isolation

  • Phylogenetic footprinting eliminates ~90% of

false predictions

  • Algorithms for detection of clusters of binding

sites perform better, especially when possible to create trained discriminant functions

slide-31
SLIDE 31

CMMT

RAVEN Project: Regulatory Analysis of Variation in ENhancers

Genetic variation in TFBS can result in biomedically important phenotypes

slide-32
SLIDE 32

CMMT

Sequence Variation in TFBS

TSS AaGT

URF

Koivisto et al., 1994 Familial hypercholesterolemia LDLR I DeVivo et al., 2002 Endometrial cancer PR

  • Y. Olswang et al., 2002

Obesity PEPCK J Hager et al., 1998 Leptin levels Ob KY Zwarts et al., 2002 Coronary artery disease ABCA1 H Hackstein et al., 2001 Reduced soluble IL4R IL4Ralpha JC Engert et al., 2002 Elevated Body Mass Resistin JC Knight et al., 1999 Malaria Susceptibility TNFalpha S Otabe et al., 2000 Elevated Body Mass UCP3 PJ Bosma, et al., 1995 Gilbert’s Syndrome –jaundice UGT1A1 REFERENCE DISEASE/CONDITION (associated) GENE

slide-33
SLIDE 33

CMMT

Stage 1: Prediction of Regulatory Regions

slide-34
SLIDE 34

CMMT

Stage 1: Identify Putative Regulatory Regions

  • 0.2

0.2 0.4 0.6 0.8 1 1000 2000 3000 4000 5000 6000 7000

FoxC2

100% 80% 60% 40% 20% 0%

  • Retrieves orthologous human and mouse gene

sequences from GeneLynx

  • Aligns sequences with ORCA Aligner
  • Finds most significant non-coding regions
  • Designs primers
slide-35
SLIDE 35

CMMT

Data/Orthology obtained from GeneLynx (www.genelynx.org)

slide-36
SLIDE 36

CMMT

Stage 2: Analysis of Polymorphisms

ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT

slide-37
SLIDE 37

CMMT

Identify variations that generate allele-specific binding site predictions

1234567890123456789012345 ACGCATAAGTTAAtGAATAACAGAT .............c...........

  • 4
  • 2

2 4 1 2 3 4 5 6 7 8 9 10 11

Differences in scores

slide-38
SLIDE 38

CMMT

RAVEN Implementation Status A first look at the alpha-version of the RAVEN service…

slide-39
SLIDE 39

CMMT

RAVEN screenshots

slide-40
SLIDE 40

CMMT

Stage 3: Prediction of Regulatory “HotSpots”

slide-41
SLIDE 41

CMMT

UGT1A1 (Gilbert’s Syndrome)

  • 0.2

0.2 0.4 0.6 0.8 1 100 510 920 1330 1740 2150 2560 2970 3380 3790 4200 4610 5020 5430 5840 Series1 Series2

Wildtype Mutant

Liver Module Model Score “Window” Position in Sequence

slide-42
SLIDE 42

CMMT

“HotSpots” in Muscle Regulatory Module (200bp)

  • 0.2
  • 0.1

0.1 0.2

Maximum Differential for any potential SNP

slide-43
SLIDE 43

CMMT

RAVEN Summary

  • Existing tools can be used to predict allele-

specific binding sites

  • Essentially all SNPs will result in alteration
  • f a predicted binding site
  • CRM analysis is likely to be required to

produce specific predictions

slide-44
SLIDE 44

CMMT

Linking co-expressed genes from microarrays to candidate transcription factors

slide-45
SLIDE 45

CMMT

Pattern Discovery

slide-46
SLIDE 46

CMMT

POSSUM Project

  • A significant subset of TFs are represented

by existing binding profiles

  • Within same structural class, often binding

specificity retained (more on this later)

  • Can we link known TFs to a putative

regulon by over-representation of predicted binding sites in promoters?

slide-47
SLIDE 47

CMMT

POSSUM Procedure

Set of co- expressed genes Automated sequence retrieval from EnsEMBL Phylogenetic Footprinting Detection of transcription factor binding sites Statistical significance of binding sites

  • Z score
  • Fisher

Putative mediating transcription factors

ORCA

slide-48
SLIDE 48

CMMT

Reference Co-Regulated Gene Sets

++++ p<1e-30, +++ p<1e-10, ++ p<1e-05, + p<1e-02

+ Ahr-ARNT + GATA-2 + GATA-2 + Sox-5 + SPI-B + + MZF_1-4 + Sox-5 + Yin-Yang + Elk-1 + HNF-3beta * + MZF_5-13 ++ Brachyury + S8 + Thing1-E47 +++ Irf-1 + RORalpha-1 + Gklf + +++ SPI-1 ++ FREAC-4 + Tal1beta- E47S + ++++ c-FOS ++ E4BP4 ++ Pax-2 + ++++ SPI-B +++ FREAC-3 ++ c-MYB-1 + ++++ Irf-2 +++ GATA-2 ++ TEF-1 * + ++++ p50 * +++ FREAC-7 + ++++ FREAC-7 ++ ++++ p65 * + ++++ Gfi + ++++ Myf * + ++++ c-REL * + ++++ COUP-TF + ++++ SRF * ++ ++++ NF-κB * + ++++ HNF-1 * ++ ++++ Mef2 * Fisher p-value z-score p-value Fisher P-value z-score p-value Fisher p-value z-score p-value

  • C. Known NF-κB targets

(61)

  • B. Liver-specific (15)
  • A. Muscle-specific (15)
slide-49
SLIDE 49

CMMT

MICROARRAY APPLICATION:

NF-kB Inhibitor-sensitive genes (326)

+ + HMG SRY + ++ bHLH-ZIP Max + +++ Forkhead FREAC-4 + +++ Forkhead HFH-2 + +++ ETS SPI-B + +++ HMG Sox-5 + ++++ Homeo Pbx + ++++ Rel/NFkB p50 + ++++ Rel/NFkB c-Rel ++ ++++ Rel/NFkB p65 ++ ++++ Rel/NFkB NF-kappaB Fisher p- value z-score p- value Class Genes Significantly Down-regulated After Treatment with Inhibitor

++++ p<1e-30, +++ p<1e-10, ++ p<1e-05, + p<1e-02

slide-50
SLIDE 50

CMMT

de novo Discovery

  • f TF Binding Sites
slide-51
SLIDE 51

CMMT

Pattern Discovery

slide-52
SLIDE 52

CMMT

de novo Pattern Discovery

  • Exhaustive

– e.g. YMF (Sinha & Tompa) – Generalization: Identify over-represented oligomers in comparison of “+” and “-” (or complete) promoter collections

  • Monte Carlo/Gibbs Sampling

– e.g. AnnSpec (Workman & Stormo) – Generalization: Identify strong patterns in “+” promoter collection vs. background model of expected sequence characteristics

slide-53
SLIDE 53

CMMT

Regulatory Analysis Methods for Single-celled Organisms The Proving Ground

slide-54
SLIDE 54

CMMT

Yeast Regulatory Sequence Analysis (YRSA) system

slide-55
SLIDE 55

CMMT

Tests of YRSA System

PDR3-regulated genes from array study Classic cell-cycle array data re-clustered by Getz et al DNA-damage response partially mediating by MCB

slide-56
SLIDE 56

CMMT

rank LEU3 STE12 RLM1 MIG1 OAF1 GAL4 XBP1 CBF1 RPN4 PDR3 ADR1 REB1 ABF1 RAP1 GCN4 PHO4 39.0 17.0 21.0 17.8 na 7.2 0.7 na 1.5 na 1.1 1.0 0.9 1.1 1.1 0.8 7 10 7 16 5 6 28 20 24 10 20 24 27 17 12 18 5 10 15 20 25 30 35 40 comparison rank of correct pattern

+

Rank of found pattern in verified promoters Rank of found pattern in randomly selected promoters

a b

average comparison rank (random promoters) average comparison rank (verified promoters) Number of promoters (sequence depth)

Sequence depth dependancy of MAP scores

Performance: Hit and Miss

slide-57
SLIDE 57

CMMT

Applied Pattern Discovery is Acutely Sensitive to Noise

10 12 14 16 18 100 200 300 400 500 600

SEQUENCE LENGTH PATTERN SIMILARITY

  • vs. TRUE MEF2 PROFILE

True Mef2 Binding Sites

slide-58
SLIDE 58

CMMT

Four Approaches to Improve Sensitivity

  • Better background models
  • Higher-order properties of DNA
  • Phylogenetic Footprinting

– Human:Mouse comparison eliminates ~75% of sequence

  • Regulatory Modules

– Architectural rules

  • Limit the types of binding profiles allowed

– TFBS patterns are NOT random

slide-59
SLIDE 59

CMMT

Pattern discovery methods using biochemical constraints

slide-60
SLIDE 60

CMMT

Some profile constraints have been explored…

  • Segmentation of informative

columns

  • Palindromic patterns
slide-61
SLIDE 61

CMMT

Our Hypothesis

  • Point 1: Structurally-related DNA binding

domains interact with similar target sequences

  • Exceptions exist (e.g. Zn-fingers)
  • Point 2: There are a finite number of binding

domains used in human TFs

  • Approximately 20-25
  • Idea: We could use the shared binding properties

for each family to focus pattern detection methods

  • Constrain the range of patterns sought
slide-62
SLIDE 62

CMMT

Comparison of profiles requires alignment and a scoring function

  • Scoring function based on sum of

squared differences

  • Align frequency matrices with modified

Needleman-Wunsch algorithm

  • Calculate empirical p-values based on

simulated set of matrices

Score Frequency

slide-63
SLIDE 63

CMMT

Intra-family comparisons more similar than inter-family

TF Database (JASPAR) COMPARE Match to bHLH

Jackknife Test 87% correct Independent Test Set 93% correct

slide-64
SLIDE 64

CMMT

slide-65
SLIDE 65

CMMT

FBPs enhance sensitivity of pattern detection

slide-66
SLIDE 66
slide-67
SLIDE 67

CMMT

Conclusions

  • Pattern analysis methods have utility
  • Combine knowledge from multiple fields
  • Statistics and AI methods must be imported

– Gibbs sampling, LRA, neural networks, SVMs, etc

  • Evolution drives understanding in biology

– Phylogenetic Footprinting

  • Biochemistry inspires Bioinformatics

– Regulatory Modules – Familial Binding Profiles

  • Analysis of regulatory sequences is improving
  • Given sets of orthologous genes, one can predict regulatory regions
  • Given sets of co-regulated genes, it is possible to infer the binding

profiles for critical transcription factors

slide-68
SLIDE 68

Acknowledgements

Wasserman Group

Wynand Alkema Dave Arenillas Jochen Brumm Alice Choi Shannan Ho Sui Danielle Kemmer Jonathan Lim Raf Podowski Dora Pak Albin Sandelin Chris Walsh

Collaborating Trainees

Malin Andersson (KTH) Öjvind Johansson (UCSD) Stuart Lithwick (U.Toronto)

Support: CIHR, CGDN, CFI, Merck-Frosst, BC Children’s Hospital Foundation, Pharmacia, EC–Marie Curie, KI-Funder

Collaborators Boris Lenhard (K.I.) Chip Lawrence (Wadsworth) William Thompson (Wadsworth) Jens Lagergren (KTH) Christer Höög (K.I.) Brenda Gallie (OCI) Jacob Odeberg (KTH) Niclas Jareborg (AZ) William Hayes (AZ) James Mortimer (MF) Group Alumni Elena Herzog Annette Höglund William Krivan Luis Mendoza

slide-69
SLIDE 69

CMMT

“Regulog” Analysis Comparative Genomics for Promoters

slide-70
SLIDE 70

CMMT

Approach

  • Define all regulatory sequences in S. aureus.
  • Transcription factor binding sites
  • RNA structures
  • Promoters

=>Phylogenetic footprinting

slide-71
SLIDE 71

Find a conserved pattern

  • E. coli
  • B. subtilis
  • S. aureus

clpP

taccgctattgaggta taccccgatcggggta tacccattaaggagta taactctaaagtggta tacctcaatagcggta taccccgatcggggta tactccttaatgggta taccactttagagtta

TACCNCN(A/T)(A/T)NGNGGTA TACCNRWAAYGBGGTA

A [0 8 1 0 1 1 1 5 3 5 0 2 1 0 0 8] C [0 0 7 7 4 7 0 0 0 2 0 1 0 0 0 0] G [0 0 0 0 1 0 2 0 0 0 7 4 7 7 0 0] T [8 0 0 1 2 0 5 3 5 1 1 1 0 1 8 0]

Pattern detection

slide-72
SLIDE 72

CMMT

MAP value

1 2 3 4 5

Frequency

0.00 0.05 0.10 0.15 0.20 0.25

Regulatory sequences in S. aureus

1430 patterns

Gibbs sampling Compare to random sequences

1818 sets of orthologs from S. aureus real

slide-73
SLIDE 73

CMMT

Regulatory sequences in S. aureus

MAP value

1 2 3 4 5

Frequency

0.00 0.05 0.10 0.15 0.20 0.25

1430 patterns 318 significant patterns

Gibbs sampling Compare to random sequences Remove redundancies

1818 sets of orthologs from S. aureus real random

slide-74
SLIDE 74

CMMT

Regulatory sequences in S. aureus

1818 sets of orthologs from S. aureus 1430 patterns

Gibbs sampling Compare to random sequences

318 significant patterns

Cluster with MatrixAligner (Sandelin et al 2003)

154 unique patterns in S. aureus

Remove redundancies

slide-75
SLIDE 75

CMMT

Approach

  • Define all regulatory sequences in S. aureus.
  • Transcription factor binding sites
  • RNA structures
  • Promoters
  • Define sets of genes that are under control
  • f these regulatory sequences =>regulons

– Sequence search – Regulog filtering

slide-76
SLIDE 76

CMMT

Regulon prediction with site search

Site score threshold (p-value)

0.00 0.02 0.04 0.06 0.08 0.10 0.12

Fraction of total ORFS in regulon

0.00 0.02 0.04 0.06 0.08 0.10

175 members in E. coli => Site searches produce too many false positive hits

slide-77
SLIDE 77

CMMT

Regulon conservation filter

  • A predicted regulon member is more

likely a true positive when its

  • rtholog(s) is regulated by the same

regulatory sequence.

  • Such conserved regulons are called

regulogs

slide-78
SLIDE 78

Regulogs

gene geneA geneB geneC geneD geneF geneA geneB geneC geneD geneF geneC geneD geneF

B C D

geneG geneG geneG geneA geneC geneD geneE geneG geneF

A

geneB = regulon 1 1 0.66 0.33 1 gene = regulog geneA geneB geneC geneD geneE geneG geneF

A

Regulon Conservation Filtering (RECF)

= putative binding site

slide-79
SLIDE 79

CMMT

RECF test: Escherichia coli

10.4 3 21 3 218 4 metR 11 4 7 4 77 4 torR 12.5 3 11 3 137 5

  • xyR

15.2 2 12 2 182 2 ilvY 25.5 1 4 1 102 4 pdhR Pos Total Pos Total Efficiency REGULOG REGULON #known TF

Efficiency

RECF

SpecificityREGULOG x SensitivityREGULOG SpecificityREGULON x SensitivityREGULON

slide-80
SLIDE 80

RECF test: Escherichia coli

. . . . . . . . . . . . . . . . . . . . . 10.4 3 21 3 218 4 metR 4.2 3.8 20 7.2 174 9.8 AVG. 11 4 7 4 77 4 torR 12.5 3 11 3 137 5

  • xyR

15.2 2 12 2 182 2 ilvY 25.5 1 4 1 102 4 pdhR Pos Total Pos Total Efficiency REGULOG REGULON #known TF

Efficiency

RECF

SpecificityREGULOG x SensitivityREGULOG SpecificityREGULON x SensitivityREGULON

slide-81
SLIDE 81

CMMT

RECF applied to S. aureus

RCS Consensus Members (leftmost members are the members with the highest confidence)

1.00 AACACAATATATAGTG nrdD,SA2409,nrdI,nrdE,cspC,mtlF 1.00 TGTTAGAAAATCTAAC glnR,nrgA,glnA 1.00 AGGTGCTAAATCCTGC SA0011 0.89 GCCAGCGTAGGGAAGT SA0928,SA0929,thiD,thiE,SA1897,gapR,thiM 0.88 ACAGGTCATAAGGGTC SA0929,SA1897,SA0928,thiD,polC,thiE,thiM 0.87 AAGGGTGGAACCACGA thrS,leuS,alaS,cysE,cysS,SA0489,SA0490,SA0491,pheS,pheT,S A1931,aspS,hisS,ileS,tyrS,trpG,valS,serS,SA0331,SA2101,SA148 6,SA1289,SA1290,SA1291,SA2205,SA1392,truncated(radC),SA1 578,murE,SA2102,SA1562,SA1199,trpD,trpC,trpF,trpB,trpA 0.86 TGTGAA?T?TTTCAC? narG,narI,SA2183,narH,pflB,SA2174,lctE,SA1455,narK,SA0293,m smX,adhE,rpsU,fbaA 0.83 AAAAGAGTGCTAACA? crtM,groES,hrcA,SA1747,SA1582,SA1581,SA2305,SA1748,grpE 0.83 TTGAAAATGATTATCA SA0307,SA0116,SA0689,SA0117,SA0690,SA0331,SA0977,SA09 78,SA1329,SA2162,ahpF,SA1979,SA0688,feoB,SA2338,sirA,SA2 079,katA,SA0757,ahpC,fhuA,fhuB,fhuG,SA0335,SA2101,SA0160, SA0170,hemX,sirB,hemL,hemB,hemD,hemC,dapD,hemA,SA2102 ,SA0588,SA0589,SA0115,dps,fer,SA0774,SA1678 0.82 ?A?AAAAGTTATCCAC SA0339,orfX,dnaA,dnaN,SA1419,SA1420,SA1421,SA1422,SA14 23,aroE,SA1425,SA1426,SA0248

slide-82
SLIDE 82

The Fur regulog

Known in

  • ther bacteria

Known in

  • S. aureus

Unknown

slide-83
SLIDE 83

CMMT

Regulog Conclusions

  • Using only sequence data, reliable

predictions for sets of co-regulated genes can be obtained.

– Phylogenetic information is used to obtain a set

  • f putative regulatory sequences

– Phylogenetic information is used to improve of predictions of sets of co-regulated genes – Facilitates targetting of genes for experimental studies – Portable to other genomes