[PPT] - Bioinformatics for the Identification of Sequences Regulating Gene PowerPoint Presentation

SLIDE 1

Bioinformatics for the Identification of Sequences Regulating Gene Transcription

Wyeth W. Wasserman

University of British Columbia

www.cisreg.ca

SLIDE 2

Acknowledgements

Wasserman Group

Dave Arenillas Jochen Brumm Alice Chou Debra Fulton Shannan Ho Sui Carol Huang Danielle Kemmer (KI) Byron Kuo Jonathan Lim Raf Podowski (KI) Dora Pak Chris Walsh Dimas Yusuf

Collaborating Trainees

Malin Andersson (KTH) Öjvind Johansson (KTH) Stuart Lithwick (U.Toronto)

Support: CIHR, CGDN, MSFHR, CFI, Merck-Frosst, BC Children’s Hospital Foundation

Collaborators Jenny Bryan (UBC) Brenda Gallie (OCI) Jens Lagergren (KTH) Chip Lawrence (Brown) Boris Lenhard (K.I.) James Mortimer (MF) Jacob Odeberg (KTH) Group Alumni Wynand Alkema Elena Herzog Annette Höglund William Krivan Luis Mendoza Albin Sandelin

SLIDE 3

CMMT

Overview

DISCRIMINATION: TFBS Prediction with Motif

Models

Phylogenetic Footprinting
Combinatorial Interactions
Current Activities
DISCOVERY: Inferring Regulatory Mechanisms

for Co-Expressed (Co-Regulated) Genes

Motif Over-representation
Pattern Discovery
Current Activities

SLIDE 4

CMMT

Transcription Factor Binding Sites

(over-simplified for pedagogical purposes)

TATA URE

URF Pol-II

SLIDE 5

Teaching a computer to find TFBS…

SLIDE 6

Representing Binding Sites for a TF

A set of sites represented as a consensus
VDRTWRWWSHD (IUPAC degenerate DNA)

A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3 C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12 G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2 T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4

A matrix describing a a set of sites
A single site
AAGTTAATGA

Set of binding sites AAGTTAATGA CAGTTAATAA GAGTTAAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTGATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA AAGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAATGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA Set of binding sites AAGTTAATGA CAGTTAATAA GAGTTAAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTGATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA AAGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAATGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA

SLIDE 7

CMMT

PFMs to PWMs (PSSMs)

A 5 0 1 0 0 C 0 2 2 4 0 G 0 3 1 0 4 T 0 0 1 1 1 A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5 0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3 T -1.7 -1.7 -0.2 -0.2 -0.2 f matrix w matrix Log(

)

f(b,i)+ s(N) p(b)

SLIDE 8

Detecting binding sites in a single sequence

Scanning a sequence against a PW M

A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ] C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ] G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ] T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ]

ACCCTCCCCAGGGGCGGGGGGCGGTGGCCAGGACGGTAGCTCC

Abs_score = 13.4 (sum of column scores)

Sp1

Calculating the relative score

A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ] C [-0.2284 -0.2284 -1.5 -1.5 1.5128 1.5128

1.5 -0.2284 -1.5 -0.2284 -1.5 ]

G [ 1.2348 1.2348 1.2348 1.2348 2.1222 2.1222 2.1222 2.1222 0.4368 1.2348 1.2348 1.5128 1.5128 1.7457 1.7457 1.7457 1.7457

1.5 ]

T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 1.7457 ] A [-0.2284 0.4368 -1.5 -1.5 -

1.5

1.5 0.4368 -

1.5

1.5 -

1.5

1.5 -0.2284 0.4368 ] C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -

1.5

1.5 -0.2284 -1.5 -0.2284 -1.5 ] G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -

1.5

1.5 ] T [ 0.4368 0.4368 -

0.2284

0.2284

1.5

1.5 -

1.5

1.5 -0.2284 0.4368 0.4368 0.4368 -

1.5

1.5 1.7457 ]

Max_score = 15.2 (sum of highest column scores) Min_score = -10.3 (sum of lowest column scores)

93% = ⋅ − − = ⋅ =

100% 10.3) ( 15.2 (-10.3)

13.4

% 100 Min_score

Max_score

Min_score

Abs_score

Rel_score

Scanning 1 3 0 0 bp of hum an insulin receptor gene w ith Sp1 at rel_ score threshold of 7 5 %

Ouch.

SLIDE 9

CMMT

Performance of Profiles

95% of predicted sites bound in vitro

(Tronche 1997)

MyoD binding sites predicted about once

every 600 bp (Fickett 1995)

The Futility Theorem

– Nearly 100% of predicted transcription factor binding sites have no function in vivo

SLIDE 10

CMMT

JASPAR AN OPEN-ACCESS DATABASE OF TF BINDING PROFILES

SLIDE 11

CMMT

DISCRIMINATION Overcoming the Specificity Problems

SLIDE 12

Phylogenetic Footprinting Dramatically Reduces Spurious Hits

Human Mouse Actin, alpha cardiac

SLIDE 13

CMMT

Performance: Human vs. Mouse

Testing set: 40 experimentally defined sites in 15 well

studied genes (Replicated with 100+ site set)

75-90% of defined sites detected with conservation filter,

while only 11-16% of total predictions retained

SELECTIVITY SENSITIVITY

SLIDE 14

CMMT

ConSite (www.cisreg.ca)

Now Featuring: Ortholog Sequence Retrieval Service

SLIDE 15

CMMT

Current Activity: Analysis of Genetic Variation in TFBS

ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT

SLIDE 16

CMMT

Sequence Variation in TFBS

TSS AaGT

URF

GENE DISEASE/CONDITION (associated) REFERENCE UGT1A1 Gilbert’s Syndrome –jaundice PJ Bosma, et al., 1995 UCP3 Elevated Body Mass S Otabe et al., 2000 TNFalpha Malaria Susceptibility JC Knight et al., 1999 Resistin Elevated Body Mass JC Engert et al., 2002 IL4Ralpha Reduced soluble IL4R H Hackstein et al., 2001 ABCA1 Coronary artery disease KY Zwarts et al., 2002 Ob Leptin levels J Hager et al., 1998 PEPCK Obesity

Y. Olswang et al., 2002

PR Endometrial cancer I DeVivo et al., 2002 LDLR Familial hypercholesterolemia Koivisto et al., 1994

SLIDE 17

CMMT

Identifying allele-specific binding site predictions

1234567890123456789012345 ACGCATAAGTTAAtGAATAACAGAT .............c...........

4
2

2 4 1 2 3 4 5 6 7 8 9 10 11

Swt-Smt

2 1

1
2

SLIDE 18

CMMT

RAVEN screenshots

SLIDE 19

CMMT

Recent and Active Projects

JUMBO-JASPAR

– Building a second generation open-access database

NHR-scan

– Identification of binding sites for nuclear hormone receptors

SLIDE 20

CMMT

Discrimination of Regulatory Modules TFs do NOT act in isolation

SLIDE 21

Layers of Complexity in Metazoan Transcription

SLIDE 22

CMMT

Detecting Clusters of TF Binding Sites

Trained Methods

– Sufficient examples of real clusters to establish weights on the relative importance of each TF

Statistical Over-Representation of Combinations

– Binding profiles available for a set of biologically motivated TFs

SLIDE 23

CMMT

Training for the detection of liver cis-regulatory modules (CRMs)

SLIDE 24

CMMT

Building a predictive model

(Brief, as this is well described in the literature)

HNF1 C/EBP HNF3 HNF4 At 60% sensitivity, predictions made ~1/30,000 bp

SLIDE 25

CMMT

UGT1A1

0.2

0.2 0.4 0.6 0.8 1 100 510 920 1330 1740 2150 2560 2970 3380 3790 4200 4610 5020 5430 5840 Series1 Series2 Wildtype

Other

Liver Module Model Score “Window” Position in Sequence

SLIDE 26

CMMT

MSCAN: An untrained method for CRM detection

(w/ J. Lagergren, Royal Technical University of Sweden)

MSCAN takes as input a user-defined set of TF

profiles

Calculates significance for each observed “site”

based on local sequence characteristics

Calculates cluster significance using a dynamic

programming approach

Approximately 1 significant liver cluster / 18 000 bp in human

genome sequence

Filters out statistically significant clusters of sites

that contain local repeats

Identification of non-random characteristics in DNA

http://mscan.cgb.ki.se

SLIDE 27

CMMT

Current Activities on Combinatorial Binding Prediction

Social network analysis to identify a reliable

set of genes regulated by a given set of TFs

SLIDE 28

CMMT

Making better predictions

Profiles make far too many false predictions to

have predictive value in isolation

Phylogenetic footprinting eliminates ~90% of

false predictions

Algorithms for detection of clusters of binding

sites perform better, especially when possible to create train on known examples for the target context

SLIDE 29

CMMT

Linking co-expressed genes from microarrays to candidate transcription factors

SLIDE 30

CMMT

DISCOVERY Inferring regulatory mechanisms for subsets of co-expressed genes

SLIDE 31

CMMT

Deciphering Regulation of Co- Expressed Genes

SLIDE 32

CMMT

POSSUM Procedure

Set of co- expressed genes Automated sequence retrieval from EnsEMBL Phylogenetic Footprinting Detection of transcription factor binding sites Statistical significance of binding sites Putative mediating transcription factors

ORCA

SLIDE 33

CMMT

Statistical Methods for Identifying Over- represented TFBS

Z scores

– Based on the number of occurrences of the TFBS relative to background – Normalized for sequence length – Simple binomial distribution model

Fisher exact probability scores

– Based on the number of genes containing the TFBS relative to background – Hypergeometric probability distribution

SLIDE 34

CMMT

The oPOSSUM Database

Orthologous genes:

8468

Promoter pairs:

6911

Promoters with TFBS:

6758

Total # of TFBS predictions:

1638293

Overall failure rate:

20.2%

SLIDE 35

CMMT

Validation using Reference Gene Sets

TFs with experimentally-verified sites in the reference sets.

A. Muscle-specific (23 input; 16

analyzed)

B. Liver-specific (20 input; 12 analyzed)

Rank Z-score Fisher Rank Z-score Fisher SRF 1 21.41 1.18e-02 HNF-1 1 38.21 8.83e-08 MEF2 2 18.12 8.05e-04 HLF 2 11.00 9.50e-03 c-MYB_1 3 14.41 1.25e-03 Sox-5 3 9.822 1.22e-01 Myf 4 13.54 3.83e-03 FREAC-4 4 7.101 1.60e-01 TEF-1 5 11.22 2.87e-03 HNF-3beta 5 4.494 4.66e-02 deltaEF1 6 10.88 1.09e-02 SOX17 6 4.229 4.20e-01 S8 7 5.874 2.93e-01 Yin-Yang 7 4.070 1.16e-01 Irf-1 8 5.245 2.63e-01 S8 8 3.821 1.61e-02 Thing1-E47 9 4.485 4.97e-02 Irf-1 9 3.477 1.69e-01 HNF-1 10 3.353 2.93e-01 COUP-TF 10 3.286 2.97e-01

SLIDE 36

Application to Microarray Data Sets

1. NF-кB inhibition microarray study

SLIDE 37

Genes Significantly Down-regulated by the NF-κB inhibitor (326 input; 179 analyzed)

TF Class Rank Z-score Fisher

No. Genes

p65 REL 1 36.57 5.66e-12 62 NF-kappaB REL 2 32.58 5.82e-11 61 c-REL REL 3 26.02 8.59e-08 63 Irf-2 TRP-CLUSTER 4 20.39 5.74e-04 6 SPI-B ETS 5 16.59 1.23e-03 135 Irf-1 TRP-CLUSTER 6 15.4 9.55e-04 23 Sox-5 HMG 7 15.38 2.56e-02 126 p50 REL 8 14.72 2.23e-03 19 Nkx HOMEO 9 13.66 2.29e-03 111 Bsap PAIRED 10 13.2 9.92e-02 1 FREAC-4 FORKHEAD 11 12.05 1.66e-03 92 n-MYC bHLH-ZIP 25 6.695 1.84e-03 102 ARNT bHLH 26 6.695 1.84e-03 102 HNF-3beta FORKHEAD 29 5.948 3.32e-03 47 SOX17 HMG 31 5.406 8.60e-03 79

SLIDE 38

CMMT

C-Myc SAGE Data

c-Myc transcription factor dimerizes with the Max

protein

Key regulator of cell proliferation, differentiation

and apoptosis

Menssen and Hermeking identified 216 different

SAGE tags corresponding to unique mRNAs that were induced after adenoviral expression of c-Myc in HUVEC cells

They then went on to confirm the induction of 53

genes using microarray analysis and RT-PCR

SLIDE 39

CMMT

Induced Genes after Ectopic Expression of c-Myc (SAGE) (53 input; 36 analyzed)

TF Class Rank Z-score Fisher

No. Genes

Myc-Max bHLH-ZIP 1 21.68 5.35e-03 7 Staf ZN-FINGER, C2H2 2 20.17 1.70e-02 2 Max bHLH-ZIP 3 18.32 2.16e-02 12 SAP-1 ETS 4 13.23 1.61e-04 13 USF bHLH-ZIP 5 11.90 1.84e-01 16 SP1 ZN-FINGER, C2H2 6 11.68 4.40e-02 12 n-MYC bHLH-ZIP 7 11.11 1.55e-01 20 ARNT bHLH 8 11.11 1.55e-01 20 Elk-1 ETS 9 10.92 3.88e-03 19 Ahr-ARNT bHLH 10 10.17 1.11e-01 25

SLIDE 40

CMMT

C-Fos Microarray Experiment

In a study examining the role of

transcriptional repression in oncogenesis, Ordway et al. compared the gene expression profiles of fibroblasts transformed by c-fos to the parental 208F rat fibroblast cell line

We mapped the list of 252 induced

Affymetrix Rat Genome U34A GeneChip sequences to 136 human orthologs

SLIDE 41

Induced Genes after Ectopic Expression of c-Fos (Affymetrix) (136 input; 86 analyzed)

TF Class Rank Z-score Fisher

No. Genes

c-FOS bZIP 1 17.53 2.60e-05 45 RREB-1 ZN-FINGER, C2H2 2 8.899 1.41e-01 1 PPARgamma-RXRal NUCLEAR RECEPTOR 3 3.991 2.98e-01 1 CREB bZIP 4 3.626 1.25e-01 10 E2F Unknown 5 2.965 7.67e-02 15 NF-kappaB REL 6 2.915 1.04e-01 17 SRF MADS 7 2.707 2.24e-01 2 MEF2 MADS 8 2.634 1.32e-01 13 c-REL REL 9 2.467 5.79e-02 22 Staf ZN-FINGER, C2H2 10 2.385 3.74e-01 1 Ahr-ARNT bHLH 15 1.716 2.57e-03 63 deltaEF1 ZN-FINGER, C2H2 23 0.271 5.39e-03 75 Elk-1 ETS 21 0.7875 8.12e-03 37 MZF_1-4 ZN-FINGER, C2H2 27

0.2421

5.41e-03 73 n-MYC bHLH-ZIP 30

0.8738

8.20e-03 51 ARNT bHLH 31

0.8738

8.20e-03 51

SLIDE 42

CMMT

POSSUM Server

SLIDE 43

CMMT

http://www.cisreg.ca/cgi- bin/oPOSSUM/opossum

INPUT A LIST OF CO-EXPRESSED GENES

SLIDE 44

CMMT

SELECT YOUR TFBS PROFILES

SLIDE 45

CMMT

SELECT:

1. CONSERVATION
2. PSSM MATCH THRESHOLD
3. PROMOTER REGION
4. STATISTICAL MEASURE

SLIDE 46

CMMT

SLIDE 47

CMMT

de novo Discovery

f TF Binding Sites

SLIDE 48

CMMT

Pattern Discovery

SLIDE 49

CMMT

de novo Pattern Discovery

Exhaustive

– e.g. YMF (Sinha & Tompa) – Generalization: Identify over-represented oligomers in comparison of “+” and “-” (or complete) promoter collections

Monte Carlo/Gibbs Sampling

– e.g. AnnSpec (Workman & Stormo) – Generalization: Identify strong patterns in “+” promoter collection vs. background model of expected sequence characteristics

SLIDE 50

Two data structures used: 1) Current pattern nucleotide frequencies

qi,1,..., qi,4 and corresponding background

frequencies pi,1,..., pi,4 2) Current positions of site startpoints in the N sequences a1, ..., aN , i.e. the alignment that contributes to qi,j. One starting point in each sequence is chosen randomly initially.

The Gibbs Sampling algorithm

tgacttcc tgatctct agacctca tgacctct

Probabilistic Methods for Pattern Discovery(3)

II.26

SLIDE 51

I teration step

Remove one sequence z from the

set. Update the current pattern

according to

tgacttcc tgatctct agacctca tgacctct

B N b c q

j j i j i

+ − + = 1

, ,

Pseudocount for symbol j Sum of all pseudocounts in column

Probabilistic Methods for Pattern Discovery(4)

A

’Score’ the current pattern against each possible occurence

ak in z. Draw a new ak with

probabilities based on respective score divided by the background model

B

II.27

z

SLIDE 52

CMMT

Applied Pattern Discovery is Acutely Sensitive to Noise

10 12 14 16 18 100 200 300 400 500 600

SEQUENCE LENGTH PATTERN SIMILARITY

vs. TRUE MEF2 PROFILE

True Mef2 Binding Sites

SLIDE 53

CMMT

Four Approaches to Improve Sensitivity

Better background models
Higher-order properties of DNA
Phylogenetic Footprinting

– Human:Mouse comparison eliminates ~75% of sequence

Regulatory Modules

– Architectural rules

Limit the types of binding profiles allowed

– TFBS patterns are NOT random

SLIDE 54

CMMT

Pattern discovery methods using biochemical constraints

SLIDE 55

Information segmentation

Information content distributions of TFBS are distinctly non-random

(Wasserman et al 2000) Palindromicity, dyads (van Helden et al 2000) Variable gaps (Hu 2003)

TFBSs are not randomly drawn

Enhancing pattern detection sensitivity

SLIDE 56

CMMT

Our Hypothesis

Point 1: Structurally-related DNA binding

domains interact with similar target sequences

Exceptions exist (e.g. Zn-fingers)
Point 2: There are a finite number of binding

domains used in human TFs

Approximately 20-25
Idea: We could use the shared binding properties

for each family to focus pattern detection methods

Constrain the range of patterns sought

SLIDE 57

CMMT

Comparison of profiles requires alignment and a scoring function

Scoring function based on sum of

squared differences

Align frequency matrices with modified

Needleman-Wunsch algorithm

Calculate empirical p-values based on

simulated set of matrices

Score Frequency

SLIDE 58

CMMT

Intra-family comparisons more similar than inter-family

TF Database (JASPAR) COMPARE Match to bHLH

Jackknife Test 87% correct Independent Test Set 93% correct

SLIDE 59

CMMT

SLIDE 60

CMMT

FBPs enhance sensitivity of pattern detection

SLIDE 61

SLIDE 62

CMMT

“Regulog” Analysis Comparative Genomics for Promoters

SLIDE 63

CMMT

Approach

Define all regulatory sequences in S. aureus.
Transcription factor binding sites
RNA structures
Promoters

=>Phylogenetic footprinting

SLIDE 64

Find a conserved pattern

E. coli
B. subtilis
S. aureus

clpP TACCNCN(A/T)(A/T)NGNGGTA TACCNRWAAYGBGGTA

taccgctattgaggta taccccgatcggggta tacccattaaggagta taactctaaagtggta tacctcaatagcggta taccccgatcggggta tactccttaatgggta taccactttagagtta

A [0 8 1 0 1 1 1 5 3 5 0 2 1 0 0 8] C [0 0 7 7 4 7 0 0 0 2 0 1 0 0 0 0] G [0 0 0 0 1 0 2 0 0 0 7 4 7 7 0 0] T [8 0 0 1 2 0 5 3 5 1 1 1 0 1 8 0]

Pattern detection

SLIDE 65

CMMT

Regulatory sequences in S. aureus

1818 sets of orthologs from S. aureus 1430 patterns

Gibbs sampling Compare to random sequences

318 significant patterns

Cluster with MatrixAligner (Sandelin et al 2003)

154 unique patterns in S. aureus

Remove redundancies

SLIDE 66

CMMT

Approach

Define all regulatory sequences in S. aureus.
Transcription factor binding sites
RNA structures
Promoters
Define sets of genes that are under control
f these regulatory sequences =>regulons

– Sequence search – Regulog filtering

SLIDE 67

CMMT

Regulon prediction with site search

Site score threshold (p-value)

0.00 0.02 0.04 0.06 0.08 0.10 0.12

Fraction of total ORFS in regulon

0.00 0.02 0.04 0.06 0.08 0.10

175 members in E. coli = > Site searches produce too many false positive hits

SLIDE 68

CMMT

Regulon conservation filter

A predicted regulon member is more

likely a true positive when its

rtholog(s) is regulated by the same

regulatory sequence.

Such conserved regulons are called

regulogs

SLIDE 69

Regulogs

gene geneA geneB geneC geneD geneF geneA geneB geneC geneD geneF geneC geneD geneF

B C D

geneG geneG geneG geneA geneC geneD geneE geneG geneF

A

geneB = regulon 1 1 0.66 0.33 1 gene = regulog geneA geneB geneC geneD geneE geneG geneF

A

Regulon Conservation Filtering (RECF)

= putative binding site

SLIDE 70

CMMT

RECF test: Escherichia coli

10.4 3 21 3 218 4 metR 11 4 7 4 77 4 torR 12.5 3 11 3 137 5

xyR

15.2 2 12 2 182 2 ilvY 25.5 1 4 1 102 4 pdhR Pos Total Pos Total Efficiency REGULOG REGULON # known TF

Efficiency

RECF

SpecificityREGULOG SpecificityREGULON x SensitivityREGULOG x SensitivityREGULON

SLIDE 71

RECF test: Escherichia coli

. . . . . . . . . . . . . . . . . . . . . 10.4 3 21 3 218 4 metR 4.2 3.8 20 7.2 174 9.8 AVG. 11 4 7 4 77 4 torR 12.5 3 11 3 137 5

xyR

15.2 2 12 2 182 2 ilvY 25.5 1 4 1 102 4 pdhR Pos Total Pos Total Efficiency REGULOG REGULON # known TF

Efficiency

RECF

SpecificityREGULOG SpecificityREGULON x SensitivityREGULOG x SensitivityREGULON

SLIDE 72

CMMT

RECF applied to S. aureus

RCS Consensus Members (leftmost members are the members with the highest confidence)

1.00 AACACAATATATAGTG nrdD,SA2409,nrdI,nrdE,cspC,mtlF 1.00 TGTTAGAAAATCTAAC glnR,nrgA,glnA 1.00 AGGTGCTAAATCCTGC SA0011 0.89 GCCAGCGTAGGGAAGT SA0928,SA0929,thiD,thiE,SA1897,gapR,thiM 0.88 ACAGGTCATAAGGGTC SA0929,SA1897,SA0928,thiD,polC,thiE,thiM 0.87 AAGGGTGGAACCACGA thrS,leuS,alaS,cysE,cysS,SA0489,SA0490,SA0491,pheS,pheT,S A1931,aspS,hisS,ileS,tyrS,trpG,valS,serS,SA0331,SA2101,SA148 6,SA1289,SA1290,SA1291,SA2205,SA1392,truncated(radC),SA1 578,murE,SA2102,SA1562,SA1199,trpD,trpC,trpF,trpB,trpA 0.86 TGTGAA?T?TTTCAC? narG,narI,SA2183,narH,pflB,SA2174,lctE,SA1455,narK,SA0293,m smX,adhE,rpsU,fbaA 0.83 AAAAGAGTGCTAACA? crtM,groES,hrcA,SA1747,SA1582,SA1581,SA2305,SA1748,grpE 0.83 TTGAAAATGATTATCA SA0307,SA0116,SA0689,SA0117,SA0690,SA0331,SA0977,SA09 78,SA1329,SA2162,ahpF,SA1979,SA0688,feoB,SA2338,sirA,SA2 079,katA,SA0757,ahpC,fhuA,fhuB,fhuG,SA0335,SA2101,SA0160, SA0170,hemX,sirB,hemL,hemB,hemD,hemC,dapD,hemA,SA2102 ,SA0588,SA0589,SA0115,dps,fer,SA0774,SA1678 0.82 ?A?AAAAGTTATCCAC SA0339,orfX,dnaA,dnaN,SA1419,SA1420,SA1421,SA1422,SA14 23,aroE,SA1425,SA1426,SA0248

SLIDE 73

The Fur regulog

Known in

ther bacteria

Known in

S. aureus

Unknown

SLIDE 74

CMMT

Conclusions

Pattern analysis methods have utility
Combine knowledge from multiple fields
Statistics and AI methods must be imported

– Gibbs sampling, LRA, neural networks, SVMs, etc

Evolution drives understanding in biology

– Phylogenetic Footprinting

Biochemistry inspires Bioinformatics

– Regulatory Modules – Familial Binding Profiles

Analysis of regulatory sequences is improving
Given sets of orthologous genes, one can predict regulatory regions
Given sets of co-regulated genes, it is possible to infer the binding

profiles for critical transcription factors

SLIDE 75

CMMT