Bioinformatics for the Identification of Sequences Regulating Gene - - PowerPoint PPT Presentation

bioinformatics for the identification of sequences
SMART_READER_LITE
LIVE PREVIEW

Bioinformatics for the Identification of Sequences Regulating Gene - - PowerPoint PPT Presentation

Bioinformatics for the Identification of Sequences Regulating Gene Transcription Wyeth W. Wasserman University of British Columbia www.cisreg.ca Acknowledgements Collaborators Wasserman Group Dave Arenillas Jenny Bryan (UBC) Jochen Brumm


slide-1
SLIDE 1

Bioinformatics for the Identification of Sequences Regulating Gene Transcription

Wyeth W. Wasserman

University of British Columbia

www.cisreg.ca

slide-2
SLIDE 2

Acknowledgements

Wasserman Group

Dave Arenillas Jochen Brumm Alice Chou Debra Fulton Shannan Ho Sui Carol Huang Danielle Kemmer (KI) Byron Kuo Jonathan Lim Raf Podowski (KI) Dora Pak Chris Walsh Dimas Yusuf

Collaborating Trainees

Malin Andersson (KTH) Öjvind Johansson (KTH) Stuart Lithwick (U.Toronto)

Support: CIHR, CGDN, MSFHR, CFI, Merck-Frosst, BC Children’s Hospital Foundation

Collaborators Jenny Bryan (UBC) Brenda Gallie (OCI) Jens Lagergren (KTH) Chip Lawrence (Brown) Boris Lenhard (K.I.) James Mortimer (MF) Jacob Odeberg (KTH) Group Alumni Wynand Alkema Elena Herzog Annette Höglund William Krivan Luis Mendoza Albin Sandelin

slide-3
SLIDE 3

CMMT

Overview

  • DISCRIMINATION: TFBS Prediction with Motif

Models

  • Phylogenetic Footprinting
  • Combinatorial Interactions
  • Current Activities
  • DISCOVERY: Inferring Regulatory Mechanisms

for Co-Expressed (Co-Regulated) Genes

  • Motif Over-representation
  • Pattern Discovery
  • Current Activities
slide-4
SLIDE 4

CMMT

Transcription Factor Binding Sites

(over-simplified for pedagogical purposes)

TATA URE

URF Pol-II

slide-5
SLIDE 5

Teaching a computer to find TFBS…

slide-6
SLIDE 6

Representing Binding Sites for a TF

  • A set of sites represented as a consensus
  • VDRTWRWWSHD (IUPAC degenerate DNA)

A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3 C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12 G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2 T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4

  • A matrix describing a a set of sites
  • A single site
  • AAGTTAATGA

Set of binding sites AAGTTAATGA CAGTTAATAA GAGTTAAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTGATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA AAGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAATGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA Set of binding sites AAGTTAATGA CAGTTAATAA GAGTTAAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTGATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA AAGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAATGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA

slide-7
SLIDE 7

CMMT

PFMs to PWMs (PSSMs)

A 5 0 1 0 0 C 0 2 2 4 0 G 0 3 1 0 4 T 0 0 1 1 1 A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5 0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3 T -1.7 -1.7 -0.2 -0.2 -0.2 f matrix w matrix Log(

)

f(b,i)+ s(N) p(b)

slide-8
SLIDE 8

Detecting binding sites in a single sequence

Scanning a sequence against a PW M

A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ] C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ] G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ] T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ]

ACCCTCCCCAGGGGCGGGGGGCGGTGGCCAGGACGGTAGCTCC

Abs_score = 13.4 (sum of column scores)

Sp1

Calculating the relative score

A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ] C [-0.2284 -0.2284 -1.5 -1.5 1.5128 1.5128

  • 1.5 -0.2284 -1.5 -0.2284 -1.5 ]

G [ 1.2348 1.2348 1.2348 1.2348 2.1222 2.1222 2.1222 2.1222 0.4368 1.2348 1.2348 1.5128 1.5128 1.7457 1.7457 1.7457 1.7457

  • 1.5 ]

T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 1.7457 ] A [-0.2284 0.4368 -1.5 -1.5 -

  • 1.5

1.5 0.4368 -

  • 1.5

1.5 -

  • 1.5

1.5 -0.2284 0.4368 ] C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -

  • 1.5

1.5 -0.2284 -1.5 -0.2284 -1.5 ] G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -

  • 1.5

1.5 ] T [ 0.4368 0.4368 -

  • 0.2284

0.2284

  • 1.5

1.5 -

  • 1.5

1.5 -0.2284 0.4368 0.4368 0.4368 -

  • 1.5

1.5 1.7457 ]

Max_score = 15.2 (sum of highest column scores) Min_score = -10.3 (sum of lowest column scores)

93% = ⋅ − − = ⋅ =

100% 10.3) ( 15.2 (-10.3)

  • 13.4

% 100 Min_score

  • Max_score

Min_score

  • Abs_score

Rel_score

Scanning 1 3 0 0 bp of hum an insulin receptor gene w ith Sp1 at rel_ score threshold of 7 5 %

Ouch.

slide-9
SLIDE 9

CMMT

Performance of Profiles

  • 95% of predicted sites bound in vitro

(Tronche 1997)

  • MyoD binding sites predicted about once

every 600 bp (Fickett 1995)

  • The Futility Theorem

– Nearly 100% of predicted transcription factor binding sites have no function in vivo

slide-10
SLIDE 10

CMMT

JASPAR AN OPEN-ACCESS DATABASE OF TF BINDING PROFILES

slide-11
SLIDE 11

CMMT

DISCRIMINATION Overcoming the Specificity Problems

slide-12
SLIDE 12

Phylogenetic Footprinting Dramatically Reduces Spurious Hits

Human Mouse Actin, alpha cardiac

slide-13
SLIDE 13

CMMT

Performance: Human vs. Mouse

  • Testing set: 40 experimentally defined sites in 15 well

studied genes (Replicated with 100+ site set)

  • 75-90% of defined sites detected with conservation filter,

while only 11-16% of total predictions retained

SELECTIVITY SENSITIVITY

slide-14
SLIDE 14

CMMT

ConSite (www.cisreg.ca)

Now Featuring: Ortholog Sequence Retrieval Service

slide-15
SLIDE 15

CMMT

Current Activity: Analysis of Genetic Variation in TFBS

ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT

slide-16
SLIDE 16

CMMT

Sequence Variation in TFBS

TSS AaGT

URF

GENE DISEASE/CONDITION (associated) REFERENCE UGT1A1 Gilbert’s Syndrome –jaundice PJ Bosma, et al., 1995 UCP3 Elevated Body Mass S Otabe et al., 2000 TNFalpha Malaria Susceptibility JC Knight et al., 1999 Resistin Elevated Body Mass JC Engert et al., 2002 IL4Ralpha Reduced soluble IL4R H Hackstein et al., 2001 ABCA1 Coronary artery disease KY Zwarts et al., 2002 Ob Leptin levels J Hager et al., 1998 PEPCK Obesity

  • Y. Olswang et al., 2002

PR Endometrial cancer I DeVivo et al., 2002 LDLR Familial hypercholesterolemia Koivisto et al., 1994

slide-17
SLIDE 17

CMMT

Identifying allele-specific binding site predictions

1234567890123456789012345 ACGCATAAGTTAAtGAATAACAGAT .............c...........

  • 4
  • 2

2 4 1 2 3 4 5 6 7 8 9 10 11

Swt-Smt

2 1

  • 1
  • 2
slide-18
SLIDE 18

CMMT

RAVEN screenshots

slide-19
SLIDE 19

CMMT

Recent and Active Projects

  • JUMBO-JASPAR

– Building a second generation open-access database

  • NHR-scan

– Identification of binding sites for nuclear hormone receptors

slide-20
SLIDE 20

CMMT

Discrimination of Regulatory Modules TFs do NOT act in isolation

slide-21
SLIDE 21

Layers of Complexity in Metazoan Transcription

slide-22
SLIDE 22

CMMT

Detecting Clusters of TF Binding Sites

  • Trained Methods

– Sufficient examples of real clusters to establish weights on the relative importance of each TF

  • Statistical Over-Representation of Combinations

– Binding profiles available for a set of biologically motivated TFs

slide-23
SLIDE 23

CMMT

Training for the detection of liver cis-regulatory modules (CRMs)

slide-24
SLIDE 24

CMMT

Building a predictive model

(Brief, as this is well described in the literature)

HNF1 C/EBP HNF3 HNF4 At 60% sensitivity, predictions made ~1/30,000 bp

slide-25
SLIDE 25

CMMT

UGT1A1

  • 0.2

0.2 0.4 0.6 0.8 1 100 510 920 1330 1740 2150 2560 2970 3380 3790 4200 4610 5020 5430 5840 Series1 Series2 Wildtype

Other

Liver Module Model Score “Window” Position in Sequence

slide-26
SLIDE 26

CMMT

MSCAN: An untrained method for CRM detection

(w/ J. Lagergren, Royal Technical University of Sweden)

  • MSCAN takes as input a user-defined set of TF

profiles

  • Calculates significance for each observed “site”

based on local sequence characteristics

  • Calculates cluster significance using a dynamic

programming approach

  • Approximately 1 significant liver cluster / 18 000 bp in human

genome sequence

  • Filters out statistically significant clusters of sites

that contain local repeats

  • Identification of non-random characteristics in DNA

http://mscan.cgb.ki.se

slide-27
SLIDE 27

CMMT

Current Activities on Combinatorial Binding Prediction

  • Social network analysis to identify a reliable

set of genes regulated by a given set of TFs

slide-28
SLIDE 28

CMMT

Making better predictions

  • Profiles make far too many false predictions to

have predictive value in isolation

  • Phylogenetic footprinting eliminates ~90% of

false predictions

  • Algorithms for detection of clusters of binding

sites perform better, especially when possible to create train on known examples for the target context

slide-29
SLIDE 29

CMMT

Linking co-expressed genes from microarrays to candidate transcription factors

slide-30
SLIDE 30

CMMT

DISCOVERY Inferring regulatory mechanisms for subsets of co-expressed genes

slide-31
SLIDE 31

CMMT

Deciphering Regulation of Co- Expressed Genes

slide-32
SLIDE 32

CMMT

  • POSSUM Procedure

Set of co- expressed genes Automated sequence retrieval from EnsEMBL Phylogenetic Footprinting Detection of transcription factor binding sites Statistical significance of binding sites Putative mediating transcription factors

ORCA

slide-33
SLIDE 33

CMMT

Statistical Methods for Identifying Over- represented TFBS

  • Z scores

– Based on the number of occurrences of the TFBS relative to background – Normalized for sequence length – Simple binomial distribution model

  • Fisher exact probability scores

– Based on the number of genes containing the TFBS relative to background – Hypergeometric probability distribution

slide-34
SLIDE 34

CMMT

The oPOSSUM Database

  • Orthologous genes:

8468

  • Promoter pairs:

6911

  • Promoters with TFBS:

6758

  • Total # of TFBS predictions:

1638293

  • Overall failure rate:

20.2%

slide-35
SLIDE 35

CMMT

Validation using Reference Gene Sets

TFs with experimentally-verified sites in the reference sets.

  • A. Muscle-specific (23 input; 16

analyzed)

  • B. Liver-specific (20 input; 12 analyzed)

Rank Z-score Fisher Rank Z-score Fisher SRF 1 21.41 1.18e-02 HNF-1 1 38.21 8.83e-08 MEF2 2 18.12 8.05e-04 HLF 2 11.00 9.50e-03 c-MYB_1 3 14.41 1.25e-03 Sox-5 3 9.822 1.22e-01 Myf 4 13.54 3.83e-03 FREAC-4 4 7.101 1.60e-01 TEF-1 5 11.22 2.87e-03 HNF-3beta 5 4.494 4.66e-02 deltaEF1 6 10.88 1.09e-02 SOX17 6 4.229 4.20e-01 S8 7 5.874 2.93e-01 Yin-Yang 7 4.070 1.16e-01 Irf-1 8 5.245 2.63e-01 S8 8 3.821 1.61e-02 Thing1-E47 9 4.485 4.97e-02 Irf-1 9 3.477 1.69e-01 HNF-1 10 3.353 2.93e-01 COUP-TF 10 3.286 2.97e-01

slide-36
SLIDE 36

Application to Microarray Data Sets

  • 1. NF-кB inhibition microarray study
slide-37
SLIDE 37

Genes Significantly Down-regulated by the NF-κB inhibitor (326 input; 179 analyzed)

TF Class Rank Z-score Fisher

  • No. Genes

p65 REL 1 36.57 5.66e-12 62 NF-kappaB REL 2 32.58 5.82e-11 61 c-REL REL 3 26.02 8.59e-08 63 Irf-2 TRP-CLUSTER 4 20.39 5.74e-04 6 SPI-B ETS 5 16.59 1.23e-03 135 Irf-1 TRP-CLUSTER 6 15.4 9.55e-04 23 Sox-5 HMG 7 15.38 2.56e-02 126 p50 REL 8 14.72 2.23e-03 19 Nkx HOMEO 9 13.66 2.29e-03 111 Bsap PAIRED 10 13.2 9.92e-02 1 FREAC-4 FORKHEAD 11 12.05 1.66e-03 92 n-MYC bHLH-ZIP 25 6.695 1.84e-03 102 ARNT bHLH 26 6.695 1.84e-03 102 HNF-3beta FORKHEAD 29 5.948 3.32e-03 47 SOX17 HMG 31 5.406 8.60e-03 79

slide-38
SLIDE 38

CMMT

C-Myc SAGE Data

  • c-Myc transcription factor dimerizes with the Max

protein

  • Key regulator of cell proliferation, differentiation

and apoptosis

  • Menssen and Hermeking identified 216 different

SAGE tags corresponding to unique mRNAs that were induced after adenoviral expression of c-Myc in HUVEC cells

  • They then went on to confirm the induction of 53

genes using microarray analysis and RT-PCR

slide-39
SLIDE 39

CMMT

Induced Genes after Ectopic Expression of c-Myc (SAGE) (53 input; 36 analyzed)

TF Class Rank Z-score Fisher

  • No. Genes

Myc-Max bHLH-ZIP 1 21.68 5.35e-03 7 Staf ZN-FINGER, C2H2 2 20.17 1.70e-02 2 Max bHLH-ZIP 3 18.32 2.16e-02 12 SAP-1 ETS 4 13.23 1.61e-04 13 USF bHLH-ZIP 5 11.90 1.84e-01 16 SP1 ZN-FINGER, C2H2 6 11.68 4.40e-02 12 n-MYC bHLH-ZIP 7 11.11 1.55e-01 20 ARNT bHLH 8 11.11 1.55e-01 20 Elk-1 ETS 9 10.92 3.88e-03 19 Ahr-ARNT bHLH 10 10.17 1.11e-01 25

slide-40
SLIDE 40

CMMT

C-Fos Microarray Experiment

  • In a study examining the role of

transcriptional repression in oncogenesis, Ordway et al. compared the gene expression profiles of fibroblasts transformed by c-fos to the parental 208F rat fibroblast cell line

  • We mapped the list of 252 induced

Affymetrix Rat Genome U34A GeneChip sequences to 136 human orthologs

slide-41
SLIDE 41

Induced Genes after Ectopic Expression of c-Fos (Affymetrix) (136 input; 86 analyzed)

TF Class Rank Z-score Fisher

  • No. Genes

c-FOS bZIP 1 17.53 2.60e-05 45 RREB-1 ZN-FINGER, C2H2 2 8.899 1.41e-01 1 PPARgamma-RXRal NUCLEAR RECEPTOR 3 3.991 2.98e-01 1 CREB bZIP 4 3.626 1.25e-01 10 E2F Unknown 5 2.965 7.67e-02 15 NF-kappaB REL 6 2.915 1.04e-01 17 SRF MADS 7 2.707 2.24e-01 2 MEF2 MADS 8 2.634 1.32e-01 13 c-REL REL 9 2.467 5.79e-02 22 Staf ZN-FINGER, C2H2 10 2.385 3.74e-01 1 Ahr-ARNT bHLH 15 1.716 2.57e-03 63 deltaEF1 ZN-FINGER, C2H2 23 0.271 5.39e-03 75 Elk-1 ETS 21 0.7875 8.12e-03 37 MZF_1-4 ZN-FINGER, C2H2 27

  • 0.2421

5.41e-03 73 n-MYC bHLH-ZIP 30

  • 0.8738

8.20e-03 51 ARNT bHLH 31

  • 0.8738

8.20e-03 51

slide-42
SLIDE 42

CMMT

  • POSSUM Server
slide-43
SLIDE 43

CMMT

http://www.cisreg.ca/cgi- bin/oPOSSUM/opossum

INPUT A LIST OF CO-EXPRESSED GENES

slide-44
SLIDE 44

CMMT

SELECT YOUR TFBS PROFILES

slide-45
SLIDE 45

CMMT

SELECT:

  • 1. CONSERVATION
  • 2. PSSM MATCH THRESHOLD
  • 3. PROMOTER REGION
  • 4. STATISTICAL MEASURE
slide-46
SLIDE 46

CMMT

slide-47
SLIDE 47

CMMT

de novo Discovery

  • f TF Binding Sites
slide-48
SLIDE 48

CMMT

Pattern Discovery

slide-49
SLIDE 49

CMMT

de novo Pattern Discovery

  • Exhaustive

– e.g. YMF (Sinha & Tompa) – Generalization: Identify over-represented oligomers in comparison of “+” and “-” (or complete) promoter collections

  • Monte Carlo/Gibbs Sampling

– e.g. AnnSpec (Workman & Stormo) – Generalization: Identify strong patterns in “+” promoter collection vs. background model of expected sequence characteristics

slide-50
SLIDE 50

Two data structures used: 1) Current pattern nucleotide frequencies

qi,1,..., qi,4 and corresponding background

frequencies pi,1,..., pi,4 2) Current positions of site startpoints in the N sequences a1, ..., aN , i.e. the alignment that contributes to qi,j. One starting point in each sequence is chosen randomly initially.

The Gibbs Sampling algorithm

tgacttcc tgatctct agacctca tgacctct

Probabilistic Methods for Pattern Discovery(3)

II.26

slide-51
SLIDE 51

I teration step

Remove one sequence z from the

  • set. Update the current pattern

according to

tgacttcc tgatctct agacctca tgacctct

B N b c q

j j i j i

+ − + = 1

, ,

Pseudocount for symbol j Sum of all pseudocounts in column

Probabilistic Methods for Pattern Discovery(4)

A

’Score’ the current pattern against each possible occurence

ak in z. Draw a new ak with

probabilities based on respective score divided by the background model

B

II.27

z

slide-52
SLIDE 52

CMMT

Applied Pattern Discovery is Acutely Sensitive to Noise

10 12 14 16 18 100 200 300 400 500 600

SEQUENCE LENGTH PATTERN SIMILARITY

  • vs. TRUE MEF2 PROFILE

True Mef2 Binding Sites

slide-53
SLIDE 53

CMMT

Four Approaches to Improve Sensitivity

  • Better background models
  • Higher-order properties of DNA
  • Phylogenetic Footprinting

– Human:Mouse comparison eliminates ~75% of sequence

  • Regulatory Modules

– Architectural rules

  • Limit the types of binding profiles allowed

– TFBS patterns are NOT random

slide-54
SLIDE 54

CMMT

Pattern discovery methods using biochemical constraints

slide-55
SLIDE 55

Information segmentation

Information content distributions of TFBS are distinctly non-random

(Wasserman et al 2000) Palindromicity, dyads (van Helden et al 2000) Variable gaps (Hu 2003)

TFBSs are not randomly drawn

Enhancing pattern detection sensitivity

slide-56
SLIDE 56

CMMT

Our Hypothesis

  • Point 1: Structurally-related DNA binding

domains interact with similar target sequences

  • Exceptions exist (e.g. Zn-fingers)
  • Point 2: There are a finite number of binding

domains used in human TFs

  • Approximately 20-25
  • Idea: We could use the shared binding properties

for each family to focus pattern detection methods

  • Constrain the range of patterns sought
slide-57
SLIDE 57

CMMT

Comparison of profiles requires alignment and a scoring function

  • Scoring function based on sum of

squared differences

  • Align frequency matrices with modified

Needleman-Wunsch algorithm

  • Calculate empirical p-values based on

simulated set of matrices

Score Frequency

slide-58
SLIDE 58

CMMT

Intra-family comparisons more similar than inter-family

TF Database (JASPAR) COMPARE Match to bHLH

Jackknife Test 87% correct Independent Test Set 93% correct

slide-59
SLIDE 59

CMMT

slide-60
SLIDE 60

CMMT

FBPs enhance sensitivity of pattern detection

slide-61
SLIDE 61
slide-62
SLIDE 62

CMMT

“Regulog” Analysis Comparative Genomics for Promoters

slide-63
SLIDE 63

CMMT

Approach

  • Define all regulatory sequences in S. aureus.
  • Transcription factor binding sites
  • RNA structures
  • Promoters

=>Phylogenetic footprinting

slide-64
SLIDE 64

Find a conserved pattern

  • E. coli
  • B. subtilis
  • S. aureus

clpP TACCNCN(A/T)(A/T)NGNGGTA TACCNRWAAYGBGGTA

taccgctattgaggta taccccgatcggggta tacccattaaggagta taactctaaagtggta tacctcaatagcggta taccccgatcggggta tactccttaatgggta taccactttagagtta

A [0 8 1 0 1 1 1 5 3 5 0 2 1 0 0 8] C [0 0 7 7 4 7 0 0 0 2 0 1 0 0 0 0] G [0 0 0 0 1 0 2 0 0 0 7 4 7 7 0 0] T [8 0 0 1 2 0 5 3 5 1 1 1 0 1 8 0]

Pattern detection

slide-65
SLIDE 65

CMMT

Regulatory sequences in S. aureus

1818 sets of orthologs from S. aureus 1430 patterns

Gibbs sampling Compare to random sequences

318 significant patterns

Cluster with MatrixAligner (Sandelin et al 2003)

154 unique patterns in S. aureus

Remove redundancies

slide-66
SLIDE 66

CMMT

Approach

  • Define all regulatory sequences in S. aureus.
  • Transcription factor binding sites
  • RNA structures
  • Promoters
  • Define sets of genes that are under control
  • f these regulatory sequences =>regulons

– Sequence search – Regulog filtering

slide-67
SLIDE 67

CMMT

Regulon prediction with site search

Site score threshold (p-value)

0.00 0.02 0.04 0.06 0.08 0.10 0.12

Fraction of total ORFS in regulon

0.00 0.02 0.04 0.06 0.08 0.10

175 members in E. coli = > Site searches produce too many false positive hits

slide-68
SLIDE 68

CMMT

Regulon conservation filter

  • A predicted regulon member is more

likely a true positive when its

  • rtholog(s) is regulated by the same

regulatory sequence.

  • Such conserved regulons are called

regulogs

slide-69
SLIDE 69

Regulogs

gene geneA geneB geneC geneD geneF geneA geneB geneC geneD geneF geneC geneD geneF

B C D

geneG geneG geneG geneA geneC geneD geneE geneG geneF

A

geneB = regulon 1 1 0.66 0.33 1 gene = regulog geneA geneB geneC geneD geneE geneG geneF

A

Regulon Conservation Filtering (RECF)

= putative binding site

slide-70
SLIDE 70

CMMT

RECF test: Escherichia coli

10.4 3 21 3 218 4 metR 11 4 7 4 77 4 torR 12.5 3 11 3 137 5

  • xyR

15.2 2 12 2 182 2 ilvY 25.5 1 4 1 102 4 pdhR Pos Total Pos Total Efficiency REGULOG REGULON # known TF

Efficiency

RECF

SpecificityREGULOG SpecificityREGULON x SensitivityREGULOG x SensitivityREGULON

slide-71
SLIDE 71

RECF test: Escherichia coli

. . . . . . . . . . . . . . . . . . . . . 10.4 3 21 3 218 4 metR 4.2 3.8 20 7.2 174 9.8 AVG. 11 4 7 4 77 4 torR 12.5 3 11 3 137 5

  • xyR

15.2 2 12 2 182 2 ilvY 25.5 1 4 1 102 4 pdhR Pos Total Pos Total Efficiency REGULOG REGULON # known TF

Efficiency

RECF

SpecificityREGULOG SpecificityREGULON x SensitivityREGULOG x SensitivityREGULON

slide-72
SLIDE 72

CMMT

RECF applied to S. aureus

RCS Consensus Members (leftmost members are the members with the highest confidence)

1.00 AACACAATATATAGTG nrdD,SA2409,nrdI,nrdE,cspC,mtlF 1.00 TGTTAGAAAATCTAAC glnR,nrgA,glnA 1.00 AGGTGCTAAATCCTGC SA0011 0.89 GCCAGCGTAGGGAAGT SA0928,SA0929,thiD,thiE,SA1897,gapR,thiM 0.88 ACAGGTCATAAGGGTC SA0929,SA1897,SA0928,thiD,polC,thiE,thiM 0.87 AAGGGTGGAACCACGA thrS,leuS,alaS,cysE,cysS,SA0489,SA0490,SA0491,pheS,pheT,S A1931,aspS,hisS,ileS,tyrS,trpG,valS,serS,SA0331,SA2101,SA148 6,SA1289,SA1290,SA1291,SA2205,SA1392,truncated(radC),SA1 578,murE,SA2102,SA1562,SA1199,trpD,trpC,trpF,trpB,trpA 0.86 TGTGAA?T?TTTCAC? narG,narI,SA2183,narH,pflB,SA2174,lctE,SA1455,narK,SA0293,m smX,adhE,rpsU,fbaA 0.83 AAAAGAGTGCTAACA? crtM,groES,hrcA,SA1747,SA1582,SA1581,SA2305,SA1748,grpE 0.83 TTGAAAATGATTATCA SA0307,SA0116,SA0689,SA0117,SA0690,SA0331,SA0977,SA09 78,SA1329,SA2162,ahpF,SA1979,SA0688,feoB,SA2338,sirA,SA2 079,katA,SA0757,ahpC,fhuA,fhuB,fhuG,SA0335,SA2101,SA0160, SA0170,hemX,sirB,hemL,hemB,hemD,hemC,dapD,hemA,SA2102 ,SA0588,SA0589,SA0115,dps,fer,SA0774,SA1678 0.82 ?A?AAAAGTTATCCAC SA0339,orfX,dnaA,dnaN,SA1419,SA1420,SA1421,SA1422,SA14 23,aroE,SA1425,SA1426,SA0248

slide-73
SLIDE 73

The Fur regulog

Known in

  • ther bacteria

Known in

  • S. aureus

Unknown

slide-74
SLIDE 74

CMMT

Conclusions

  • Pattern analysis methods have utility
  • Combine knowledge from multiple fields
  • Statistics and AI methods must be imported

– Gibbs sampling, LRA, neural networks, SVMs, etc

  • Evolution drives understanding in biology

– Phylogenetic Footprinting

  • Biochemistry inspires Bioinformatics

– Regulatory Modules – Familial Binding Profiles

  • Analysis of regulatory sequences is improving
  • Given sets of orthologous genes, one can predict regulatory regions
  • Given sets of co-regulated genes, it is possible to infer the binding

profiles for critical transcription factors

slide-75
SLIDE 75

CMMT

Questions? Comments? Complaints?