SLIDE 1 Introduction to EPD
General Goal: To provide the best possible guess about the location of the transcription start sites (TSS) of a gene based on multiple evidence Leading concepts:
- Promoters are defined as transcription initiation regions
- No redundancy
- Data selection according to consistent criteria defined in user manual
- Independent evaluation of published results
- Dynamic entries potentially based on multiple sources
- Definition of promoter sequence through positional pointers
- Cross-referencing between related entries
- Definition of a subset of phylogenetically independent promoters for
comparative sequence analysis
Promoter Definition
Three definitions for E. coli promoters
- 1. Transcription start regions.
- 2. DNA sequences essential for accurate and efficient RNA chain
initiation
- 3. RNA polymerase binding sites.
How the term promoter is used in the literature: Dickson et al. (1975), Science 187, 27: "... the (lac) promoter can be divided into two functional units, the CAP interaction site and the RNA polymerase interaction site.“ Harley & Reynolds (1987), Nucl. Acids. Res. 15, 2343: "Promoters are DNA sequences which affect the frequency and location of transcription initiation through interaction with RNA polymerase."
SLIDE 2 EPD: Admission criteria
I order to be included in EPD, a promoter must be: a) recognized by eukaryotic RNA POL II b) active in a higher eukaryotic (viral promoters ok) c) experimentally defined, or homologous and sufficiently similar to an experimentally defined promoter d) biologically functional (no promoters of transcribed pseudo-genes) e) available in the current sequence database (not a problem anymore) f) distinct from other promoters in the EPD (no redundancy, one copy for tandemly repeated gene clusters, active retrotransposons, etc.) Recent developments: Acceptance of “low quality” entries based on weak evidence
EPD entries are positional sequence features
Positional sequence features have a center position, but no experimentally defined borders . Examples of positional features:
- promoters (DNA)
- splice-junctions (DNA/RNA)
- polyadenylation sites (RNA)
- translation start sites (RNA)
- catalytic residues (proteins)
- post-translational modification sites
How can sequence analysis software deal with positional featres:
- by extracting fixed length sequence segments around positions
- relative 5’ and 3’ border may be specified on the fly
SLIDE 3
EPD format: Example of an EPD entry
ID LE_1A12 standard; single; PLN. XX AC EP35029; XX DT ??-JUN-1993 (Rel. 35, created) DT 07-OCT-2002 (Rel. 72, Last annotation update). XX DE 1-aminocyclopropane-1-carboxylic acid synthase 2 OS Lycopersicon esculentum (tomato). XX HG none. AP none. NP none. XX DR EMBL; X59139.1; [-2883, 4361]. DR SWISS-PROT; P18485; 1A12_LYCES. XX RN [1] RX MEDLINE; 1762159. RA Rottmann W.H., Peter G.F., Oeller P.W., Keller J.A., Shen N.F., RA Nagy B.P., Taylor L.P., Campbell A.D., Theologis A.; RT "The 1-aminocyclopropane-1-carboxylate synthase in tomato is RT encoded by a multigene family whose transcription is induced RT during fruit and floral senescence"; RL J. Mol. Biol. 222:937-961(1991). ...
EPD format: Example of an EPD entry (continuation)
ME Nuclease protection with homologous sequence ladder [1]. ME Primer extension with homologous sequence ladder [1]. XX SE acttcagtctttccccttatatatatccctcacattccttaattctcttACACCATAACA XX TX 1. Plant promoters TX 1.1. Chromosomal genes TX 1.1.4. Enzymes TX 1.1.4.6. Ethylene synthesis XX KW Fruit ripening, Ethylene biosynthesis, Lyase, Multigene family. XX FP Le ACC synth. ACC2 :+S EM:X59139.1 1+ 2884; 35029. XX DO Experimental evidence: 3h,6h DO Expression/Regulation: +fruit ripening;+wounding RF JMB222:937 //
Note: The line starting with the code FP defines the position of the TSS EM:X59139.1 sequence identifier 1 topology (1 = liner, 0 = circular) + strand (+/−) 2884 position within sequence
SLIDE 4 Signal Search Analysis Essentials
History: Signal Search Analysis is an ancient method developed by myself in the early eighties in Max Birnstiel’s lab in Zurich (first published in 1984) Purpose: to discover and characterize sequence motifs that occur at constrained distances from physiologically defined sites in nucleic acid sequences. Recent event: Adaptation of software to new environment, SSA web server, application to promoters and translational start sites. Note the difference: SSA programs serve to characterize motifs that occur at constrained distances from sites not: motifs that are over-represented within sequence sets There are hundreds of programs that address the latter problem, but
- nly very few that serve the same purpose as the SSA programs!
Early comparative analysis of E.coli promoter sequences
- FIG. 4. Comparison of promoter sequences (see text). b, Homologous sequence probably
engaged by RNA polymerase; i, mRNA initiation point (underlined). Hyphens have been
- mitted. SV40, simian virus 40; w.t., wild type.
Among the promoter sequences, there is a homologous, 7-base sequence lying to the left of the initiation points. I feel that the DNA sequence 5' T-A-T-Pu-A-T-G 3' 3' A-T-A-Py-T-A-C 5' is implicated in the formation of a tight binary complex with RNA polymerase.
Text and Figures from: Pribnow (1975) Proc. Nat. Acad. Sci. USA 72, 784-788.
SLIDE 5 SSA Signal Search Analysis
Giovanna Ambrosini ISREC Swiss Institute for Experimental Cancer Research
- History: Signal Search Analysis is a method developed by P Bucher in the early eighties
(Bucher, P. and Bryan B., E.N.; Nucleic Acids Res, v.12(1 Pt 1): 287–305)
- Purpose: to discover and characterize sequence motifs that occur at constrained distances
from physiologically defined sites in nucleic acid sequences.
- Signal search analysis programs:
- 1. CPR: generates a “constraint profile” for the neighborhood of a functional site
- 2. SList: generates lists of over and under-represented motifs in particular regions relative to a
functional site
- 3. OProf: generates a “signal occurrence profile” for a particular motif
- 4. PatOP: optimizes a weight matrix description of a locally over-represented sequence motif
- Recent events: Adaptation of software to new environment, SSA web server, application to
promoters and translational start sites
Signal Search Analysis: Sequence via a functional position set
Input Data Structure Work data
Primary experimental data (Functional Position Set)
- annotated functional positions in DNA
sequences stored in a database A DNA sequence matrix
- a set of fixed-length sequence segments
with an experimentally defined site at a fixed internal position
SLIDE 6
Generating signal search data from a DNA sequence matrix Computing a constraint profile from signal search data
SLIDE 7
Generating a constraint profile for plant promoters Input parameters for constraint profile for plant promoters.
SLIDE 8 Input menu for signal search data
Special collections: each line expands to all combinations of
NNXXNN -> NNAANN NNACNN NNAGNN NNATNN NNCANN NNCCNN NNCGNN NNCTNN NNGANN NNGCNN NNGGNN NNGTNN NNTANN NNTCNN NNTGNN NNTTNN
Signal Occurrence Profile.
Is the tri-nucleotide TAT over-represented or under-represented inone
- f the window ? Answer see next slide.
SLIDE 9
How to compute the local occurrence frequency ?
Note: Each “signal” occurrence (TAT) is counted only once per sequence. Windows containing N’s do not add to the sample size. Here, occurrence frequencies are computed for non-overlapping windows. Signal occurrence profiles are usually computed from overalapping windows.
Making a Signal Occurrence Profile for the eukaryotic TATA-box: Input data and parameters
SLIDE 10
Making Signal Occurrence Profile for the TATA-box for Eukaryotic Promoters: Result Concept of a locally over-represented sequence motif
SLIDE 11 Definition of a Locally Over-represented Sequence Motif
- Components of the formal motif description
1. A weight matrix or consensus sequence defining the motif 2. A cut-off value determining which subsequence constitutes a motif match 3. A preferred region of occurrence defined by 5’ and 3’ borders relative to a functional site, e.g. a transcription initiation site
A motif which preferentially occurs at a characteristic distance (range) from a certain type of functional position Example: the TATA-box is a locally over-represented sequence motif of the -30 region of eukaryotic POL II transcription initiation sites
The PATOP algorithm optimizes a locally over-represented sequence motif
SLIDE 12
A weight matrix definition for the TATA-box motif
See also. Bucher 1990, J. Mol. Biol. 212, 563-578.
Weight matrix for the Initiator (Cap-signal)
See also. Bucher 1990, J. Mol. Biol. 212, 563-578.
SLIDE 13
Positional distribution of “site-selector” promoter elements
See also. Bucher 1990, J. Mol. Biol. 212, 563-578.
Weight matrix definition for CCAAT box motif
See also. Bucher 1990, J. Mol. Biol. 212, 563-578.
SLIDE 14
A weight matrix definition for the GC-box motif
See also. Bucher 1990, J. Mol. Biol. 212, 563-578.
Positional distributions of the promoter upstream elements
See also. Bucher 1990, J. Mol. Biol. 212, 563-578.
SLIDE 15 Comparative analysis of cancer up- and down- regulated promoters
Motifs considered: Name preferred position
Initiator 25% - 50% TATA-box
~30% GC-box
~50% CCAAT-box
~20%
Positional distribution of Initiator motif in cancer up- and down-regulated promoters
SLIDE 16
Positional distribution of TATA-boxes in cancer up- and down-regulated promoters Positional distribution of GC-boxes in cancer up- and down-regulated promoters
SLIDE 17
Positional distribution of CCAAT-boxes in cancer up- and down-regulated promoters Comparative analysis of cancer up- and down- regulated promoters: Summary of results
Signal content Name Frequency in Frequency in cancer-up genescancer-down genes Initiator no change no change TATA-box up down GC-box no change no change CCAAT-box up down Next questions:
Are TATA-box and CCAAT-box binding factors up-regulated in cancer cells? Or do cancer-specific transcription factors (binding to adjacent sites) preferentially interact with TATA-box and CCAAT-box binding factors?
SLIDE 18 Concluding remarks
- Signal search analysis has played an instrumental role in the characterization of eukaryotic
promoter elements
- The method has originally been developed for the analysis of eukaryotic promoters but has
a much broader application potential (e.g. Shine-Dalgarno signal analysis)
- Rapidly growing collection of complete genomes and high-throughput methods for genomic
analysis increase the statistical power to discover new motifs, or better characterize already known control signals
- Aligning sequence sets with respect to a well characterized motif might allow the detection
- f binding sites of cooperating transcription factors positionally correlated with the known
motif
- Confirm or challenge commonly accepted hypotheses originally derived from small sets