Introduction to EPD General Goal: To provide the best possible - - PDF document

▶

Apr 07, 2023 567 likes •770 views

Introduction to EPD General Goal: To provide the best possible guess about the location of the transcription start sites (TSS) of a gene based on multiple evidence Leading concepts: Promoters are defined as transcription initiation regions

SLIDE 1

Introduction to EPD

General Goal: To provide the best possible guess about the location of the transcription start sites (TSS) of a gene based on multiple evidence Leading concepts:

Promoters are defined as transcription initiation regions
No redundancy
Data selection according to consistent criteria defined in user manual
Independent evaluation of published results
Dynamic entries potentially based on multiple sources
Definition of promoter sequence through positional pointers
Cross-referencing between related entries
Definition of a subset of phylogenetically independent promoters for

comparative sequence analysis

Promoter Definition

Three definitions for E. coli promoters

1. Transcription start regions.
2. DNA sequences essential for accurate and efficient RNA chain

initiation

3. RNA polymerase binding sites.

How the term promoter is used in the literature: Dickson et al. (1975), Science 187, 27: "... the (lac) promoter can be divided into two functional units, the CAP interaction site and the RNA polymerase interaction site.“ Harley & Reynolds (1987), Nucl. Acids. Res. 15, 2343: "Promoters are DNA sequences which affect the frequency and location of transcription initiation through interaction with RNA polymerase."

SLIDE 2

EPD: Admission criteria

I order to be included in EPD, a promoter must be: a) recognized by eukaryotic RNA POL II b) active in a higher eukaryotic (viral promoters ok) c) experimentally defined, or homologous and sufficiently similar to an experimentally defined promoter d) biologically functional (no promoters of transcribed pseudo-genes) e) available in the current sequence database (not a problem anymore) f) distinct from other promoters in the EPD (no redundancy, one copy for tandemly repeated gene clusters, active retrotransposons, etc.) Recent developments: Acceptance of “low quality” entries based on weak evidence

EPD entries are positional sequence features

Positional sequence features have a center position, but no experimentally defined borders . Examples of positional features:

promoters (DNA)
splice-junctions (DNA/RNA)
polyadenylation sites (RNA)
translation start sites (RNA)
catalytic residues (proteins)
post-translational modification sites

How can sequence analysis software deal with positional featres:

by extracting fixed length sequence segments around positions
relative 5’ and 3’ border may be specified on the fly

SLIDE 3

EPD format: Example of an EPD entry

ID LE_1A12 standard; single; PLN. XX AC EP35029; XX DT ??-JUN-1993 (Rel. 35, created) DT 07-OCT-2002 (Rel. 72, Last annotation update). XX DE 1-aminocyclopropane-1-carboxylic acid synthase 2 OS Lycopersicon esculentum (tomato). XX HG none. AP none. NP none. XX DR EMBL; X59139.1; [-2883, 4361]. DR SWISS-PROT; P18485; 1A12_LYCES. XX RN [1] RX MEDLINE; 1762159. RA Rottmann W.H., Peter G.F., Oeller P.W., Keller J.A., Shen N.F., RA Nagy B.P., Taylor L.P., Campbell A.D., Theologis A.; RT "The 1-aminocyclopropane-1-carboxylate synthase in tomato is RT encoded by a multigene family whose transcription is induced RT during fruit and floral senescence"; RL J. Mol. Biol. 222:937-961(1991). ...

EPD format: Example of an EPD entry (continuation)

ME Nuclease protection with homologous sequence ladder [1]. ME Primer extension with homologous sequence ladder [1]. XX SE acttcagtctttccccttatatatatccctcacattccttaattctcttACACCATAACA XX TX 1. Plant promoters TX 1.1. Chromosomal genes TX 1.1.4. Enzymes TX 1.1.4.6. Ethylene synthesis XX KW Fruit ripening, Ethylene biosynthesis, Lyase, Multigene family. XX FP Le ACC synth. ACC2 :+S EM:X59139.1 1+ 2884; 35029. XX DO Experimental evidence: 3h,6h DO Expression/Regulation: +fruit ripening;+wounding RF JMB222:937 //

Note: The line starting with the code FP defines the position of the TSS EM:X59139.1 sequence identifier 1 topology (1 = liner, 0 = circular) + strand (+/−) 2884 position within sequence

SLIDE 4

Signal Search Analysis Essentials

History: Signal Search Analysis is an ancient method developed by myself in the early eighties in Max Birnstiel’s lab in Zurich (first published in 1984) Purpose: to discover and characterize sequence motifs that occur at constrained distances from physiologically defined sites in nucleic acid sequences. Recent event: Adaptation of software to new environment, SSA web server, application to promoters and translational start sites. Note the difference: SSA programs serve to characterize motifs that occur at constrained distances from sites not: motifs that are over-represented within sequence sets There are hundreds of programs that address the latter problem, but

nly very few that serve the same purpose as the SSA programs!

Early comparative analysis of E.coli promoter sequences

FIG. 4. Comparison of promoter sequences (see text). b, Homologous sequence probably

engaged by RNA polymerase; i, mRNA initiation point (underlined). Hyphens have been

mitted. SV40, simian virus 40; w.t., wild type.

Among the promoter sequences, there is a homologous, 7-base sequence lying to the left of the initiation points. I feel that the DNA sequence 5' T-A-T-Pu-A-T-G 3' 3' A-T-A-Py-T-A-C 5' is implicated in the formation of a tight binary complex with RNA polymerase.

Text and Figures from: Pribnow (1975) Proc. Nat. Acad. Sci. USA 72, 784-788.

SLIDE 5

SSA Signal Search Analysis

Giovanna Ambrosini ISREC Swiss Institute for Experimental Cancer Research

History: Signal Search Analysis is a method developed by P Bucher in the early eighties

(Bucher, P. and Bryan B., E.N.; Nucleic Acids Res, v.12(1 Pt 1): 287–305)

Purpose: to discover and characterize sequence motifs that occur at constrained distances

from physiologically defined sites in nucleic acid sequences.

Signal search analysis programs:
1. CPR: generates a “constraint profile” for the neighborhood of a functional site
2. SList: generates lists of over and under-represented motifs in particular regions relative to a

functional site

3. OProf: generates a “signal occurrence profile” for a particular motif
4. PatOP: optimizes a weight matrix description of a locally over-represented sequence motif
Recent events: Adaptation of software to new environment, SSA web server, application to

promoters and translational start sites

Signal Search Analysis: Sequence via a functional position set

Input Data Structure Work data

Primary experimental data (Functional Position Set)

annotated functional positions in DNA

sequences stored in a database A DNA sequence matrix

a set of fixed-length sequence segments

with an experimentally defined site at a fixed internal position

SLIDE 6

Generating signal search data from a DNA sequence matrix Computing a constraint profile from signal search data

SLIDE 7

Generating a constraint profile for plant promoters Input parameters for constraint profile for plant promoters.

SLIDE 8

Input menu for signal search data

Special collections: each line expands to all combinations of

bases. For instance:

NNXXNN -> NNAANN NNACNN NNAGNN NNATNN NNCANN NNCCNN NNCGNN NNCTNN NNGANN NNGCNN NNGGNN NNGTNN NNTANN NNTCNN NNTGNN NNTTNN

Signal Occurrence Profile.

Is the tri-nucleotide TAT over-represented or under-represented inone

f the window ? Answer see next slide.

SLIDE 9

How to compute the local occurrence frequency ?

Note: Each “signal” occurrence (TAT) is counted only once per sequence. Windows containing N’s do not add to the sample size. Here, occurrence frequencies are computed for non-overlapping windows. Signal occurrence profiles are usually computed from overalapping windows.

Making a Signal Occurrence Profile for the eukaryotic TATA-box: Input data and parameters

SLIDE 10

Making Signal Occurrence Profile for the TATA-box for Eukaryotic Promoters: Result Concept of a locally over-represented sequence motif

SLIDE 11

Definition of a Locally Over-represented Sequence Motif

Components of the formal motif description

1. A weight matrix or consensus sequence defining the motif 2. A cut-off value determining which subsequence constitutes a motif match 3. A preferred region of occurrence defined by 5’ and 3’ borders relative to a functional site, e.g. a transcription initiation site

Concept

A motif which preferentially occurs at a characteristic distance (range) from a certain type of functional position Example: the TATA-box is a locally over-represented sequence motif of the -30 region of eukaryotic POL II transcription initiation sites

The PATOP algorithm optimizes a locally over-represented sequence motif

SLIDE 12

A weight matrix definition for the TATA-box motif

Weight matrix for the Initiator (Cap-signal)

Positional distribution of “site-selector” promoter elements

Weight matrix definition for CCAAT box motif

A weight matrix definition for the GC-box motif

Positional distributions of the promoter upstream elements

Comparative analysis of cancer up- and down- regulated promoters

Motifs considered: Name preferred position

approx. frequency

Initiator 25% - 50% TATA-box

30 to -25

~30% GC-box

200 to 0

~50% CCAAT-box

200 to -50

~20%

Positional distribution of Initiator motif in cancer up- and down-regulated promoters

SLIDE 16

Positional distribution of TATA-boxes in cancer up- and down-regulated promoters Positional distribution of GC-boxes in cancer up- and down-regulated promoters

SLIDE 17

Positional distribution of CCAAT-boxes in cancer up- and down-regulated promoters Comparative analysis of cancer up- and down- regulated promoters: Summary of results

Signal content Name Frequency in Frequency in cancer-up genescancer-down genes Initiator no change no change TATA-box up down GC-box no change no change CCAAT-box up down Next questions:

Are TATA-box and CCAAT-box binding factors up-regulated in cancer cells? Or do cancer-specific transcription factors (binding to adjacent sites) preferentially interact with TATA-box and CCAAT-box binding factors?

SLIDE 18

Concluding remarks

Signal search analysis has played an instrumental role in the characterization of eukaryotic

promoter elements

The method has originally been developed for the analysis of eukaryotic promoters but has

a much broader application potential (e.g. Shine-Dalgarno signal analysis)

Rapidly growing collection of complete genomes and high-throughput methods for genomic

analysis increase the statistical power to discover new motifs, or better characterize already known control signals

Aligning sequence sets with respect to a well characterized motif might allow the detection
f binding sites of cooperating transcription factors positionally correlated with the known

motif

Confirm or challenge commonly accepted hypotheses originally derived from small sets

Introduction to EPD

General Goal: To provide the best possible guess about the location of the transcription start sites (TSS) of a gene based on multiple evidence Leading concepts:

comparative sequence analysis

Promoter Definition

Three definitions for E. coli promoters

initiation

EPD: Admission criteria

EPD entries are positional sequence features

Positional sequence features have a center position, but no experimentally defined borders . Examples of positional features:

How can sequence analysis software deal with positional featres:

EPD format: Example of an EPD entry

EPD format: Example of an EPD entry (continuation)

Note: The line starting with the code FP defines the position of the TSS EM:X59139.1 sequence identifier 1 topology (1 = liner, 0 = circular) + strand (+/−) 2884 position within sequence

Signal Search Analysis Essentials

Early comparative analysis of E.coli promoter sequences

engaged by RNA polymerase; i, mRNA initiation point (underlined). Hyphens have been

Among the promoter sequences, there is a homologous, 7-base sequence lying to the left of the initiation points. I feel that the DNA sequence 5' T-A-T-Pu-A-T-G 3' 3' A-T-A-Py-T-A-C 5' is implicated in the formation of a tight binary complex with RNA polymerase.

Text and Figures from: Pribnow (1975) Proc. Nat. Acad. Sci. USA 72, 784-788.

SSA Signal Search Analysis

Giovanna Ambrosini ISREC Swiss Institute for Experimental Cancer Research

(Bucher, P. and Bryan B., E.N.; Nucleic Acids Res, v.12(1 Pt 1): 287–305)

from physiologically defined sites in nucleic acid sequences.

functional site

promoters and translational start sites

Signal Search Analysis: Sequence via a functional position set

Input Data Structure Work data

Primary experimental data (Functional Position Set)

sequences stored in a database A DNA sequence matrix

with an experimentally defined site at a fixed internal position

Generating signal search data from a DNA sequence matrix Computing a constraint profile from signal search data

Generating a constraint profile for plant promoters Input parameters for constraint profile for plant promoters.

Input menu for signal search data

Special collections: each line expands to all combinations of

NNXXNN -> NNAANN NNACNN NNAGNN NNATNN NNCANN NNCCNN NNCGNN NNCTNN NNGANN NNGCNN NNGGNN NNGTNN NNTANN NNTCNN NNTGNN NNTTNN

Signal Occurrence Profile.

Is the tri-nucleotide TAT over-represented or under-represented inone

How to compute the local occurrence frequency ?

Note: Each “signal” occurrence (TAT) is counted only once per sequence. Windows containing N’s do not add to the sample size. Here, occurrence frequencies are computed for non-overlapping windows. Signal occurrence profiles are usually computed from overalapping windows.

Making a Signal Occurrence Profile for the eukaryotic TATA-box: Input data and parameters

Making Signal Occurrence Profile for the TATA-box for Eukaryotic Promoters: Result Concept of a locally over-represented sequence motif

Definition of a Locally Over-represented Sequence Motif

1. A weight matrix or consensus sequence defining the motif 2. A cut-off value determining which subsequence constitutes a motif match 3. A preferred region of occurrence defined by 5’ and 3’ borders relative to a functional site, e.g. a transcription initiation site

A motif which preferentially occurs at a characteristic distance (range) from a certain type of functional position Example: the TATA-box is a locally over-represented sequence motif of the -30 region of eukaryotic POL II transcription initiation sites

The PATOP algorithm optimizes a locally over-represented sequence motif

A weight matrix definition for the TATA-box motif

See also. Bucher 1990, J. Mol. Biol. 212, 563-578.

Weight matrix for the Initiator (Cap-signal)

See also. Bucher 1990, J. Mol. Biol. 212, 563-578.

Positional distribution of “site-selector” promoter elements

See also. Bucher 1990, J. Mol. Biol. 212, 563-578.

Weight matrix definition for CCAAT box motif

See also. Bucher 1990, J. Mol. Biol. 212, 563-578.

A weight matrix definition for the GC-box motif

See also. Bucher 1990, J. Mol. Biol. 212, 563-578.

Positional distributions of the promoter upstream elements

See also. Bucher 1990, J. Mol. Biol. 212, 563-578.

Comparative analysis of cancer up- and down- regulated promoters

Motifs considered: Name preferred position

Initiator 25% - 50% TATA-box

~30% GC-box

~50% CCAAT-box

~20%

Positional distribution of Initiator motif in cancer up- and down-regulated promoters

Positional distribution of TATA-boxes in cancer up- and down-regulated promoters Positional distribution of GC-boxes in cancer up- and down-regulated promoters

Positional distribution of CCAAT-boxes in cancer up- and down-regulated promoters Comparative analysis of cancer up- and down- regulated promoters: Summary of results

Signal content Name Frequency in Frequency in cancer-up genescancer-down genes Initiator no change no change TATA-box up down GC-box no change no change CCAAT-box up down Next questions:

Are TATA-box and CCAAT-box binding factors up-regulated in cancer cells? Or do cancer-specific transcription factors (binding to adjacent sites) preferentially interact with TATA-box and CCAAT-box binding factors?

Concluding remarks

promoter elements

a much broader application potential (e.g. Shine-Dalgarno signal analysis)

analysis increase the statistical power to discover new motifs, or better characterize already known control signals

motif