Characterization of transcription factor binding sites by - - PDF document

▶

Jan 26, 2024 39 likes •254 views

Characterization of transcription factor binding sites by high-throughput SELEX Overview of the HTPSELEX Database Transcription Factor Binding Sites: Features and Facts Degenerate sequence motifs Typical length: 6-20 bp Low information

SLIDE 1

Characterization of transcription factor binding sites by high-throughput SELEX Overview of the HTPSELEX Database Transcription Factor Binding Sites: Features and Facts

Degenerate sequence motifs Typical length: 6-20 bp Low information content: 8-12 bits (1 site per 250-4000 bp) Quantitative recognition mechanism: measurable affinity of different sites may vary over three orders of magnitude Regulatory function often depends on cooperative interactions with neighboring sites

SLIDE 2

Representation of the binding specificity by a scoring matrix (also referred to as weight matrix)

5 5 5 5

T

2

5 G

3 5 C

5 A 9 8 7 6 5 4 3 2 1 Strong C T T T G A T C T Binding site 5 + 5 + 5 + 5 + 5 + 5 + 5 + 3 + 5 = 43 Random A C G T A C G T A Sequence -10 -10

13 + 5 -10 -15 -13 -11 - 6 = -83

Title

SLIDE 3

Physical interpretation of an weight matrix

Weight matrix elements represent relative binding energies between DNA base-pairs and protein surface areas (base-pair acceptor sites). A weight matrix column describes the base preferences of a base-pair acceptor site.

Berg-von Hippel model of protein-DNA interactions

On a relative scale, the binding constant for sequence x is given by:

const. ) ( ) ( + = ∆ − x S x G

RT b w b

i i

) ( ) ( − = ε

∑

=

N i i i x

w S

) ( ) (x

The weight matrix score expresses the binding free energy of protein-DNA complex in arbitrary units:

∑

=

N i i i x

E

) ( ) ( ε x

It is convenient to express the binding free energy in dimension-free −RT units:

) ( rel

) (

x

e K =

For sequences longer than the weight matrix:

) ... ( rel ) ... ( rel

1 1

max 1 ) (

1 ) (

− + − +

− −

= = ∑

N i i N i i

x x E i i x x E

e K e K x x

(index i runs over all subsequence starting positions on both strands)

SLIDE 4

Berg-von Hippel Theory – Information Content

) ( ) ( ln 1 ) ( b q b p b

i i

λ ε − =

The energy terms of a weight matrix can be computed from the base frequencies pi(b) found in in vitro or in vivo selected binding sites: q(b) is the background frequency of base b. λ is an unknown parameters related to the stringency of the binding conditions. The information content of a binding site has been defined as the conditional entropy of the base frequency matrix relative to back-ground base frequencies.

∑∑

= =

=

N i T A b i i

b q b p b p IC

1 2

) ( ) ( log ) (

Paradox: λ depends on selection conditions (e.g. the protein concentration) - therefore the base frequencies observed in selected binding sites do not reflect a protein-intrinsic property.

Weight matrices/profiles from a biochemical and viewpoint

A weight matrix expresses the sequence specificity of a DNA binding proteins. A column describes the base preferences of a surface area of the DNA- binding protein. Weights of a weight matrix can be interpreted as additive binding energy contributions. No interactions between binding site positions ! According to the Berg-von Hippel theory negated binding energies are proportional to the logarithms of the base frequencies observed in an in vivo or in vitro selected set of binding sites. Weight matrices can thus be used to compute relative binding energies

r dissociation constants for oligonucleotides of any sequence, which in

turn can be experimentally determined by gel shift experiments. An accurate weight matrix for the binding specificity of a transcription factor is one that accurately predicts binding constants.

SLIDE 5

Experimental techniques for estimating the parameters

f a TF specificity matrix

Competitive bandshifts (EMSA) → rel. binding constants of oligonucletides Alignment of in vivo sites → base frequency matrix (from 10-100 sequences) in vitro selection (SELEX) → base frequency matrix (up to 200 sequences) SAGE/SELEX → base frequency matrix (up to 10’000 binding sequences) Exhaustive mutagenesis + Krel assay → intrinsic specificity matrix Protein binding arrays + magic algorithm → intrinsic specificity matrix Some problems and limitations: – A base probability matrix is generate by an alignment or probabilistic modeling algorithm → no direct observation – Krel usually not very precise (within factor of 2) – Point mutations may create binding site in other frame Modeling of a Transcription Factor Binding Site from High Throughput SELEX Data Using a Hidden Markov Modeling Approach

Emmanuelle Roulet, Nicolas Mermod (Center for biotechnology UNIL- EPFL, Lausanne, Switzerland) Anamaria A Camargo, Andrew JG Simpson (Ludwig Institute of Cancer Research, Sao Paulo, Brazil) Philipp Bucher (Swiss Institute for Experimental Cancer Research and Swiss Institute of Bioinformatics, Epalinges s/Lausanne, Switzerland)

Nat. Biotechnol. 20, 31-835 (2002)

SLIDE 6

Motivation and Goals of the Project

Motivation: Accurate and reliable computational tools to predict transcription factor binding sites are still not available. Potential reasons: 1. Lack of adequate experimental data 2. Lack of adequate computational models 3. Lack of an adequate method to estimate the parameters of a computational model from the experimental data Goal: To develop a combined computational-experimental protocol to derive an accurate predictive model of the sequence specificity of a DNA-binding protein Potential benefits: 1. Being able to predict transcription factor binding in genome sequences. 2. Insights into molecular mechanisms of sequence-specific protein-DNA interactions 3. Ability to rationally design gene control regions of desired properties for biotechnological applications

SLIDE 7

Our Approach to the Problem of Characterizing the Sequence-Specificity of a DNA Binding Transcription Factor

1. Choice of a quantitative predictive model for representing the binding
specificity. Our choice: a profile-HMM
2. Choice of an experimental method to generate data for estimating the

model parameters. Our choice: a SELEX experiment

3. Choice of a machine learning algorithm to estimate the model parameters

from the data. Our choice: the Baum-Welch HMM training algorithm

4. Validation of the approach and optimization of the experimental parameters

by a computer simulation of step 2 and 3

5. Adjustment of experimental protocol to produce the necessary data as

suggested by the computer simulation

6. Generation of the experimental data
7. Building a binding site model from the data
8. A posteriori validation of the model by cross-validation and comparison with

independent experimental results

Study Object: Transcription Factor CTF/NFI

Dimeric DNA-binding protein recognizing a palindromic sequence motif with consensus sequence TTGGC(N5)GCCAA First isolated as a replication factor of Adenovirus type 2 Later independently isolated as a CCAAT-box binding transcription factor Can activate transcription of a reporter gene in transfected cells Recently shown to be implicated in regulatory pathways related to tumor progression and immune response Biochemical mechanism of gene regulation still elusive

SLIDE 8

Old CTF/NFI Binding Site Profile

Example: TGGGCATATAGCCAC Score: 10-1+10+10+10 +0 +10+10+10+10+9 = 88

SLIDE 9

Random sequence library

5’ –TCCATCTCTTCTGTATGTCGAGATCTA.N(25).TAGATCTCCTAACCGACTCCGTTAATT-3’

Second strand synthesis by pcr

Bgl II Bgl II

5’–TCCATCTCTTCTGTATGTCGAGATCTA.N(25).TAGATCTCCTAACCGACTCCGTTAATT-3’ 3’–AGGTAGAGAAGACATACAGATCTAGAT.N(25).ATCTAGAGGATTGGCTGAGGCAATTAA-5’

Selection of binding sequences (gel shift) Amplification Digestion Bgl II

5’ –GATCTA..N(25)..TA AT..N(25)..TACTAG-3’

Concatemerization and cloning

5’-GATCTA…N(25)…TAGATCTA…N(25)…TAGATCTA…N(25)…TA AT…N(25)…ATCTAGAT…N(25)…ATCTAGAT…N(25)…ATCTAG-3’

site 1 site 2 site 3 HTS sequencing

Primer 1 Primer 2 Selection cycles

Principle of the Baum-Welch hidden Markov model training algorithm

AACAGCGTGCCAACTAGTGATCACA CCACAACFFACGCCCAAATAACCAA GTTAGTGGACCGCTTCCAGCAATCT ATCACGGCACCCCATTTTTCTGTCT TGGTAAATTAATAATAAAACAGTGG GCGCGTGATTTGGCATCGTCCCATA AAGTTGGCTTTTCACCAATAGCGAG ...

Initial model: Training sequences: Trained model:

How does it work ? 1. The initial model serves as current model. 2. Training sequences are aligned to the current model. 3. New base and transition frequencies are estimated from the multiple alignment generated by step 2. The new model becomes the current model. 4. Step 2 and 3 are repeated until convergence is reached.

SLIDE 10

SLIDE 11

SLIDE 12

Doing the Experiment

SLIDE 13

309 552 1156 1156 4 7876 11377 15481 15481 SUM 5585 7385 8813 8813 3 203 731 1572 1572 2 954 1227 1678 1678 1 825 1482 2262 2262

Diff. sites

err < 0.01/bp err <0.001/bp Different sites Sites Cycle Site Statistics 102 318 215 378 4 1187 1619 1445 2234 3 208 447 392 545 2 111 553 364 623 1 295 427 425 468 Clones with detectable inserts Colonies Clones Seq.reads Cycle Clone statistics

Results – CTF/NF1 New CTF/NFI model

Scoring profile (relative energy units): Hidden Markov Model (frequencies given in %):

SLIDE 14

Predicted and observed evolution of Selex populations

Theoretically predicted affinity profiles of successive SELEX cycles (Djordjevic & Sengupta 2006) Weight matrix scores for successive CTF/NF1 HTP SELEX populations (Roulet et al. 2002) low affinity high high

Major Differences between New and Old CTF/NFI Binding Site Models

The new model contains a sixth half-site position reducing the major spacer length class to 3. This extends the consensus half-site motif to TTGGCA. Alternative spacer length classes N4 and N5 (N6 and N7 according to the old numbering system) receive much more severe penalties in the new profile. Based on the estimated frequencies, it is not certain whether these binding modes have occurred at all during SELEX amplification. The G mismatch at the first position of the half-site weigth matrix has a much lower weight in the new model.

SLIDE 15

SLIDE 16

Quality Assessment of the New Model: Comparison of Predicted Binding Scores with in vitro measured Binding Constants

Data from Meisterernst et al. (1988). Nucl. Acids Res. 16, 4419-4435

SLIDE 17

Beyond simple weight matrices: correlated dinucleotide analysis

HTP SELEX Sequencing totals for members of the TCF family

11683 11937 11951 11951 TCF4_3 TCF4 6500 6962 7274 8083 SUM 4800 5129 5311 6116 LBC_6 1700 1833 1963 1967 LBC_5 LEF1/TCF-1 α with β-catenin 11756 12288 13521 14161 SUM 328 359 379 397 LEF1_7 2144 2327 2500 3072 LEF1_6 1128 1366 1471 1503 LEF1_5 6263 6169 7046 7064 LEF1_3 1893 2067 2125 2125 LEF1_2 LEF1/TCF-1α <0.001% per bp <0.01% per bp % error rate Total number of unique sites Total number of sites SELEX Library

SLIDE 18

0.004 0.422 0.562 0.011 9 C 0.840 0.001 0.004 0.154 8 T 0.020 0.010 0.001 0.968 7 A 0.016 0.936 0.034 0.014 6 G 0.988 0.005 0.003 0.004 5 T 0.203 0.292 0.411 1 C 0.093 10 A 4 T 3 T 2 C 0.831 0.991 0.961 0.044 T 0.047 0.001 0.003 0.093 G 0.080 0.005 0.019 0.851 C 0.042 0.002 0.018 0.013 A 0.001 0.220 0.777 0.002 9 C 0.979 0.001 0.003 0.017 8 T 0.001 0.004 0.001 0.994 7 A 0.002 0.995 0.003 0.001 6 G 0.994 0.001 0.004 0.001 5 T 0.103 0.182 0.682 1 C 0.033 10 A 4 T 3 T 2 C 0.973 0.993 0.993 0.005 T 0.003 0.001 0.001 0.005 G 0.020 0.005 0.004 0.989 C 0.003 0.001 0.002 0.001 A

PSSM of LEF1/TCF-1α SELEX cycle 3 PSSM of LEF1/TCF-1α SELEX cycle 6

Base frequency tables for DNA binding sites of TCF family members derived by HTP SELEX

SLIDE 19

Sequence Logos for binding sites of TCF family proteins

Lef-1 Lef-1/beta-catenin Tcf-4

Comparison of our TCF4 binding site with motif obtained by affinity measurements

Sequence Logo pasted from Hallikas et al. (2006). Cell 124:21. Motif obtained by competition assays with complete single base-substitution

series. Note: at least one significant position is missing because of a priori

restriction of motif extension.

SLIDE 20

Overview of HTPSELEX Database

Contents – from raw data to HMMs:

Single-read sequencing chromatograms
Clone sequences (assembled by Phred/Phrap)
Site sequences with estimated sequencing errors
HMMs for binding sites in two formats (decodeanhmm, MAMOT)

Additional features:

Quality-controlled sequence download
Access to selected low-throughput SELEX data
Experimental and computational protocols

SLIDE 21

Example of a HTPSELEX clone entry

ID LBC_5_00003 standard; DNA; UNC; 1023 BP. XX AC LBC_5_00003 XX DT 5-Jun-2005 XX DE 5' Sequence of SELEX/SAGE Clone : LBC_5_00003 of cycle 5 XX KW HTP SELEX/SAGE, invitro transcription factor binding sites XX OS unidentified OC unidentified XX RN [1] RA Emmanuelle Roulet, Stephane Busso, Anamaria A.Camargo, Andrew J.G Simpson, RA Nicolas Mermod, and Philipp Bucher. RT High-throughput SELEX-SAGE method for quantitative modelling of RT transcription-factor binding sites. RL Nature Biotechnology 20:831-835(2000) XX DR TRACES;LBC_5_003TF.scf XX FH Key Location/Qualifiers FH FT source 1..1023 FT /mol_type="unassigned DNA" FT /organism="unidentified" FT /tissue_type="SELEX" FT misc_binding 110..142 FT /bound_moiety ="LEF1/TCF with beta catenin " FT /label="LBC_5_00003_1" FT /note="Base quality score is 2.8361e-03" FT misc_binding 143..175 FT /bound_moiety ="LEF1/TCF with beta catenin " FT /label="LBC_5_00003_2" FT /note="Base quality score is 1.2369e-03" XX SQ Sequence 1023 BP; 230 A; 291 C; 260 G; 242 T; 0 other; AAAACCTAAT ATAAGGGGCA GATTAGGGCC CTCTCGATGC TGCTCGAGCG GCCGCCAGTG TGATGGATAT CTGCAGAATT CCAGCACACT GGCGGCCGTT ACTAGTGGAT CTATTGGCGG