ISMB 2003: Tutorial AM4 Analysis of regulatory sequences controlling - - PowerPoint PPT Presentation
ISMB 2003: Tutorial AM4 Analysis of regulatory sequences controlling - - PowerPoint PPT Presentation
ISMB 2003: Tutorial AM4 Analysis of regulatory sequences controlling expression of biological networks Wyeth Wasserman, Albin Sandelin, Boris Lenhard Modified Materials, Sample Programs, Links Page All tutorial materials are available at
Modified Materials, Sample Programs, Links Page
All tutorial materials are available at
http://lenhard.cgb.ki.se/ISMB2003/
User: ismb2003 Pasword: am4tutorial
Tutorial AM4 Analysis of regulatory sequences controlling the expression of biological networks
Part I: Discrimination of regulatory regions in a gene or genome Part II: Discovery of novel regulatory controls for co-expressed genes Part III: Programming methods for the analysis of regulatory control sequences
Analysis of regulatory sequences controlling the expression of biological networks
Wyeth Wasserman
Albin Sandelin Boris Lenhard
Part I: Discrimination of regulatory regions in a gene or genome
Tutorial AM4 Analysis of regulatory sequences controlling the expression of biological networks
A.
Introduction to transcription in eukaryotic cells
B.
Modeling binding properties of transcription factors
C.
Resources for analysis of regulatory sequences
1.
Detection of individual TFBS
1.
Enhanced specificity via “Phylogenetic Footprinting”
2.
Clusters of Transcription Factor Binding Sites
3.
Promoter Detection
Transcription Simplified
TATA URE
URF Pol-II Three-step Process:
- 1. TF binds to TFBS (DNA)
- 2. TF catalyzes recruitment of
polII complex
- 3. Transcription initiates
I.1
Complexity in Metazoan Transcription I.2
Diverse and non-uniform use of terms: Partial glossary for tutorial
- Promoter – Sufficient to support the initiation of transcription;
- rientation dependent; includes TSS
- Regulatory Regions
- Proximal – adjacent to promoter
- Distal – some distance away from promoter (vague)
- May be positive (enhancing) or negative (repressing)
- TSS – transcription start site
- TFBS – single transcription factor binding site
- Modules – Sets of TFBS that function together
EXON
TFBS TATA
TSS
TFBS TFBS Promoter Region TFBS TFBS Distal Regulatory Region
I.3
Proximal Regulatory Region
EXON
TFBS TFBS Distal R.R.
Restrictions in Coverage of Tutorial
Focus on Eukaryotic cells.
Most of the principles apply to microbes. A set of relevant links is provided for microbial
resources in the “Resources” handout
Polymerase II driven promoters
Generally protein coding genes
All references are made to activating sequences
Information about repressive regulation is sparse
I.4
Laboratory Discovery of TF Binding Sites I.5A
LUCIFERASE LUCIFERASE LUCIFERASE LUCIFERASE LUCIFERASE LUCIFERASE LUCIFERASE ACTIVITY
Representing Binding Sites for a TF
- A set of sites represented as a consensus
- VDRTWRWWSHDWVWH (IUPAC degenerate DNA)
A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3 C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12 G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2 T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4
- A matrix describing a a set of sites
- A single site
- AAGTTAATGATTAAC
I.5B
Set of binding sites AAGTTAATGA CAGTTAATAA GAGTTAAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTGATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA AAGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAATGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA Set of binding sites AAGTTAATGA CAGTTAATAA GAGTTAAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTGATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA AAGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAATGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA
TGCTG = 0.9 PFMs to PWMs One would like to add the following features to the model:
- 1. Correcting for the base frequencies in DNA
- 2. Weighting for the confidence (depth) in the pattern
- 3. Convert to log-scale probability for easy arithmetic
A 5 0 1 0 0 C 0 2 2 4 0 G 0 3 1 0 4 T 0 0 1 1 1 A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5 0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3 T -1.7 -1.7 -0.2 -0.2 -0.2 f matrix w matrix Log(
)
f(b,i)+ s(n) p(b) I.6
Performance of Profiles
- 95% of predicted sites bound in vitro (Tronche 1997)
- MyoD binding sites predicted about once every 600 bp (Fickett
1995)
- My Futility Theorem
- Nearly 100% of predicted transcription factor binding sites have no
function in vivo
I.7
Predicted TFBS in 1 kbp beta-globin proximal promoter
I.8
Terms
- Specificity – The portion of predictions that are correct
- Sensitivity – The portion of “positives” that are detected
- The detection of TFBS is limited by terrible specificity. Why?
I.9
OnLine resources for the detection of TFBS
- TESS
- TRRD
- MatInspector (“Transfac”)
I.10
Phylogenetic Footprinting
70,000,000 years of evolution reveals most regulatory regions
I.11
Phylogenetic Footprinting to Identify Functional Segments
% Identity
Actin gene compared between human and mouse with DPB.
200 bp Window Start Position (human sequence)
I.12
Phylogenetic Footprinting (2)
- 0.2
0.2 0.4 0.6 0.8 1 1000 2000 3000 4000 5000 6000 7000
FoxC2
100% 80% 60% 40% 20% 0%
I.13
1kbp promoter screen with footprinting I.14
Performance: Human vs. Mouse
Testing set: 40 experimentally defined sites in 15 well
studied genes
85-95% of defined sites detected with conservation
filter, while only 11-16%of total predictions retained I.15
SELECTIVITY SENSITIVITY
Choosing the ”right” species...
(BONUS: What’s the ultimate sin in bioinformatics?)
COW MOUSE CHICKEN
HUMAN HUMAN HUMAN
I.16
Considerations in Phylogenetic Footprinting
- Alignment methods
Pairwise vs. multiple Local vs. global
- Definition of Conservation
Windows vs. significant segments
- Selection of Organisms
Evolutionary distance Retained function
I.17
OnLine Resources for Phylogenetic Footprinting
- Linked to TFBS
- ConSite
- rVISTA
- Alignments
- Blastz
- Lagan
- Avid
- ORCA Aligner/OrthoSeq
- Visualization
- SimPlot
- Vista Browser
- PipMaker
I.18
Regulatory Modules
TFs do NOT act in isolation
I.19
Liver regulatory modules I.20
Models for Liver TFs… HNF1 C/EBP HNF3 HNF4 I.21
Scan the UDPGT1 sequence for modules (Gilbert’s Syndrome)
- 0.2
0.2 0.4 0.6 0.8 1 1 5 1 9 2 1 3 3 1 7 4 2 1 5 2 5 6 2 9 7 3 3 8 3 7 9 4 2 4 6 1 5 2 5 4 3 5 8 4 Series1 Series2
Wildtype Mutant
Liver Module Model Score
“Window” Position in Sequence Liver regulatory module scores overlapping sequence variation I.22
Predicted Muscle Regulatory Module
0.1 0.2 0.3 0.4 0.5 1500 3000 4500 6000 7500 9000
Kcna7 Score
I.23
Untrained Methods
- New generation of tools to identify clusters of TFBS for user-
specified set of TFs
- Identify statistically significant clusters of sites within genomes
- See MSCAN talk on Wednesday
I.24
Considerations in Searching for Clusters of Binding Sites: Key items
- Biological motivation for grouping transcription factors
- Is there sufficient data to train a discrimination function?
- Are there binding profiles for the critical transcription factors?
I.25
OnLine Tools for Detection of Site Clusters
- MSCAN (user defined sets of TFs)
- TransRegio (liver and muscle)
- COMET/CISTER/ClusterBuster
I.26
Promoter Detection
Statistical Properties
- f Sequences
I.27
Promoter Detection
Approaches based on detection of TFBS Approaches based on sequence properties Some considerations regarding current approaches
I.28
Promoters by Detection of Binding Sites
- Early promoter detection tools were based on promoters of
small set of highly expressed genes
“TATA” Box at –30; CATT Box at –90
- Attempted to define the specific position at which RNA
transcripts are initiated
- Benchmarking test in late 1990s
Most promoter prediction tools were slightly better than random
guessing
nothing dramatically better than TATA prediction at -30
I.29
What were we doing wrong?
Grouping diverse promoters into a single mega-class Attempting to pinpoint a specific start position when
biochemical system is ambiguous
Ignoring a common observation in the laboratory-
based literature… I.30
Sequence Properties in Regions containing Promoters
- Long recognized (in labs) that a significant subset of promoters
are situated within or adjacent to regions rich in CG dinucleotides
Without selection CG dinucleotides are modified CpG islands believed to favor “open” chromatin
- A new generation of promoter detection tools (CpG-island
detectors) are based on the detection of C/G-rich regions containing over-represented strings/motifs (generally A/T-rich) identified in training data
I.31
OnLine Tools for Promoter Detection
- EpoNine
- Promoter Inspector
- FirstEF
- Others?
- Defining the likely TSS with NNPP
I.32
Looking back at part 1: Key items
- Profiles provide reasonable estimate of the potential for a TF to
bind to a sequence in vitro (i.e. in the lab)
- In vitro binding is not predictive of in vivo function (i.e. in the cell)
- Prediction of promoters with CpG islands is useful, but detection
- f the other 50% of promoters is poor
- There are two reasonable methods to improve the prediction of
individual TF binding sites
- Phylogenetic Footprinting identifies sites conserved across
evolution, improving specificity by an order of magnitude in the best cases
- Analysis of clusters of TFBS for biologically linked TFs can improve
specificity by two orders of magnitude
I.33
Wyeth Wasserman
Albin Sandelin
Boris Lenhard
Analysis of regulatory sequences controlling the expression of biological networks
Part II: Discovery of novel regulatory controls for co-expressed genes
Tutorial AM4 Analysis of regulatory sequences controlling the expression of biological networks
- A. Introduction: the problem
- B. Selection of promoter sequences
- C. Algorithms for pattern discovery
- D. Enhancing pattern discovery
II.1
Predicting binding sites for factor X in promoter Y Sliding models over sequences (pattern detection)
ATGCTATAGTGTGCACGATCGATGCTAGTGCATCAACAGCTG CAGCTGTGTCGGGAA
Analyzing shared properties in promoter sets Over-representation (pattern discovery)
Approaches in promoter analysis II.2
Definitions
Co-regulation: Genes with similar expression patterns resulting from the influence of one or more common control mechanisms
Given a set of ”co-regulated” genes, define motifs over-represented in the regulatory regions
The problem II.3
Gene Networks
What are gene networks How are gene networks defined
II.4
Definitions of Gene Networks in Genomics Data(1) Why is co-regulation
- ccuring?
Why is it frequent?
Large protein systems
are often activated simultanosly
Evolution works by
modifying existing systems
DNA DUPLICATION G1 G2 NUCLEAR DIVISION
The cell cycle II.5
Definitions of Gene Networks in Genomics Data(2)
- Protein complexes
- Pathways
- Proteins associated with specific
process
- Developmental program
II.6
Systems potentially co-regulated :
Selection of Promoter Sequences for Analysis The good, the bad and the ugly --- benchwork Microarrays Other... II.7 How do we define our set of promoters?
Microarrays Provides ’snapshots’ of mRNA levels in the cell mRNA distribution for certain genes are related Cluster genes that are expressed in similar fashion Concept: co-expressed genes believed to be co-regulated
Selection of Promoter sequences for analysis (1) II.8
Pros: Wealth of data Cons: Often noisy Expensive Secondary effects a problem Coexpression != Co-regulation
Relative concentrations of microarrays in ISMB abstracts
0.2 0.4 0.6 0.8 1 1 .2 2000 2001 2002
Relative concentrations of microarrays in ISMB abstracts
0.2 0.4 0.6 0.8 1 1 .2 2000 2001 2002
Selection of Promoter sequences for analysis (2) II.9
Other approaches: Litterature-based selection Chromatin immuno-precipitation Green Fluorescent Protein based approaches
Selection of Promoter sequences for analysis (3) II.10
Online Resources
Expression databases General : NCBI Gene Expression Omnibus EMBL ArrayExpress Stanford Microarray Database Selection of Promoter sequences for analysis (4) II.11
Functionally related proteins are
- ften co-regulated
By analyzing co-regulated genes, we aim to find shared regulatory signals Selection of co-regulated genes is non-trivial. High throughput methods tend to be too noisy for our needs Filtering selected sequences using complementary data is often benificial Selection of Promoter sequences for analysis: Key items II.12
Methods for Pattern Discovery
Word-based vs matrix-based Exhaustive Probabilistic Enhancements
II.13
Methods for Pattern Discovery
Word-based
TFBS are words Words are easily counted Pros Realistic complexity Based on well-understood statistics Cons TF binding properties are often degenerate
Matrix-based
TF:s do not bind to words Pros Matrix models are more accurate descriptions of binding preferences Cons Unrealistically expensive in computer time (if optimals are sought)
AAGTTAATGATTAAC
Exhaustive Methods for Pattern Discovery
What is an exhaustive method? Types of exhaustive methods
Word based Matrix based
II.14
Exhaustive methods(1)
Computer science: Exhaustive algorithm: All possible solutions are evaluated: often VERY CPU intensive for large indata In this context Count all possible motifs/words. Analyze
- ver-representation
II.15
Exhaustive methods(2)
Word based methods: How likely are X words in a set of sequences, given sequence characteristics?
CCCGCCGGAATGAAATCTGATTGACATTTTCC >EP71002 (+) Ce[IV] msp-56 B; range -100 to -75 TTCAAATTTTAACGCCGGAATAATCTCCTATT >EP63009 (+) Ce Cuticle Col-12; range -100 to -75 TCGCTGTAACCGGAATATTTAGTCAGTTTTTG >EP63010 (+) Ce Cuticle Col-13; range -100 to -75 TATCGTCATTCTCCGCCTCTTTTCTT >EP11013 (+) Ce vitellogenin 2; range -100 to -75 GCTTATCAATGCGCCCGGAATAAAACGCTATA >EP11014 (+) Ce vitellogenin 5; range -100 to -75 CATTGACTTTATCGAATAAATCTGTT >EP11015 (-) Ce vitellogenin 4; range -100 to -75 ATCTATTTACAATGATAAAACTTCAA >EP11016 (+) Ce vitellogenin 6; range -100 to -75 ATGGTCTCTACCGGAAAGCTACTTTCAGAATT >EP11017 (+) Ce calmodulin cal-2; range -100 to -75 TTTCAAATCCGGAATTTCCACCCGGAATTACT >EP63007 (-) Ce cAMP-dep. PKR P1+; range -100 to -75 TTTCCTTCTTCCCGGAATCCACTTTTTCTTCC >EP63008 (+) Ce cAMP-dep. PKR P2; range -100 to -75 ACTGAACTTGTCTTCAAATTTCAACACCGGAA >EP17012 (+) Ce hsp 16K-1 A; range -100 to -75 TCAATGCCGGAATTCTGAATGTGAGTCGCCCT >EP55011 (-) Ce hsp 16K-1 B; range
II.16
Over-representation How many words of type ’AGGAGTGA’ are found in our sequences?
[ ] ∏
=
=
k j j
a p i in begins w P
1
) (
[ ]
∏
=
+ − =
k j j w
a p k n X E
1
) ( ) 1 (
[ ] [ ]
w w w w
X Var X E X Z − =
How likely is this result?
Exhaustive methods(3) II.17
Background properties
Simple: How likely are single nucleotides? (extended Bernoulli) Complex: Neglect certain words Locations of TFBS Higher-order descriptions of DNA Exhaustive methods(4) II.18
Exhaustive methods(5) Find all words of length 7 in the yeast genome Make a lookup table: AAACCTTT 456 TTTTTTTT 57788 GATAGGCA 589 Etc...
GTCTTTATCTTCAAAGTTGTCTGTCCAAGATTTGGACTTGAAGG ACAAGCGTGTCTTCTCAGAGTTGACTTCAACGTCCCATTGGAC GGTAAGAAGATCACTTCTAACCAAAGAATTGTTGCTGCTTTGC CAACCATCAAGTACGTTTTGGAACACCACCCAAGATACGTTGT CTTGTTCTCACTTGGGTAGACCAAACGGTGAAAGAAACGAAAA ATACTCTTTGGCTCCAGTTGCTAAGGAATTGCAATCATTGTTG GGTAAGGATGTCACCTTCTTGAACGACTGTGTCGGTCCAGAA GTTGAAGCCGCTGTCAAGGCTTCTGCCCCAGGTTCCGTTATTT TGTTGGAAAACTGCGTTACCACATCGAAGAAGAAGGTTCCAGA AAGGTCGATGGTCAAAAGGTCAAGGCTCAAGGAAGATGTTCA AAAGTTCAGACACGAATTGAGCTCTTTGGCTGATGTTTACATC ACGATGCCTTCGGTACCGCTCACAGAGCTCACTCTTCTATGGT CGGTTTCGACTTGCCAACGTGCTGCCGGTTTCTTGTTGGAAAA GGAATTGAAGTACTTCGGTAAGGCTTTGGAGAACCCAACCAG ACCATTCTTGGCCATCTTAGGTGGTGCCAAGGTTGCTGACAAG ATTCAATTGATTGACAACTTGTTGGACAAGGTCGACTCTATCAT CATTGGTGGTGGTATGGCTTTCCCTTCAAGAAGGTTTTGGAAA ACACTGAAATCGGTGACTCCATCTTCGACAAGGCTGGTGCTG AAATCGTTCCAAAGTTGATGGAAAAGGCCAAGGCCAAGGGTG TCGAAGTCGTCTTGCAGTCGACTTCATCATTGCTGATGCTTTC TCTGCTGATGCCAACACCAAGACTGTCACTGACAAGGAAGGT ATTCCAGCTGGCTGGCAAGGGTTGGACAATGGTCCAGAATCT AGAAAGTGTTTGCTGCTACTGTTGCAAAGGCTAAGACCATTGT CTGGAACGGTCCACCAGGTGTTTTCGAATTCGAAAAGTTCGCT GCTGGTACTAAGGCTTTGTTAGACGAAGTTGTCAAGAGCTCTG CTGCTGGTAACACCGTCATCATTGGTGGTGGTGACACTGCCA
II.19
Matrix based methods
cagagcgatAGGTCAacgataatat gcgatagcaAGGTCGccccgtatag aacttggttAGGTCAttagcgagta ggggatgggCCCTCAaatacgcgga aaccggaagGGTTCAacgatctatt
A 3 0 0 0 0 4 C 1 1 1 0 5 0 G 1 4 3 0 0 1 T 0 0 1 5 0 0
= local multiple alignment No current exhaustive methods, due to NP-completeness
Exhaustive methods(6) II.20
Resources Moby Dick (Bussemaker et al) (not online) Dyad analysis (van Helden et al) YMF (Sinha and Tompa)
Exhaustive methods(7) II.21
- Algorithms with high complexity - Large
sequences and/or many possible word lengths not possible
- Often word-based
- TFBS are not words (’fuzzy’ binding)
- Sensitivity susceptible to noisy indata
(e.g. microarrays) Exhaustive methods: Key items II.22
Probabilistic Methods for Pattern Discovery
What is a probabilistic method? The Gibbs sampler algorithm Improving background models
II.23
Probabilistic Methods for Pattern Discovery(1)
Computer science: Probabilistic algorithm: uses randomness Bioinformatics: Probabilistic algorithm often the same as Monte Carlo algorithm: an approximation algorithm that always is fast but does not always give the best solution
II.24
Motivation:
TFBS are not words Efficiency Can be intentionally influenced by biological data
Overview:
Find a local alignment of width x of sites that maximizes information content in reasonable time Usually by Gibbs sampling or EM methods Probabilistic Methods for Pattern Discovery(2) II.25
Two data structures used: 1) Current pattern nucleotide frequencies qi,1,..., qi,4 and corresponding background frequencies pi,1,..., pi,4 2) Current positions of site startpoints in the N sequences a1, ..., aN , i.e. the alignment that contributes to qi,j. One starting point in each sequence is chosen randomly initially.
The Gibbs Sampling algorithm
tgacttcc tgatctct agacctca tgacctct
Probabilistic Methods for Pattern Discovery(3) II.26
Iteration step Remove one sequence z from the
- set. Update the current pattern
according to
tgacttcc tgatctct agacctca tgacctct
B N b c q
j j i j i
+ − + = 1
, ,
Pseudocount for symbol j Sum of all pseudocounts in column
Probabilistic Methods for Pattern Discovery(4) A ’Score’ the current pattern against each possible occurence ak in z. Draw a new ak with probabilities based on respective score divided by the background model B II.27 z
Probabilistic Methods for Pattern Discovery(5)
10 12 14 16 18 100 200 300 400 500 600
SEQUENCE LENGTH PATTERN SIMILARITY
- vs. TRUE MEF2 PROFILE
True Mef2 Binding Sites
Sensitivity weaknesses: ’Pattern drowning’ II.28
Correction for background properties Workman & Stormo (ANN-Spec) – Train on background set as well to find ’commonly occuring’ patterns. Maximization of probabililty of finding pattern in positive sequences and not in background seqsequences In effect: Try to discriminate between ’common’ and ’novel’ patterns Thijs et al, Bailey and Elkan Markov background model describing DNA in m:th order Probabilistic Methods for Pattern Discovery(6) II.29
What is a higher-order background model? Zero-order:
p(A)=0.29, p(C)=0.21, p(G)=0.21, p(T)=0.29
∏
=
=
N i i
nucleotide P seq P
... 1
) ( ) (
First-order:
A
A
T C G A
m:th-order: The chance of drawing base x is dependant on the identity of the previous m bases Probabilistic Methods for Pattern Discovery(7) II.30
Online resources Gibbs Motif Sampler(Lawrence et al) MEME(Bailey and Elkan) AnnSpec(Workman and Stormo) AlignAce(Roth et al) Probabilistic Methods for Pattern Discovery(8) II.31
Gibbs Sampling/EM algorithms
- Complexity is moderate.
Optimality not guaranteed.
- Low sensitivity: patterns
’drown’ in large sequences (~>500 bp)
- Sensitivity susceptible to noisy
input data (e.g. microarrays) Probabilistic Methods for Pattern Discovery: Key items II.32
Enhancing pattern discovery sensitivity
Cross-species comparison Modelling of pattern constraints
information content structural constraints palindromicity
Usage of prior knowledge
II.33
Search only where TFBS are likely: Cross-species comparison Use as filtering Or include orthologous sequences in analysis
Enhancing pattern detection sensitivity (1) II.34
Information segmentation
Information content distributions of TFBS are distinctly non-random
(Wasserman et al 2000) Palindromicity, dyads (van Helden et al 2000) Variable gaps (Hu 2003)
TFBSs are not randomly drawn
Enhancing pattern detection sensitivity (2) II.35
Building in biological knowledge in pattern finding - priors
How do priors work? Essentially by increasing the psudocounts by some fraction submitted in the prior Enhancing pattern detection sensitivity (3) A certain residue is according to our prior knowledge an A in 47/100 cases. New pseudocount for first residue, A: 50/100 x k x#number of sites Example: II.36
’Biasing’ probabilistic pattern finder with prior knowledge - an unexplored area Examples Structural constraints (example: HLH factors have certain shared binding preferences) Information content ’landscape’ Enhancing pattern detection sensitivity (4) II.37
Example of enhanced sensitivity using biologically based priors
use prior no prior background (no sites)
Enhancing pattern detection sensitivity (5) II.38
Difficulties in pattern detection are attributed to: Low signal strength Cross species comparison as filters Complexity of background DNA New >0 order models Simplified models of TFBS properties Dyads, palindromes Segmentation of information content distributions Usage of biological knowledge as priors Enhancing pattern detection sensitivity: Key items II.39
Evaluation of Patterns
How relevant is our new pattern? Algorithms for pattern comparison
II.40
Pattern finders can generally NOT distinguish between patterns and over-represented ’junk’
Q: How do we know if a pattern is true A: In the lab! Q:How can we avoid labwork? A: Compare to already known patterns! Q: How do we know a function of a gene A: In the lab! Q:How can we avoid labwork? A: Compare to already known sequences! The sequence analogy
Evaluation of patterns(1) II.41
Algorithms for pattern comparison
Sandelin & Wasserman Needleman-Wunsch variant Hughes et al Based on protein BLOCKS alignment algorithm (Pietrokowski) Evaluation of patterns(2) II.42
Online Resources
CompareAce (Hughes et al) JASPAR (Sandelin & Wasserman) Integrated systems (yeast) Atlas YRSA Evaluation of patterns(3) II.43
Pattern discovery is based on finding shared TFBS for co-regulated genes Co-regulated sequences can be based on different experiments and different clustering analysis algorithms Algorithms for pattern discovery can be exhaustive probabilistic Enhanced sensitivity can be achieved by Cross-species comparison Information segmentation Biologically based priors New patterns can be compared to verified patterns
Looking back at part II: II.44
Wyeth Wasserman Albin Sandelin
Boris Lenhard
Analysis of regulatory sequences controlling the expression of biological networks
Part III: Programming Resources for the Analysis of Regulatory Sequences
Section 3.1
Orientation
Objectives TFBS –Basic features Prerequisites for this part of the tutorial III.1
3.1 Orientation
Objectives
- Purpose: Introduction to the tools and “discipline” of
practical analysis of regulatory regions and networks
- After mastering the material presented here, you should
be able to script and automate computational methods for the analysis of regulatory sequences.
- More advanced Perl users will be provided with enough
information to contribute extensions and enhancements to the existing computational framework for regulatory sequence analysis.
III.2
3.1 Orientation
Regulatory regions problem space
Sets of binding sites
AATCACCA AATCACCA AATCACCA AATCACCA AATCTCCC AATCTCCG AATCACAC AATCATCA AATCTCAC AATCTCTG AGTCCCCA AATCCCGG AATCTGAG AATCCATA ATTCAGCC AATAACTT GATAACCT AATTAGAC GATTACAG GATTAGCG ATTCTTCC TATGAACA GATTAAAA AGACCCCA
Sets of binding sites
AATCACCA AATCACCA AATCACCA AATCACCA AATCTCCC AATCTCCG AATCACAC AATCATCA AATCTCAC AATCTCTG AGTCCCCA AATCCCGG AATCTGAG AATCCATA ATTCAGCC AATAACTT GATAACCT AATTAGAC GATTACAG GATTAGCG ATTCTTCC TATGAACA GATTAAAA AGACCCCA
Specificity profiles for binding sites
A [ -2 0 -2 -0.415 0.585 -2 -2 2.088 -2 -2 -1 0.585 ] C [ 1 0.585 0 0 -1 -2 -2 -2 2.088 -2 0.585 0.807 ] G [0.585 0.322 0.807 1.585 1 -2 2 -2 -2 2.088 -2 0 ] T [0.319 0.322 1 -2 0 2.088 -1 -2 -2 -2 1.459 -0.415 ]
Specificity profiles for binding sites
A [ -2 0 -2 -0.415 0.585 -2 -2 2.088 -2 -2 -1 0.585 ] C [ 1 0.585 0 0 -1 -2 -2 -2 2.088 -2 0.585 0.807 ] G [0.585 0.322 0.807 1.585 1 -2 2 -2 -2 2.088 -2 0 ] T [0.319 0.322 1 -2 0 2.088 -1 -2 -2 -2 1.459 -0.415 ]
Clusters of binding sites Clusters of binding sites Transcription factors Transcription factor binding sites Regulatory nucleotide sequences Transcription factors Transcription factor binding sites Regulatory nucleotide sequences
TATA URE
URF Pol-II
III.3
3.1 Orientation
TFBS – basic features
- A computational framework for transcription factor binding site
analysis and manipulation
- Implemented in Perl
- Models and manipulates common objects from the regulatory
regions object space
- Patterns representing TF specificity
- Sets of patterns
- Detected binding sites
- Sets of detected binding sites
- Unified DB interface to pattern databases
- Pattern generators
- Enables easy scripting and automation of regulatory region
analysis
III.4
3.1 Orientation
Prerequisites for full benefit from this part of the tutorial
- Several concepts covered in parts I and II
- Profile matrices (PFM, ICM, PWM)
- phylogenetic footprinting
- Basic to intermediate knowledge of Perl programming
- General syntax and semantics
- References and complex data structures
- Basic knowledge of Unix OSs and an access to one is needed
at the moment
III.5
Section 3.2
Perl and Bioperl
Perl and Perl objects Bioperl Selected Bioperl objects III.6
Annotation Sequence features Sequence
3.2 Perl and Bioperl
Bioinformatics problem|object space
LOCUS SARS 1942 bp mRNA linear PRI 04-FEB-2003 DEFINITION Homo sapiens seryl-tRNA synthetase (SARS), mRNA. ACCESSION NM_006513 VERSION NM_006513.2 GI:16306547 KEYWORDS . SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 1942) AUTHORS Hartlein,M. and Cusack,S. TITLE Structure, function and evolution of seryl-tRNA synthetases: implications for the evolution of aminoacyl-tRNA synthetases and the genetic code JOURNAL J. Mol. Evol. 40 (5), 519-530 (1995) MEDLINE 95302522 PUBMED 7540217 REFERENCE 2 (bases 1 to 1942) AUTHORS Vincent,C., Tarbouriech,N. and Hartlein,M. TITLE Genomic organization, cDNA sequence, bacterial expression, and purification of human seryl-tRNA synthase JOURNAL Eur. J. Biochem. 250 (1), 77-84 (1997) MEDLINE 98092290 PUBMED 9431993 COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The reference sequence was derived from BC000716.1, BC009390.1 and X91257.1. On Oct 22, 2001 this sequence version replaced gi:5730028. Summary: This gene belongs to the class II amino-acyl tRNA family. The encoded enzyme catalyzes the transfer of L-serine to tRNA (Ser) and is related to bacterial and yeast counterparts. COMPLETENESS: complete on the 3' end. FEATURES Location/Qualifiers source 1..1942 /organism="Homo sapiens" /db_xref="taxon:9606" /chromosome="1" /map="1p13.3-p13.1" gene 1..1942 /gene="SARS" /note="synonyms: SERS, SERRS" /db_xref="LocusID:6301" /db_xref="MIM:607529" misc_feature 79..234 /gene="SARS" /note="Seryl_tRNA_N; Region: Seryl-tRNA synthetase N-terminal domain. This domain is found associated with the Pfam tRNA synthetase class II domain (pfam00587) and represents the N-terminal domain of seryl-tRNA synthetase" /db_xref="CDD:pfam02403" misc_feature 82..1482 /gene="SARS" /note="Region: Seryl-tRNA synthetase [Translation, ribosomal structure and biogenesis]" /db_xref="CDD:COG0172" misc_feature 640..1167 /gene="SARS" /note="ThrS; Region: Threonyl-tRNA synthetase [Translation, ribosomal structure and biogenesis]" /db_xref="CDD:COG0441" misc_feature 649..1152 /gene="SARS" /note="tRNA-synt_2b; Region: tRNA synthetase class II core domain (G, H, P, S and T). Other tRNA synthetase sub-families are too dissimilar to be included. This domain is the core catalytic domain of tRNA synthetases and includes glycyl, histidyl, prolyl, seryl and threonyl tRNA synthetases" /db_xref="CDD:pfam00587" misc_feature 664..1161 /gene="SARS" /note="ProS; Region: Prolyl-tRNA synthetase [Translation, ribosomal structure and biogenesis]" /db_xref="CDD:COG0442" polyA_signal 1831..1836 /gene="SARS" polyA_site 1852 /gene="SARS" /evidence=experimental polyA_site 1895 /gene="SARS" /evidence=experimental BASE COUNT 530 a 460 c 538 g 414 t ORIGIN 1 gcagtgcggc ggtcacaggc tgagtgctgc ggcgcgatcc ttgcttccct gagcgttggc 61 ccgggaggaa agaagatggt gctggatctg gatttgtttc gggtggataa aggaggggac 121 ccagccctca tccgagagac gcaggagaag cgcttcaagg acccgggact agtggaccag 181 ctggtgaagg cagacagcga gtggcgacga tgtagatttc gggcagacaa cttgaacaag 241 ctgaagaacc tatgcagcaa gacaatcgga gagaaaatga agaaaaaaga gccagtggga 301 gatgatgagt ctgtcccaga gaatgtgctg agtttcgatg accttactgc agacgcttta 361 gctaacctga aagtctcaca aatcaaaaaa gtccgactcc tcattgatga agccatcctg 421 aagtgtgacg cggagcggat aaagttggaa gcagagcggt ttgagaacct ccgagagatt 481 gggaaccttc tgcacccttc tgtacccatc agtaacgatg aggatgtgga caacaaagta 541 gagaggattt ggggtgattg tacagtcagg aagaagtact ctcatgtgga cctggtggtg 601 atggtagatg gctttgaagg cgaaaagggg gccgtggtgg ctgggagtcg agggtacttc 661 ttgaaggggg tcctggtgtt cctggaacag gctctcatcc agtatgccct tcgcaccttg 721 ggaagtcggg gctacattcc catttatacc ccctttttca tgaggaagga ggtcatgcag 781 gaggtggcac agctcagcca gtttgatgaa gaactttata aggtgattgg caaaggcagt 841 gaaaagtctg atgacaactc ctatgatgag aagtacctga ttgccacctc agagcagccc 901 attgctgccc tgcaccggga tgagtggctc cggccggagg acctgcccat caagtatgct 961 ggcctgtcta cctgcttccg tcaggaggtg ggctcccatg gccgtgacac ccgtggcatc 1021 ttccgagtcc atcagtttga gaagattgaa cagtttgtgt actcatcacc ccatgacaac 1081 aagtcatggg agatgtttga agagatgatt accaccgcag aggagttcta ccagtccctg 1141 gggattcctt accacattgt gaatattgtc tcaggttctt tgaatcatgc tgccagtaag 1201 aagcttgacc tggaggcctg gtttccgggc tcaggagcct tccgtgagtt ggtctcctgt 1261 tctaattgca cggattacca ggctcgccgg cttcgaatcc gatatgggca aaccaagaag 1321 atgatggaca aggtggagtt tgtccatatg ctcaatgcta ccatgtgcgc cactacccgt 1381 accatctgcg ccatcctgga gaactaccag acagagaagg gcatcactgt gcctgagaaa 1441 ttgaaggagt tcatgccgcc aggactgcaa gaactgatcc cctttgtgaa gcctgcgccc 1501 attgagcagg agccatcaaa gaagcagaag aagcaacatg agggcagcaa aaagaaagca 1561 gcagcaagag acgtcaccct agaaaacagg ctgcagaaca tggaggtcac cgatgcttga 1621 acattcctgc ctccctattt gccaggcttt catttctgtc tgctgagatc tcagagcctg 1681 cccaacagca gggaagccaa gcacccattc atccccctgc ccccatctga ctgcgtagct 1741 gagaggggaa cagtgccatg taccacacag atgttcctgt ctcctcgcat gggcataggg 1801 acccatcatt gatgactgat gaaaccatgt aataaagcat ctctggggag ggcttaggac 1861 tcttcctcag tcttcttccc cgggcttgaa ccccgaaaaa aaaaaaaaaa aaaaaaaaaa 1921 aaaaaaaaaa aaaaaaaaaa aa // TBLASTN 2.2.3 [May-13-2002] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= (958 letters) Database: /opt/cgb/PEGASUS/BLAST/Pegasus_transcript 3265 sequences; 7,560,541 total letters Searching.......done Score E Sequences producing significant alignments: (bits) Value 9550__TRANSCRIPT__2305 KIAA1464 protein [OTHER] 68 8e-12 10825__TRANSCRIPT__1863 RAN binding protein 9 [OTHER] 66 3e-11 9615__TRANSCRIPT__2321 DEAD/H (Asp-Glu-Ala-Asp/His) box polypept... 39 0.005 >9550__TRANSCRIPT__2305 KIAA1464 protein [OTHER] Length = 1488 Score = 67.8 bits (164), Expect = 8e-12 Identities = 41/112 (36%), Positives = 57/112 (50%), Gaps = 9/112 (8%) Frame = +1 Query: 504 GFDLNVFGYCGFDGLITNSTEQSKEYAKPFGRDDVIGCGINFIDGSIFFTKNGIHLGNAF 563 G+D + +GY G DG S+ + Y F DVIGC +N I+G+ F+TKNG LG Sbjct: 1 GWDKHSYGYHGDDGHSFCSSGTGQPYGPTFTTGDVIGCCVNLINGTCFYTKNGHSLGVCI 180 Query: 564 TDLN--------DLEFVPYVALR-PGNSIKTNFGLNEDFVFDIIGYQDKWKS 606 DL P V L+ PG + NFG + F+FDI Y +W++ Sbjct: 181 RDLGGSALWSHFGWNLYPTVGLQTPGEIVDANFG-QQPFLFDIEDYMREWRA 333 Score = 45.4 bits (106), Expect = 4e-05 Identities = 29/118 (24%), Positives = 51/118 (42%) Frame = +1 Query: 708 GSLPNTLNVMINDYLIHEGLVDVAKGFLKDLQKDAVNVNGQHSESKDVIRHNERQIMKEE 767 G L M++ YL+H G A F R E I +E+ Sbjct: 379 GEWQAVLQNMVSSYLVHHGYCATATAF---------------------ARMTETPIQEEQ 495 Query: 768 RMVKIRQELRYLINKGQISKCINYIDNEIPDLLKNNLELVFELKLANYLVMIKKSSSK 825 +K RQ+++ L+ +G++ + I P LL++N L+F LK ++ M+ + S+ Sbjct: 496 ASIKNRQKIQKLVLEGRVGEAIETTQRFYPGLLEHNPNLLFMLKCRQFVEMVNGTDSE 669 >10825__TRANSCRIPT__1863 RAN binding protein 9 [OTHER] Length = 1107 Score = 65.9 bits (159), Expect = 3e-11 Identities = 45/131 (34%), Positives = 65/131 (49%), Gaps = 1/131 (0%) Frame = +1 Query: 485 GVSAMSLNVDGSINKCQKYGFDLNVFGYCGFDGLITNSTEQSKEYAKPFGRDDVIGCGIN 544 G+SA +N++ + G+D + +GY G DG S+ + Y F DVIGC +N Sbjct: 10 GLSAQGVNMN------RLPGWDKHSYGYHGDDGHSFCSSGTGQPYGPTFTTGDVIGCCVN 171 Query: 545 FIDGSIFFTKNGIHLGNAFTDLNDLEFVPYVALR-PGNSIKTNFGLNEDFVFDIIGYQDK 603 I+ + F+TKNG L + L P V L+ PG + NFG FVFDI Y + Sbjct: 172 LINNTCFYTKNGHSLDVKYAILQP-NLYPTVGLQTPGEVVDANFG-QHPFVFDIEDYMRE 345 Query: 604 WKSLAYEHICR 614 W++ I R Sbjct: 346 WRTKIQAQIDR 378 Score = 45.1 bits (105), Expect = 6e-05 Identities = 26/119 (21%), Positives = 52/119 (42%) Frame = +1 Query: 707 DGSLPNTLNVMINDYLIHEGLVDVAKGFLKDLQKDAVNVNGQHSESKDVIRHNERQIMKE 766 +G + M++ YL+H G A+ F R ++ +++E Sbjct: 397 EGEWQTMIQKMVSSYLVHHGYCATAEAF---------------------ARSTDQTVLEE 513 Query: 767 ERMVKIRQELRYLINKGQISKCINYIDNEIPDLLKNNLELVFELKLANYLVMIKKSSSK 825 +K RQ ++ L+ G++ + I P LL+ N L+F LK+ ++ M+ + S+ Sbjct: 514 LASIKNRQRIQKLVLAGRMGEAIETTQQLYPSLLERNPNLLFTLKVRQFIEMVNGTDSE 690 >9615__TRANSCRIPT__2321 DEAD/H (Asp-Glu-Ala-Asp/His) box polypeptide 1 [SIGNALING] Length = 2706 Score = 38.5 bits (88), Expect = 0.005 Identities = 31/90 (34%), Positives = 43/90 (47%), Gaps = 4/90 (4%) Frame = +1 Query: 510 FGYCGFDGLITNS-TEQSKEYAKPFGRDDVIGCGINFIDGSIFFTKNGIHLGNAF---TD 565 FG+ GF G S +Q Y + F D IGC ++ G + F+KNG LG AF Sbjct: 775 FGF-GFGGTGKKSHNKQFDNYGEEFTMHDTIGCYLDIDKGHVKFSKNGKDLGLAFEIPPH 951 Query: 566 LNDLEFVPYVALRPGNSIKTNFGLNEDFVF 595 + + P L+ +K NFG E+F F Sbjct: 952 MKNQALFPACVLK-NAELKFNFG-EEEFKF 1035 Database: /opt/cgb/PEGASUS/BLAST/Pegasus_transcript Posted date: Nov 13, 2002 1:39 PM Number of letters in database: 7,560,541 Number of sequences in database: 3265 Lambda K H 0.315 0.134 0.381 Gapped Lambda K H 0.267 0.0410 0.140 Matrix: BLOSUM62 Gap Penalties: Existence: 11, Extension: 1 Number of Hits to DB: 5,635,301 Number of Sequences: 3265 Number of extensions: 64670 Number of successful extensions: 222 Number of sequences better than 1.0e-01: 6 Number of HSP's better than 0.1 without gapping: 60 Number of HSP's successfully gapped in prelim test: 11 Number of HSP's that attempted gapping in prelim test: 156 Number of HSP's gapped (non-prelim): 76 length of query: 958 length of database: 2,520,180 effective HSP length: 99 effective length of query: 859 effective length of database: 2,196,945 effective search space: 1887175755 effective search space used: 1887175755 frameshift window, decay const: 50, 0.1 T: 13 A: 40 X1: 16 ( 7.3 bits) X2: 38 (14.6 bits) X3: 64 (24.7 bits) S1: 42 (22.0 bits) S2: 77 (34.3 bits)BLAST report
Hit Hit Hit
HSP HSP HSP HSP HSP
Examples: BLAST and genbank sequence: combination of hierarchical and sequential elements: can be used to represent most complex data entities encountered in bioinformatics
III.7
3.2 Perl and Bioperl
Perl and Perl objects (1/4)
Programmer’s efficiency For most tasks in bioinformatics, our time is much more valuable than CPU time Highly suitable for biological sequence (string) manipulation Built-in regular expressions Abundance of existing, reusable code Bioperl Numerical methods Web programming Objects make programming faster and easier Complexity hiding, Abstraction, Code reuse
III.8
3.2 Perl and Bioperl
Perl and Perl objects (2/4)
use Bio::Seq; # import the module my $seqobj = Bio::Seq->new(-id => “MYSEQ_001”,
- seq => “ACGCTAGGGATGGATAGGGATGGA”);
print $seqobj->seq; # prints “ACGCTAGGGATGGATAGGGATGGA” print $seqobj->id; # prints “MYSEQ_001” my $revseqobj = $seqobj->revcom; # creates a new Bio::Seq object # with reverse complement of the # original sequence print $revseqobj->seq; # prints “TCCATCCCTATCCATCCCTAGCGT”
Creating and using an object – an example of Bio::Seq
III.9
3.2 Perl and Bioperl
Perl and Perl objects (4/4)
Bio::Seq=HASH(0x8330dcc) '_as_feat' => ARRAY(0x831f6f4) empty array '_root_verbose' => 0 'annotation' => Bio::Annotation::Collection=HASH(0x831f850) '_annotation' => HASH(0x831f634) empty hash '_root_verbose' => 0 '_typemap' => Bio::Annotation::TypeManager=HASH(0x831f928) '_root_verbose' => 0 '_type' => HASH(0x831f79c) 'comment' => 'Bio::Annotation::Comment' 'dblink' => 'Bio::Annotation::DBLink' 'reference' => 'Bio::Annotation::Reference' 'primary_seq' => Bio::PrimarySeq=HASH(0x831f6e8) '_root_verbose' => 0 'alphabet' => 'dna' 'display_id' => 'MYSEQ_001' 'seq' => 'ACGCTAGGGATGGATAGGGATGGA'
$seqobj :
internal structure Good news: you don’t have to care about it.
Bio::Seq object
Bio::SeqFeature::Exon Bio::SeqFeature::Exon Bio::SeqFeature::Generic Bio::SeqFeature::CDS Bio::Annotation::Collection Bio::PrimarySeq alphabet “dna” display_id “MYSEQ_001” seq “ACGCTAGGGATGGATAGGGATGGA”
...
empty
$seqobj :
“logical” structure: primary sequence (basic info) annotation collection list of sequence features All the components are accessed via specific methods: $seqobj->each_SeqFeature # returns the list of Bio::SeqFeature objects $seqobj->seq # returns the string “ACGCTAGGGATGGATAGGGATGGA” $seqobj->annotation # returns the Bio::AnnotationCollection object $seqobj->display_id # returns the string “MYSEQ_001”
III.10
3.2 Perl and Bioperl
BioPerl (1/5)
- A library of Perl modules for managing and manipulating life-
science information
- Biological sequences and alignments
- Reading and writing to files
- Retrieval from existing databases
- Running blast searches and the processing of results
- Much, much more....
- Reduces otherwise complex tasks to only a few lines of code
- http://bioperl.org
III.11
3.2 Perl and Bioperl
BioPerl (2/5)
Some of the most frequently used classes of Bioperl objects
Bio::Seq
seq() length() start() end() strand() desc() revcom() trunc() id() accession_number() species() annotation() all_SeqFeatures() add_SeqFeature() ...
Bio::SeqIO
next_seq() write_seq()
Bio::Tools::Run::StandAloneBlast
exists_blast() blastall() blasrpgp() bl2seq()
Bio::DB::EMBL
get_Seq_by_acc() get_Seq_by_id() get_Seq_by_gi get_Stream_by_acc() get_Stream_by_id() get_Stream_by_gi() proxy() ua() ...
Bio::SeqFeature::Generic
location() start() end() length() strand() score() frame() add_tag_value() each_tag_value() ...
III.12
3.2 Perl and Bioperl
BioPerl (3/5)
Examples of bioperl usage
(adapted from Stajich et al., Genome Res. 2002)
use Bio::DB::EMBL; # import the required modules use Bio::SeqIO; my $db = Bio::DB::EMBL->new(); # $db is a Bio::DB::EMBL object my $seqobj = $db->get_Seq_By_acc(“U14680”); # $seqobj is a Bio::Seq object my $seqout = Bio::SeqIO->new(-format=>”genbank”); # $seqout is a new Bio::SeqIO object if (defined $seqobj) { # “if the database retrieval has succeeded” $seqout->write_seq($seqobj); # write sequence data from $seqobj } # to STDOUT(default) in genbank format
To retrieve a sequence from EMBL and print it in GenBank format:
(adapted from Stajich et al., Genome Res. 2002)
III.13
3.2 Perl and Bioperl
BioPerl (3/5)
Examples of bioperl usage Hierarchical structure of a BLAST report is decomposed into a hierarchy of object:
Bio::SearchIO::Result::BLAST
Bio::Search:: Hit::BlastHit Bio::PrimarySeq alphabet “dna” display_id “MYSEQ_001” seq “ACGCTAGGGATGGATAGGGATGGA”
...
Bio::Search:: Hit::BlastHit Bio::Search:: Hit::BlastHit
Bio::SearchIO::Hit::BlastHit
Bio::Search:: HSP
...
Bio::Search:: HSP Bio::Search:: HSP name “M18833” description “Homo sapiens presenilin B” “”
Bio::SearchIO::Hit::HSP
name “M18833” description “MYSEQ_001” seq “ACGCTAGGGATGGATAGGGATGGA”
III.14
use Bio::SearchIO; my $search = Bio::SearchIO->new (-format => ‘blast’, -file => ‘report.bls’); my @HitsToSave = (); my ($cutoff_Evalue, $cutoff_Len) = (1e-3, 120); while (my $result = $search->next_result) { while (my $sbjct = $result->next_hit) { while (my $hsp = $sbjct->next_hsp) { if ($hsp->evalue < $cutoff_Evalue and $hsp->length(‘total’) >= $cutoff_Len) { push @HitsToSave, $sbjct; last; } } } } print “Hits:\n”; foreach my $sbjct (@HitsToSave) { print $sbjct->name, “\n”; }
3.2 Perl and Bioperl
BioPerl (3/5)
Examples of bioperl usage
(adapted from Stajich et al., Genome Res. 2002)
# import the required modules # $search is a Bio::SearchIO::blast object # $seqout is a new Bio::SeqIO object # cutoff criteria for hits to save # retrieve the result of one BLAST run # iterate through hits (“subjects”) # iterate through HSPs of a hit(subject) # apply cutof criteria # # # add those that pass cuttofs to a list # test only the first(=best) HSP # # # # # # loop through saved hits # print hit name (usually sequence ID)
To extract hits with E-value ≤ 10-3 and length ≥ 120 bp from a BLAST report:
(adapted from Stajich et al., Genome Res. 2002)
III.15
3.2 Perl and Bioperl
BioPerl (5/5)
Getting started with Bioperl
Bioperl tutorial
http://bioperl.org/bptutorial
Online documentation
http://doc.bioperl.org
man | perldoc pages
$ man Bio::Tools::Run::StandAloneBlast $ perldoc Bio::Graphics::Panel
Bioperl mailing list
bioperl-l@bioperl.org
III.16
3.2 Perl and Bioperl
Key ideas
- If you are not using bioperl, start using it
– it will increase your productivity and take much tedium out of coding
- When writing new bioinformatics code,
make it
- bject-oriented
- publicly available
- bioperl-compliant
III.17
Section 3.3
Analysis of regulatory regions with TFBS
Regulatory regions problem space Basic features of TFBS modules Relationship to BioPerl Representing patterns: matrix objects Detecting binding sites in a single sequence Phylogenetic footprints Pattern discovery Storing and retrieving patterns
III.18
3.3 Analysis of regulatory regions with TFBS
Regulatory regions problem space
Sets of binding sites
AATCACCA AATCACCA AATCACCA AATCACCA AATCTCCC AATCTCCG AATCACAC AATCATCA AATCTCAC AATCTCTG AGTCCCCA AATCCCGG AATCTGAG AATCCATA ATTCAGCC AATAACTT GATAACCT AATTAGAC GATTACAG GATTAGCG ATTCTTCC TATGAACA GATTAAAA AGACCCCA
Sets of binding sites
AATCACCA AATCACCA AATCACCA AATCACCA AATCTCCC AATCTCCG AATCACAC AATCATCA AATCTCAC AATCTCTG AGTCCCCA AATCCCGG AATCTGAG AATCCATA ATTCAGCC AATAACTT GATAACCT AATTAGAC GATTACAG GATTAGCG ATTCTTCC TATGAACA GATTAAAA AGACCCCA
Specificity profiles for binding sites
A [ -2 0 -2 -0.415 0.585 -2 -2 2.088 -2 -2 -1 0.585 ] C [ 1 0.585 0 0 -1 -2 -2 -2 2.088 -2 0.585 0.807 ] G [0.585 0.322 0.807 1.585 1 -2 2 -2 -2 2.088 -2 0 ] T [0.319 0.322 1 -2 0 2.088 -1 -2 -2 -2 1.459 -0.415 ]
Specificity profiles for binding sites
A [ -2 0 -2 -0.415 0.585 -2 -2 2.088 -2 -2 -1 0.585 ] C [ 1 0.585 0 0 -1 -2 -2 -2 2.088 -2 0.585 0.807 ] G [0.585 0.322 0.807 1.585 1 -2 2 -2 -2 2.088 -2 0 ] T [0.319 0.322 1 -2 0 2.088 -1 -2 -2 -2 1.459 -0.415 ]
Clusters of binding sites Clusters of binding sites Transcription factors Transcription factor binding sites Regulatory nucleotide sequences Transcription factors Transcription factor binding sites Regulatory nucleotide sequences
TATA URE
URF Pol-II
III.19
3.3 Analysis of regulatory regions with TFBS
Basic Features of TFBS modules
- Patterns representing TF specificity
- Sets of patterns
- Detected binding sites
- Sets of detected binding sites
- Unified DB interface to pattern databases
- Pattern generators
http://forkhead.cgb.ki.se/TFBS/
III.20
3.3 Analysis of regulatory regions with TFBS
Patterns
- In TFBS, a pattern is a generalized representation of a particular
sequence motif. It can be e.g. :
- At the moment, TFBS implements only matrix profiles.
- TFBS contains a set of classes for pattern construction/discovery
(TFBS::PatternGen::*)
Matrix profile
A [ -2 0 -2 -0.415 0.585 -2 -2 2.088 -2 -2 -1 0.585 ] C [ 1 0.585 0 0 -1 -2 -2 -2 2.088 -2 0.585 0.807 ] G [0.585 0.322 0.807 1.585 1 -2 2 -2 -2 2.088 -2 0 ] T [0.319 0.322 1 -2 0 2.088 -1 -2 -2 -2 1.459 -0.415 ] A [ -2 0 -2 -0.415 0.585 -2 -2 2.088 -2 -2 -1 0.585 ] C [ 1 0.585 0 0 -1 -2 -2 -2 2.088 -2 0.585 0.807 ] G [0.585 0.322 0.807 1.585 1 -2 2 -2 -2 2.088 -2 0 ] T [0.319 0.322 1 -2 0 2.088 -1 -2 -2 -2 1.459 -0.415 ]
Consensus
CBKGGTGACGTM
Regular expression
/C[^A][GT]GGTGACG[AC]/
III.21
3.3 Analysis of regulatory regions with TFBS
Classes and their roles
Class Description TFBS::Matrix:: PFM Position frequency (raw count) matrix; can be transformed to other matrix types TFBS::Matrix::ICM Information content matrix; used for drawing sequence logos TFBS::Matrix::PWM Position weight matrix; used for scanning nucleotide sequences and alignme nts thereof (phylogenetic footprinting) TFBS::Site A sequence feature object for transcription factor binding site TFBS::SiteSet An aggregate of TFBS::Site objects, with methods for manipulation of the set TFBS::SitePair A sequence feature object for pa ir of transcription factor binding sites in orthologous sequences TFBS::SitePairSet An aggregate of TFBS::SitePair objects, with methods for manipulation of the set TFBS::DB::FlatFileDir Read/write interface to a simple, flat file database of matrix obje ct data TFBS::DB::JASPAR2 Read/write interface to a MySQL database of matrix
- bject data
TFBS::DB::TRANSFAC Read -only interface to public TRANSFAC database
- n the WWW.
TFBS::PatternGen::SimplePFM Pattern generator that creates a position weight matrix from a set of short sequences of fixed size TFBS::PatternGen::Gibbs Pattern generator that uses Gibbs
III.22
3.3 Analysis of regulatory regions with TFBS
Relationship to BioPerl
- Tightly integrated with BioPerl
- All objects have Bio::Root::Root as a base class
- TFBS::Site IS A Bio::SeqFeature::Generic
- TFBS::SitePair IS A Bio::SeqFeature::FeaturePair
- All sequences can be passed to methods as Bio::Seq objects, all
alignments as Bio::SimpleAlign objects, and are internally manipulated as such
- Tries to follow Bioperl coding conventions
III.23
Given N sequence fragments of fixed length, one can assemble a position frequency matrix (number
- f times a particular nucleotide appears at a given
position):
A [19 20 1 2 14 3 6 10] C [ 0 17 4 15 14 6] G [ 4 2 1 4 1 6] T [ 1 2 23 4 6 2 3 2] A [19 20 1 2 14 3 6 10] C [ 0 17 4 15 14 6] G [ 4 2 1 4 1 6] T [ 1 2 23 4 6 2 3 2]
AATCACCA AATCACCA AATCACCA AATCACCA AATCTCCC AATCTCCG AATCACAC AATCATCA AATCTCAC AATCTCTG AGTCCCCA AATCCCGG AATCTGAG AATCCATA ATTCAGCC AATAACTT GATAACCT AATTAGAC GATTACAG GATTAGCG ATTCTTCC TATGAACA GATTAAAA AGACCCCA AATCACCA AATCACCA AATCACCA AATCACCA AATCTCCC AATCTCCG AATCACAC AATCATCA AATCTCAC AATCTCTG AGTCCCCA AATCCCGG AATCTGAG AATCCATA ATTCAGCC AATAACTT GATAACCT AATTAGAC GATTACAG GATTAGCG ATTCTTCC TATGAACA GATTAAAA AGACCCCA
3.3 Analysis of regulatory regions with TFBS
Representing patterns: Matrices (1/4) III.24
TFBS::Matrix::PFM
Position frequency matrix (raw count matrix)
length() name() ID() pdl_matrix() prettyprint() rawprint() revcom() set_matrix()
to_ICM() to_PWM() add() 3.3 Analysis of regulatory regions with TFBS
Representing patterns: Matrices (2/4)
Creating a position frequency matrix (PFM) object “manually
my $pfm = TFBS::Matrix::PFM->new (
- matrix =>
[[qw(0 3 0 2 5 0 0 16 0 0 1 5)], [qw(7 5 3 3 1 0 0 0 16 0 5 6)], [qw(5 4 6 11 7 0 15 0 0 16 0 3)], [qw(4 4 7 0 3 16 1 0 0 0 10 2)] ],
- name=>”CREB” );
The PFM object can be easily converted into other matrix types
my $icm = $pfm->to_ICM; my $pwm = $pfm->to_PWM;
III.25
Drawing a sequence logo (PNG file)
$icm->draw_logo (
- file=>"/w/TFBSdev/CREB.png",
- full_scale => 2.25,
- margin => 50,
- xsize => 500,
- ysize => 250,
- graph_title=> “Sequence logo for the CREB profile",
- x_title =>"Position",
- y_title =>"IC (bits)");
Converting a PFM on ICM and printing it...
my $icm = $pfm->to_ICM(); print $icm->prettyprint; TFBS::Matrix::ICM
Information content matrix
length() name() ID() pdl_matrix() prettyprint() rawprint() revcom()
to_PWM() total_ic() draw_logo()
A [ 0 0.004 0 0.1 0.078 0 0 2 0 0 0.05 0.037 ] C [0.199 0.007 0.093 0.15 0.016 0 0 0 2 0 0.251 0.044 ] G [0.142 0.006 0.186 0.55 0.11 0 1.559 0 0 2 0 0.022 ] T [0.113 0.006 0.216 0 0.047 2 0.104 0 0 0 0.501 0.015 ] A [ 0 0.004 0 0.1 0.078 0 0 2 0 0 0.05 0.037 ] C [0.199 0.007 0.093 0.15 0.016 0 0 0 2 0 0.251 0.044 ] G [0.142 0.006 0.186 0.55 0.11 0 1.559 0 0 2 0 0.022 ] T [0.113 0.006 0.216 0 0.047 2 0.104 0 0 0 0.501 0.015 ]
3.3 Analysis of regulatory regions with TFBS
Representing patterns: Matrices (3/4) III.26
Position weight matrix
TFBS::Matrix::PWM
length() name() ID() pdl_matrix() prettyprint() rawprint() revcom()
search_seq() search_aln()
The actual matrix used for scanning DNA sequences and scoring potential binding sites
3.3 Analysis of regulatory regions with TFBS
Representing patterns: Matrices (4/4)
Converting a PFM into a PWM
my $pwm = $pfm->to_PWM();
A [ -2 0 -2 -0.415 0.585 -2 -2 2.088 -2 -2 -1 0.585 ] C [ 1 0.585 0 0 -1 -2 -2 -2 2.088 -2 0.585 0.807 ] G [0.585 0.322 0.807 1.585 1 -2 2 -2 -2 2.088 -2 0 ] T [0.319 0.322 1 -2 0 2.088 -1 -2 -2 -2 1.459 -0.415 ] A [ -2 0 -2 -0.415 0.585 -2 -2 2.088 -2 -2 -1 0.585 ] C [ 1 0.585 0 0 -1 -2 -2 -2 2.088 -2 0.585 0.807 ] G [0.585 0.322 0.807 1.585 1 -2 2 -2 -2 2.088 -2 0 ] T [0.319 0.322 1 -2 0 2.088 -1 -2 -2 -2 1.459 -0.415 ] A [ 0 3 0 2 5 0 0 16 0 0 1 5 ] C [ 7 5 3 3 1 0 0 0 16 0 5 6 ] G [ 5 4 6 11 7 0 15 0 0 16 0 3 ] T [ 4 4 7 0 3 16 1 0 0 0 10 2 ] A [ 0 3 0 2 5 0 0 16 0 0 1 5 ] C [ 7 5 3 3 1 0 0 0 16 0 5 6 ] G [ 5 4 6 11 7 0 15 0 0 16 0 3 ] T [ 4 4 7 0 3 16 1 0 0 0 10 2 ] Converting a PFM into a PWM in a GC-rich background
my $pwm_GC70 = $pfm->to_PWM(-bg_probabilities => [0.15, 0.35, 0.35, 0.15]);
) ( 4 ) , ( log2 b p N N N i b n + +
III.27
3.3 Analysis of regulatory regions with TFBS
Detecting binding sites in a single sequence (1/2)
Scanning a sequence against a PWM
A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ] C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ] G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ] T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ]
ACCCTCCCCAGGGGCGGGGGGCGGTGGCCAGGACGGTAGCTCC
Abs_score = 13.4 (sum of column scores)
Sp1
Calculating the relative score
A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ] C [-0.2284 -0.2284 -1.5 -1.5 1.5128 1.5128
- 1.5 -0.2284 -1.5 -0.2284 -1.5 ]
G [ 1.2348 1.2348 1.2348 1.2348 2.1222 2.1222 2.1222 2.1222 0.4368 1.2348 1.2348 1.5128 1.5128 1.7457 1.7457 1.7457 1.7457
- 1.5 ]
T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 1.7457 ] A [-0.2284 0.4368 -1.5 -1.5 -
- 1.5
1.5 0.4368 -
- 1.5
1.5 -
- 1.5
1.5 -0.2284 0.4368 ] C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -
- 1.5
1.5 -0.2284 -1.5 -0.2284 -1.5 ] G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -
- 1.5
1.5 ] T [ 0.4368 0.4368 -
- 0.2284
0.2284
- 1.5
1.5 -
- 1.5
1.5 -0.2284 0.4368 0.4368 0.4368 -
- 1.5
1.5 1.7457 ]
Max_score = 15.2 (sum of highest column scores) Min_score = -10.3 (sum of lowest column scores)
93% = ⋅ − − = ⋅ = 100% 10.3) ( 15.2 (-10.3)
- 13.4
% 100 Min_score
- Max_score
Min_score
- Abs_score
Rel_score
Scanning 1300 bp of human insulin receptor gene with Sp1 at rel_score threshold of 75% Ouch.
III.28
TFBS::SiteSet add_site() add_siteset() Iterator() size() GFF() LRA() TFBS::Site
extends Bio::SeqFeature::Generic
pattern() start() end() strand() score() seq() 3.3 Analysis of regulatory regions with TFBS
Detecting binding sites in a single sequence (2/2)
Scanning a sequence with a matrix profile
my $siteset = $pwm->search_seq(
- seqobj => $bio_seq_obj,
- threshold => “75%”);
Iterating through the obtained set of sites
my $it = $siteset->Iterator(
- sort_by => “start”,
- reverse => 1);
while (my $site = $it->next()) { #...do whatever with the site }
Printing the sites in gff format
print $site_set->GFF();
III.29
Low specificity of profiles:
- too many hits
- great majority not biologically
significant A dramatic improvement in the percentage of biologically significant detections Scanning a single sequence Scanning a pair orf orthologous sequences for conserved patterns in conserved sequence regions 3.3 Analysis of regulatory regions with TFBS
Phylogenetic Footprints (1/3) III.30
TFBS::SitePairSet add_site() add_siteset() Iterator() size() GFF() set1() set2() TFBS::SitePair
extends Bio::SeqFeature::FeaturePair
site1() site2() pattern() 3.3 Analysis of regulatory regions with TFBS
Phylogenetic Footprints (2/3)
Scanning an alignment with a matrix profile
my $sitepairset = $pwm->search_aln(
- alignobj => $simplealign_obj,
- threshold => ”80%”,
- window => 50,
- conservation_ cutoff => “70%”);
Iterating through the obtained set of site pairs
my $it = $sitepairset->Iterator(
- sort_by => “start”,
- reverse => 1);
while (my $sitepair = $it->next()) { #... }
III.31
3.3 Analysis of regulatory regions with TFBS
Phylogenetic Footprints (3/3)
Mechanism of an alignment scan by search_aln
- Runs matrix searches for individual sequences;
each detected site is a subsequence with score exceeding a defined TF score threshold
- Aligns the sequences (unless user has provided an
alignment)
- Scans the alignment with a window of given
window size
- Marks as conserved regions exceeding a given
conservation cutoff value
- Of all detected TFs, reports only those that
- appear at identical position in both sequences,
- are located in the region marked as conserved
Additional parameters for scanning pairwise alignments
- WINDOW SIZE – the size of the sliding window for scanning
the allignment to locate conserved regions
- CONSERVATION CUTOFF – minimum percent of
conserved positions in the sliding window to mark the region as conserved
III.32
3.3 Analysis of regulatory regions with TFBS
Pattern Discovery (1/2)
Discovering new paterns
- verrepresented in a set of
biological sequences
AGTCGTACGTGAC ... AGTAGACGTGCCG ... ACGTGAGATACGT ... GAACGGAGTACGT ... TCGTGACGGTGAT ...
Algorithms
- Gibbs sampling (Lawrence et al.)
- Gibbs sampling/neural network
hybrid metod (Workman and Stormo)
- Phylogeny-guided multiple
sequence footprinting (Blanchette and Tompa)
III.33
3.3 Analysis of regulatory regions with TFBS
Pattern Discovery (2/2)
Retrieving a list of detected patterns
(here: position frequency matrices)
my @pfms = $patterngen->all_patterns();
TFBS::PatternGen::SimplePFM
pattern() patternSet() all_patterns
TFBS::PatternGen::Ann-Spec
pattern() patternSet() all_patterns SQI() parameters() seed() training() best_sites() best_energy() best_iter()
TFBS::PatternGen::Gibbs
pattern() patternSet() all_patterns() bg_probabilities()
Initializing the Gibbs pattern discovery program my $patterngen = TFBS::PatternGen::Gibbs->new (
- sequences => \@list_of_seqobjects,
- binary => '/Programs/Gibbs/Gibbs',
- nr_hits => scalar @list_of_seqobjects,
- motif_length => [8..12])
);
III.34
3.3 Analysis of regulatory regions with TFBS
Storing and retrieving patterns
TFBS::DB::JASPAR2
Creating a database object
my $db = TFBS::DB::JASPAR2->new(
- connect=>[“dbi:mysql:JASPAR2”,””,””]);
get_Matrix_by_ID() get_Matrix_by_name() get_MatrixSet() store_Matrix() get_ID_list() get_name_list() get_class_list() get_species_list() ...
TFBS::DB::FlatFileDir
get_Matrix_by_ID() get_Matrix_by_name() get_MatrixSet() store_Matrix() ...
Retrieving a set of all human and mouse PWM matrices:
my $pwm_set1 = $db->get_MatrixSet(-species => [qw(human mouse)],
- class => “forkhead”);
Retrieving a PWM matrix object by name :
my $pwm = $db->get_Matrix_by_name(“CREB”,”PWM”);
Retrieving a set of matrix objects by list on IDs:
my $pwm_set2 = $db->get_MatrixSet (-ID => \@list_of_IDs);
Retrieving a set of all plant PWM matrices with i.c >= 7 :
my $pwm_set3 = $db->get_MatrixSet( -sysgroup => “plants”,
- min_ic => 7);
III.35
3.3 Analysis of regulatory regions with TFBS
Key ideas
- TFBS is a suite of perl modules
designed to accelerate the manipulation
- f entities encountered in transcription
factor site analysis.
- It is tightly integrated with Bioperl
- Currently implements profile matrix
- bjects, search functions, phylogenetic
footprinting, matrix profile persistence and pattern generators.
III.36
Section 3.4
Working Examples
#1 Searching for HNF1 sites in the UDPGT1 gene #2 Detection of clusters of binding sites #3 Pattern discovery with promoters from a yeast gene network #4 Analysis of binding sites in the promoters of all yeast genes The code and data available at http://lenhard.cgb.ki.se/ISMB2003/
III.37
3.4 Working Examples
#1 Searching for HNF1 sites in the UDPGT1 gene (1/3)
- UDPGT1 (aka UGT1)
: multiple alternative transcription start sites
- alternative
transcripts exhibit tissue specificity
- looking for HNF-1
sites (characteristic for liver-specific expression)
- looking at 1000 bp
upstream of TSS
III.38
- Example 1a: Searching for HNF1 sites in the UDPGT1 gene on
a single sequence
- To run it with default files, cd to its directory (Example3/) and
type perl example1a.pl UGT1A*.fa
3.4 Working Examples
#1 Searching for HNF1 sites in the UDPGT1 gene (2/3)
use strict; use TFBS::Matrix::PFM; # Define files my (@input_files) = @ARGV; my $pfm= TFBS::Matrix::PFM->new( -matrixfile=>"HNF1.pfm",
- name=>"HNF-1", -class=>"Homeo" );
my $pwm = $pfm->to_PWM; foreach my $file (sort @input_files) { print ("\nIn $file:\n\n"); my $seqstream = Bio::SeqIO->new(-file => $file); while (my $seq = $seqstream->next_seq) { my $siteset = $pwm->search_seq(-seqstring => $seq->subseq(1,1000),
- threshold => "75%");
print $siteset->GFF; } }
[borisl@lorien Example1]$ perl example1a.pl UGT1A*.fa In UGT1A3.human.-1000.fa: SEQ TFBS TF binding site 844 857 10.298 + 0 sequence gattaatggttaat ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 851 864 19.018 + 0 sequence ggttaataattaac ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 852 865 17.550 - 0 sequence agttaattattaac ; TF "HNF-1" ; class Homeo In UGT1A3.mouse.-1000.fa: SEQ TFBS TF binding site 169 182 6.979 + 0 sequence agtttaaaactacc ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 443 456 7.774 - 0 sequence ggttaagagatatc ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 870 883 9.317 + 0 sequence ctttaataattaat ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 871 884 10.623 - 0 sequence cattaattattaaa ; TF "HNF-1" ; class Homeo In UGT1A6.human.-1000.fa: SEQ TFBS TF binding site 370 383 7.610 + 0 sequence agcaaatgattgac ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 371 384 9.371 - 0 sequence agtcaatcatttgc ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 813 826 9.743 + 0 sequence ggttcatattaacc ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 813 826 9.813 - 0 sequence ggttaatatgaacc ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 840 853 13.780 + 0 sequence ggttaaatattaat ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 841 854 13.332 - 0 sequence aattaatatttaac ; TF "HNF-1" ; class Homeo In UGT1A6.mouse.-1000.fa: SEQ TFBS TF binding site 117 130 7.756 - 0 sequence agtttagttttaaa ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 203 216 7.944 + 0 sequence ggtggataatttca ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 520 533 8.103 - 0 sequence tcttaatcttcacc ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 607 620 7.443 + 0 sequence agttagtgcttccc ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 722 735 6.982 - 0 sequence tgctactgtttaaa ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 841 854 13.780 + 0 sequence ggttaaatattaat ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 842 855 13.332 - 0 sequence aattaatatttaac ; TF "HNF-1" ; class Homeo In UGT1A8.human.-1000.fa: SEQ TFBS TF binding site 39 52 9.562 + 0 sequence ggattatatttgac ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 184 197 7.667 + 0 sequence agtacatattttcc ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 595 608 7.993 + 0 sequence agttaaaatctaaa ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 846 859 9.964 + 0 sequence agtaaatcattggc ; TF "HNF-1" ; class Homeo
III.39
- Example 1b: Searching for HNF1 sites in the UDPGT1 gene on
a human:mouse alignment
- To run it with default files, cd to its directory (Example3/) and type
perl example1b.pl UGT1A*.aln
3.4 Working Examples
#1 Searching for HNF1 sites in the UDPGT1 gene (3/3)
use strict; use TFBS::Matrix::PFM; use Bio::AlignIO; my (@input_files) = @ARGV; my $pfm= TFBS::Matrix::PFM->new( -matrixfile=>"HNF1.pfm",
- name=>"HNF-1", -class=>"Homeo" );
my $pwm = $pfm->to_PWM; foreach my $file (sort @input_files) { print ("\nIn $file:\n\n"); my $alnstream = Bio::AlignIO->new(-file => $file, -format=>"clustalw"); while (my $aln = $alnstream->next_aln) { my $aln_prom = $aln->slice (1, $aln->column_from_residue_number($aln->get_seq_by_pos(1)->id,1000) ); my $sitepairset = $pwm->search_aln(-alignobj => $aln_prom,
- threshold => "75%",
- window => 50,
- conservation => 70);
print $sitepairset->GFF; } }
[borisl@lorien Example1]$ perl example1b.pl UGT1A*.aln In UGT1A3.aln: In UGT1A6.aln: SEQ TFBS TF binding site 840 853 13.780 + 0 sequence GGTTAAATATTAAT ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 841 854 13.780 + 0 sequence GGTTAAATATTAAT ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 841 854 13.332 - 0 sequence AATTAATATTTAAC ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 842 855 13.332 - 0 sequence AATTAATATTTAAC ; TF "HNF-1" ; class Homeo In UGT1A8.aln: [borisl@lorien Example1]$ gnome-terminal
III.40
3.4 Working Examples
#3 Pattern discovery with promoters from a yeast gene network
- Example 3: We start from a set of co-regulated yeast genes, grab
their promoter regions and run Gibbs sampling on them to obtain position frequency matrices from overrepresented patterns
- To run it with default files, cd to its directory (Example3/)and type
perl example3.pl yeast_promoters.fa
use strict; use TFBS::PatternGen::Gibbs; my $patterngen= TFBS::PatternGen::Gibbs->new(-seq_file=>$ARGV[0],
- nr_hits => 6,
- motif_length => [8,9],
- additional_params => '-x -e');
my $pattern_nr = 0; foreach my $pfm ($patterngen->all_patterns()) { next if $pfm->tag("MAP_score") < 0; # patterns with negative MAP score # are meaningless - a Gibbs quirk $pattern_nr++; # do something with found patterns print "Pattern $pattern_nr:\n"; print $pfm->prettyprint(), "MAP score: ", $pfm->tag("MAP_score"), "\n\n"; # draw logo to a file $pfm->draw_logo(-file => "Logo_$pattern_nr.png"); } [borisl@lorien Example3]$ perl example3.pl yeast_promoters.fa Pattern 1: A [7 7 7 2 7 7 0 7 ] C [0 0 0 0 0 0 0 0 ] G [0 0 0 0 0 0 0 0 ] T [0 0 0 5 0 0 7 0 ] MAP score: 17.36566 Pattern 2: A [0 0 0 0 4 0 0 0 1 ] C [0 6 2 6 2 0 6 0 4 ] G [6 0 0 0 0 0 0 6 0 ] T [0 0 4 0 0 6 0 0 1 ] MAP score: 17.36566 [borisl@lorien Example3]$ emacs out3.txt
III.41
3.4 Working Examples
#2 Detection of clusters of binding sites (1/2)
- Example 2: We define a cluster of binding sites as a set of
several binding sites satisfying a set of quantitative constraints
- simple: number of sites present winthin a given sequence window
- advanced: probabilistic methods
- It is desirable to be able to define the clusters and detect them in
a set of sites detected on a single sequence or an alignment of sequences
Photoreceptors
- 140
- 60
Ret-1 CRX NRL NRL CRX Ret-1
- 200
- 40
Rhodopsin Enhancer Rhodopsin Promoter Arrestin
CRX Ret-3
- 1760
- 1950
AIRS
Rat senescence marker Human glucose-6-phosphatase Human Protein C
Liver Skeletal muscle
AChR-β MCK 5' MLC-1f3f
- Myf/E Myf/E Myf/E
TEF 119 +30 Myf/E Myf/E SRF/CArG Mef-2 Novel Mef-2
- 1256
- 1050
Myf/E Myf/E Myf/E Mef-2 +25,000 +25,200 HNF-4 HNF-1 HNF-4 HNF-3 HNF-3
- 240
- 133
- 135
C/EBP C/EBP HNF-1 C/EBP HNF-1
- 38
HNF-1 HNF-3 C/EBP NF-I
- 80
- 9
NF-I
Photoreceptors
- 140
- 60
Ret-1 CRX NRL
- 140
- 60
Ret-1 CRX NRL NRL CRX Ret-1
- 200
- 40
NRL CRX Ret-1
- 200
- 40
Rhodopsin Enhancer Rhodopsin Promoter Arrestin
CRX Ret-3
- 1760
- 1950
AIRS CRX Ret-3
- 1760
- 1950
AIRS
Rat senescence marker Human glucose-6-phosphatase Human Protein C
Liver Skeletal muscle
AChR-β MCK 5' MLC-1f3f
- Myf/E Myf/E Myf/E
TEF 119 +30 Myf/E Myf/E Myf/E TEF 119 +30 Myf/E Myf/E SRF/CArG Mef-2 Novel Mef-2
- 1256
- 1050
Myf/E Myf/E Myf/E Myf/E SRF/CArG Mef-2 Novel Mef-2
- 1256
- 1050
Myf/E Myf/E Myf/E Mef-2 +25,000 +25,200 Myf/E Myf/E Myf/E Mef-2 +25,000 +25,200 HNF-4 HNF-1 HNF-4 HNF-3 HNF-3
- 240
- 133
HNF-4 HNF-1 HNF-4 HNF-3 HNF-3
- 240
- 133
- 135
C/EBP C/EBP HNF-1 C/EBP HNF-1
- 38
- 135
C/EBP C/EBP HNF-1 C/EBP HNF-1
- 38
HNF-1 HNF-3 C/EBP NF-I
- 80
- 9
NF-I HNF-1 HNF-3 C/EBP NF-I
- 80
- 9
NF-I
III.42
3.4 Working Examples
#2 Detection of clusters of binding sites (2/2)
use strict; use TFBS::SiteSetFilter::RegModule; use TFBS::Matrix::PFM; use Bio::SeqIO; my $seqstream= Bio::SeqIO->new(-file=>"../Example4/Allyeast_promoters.fa",
- format=> 'fasta');
my $pfm1 = TFBS::Matrix::PFM->new( -matrixfile=>"../Example4/PAC.ma",
- name=>"PAC" );
my $pfm2 = TFBS::Matrix::PFM->new( -matrixfile=>"../Example1/HNF1.pfm",
- name=>"HNF-1" );
my $pwm1 = $pfm1->to_PWM(); my $pwm2 = $pfm2->to_PWM(); while (my $seq = $seqstream->next_seq()) { # analyze each sequence print $seq->id.":\n"; my $siteset= $pwm1->search_seq(-seqobj=> $seq, -threshold=> "75%"); $siteset->add_siteset ($pwm2->search_seq(-seqobj=> $seq, -threshold=> "75%")); my $filter = TFBS::SiteSetFilter::RegModule->new ( -module_elements => { "PAC" => { threshold => ”75%", min_nr_sites => 1 }, "HNF-1" => { threshold => "80%", min_nr_sites => 1 } },
- length => 100,
- min_score => 4
); my @modules = $filter->apply($siteset); foreach my $module (@modules) { print $module->GFF."\n"; } }
TFBS::SiteSetFilter::RegModule
new() apply()
YAL035W: YAL036C: YAL037W: YAL038W: SEQ TFBS TF binding site 83 96 6.841 + sequence TGTTAAAATTGATC ; TF "HNF-1" ; c SEQ TFBS TF binding site 84 97 6.960
- sequence GGATCAATTTTAAC ; TF "HNF-1" ;
SEQ TFBS TF binding site 101 114 7.182 + sequence TGTAAATAAACAAT ; TF "HNF-1" ; SEQ TFBS TF binding site 102 115 7.439
- sequence GATTGTTTATTTAC ; TF "HNF-1" ; c
SEQ TFBS TF binding site 150 158 6.040 + sequence CCATGACCC ; TF PAC ; class Unkn SEQ TFBS TF binding site 160 168 6.559
- sequence CGACGATCC ; TF PAC ; class Unkn
YAL039C: YAL040C: YAL041W: YAL042W: YAL043C: YAL043C-A: YAL044C: YAL045C: YAL046C: YAL047C: YAL048C: YAL049C: YAL051W: YAL053W: YAL054C: YAL055W: SEQ TFBS TF binding site 405 413 9.473 + sequence TGATGAGCC ; TF PAC ; class Unkn SEQ TFBS TF binding site 411 419 8.652
- sequence CGAGGAGGC ; TF PAC ; class Unkn
SEQ TFBS TF binding site 450 463 7.491 + sequence TGGTAATTTTTTGA ; TF "HNF-1" ; c SEQ TFBS TF binding site 451 464 9.047 + sequence GGTAATTTTTTGAA ; TF "HNF-1" ; c SEQ TFBS TF binding site 821 834 7.674
- sequence GATTATTCTTTGAA ; TF "HNF-1" ; c
SEQ TFBS TF binding site 862 875 8.210 + sequence GGCAAAAAATTAAC ; TF "HNF-1" ; SEQ TFBS TF binding site 863 876 12.900
- sequence TGTTAATTTTTTGC ; TF "HNF-1" ; c
SEQ TFBS TF binding site 881 894 7.454 + sequence GGTTCATAGTCCCC ; TF "HNF-1" ; SEQ TFBS TF binding site 881 889 6.040
- sequence CTATGAACC ; TF PAC ; class Unkno
YAL056W: SEQ TFBS TF binding site 279 292 8.925
- sequence GATTTATGATTTCA ; TF "HNF-1" ; c
SEQ TFBS TF binding site 321 334 7.606
- sequence GGTTTTCAATTTCC ; TF "HNF-1" ;
SEQ TFBS TF binding site 345 353 7.432 + sequence CGAAGAGCA ; TF PAC ; class Unkn SEQ TFBS TF binding site 362 370 7.432
- sequence CGATGCGCA ; TF PAC ; class Unkn
SEQ TFBS TF binding site 389 397 6.092
- sequence CGAGGATCA ; TF PAC ; class Unkn
SEQ TFBS TF binding site 423 436 8.113
- sequence TGTTGCTTATCAAC ; TF "HNF-1" ;
SEQ TFBS TF binding site 445 458 7.197 + sequence GGTAAAAATGTAAC ; TF "HNF-1" ; SEQ TFBS TF binding site 446 459 8.294
- sequence TGTTACATTTTTAC ; TF "HNF-1" ; c
YAL058C-A: YAL058W: YAL059W: YAL060W:
III.43
3.4 Working Examples
Key ideas
- TFBS modules enable easy scripting of
common types of regulatory regions analysis
- If your particular analysis requires
introduction of a new concept:
- Try modeling it into an object;
- Think it over, clean it up, document it and
share the code with the rest of us!
III.44
Section 3.5
Developers’ Corner
III.45
3.5 Developers’ corner
Getting involved in TFBS development
- Public CVS
- cvs –d :pserver:cvs@eriador.cgb.ki.se:/opt/cvs/TFBS login
Password: cvs
- browse the repository at http://forkhead.cgb.ki.se/TFBS/
- Full access to TFBS CVS repository
- contact Boris.Lenhard@cgb.ki.se
- The future of TFBS
- At least part of the classes will move into BioPerl class hierarchy
III.46
The end of part III
REFLECTIONS
- Part 1
- Futility Theorem – Essentially predictions of individual TFBS have no
relationship to an in vivo function
- Successful bioinformatics methods for site discrimination incorporate
additional information (clusters, conservation)
- Future: Chromatin?
- Part 2
- Pattern discovery methods are severely restricted by the Signal-to-Noise
problem
- Observed patterns must be carefully considered
- Successful methods for pattern discovery incorporate additional information
(conservation, structural constraints on TFs)
- Part 3
- Online tools are available for gene-centric studies, but most
bioinformaticians will benefit from using programming resources such as TFBS
Some Future Directions
- Current restrictions
- Phylogenetic footprinting methods do not take full advantage of
multiple sequence comparisons
- Inadequate reference collections of validated regulatory
modules limits development of module discrimination tools
- Long-term challenges
- Pattern discovery methods are limited by noise – restricts our
ability to analyze putative regulons from microarray studies
- Must address chromatin/structure in the nucleus