ISMB 2003: Tutorial AM4 Analysis of regulatory sequences controlling - - PowerPoint PPT Presentation

ismb 2003 tutorial am4
SMART_READER_LITE
LIVE PREVIEW

ISMB 2003: Tutorial AM4 Analysis of regulatory sequences controlling - - PowerPoint PPT Presentation

ISMB 2003: Tutorial AM4 Analysis of regulatory sequences controlling expression of biological networks Wyeth Wasserman, Albin Sandelin, Boris Lenhard Modified Materials, Sample Programs, Links Page All tutorial materials are available at


slide-1
SLIDE 1

Analysis of regulatory sequences controlling expression of biological networks

ISMB 2003: Tutorial AM4

Wyeth Wasserman, Albin Sandelin, Boris Lenhard

slide-2
SLIDE 2

Modified Materials, Sample Programs, Links Page

All tutorial materials are available at

http://lenhard.cgb.ki.se/ISMB2003/

User: ismb2003 Pasword: am4tutorial

slide-3
SLIDE 3

Tutorial AM4 Analysis of regulatory sequences controlling the expression of biological networks

Part I: Discrimination of regulatory regions in a gene or genome Part II: Discovery of novel regulatory controls for co-expressed genes Part III: Programming methods for the analysis of regulatory control sequences

slide-4
SLIDE 4

Analysis of regulatory sequences controlling the expression of biological networks

Wyeth Wasserman

Albin Sandelin Boris Lenhard

Part I: Discrimination of regulatory regions in a gene or genome

slide-5
SLIDE 5

Tutorial AM4 Analysis of regulatory sequences controlling the expression of biological networks

A.

Introduction to transcription in eukaryotic cells

B.

Modeling binding properties of transcription factors

C.

Resources for analysis of regulatory sequences

1.

Detection of individual TFBS

1.

Enhanced specificity via “Phylogenetic Footprinting”

2.

Clusters of Transcription Factor Binding Sites

3.

Promoter Detection

slide-6
SLIDE 6

Transcription Simplified

TATA URE

URF Pol-II Three-step Process:

  • 1. TF binds to TFBS (DNA)
  • 2. TF catalyzes recruitment of

polII complex

  • 3. Transcription initiates

I.1

slide-7
SLIDE 7

Complexity in Metazoan Transcription I.2

slide-8
SLIDE 8

Diverse and non-uniform use of terms: Partial glossary for tutorial

  • Promoter – Sufficient to support the initiation of transcription;
  • rientation dependent; includes TSS
  • Regulatory Regions
  • Proximal – adjacent to promoter
  • Distal – some distance away from promoter (vague)
  • May be positive (enhancing) or negative (repressing)
  • TSS – transcription start site
  • TFBS – single transcription factor binding site
  • Modules – Sets of TFBS that function together

EXON

TFBS TATA

TSS

TFBS TFBS Promoter Region TFBS TFBS Distal Regulatory Region

I.3

Proximal Regulatory Region

EXON

TFBS TFBS Distal R.R.

slide-9
SLIDE 9

Restrictions in Coverage of Tutorial

Focus on Eukaryotic cells.

Most of the principles apply to microbes. A set of relevant links is provided for microbial

resources in the “Resources” handout

Polymerase II driven promoters

Generally protein coding genes

All references are made to activating sequences

Information about repressive regulation is sparse

I.4

slide-10
SLIDE 10

Laboratory Discovery of TF Binding Sites I.5A

LUCIFERASE LUCIFERASE LUCIFERASE LUCIFERASE LUCIFERASE LUCIFERASE LUCIFERASE ACTIVITY

slide-11
SLIDE 11

Representing Binding Sites for a TF

  • A set of sites represented as a consensus
  • VDRTWRWWSHDWVWH (IUPAC degenerate DNA)

A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3 C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12 G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2 T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4

  • A matrix describing a a set of sites
  • A single site
  • AAGTTAATGATTAAC

I.5B

Set of binding sites AAGTTAATGA CAGTTAATAA GAGTTAAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTGATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA AAGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAATGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA Set of binding sites AAGTTAATGA CAGTTAATAA GAGTTAAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTGATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA AAGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAATGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA

slide-12
SLIDE 12

TGCTG = 0.9 PFMs to PWMs One would like to add the following features to the model:

  • 1. Correcting for the base frequencies in DNA
  • 2. Weighting for the confidence (depth) in the pattern
  • 3. Convert to log-scale probability for easy arithmetic

A 5 0 1 0 0 C 0 2 2 4 0 G 0 3 1 0 4 T 0 0 1 1 1 A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5 0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3 T -1.7 -1.7 -0.2 -0.2 -0.2 f matrix w matrix Log(

)

f(b,i)+ s(n) p(b) I.6

slide-13
SLIDE 13

Performance of Profiles

  • 95% of predicted sites bound in vitro (Tronche 1997)
  • MyoD binding sites predicted about once every 600 bp (Fickett

1995)

  • My Futility Theorem
  • Nearly 100% of predicted transcription factor binding sites have no

function in vivo

I.7

slide-14
SLIDE 14

Predicted TFBS in 1 kbp beta-globin proximal promoter

I.8

slide-15
SLIDE 15

Terms

  • Specificity – The portion of predictions that are correct
  • Sensitivity – The portion of “positives” that are detected
  • The detection of TFBS is limited by terrible specificity. Why?

I.9

slide-16
SLIDE 16

OnLine resources for the detection of TFBS

  • TESS
  • TRRD
  • MatInspector (“Transfac”)

I.10

slide-17
SLIDE 17

Phylogenetic Footprinting

70,000,000 years of evolution reveals most regulatory regions

I.11

slide-18
SLIDE 18

Phylogenetic Footprinting to Identify Functional Segments

% Identity

Actin gene compared between human and mouse with DPB.

200 bp Window Start Position (human sequence)

I.12

slide-19
SLIDE 19

Phylogenetic Footprinting (2)

  • 0.2

0.2 0.4 0.6 0.8 1 1000 2000 3000 4000 5000 6000 7000

FoxC2

100% 80% 60% 40% 20% 0%

I.13

slide-20
SLIDE 20

1kbp promoter screen with footprinting I.14

slide-21
SLIDE 21

Performance: Human vs. Mouse

Testing set: 40 experimentally defined sites in 15 well

studied genes

85-95% of defined sites detected with conservation

filter, while only 11-16%of total predictions retained I.15

SELECTIVITY SENSITIVITY

slide-22
SLIDE 22

Choosing the ”right” species...

(BONUS: What’s the ultimate sin in bioinformatics?)

COW MOUSE CHICKEN

HUMAN HUMAN HUMAN

I.16

slide-23
SLIDE 23

Considerations in Phylogenetic Footprinting

  • Alignment methods

Pairwise vs. multiple Local vs. global

  • Definition of Conservation

Windows vs. significant segments

  • Selection of Organisms

Evolutionary distance Retained function

I.17

slide-24
SLIDE 24

OnLine Resources for Phylogenetic Footprinting

  • Linked to TFBS
  • ConSite
  • rVISTA
  • Alignments
  • Blastz
  • Lagan
  • Avid
  • ORCA Aligner/OrthoSeq
  • Visualization
  • SimPlot
  • Vista Browser
  • PipMaker

I.18

slide-25
SLIDE 25

Regulatory Modules

TFs do NOT act in isolation

I.19

slide-26
SLIDE 26

Liver regulatory modules I.20

slide-27
SLIDE 27

Models for Liver TFs… HNF1 C/EBP HNF3 HNF4 I.21

slide-28
SLIDE 28

Scan the UDPGT1 sequence for modules (Gilbert’s Syndrome)

  • 0.2

0.2 0.4 0.6 0.8 1 1 5 1 9 2 1 3 3 1 7 4 2 1 5 2 5 6 2 9 7 3 3 8 3 7 9 4 2 4 6 1 5 2 5 4 3 5 8 4 Series1 Series2

Wildtype Mutant

Liver Module Model Score

“Window” Position in Sequence Liver regulatory module scores overlapping sequence variation I.22

slide-29
SLIDE 29

Predicted Muscle Regulatory Module

0.1 0.2 0.3 0.4 0.5 1500 3000 4500 6000 7500 9000

Kcna7 Score

I.23

slide-30
SLIDE 30

Untrained Methods

  • New generation of tools to identify clusters of TFBS for user-

specified set of TFs

  • Identify statistically significant clusters of sites within genomes
  • See MSCAN talk on Wednesday

I.24

slide-31
SLIDE 31

Considerations in Searching for Clusters of Binding Sites: Key items

  • Biological motivation for grouping transcription factors
  • Is there sufficient data to train a discrimination function?
  • Are there binding profiles for the critical transcription factors?

I.25

slide-32
SLIDE 32

OnLine Tools for Detection of Site Clusters

  • MSCAN (user defined sets of TFs)
  • TransRegio (liver and muscle)
  • COMET/CISTER/ClusterBuster

I.26

slide-33
SLIDE 33

Promoter Detection

Statistical Properties

  • f Sequences

I.27

slide-34
SLIDE 34

Promoter Detection

Approaches based on detection of TFBS Approaches based on sequence properties Some considerations regarding current approaches

I.28

slide-35
SLIDE 35

Promoters by Detection of Binding Sites

  • Early promoter detection tools were based on promoters of

small set of highly expressed genes

“TATA” Box at –30; CATT Box at –90

  • Attempted to define the specific position at which RNA

transcripts are initiated

  • Benchmarking test in late 1990s

Most promoter prediction tools were slightly better than random

guessing

nothing dramatically better than TATA prediction at -30

I.29

slide-36
SLIDE 36

What were we doing wrong?

Grouping diverse promoters into a single mega-class Attempting to pinpoint a specific start position when

biochemical system is ambiguous

Ignoring a common observation in the laboratory-

based literature… I.30

slide-37
SLIDE 37

Sequence Properties in Regions containing Promoters

  • Long recognized (in labs) that a significant subset of promoters

are situated within or adjacent to regions rich in CG dinucleotides

Without selection CG dinucleotides are modified CpG islands believed to favor “open” chromatin

  • A new generation of promoter detection tools (CpG-island

detectors) are based on the detection of C/G-rich regions containing over-represented strings/motifs (generally A/T-rich) identified in training data

I.31

slide-38
SLIDE 38

OnLine Tools for Promoter Detection

  • EpoNine
  • Promoter Inspector
  • FirstEF
  • Others?
  • Defining the likely TSS with NNPP

I.32

slide-39
SLIDE 39

Looking back at part 1: Key items

  • Profiles provide reasonable estimate of the potential for a TF to

bind to a sequence in vitro (i.e. in the lab)

  • In vitro binding is not predictive of in vivo function (i.e. in the cell)
  • Prediction of promoters with CpG islands is useful, but detection
  • f the other 50% of promoters is poor
  • There are two reasonable methods to improve the prediction of

individual TF binding sites

  • Phylogenetic Footprinting identifies sites conserved across

evolution, improving specificity by an order of magnitude in the best cases

  • Analysis of clusters of TFBS for biologically linked TFs can improve

specificity by two orders of magnitude

I.33

slide-40
SLIDE 40

Wyeth Wasserman

Albin Sandelin

Boris Lenhard

Analysis of regulatory sequences controlling the expression of biological networks

Part II: Discovery of novel regulatory controls for co-expressed genes

slide-41
SLIDE 41

Tutorial AM4 Analysis of regulatory sequences controlling the expression of biological networks

  • A. Introduction: the problem
  • B. Selection of promoter sequences
  • C. Algorithms for pattern discovery
  • D. Enhancing pattern discovery

II.1

slide-42
SLIDE 42

Predicting binding sites for factor X in promoter Y Sliding models over sequences (pattern detection)

ATGCTATAGTGTGCACGATCGATGCTAGTGCATCAACAGCTG CAGCTGTGTCGGGAA

Analyzing shared properties in promoter sets Over-representation (pattern discovery)

Approaches in promoter analysis II.2

slide-43
SLIDE 43

Definitions

Co-regulation: Genes with similar expression patterns resulting from the influence of one or more common control mechanisms

Given a set of ”co-regulated” genes, define motifs over-represented in the regulatory regions

The problem II.3

slide-44
SLIDE 44

Gene Networks

What are gene networks How are gene networks defined

II.4

slide-45
SLIDE 45

Definitions of Gene Networks in Genomics Data(1) Why is co-regulation

  • ccuring?

Why is it frequent?

Large protein systems

are often activated simultanosly

Evolution works by

modifying existing systems

DNA DUPLICATION G1 G2 NUCLEAR DIVISION

The cell cycle II.5

slide-46
SLIDE 46

Definitions of Gene Networks in Genomics Data(2)

  • Protein complexes
  • Pathways
  • Proteins associated with specific

process

  • Developmental program

II.6

Systems potentially co-regulated :

slide-47
SLIDE 47

Selection of Promoter Sequences for Analysis The good, the bad and the ugly --- benchwork Microarrays Other... II.7 How do we define our set of promoters?

slide-48
SLIDE 48

Microarrays Provides ’snapshots’ of mRNA levels in the cell mRNA distribution for certain genes are related Cluster genes that are expressed in similar fashion Concept: co-expressed genes believed to be co-regulated

Selection of Promoter sequences for analysis (1) II.8

slide-49
SLIDE 49

Pros: Wealth of data Cons: Often noisy Expensive Secondary effects a problem Coexpression != Co-regulation

Relative concentrations of microarrays in ISMB abstracts

0.2 0.4 0.6 0.8 1 1 .2 2000 2001 2002

Relative concentrations of microarrays in ISMB abstracts

0.2 0.4 0.6 0.8 1 1 .2 2000 2001 2002

Selection of Promoter sequences for analysis (2) II.9

slide-50
SLIDE 50

Other approaches: Litterature-based selection Chromatin immuno-precipitation Green Fluorescent Protein based approaches

Selection of Promoter sequences for analysis (3) II.10

slide-51
SLIDE 51

Online Resources

Expression databases General : NCBI Gene Expression Omnibus EMBL ArrayExpress Stanford Microarray Database Selection of Promoter sequences for analysis (4) II.11

slide-52
SLIDE 52

Functionally related proteins are

  • ften co-regulated

By analyzing co-regulated genes, we aim to find shared regulatory signals Selection of co-regulated genes is non-trivial. High throughput methods tend to be too noisy for our needs Filtering selected sequences using complementary data is often benificial Selection of Promoter sequences for analysis: Key items II.12

slide-53
SLIDE 53

Methods for Pattern Discovery

Word-based vs matrix-based Exhaustive Probabilistic Enhancements

II.13

slide-54
SLIDE 54

Methods for Pattern Discovery

Word-based

TFBS are words Words are easily counted Pros Realistic complexity Based on well-understood statistics Cons TF binding properties are often degenerate

Matrix-based

TF:s do not bind to words Pros Matrix models are more accurate descriptions of binding preferences Cons Unrealistically expensive in computer time (if optimals are sought)

AAGTTAATGATTAAC

slide-55
SLIDE 55

Exhaustive Methods for Pattern Discovery

What is an exhaustive method? Types of exhaustive methods

Word based Matrix based

II.14

slide-56
SLIDE 56

Exhaustive methods(1)

Computer science: Exhaustive algorithm: All possible solutions are evaluated: often VERY CPU intensive for large indata In this context Count all possible motifs/words. Analyze

  • ver-representation

II.15

slide-57
SLIDE 57

Exhaustive methods(2)

Word based methods: How likely are X words in a set of sequences, given sequence characteristics?

CCCGCCGGAATGAAATCTGATTGACATTTTCC >EP71002 (+) Ce[IV] msp-56 B; range -100 to -75 TTCAAATTTTAACGCCGGAATAATCTCCTATT >EP63009 (+) Ce Cuticle Col-12; range -100 to -75 TCGCTGTAACCGGAATATTTAGTCAGTTTTTG >EP63010 (+) Ce Cuticle Col-13; range -100 to -75 TATCGTCATTCTCCGCCTCTTTTCTT >EP11013 (+) Ce vitellogenin 2; range -100 to -75 GCTTATCAATGCGCCCGGAATAAAACGCTATA >EP11014 (+) Ce vitellogenin 5; range -100 to -75 CATTGACTTTATCGAATAAATCTGTT >EP11015 (-) Ce vitellogenin 4; range -100 to -75 ATCTATTTACAATGATAAAACTTCAA >EP11016 (+) Ce vitellogenin 6; range -100 to -75 ATGGTCTCTACCGGAAAGCTACTTTCAGAATT >EP11017 (+) Ce calmodulin cal-2; range -100 to -75 TTTCAAATCCGGAATTTCCACCCGGAATTACT >EP63007 (-) Ce cAMP-dep. PKR P1+; range -100 to -75 TTTCCTTCTTCCCGGAATCCACTTTTTCTTCC >EP63008 (+) Ce cAMP-dep. PKR P2; range -100 to -75 ACTGAACTTGTCTTCAAATTTCAACACCGGAA >EP17012 (+) Ce hsp 16K-1 A; range -100 to -75 TCAATGCCGGAATTCTGAATGTGAGTCGCCCT >EP55011 (-) Ce hsp 16K-1 B; range

II.16

slide-58
SLIDE 58

Over-representation How many words of type ’AGGAGTGA’ are found in our sequences?

[ ] ∏

=

=

k j j

a p i in begins w P

1

) (

[ ]

=

+ − =

k j j w

a p k n X E

1

) ( ) 1 (

[ ] [ ]

w w w w

X Var X E X Z − =

How likely is this result?

Exhaustive methods(3) II.17

slide-59
SLIDE 59

Background properties

Simple: How likely are single nucleotides? (extended Bernoulli) Complex: Neglect certain words Locations of TFBS Higher-order descriptions of DNA Exhaustive methods(4) II.18

slide-60
SLIDE 60

Exhaustive methods(5) Find all words of length 7 in the yeast genome Make a lookup table: AAACCTTT 456 TTTTTTTT 57788 GATAGGCA 589 Etc...

GTCTTTATCTTCAAAGTTGTCTGTCCAAGATTTGGACTTGAAGG ACAAGCGTGTCTTCTCAGAGTTGACTTCAACGTCCCATTGGAC GGTAAGAAGATCACTTCTAACCAAAGAATTGTTGCTGCTTTGC CAACCATCAAGTACGTTTTGGAACACCACCCAAGATACGTTGT CTTGTTCTCACTTGGGTAGACCAAACGGTGAAAGAAACGAAAA ATACTCTTTGGCTCCAGTTGCTAAGGAATTGCAATCATTGTTG GGTAAGGATGTCACCTTCTTGAACGACTGTGTCGGTCCAGAA GTTGAAGCCGCTGTCAAGGCTTCTGCCCCAGGTTCCGTTATTT TGTTGGAAAACTGCGTTACCACATCGAAGAAGAAGGTTCCAGA AAGGTCGATGGTCAAAAGGTCAAGGCTCAAGGAAGATGTTCA AAAGTTCAGACACGAATTGAGCTCTTTGGCTGATGTTTACATC ACGATGCCTTCGGTACCGCTCACAGAGCTCACTCTTCTATGGT CGGTTTCGACTTGCCAACGTGCTGCCGGTTTCTTGTTGGAAAA GGAATTGAAGTACTTCGGTAAGGCTTTGGAGAACCCAACCAG ACCATTCTTGGCCATCTTAGGTGGTGCCAAGGTTGCTGACAAG ATTCAATTGATTGACAACTTGTTGGACAAGGTCGACTCTATCAT CATTGGTGGTGGTATGGCTTTCCCTTCAAGAAGGTTTTGGAAA ACACTGAAATCGGTGACTCCATCTTCGACAAGGCTGGTGCTG AAATCGTTCCAAAGTTGATGGAAAAGGCCAAGGCCAAGGGTG TCGAAGTCGTCTTGCAGTCGACTTCATCATTGCTGATGCTTTC TCTGCTGATGCCAACACCAAGACTGTCACTGACAAGGAAGGT ATTCCAGCTGGCTGGCAAGGGTTGGACAATGGTCCAGAATCT AGAAAGTGTTTGCTGCTACTGTTGCAAAGGCTAAGACCATTGT CTGGAACGGTCCACCAGGTGTTTTCGAATTCGAAAAGTTCGCT GCTGGTACTAAGGCTTTGTTAGACGAAGTTGTCAAGAGCTCTG CTGCTGGTAACACCGTCATCATTGGTGGTGGTGACACTGCCA

II.19

slide-61
SLIDE 61

Matrix based methods

cagagcgatAGGTCAacgataatat gcgatagcaAGGTCGccccgtatag aacttggttAGGTCAttagcgagta ggggatgggCCCTCAaatacgcgga aaccggaagGGTTCAacgatctatt

A 3 0 0 0 0 4 C 1 1 1 0 5 0 G 1 4 3 0 0 1 T 0 0 1 5 0 0

= local multiple alignment No current exhaustive methods, due to NP-completeness

Exhaustive methods(6) II.20

slide-62
SLIDE 62

Resources Moby Dick (Bussemaker et al) (not online) Dyad analysis (van Helden et al) YMF (Sinha and Tompa)

Exhaustive methods(7) II.21

slide-63
SLIDE 63
  • Algorithms with high complexity - Large

sequences and/or many possible word lengths not possible

  • Often word-based
  • TFBS are not words (’fuzzy’ binding)
  • Sensitivity susceptible to noisy indata

(e.g. microarrays) Exhaustive methods: Key items II.22

slide-64
SLIDE 64

Probabilistic Methods for Pattern Discovery

What is a probabilistic method? The Gibbs sampler algorithm Improving background models

II.23

slide-65
SLIDE 65

Probabilistic Methods for Pattern Discovery(1)

Computer science: Probabilistic algorithm: uses randomness Bioinformatics: Probabilistic algorithm often the same as Monte Carlo algorithm: an approximation algorithm that always is fast but does not always give the best solution

II.24

slide-66
SLIDE 66

Motivation:

TFBS are not words Efficiency Can be intentionally influenced by biological data

Overview:

Find a local alignment of width x of sites that maximizes information content in reasonable time Usually by Gibbs sampling or EM methods Probabilistic Methods for Pattern Discovery(2) II.25

slide-67
SLIDE 67

Two data structures used: 1) Current pattern nucleotide frequencies qi,1,..., qi,4 and corresponding background frequencies pi,1,..., pi,4 2) Current positions of site startpoints in the N sequences a1, ..., aN , i.e. the alignment that contributes to qi,j. One starting point in each sequence is chosen randomly initially.

The Gibbs Sampling algorithm

tgacttcc tgatctct agacctca tgacctct

Probabilistic Methods for Pattern Discovery(3) II.26

slide-68
SLIDE 68

Iteration step Remove one sequence z from the

  • set. Update the current pattern

according to

tgacttcc tgatctct agacctca tgacctct

B N b c q

j j i j i

+ − + = 1

, ,

Pseudocount for symbol j Sum of all pseudocounts in column

Probabilistic Methods for Pattern Discovery(4) A ’Score’ the current pattern against each possible occurence ak in z. Draw a new ak with probabilities based on respective score divided by the background model B II.27 z

slide-69
SLIDE 69

Probabilistic Methods for Pattern Discovery(5)

10 12 14 16 18 100 200 300 400 500 600

SEQUENCE LENGTH PATTERN SIMILARITY

  • vs. TRUE MEF2 PROFILE

True Mef2 Binding Sites

Sensitivity weaknesses: ’Pattern drowning’ II.28

slide-70
SLIDE 70

Correction for background properties Workman & Stormo (ANN-Spec) – Train on background set as well to find ’commonly occuring’ patterns. Maximization of probabililty of finding pattern in positive sequences and not in background seqsequences In effect: Try to discriminate between ’common’ and ’novel’ patterns Thijs et al, Bailey and Elkan Markov background model describing DNA in m:th order Probabilistic Methods for Pattern Discovery(6) II.29

slide-71
SLIDE 71

What is a higher-order background model? Zero-order:

p(A)=0.29, p(C)=0.21, p(G)=0.21, p(T)=0.29

=

=

N i i

nucleotide P seq P

... 1

) ( ) (

First-order:

A

A

T C G A

m:th-order: The chance of drawing base x is dependant on the identity of the previous m bases Probabilistic Methods for Pattern Discovery(7) II.30

slide-72
SLIDE 72

Online resources Gibbs Motif Sampler(Lawrence et al) MEME(Bailey and Elkan) AnnSpec(Workman and Stormo) AlignAce(Roth et al) Probabilistic Methods for Pattern Discovery(8) II.31

slide-73
SLIDE 73

Gibbs Sampling/EM algorithms

  • Complexity is moderate.

Optimality not guaranteed.

  • Low sensitivity: patterns

’drown’ in large sequences (~>500 bp)

  • Sensitivity susceptible to noisy

input data (e.g. microarrays) Probabilistic Methods for Pattern Discovery: Key items II.32

slide-74
SLIDE 74

Enhancing pattern discovery sensitivity

Cross-species comparison Modelling of pattern constraints

information content structural constraints palindromicity

Usage of prior knowledge

II.33

slide-75
SLIDE 75

Search only where TFBS are likely: Cross-species comparison Use as filtering Or include orthologous sequences in analysis

Enhancing pattern detection sensitivity (1) II.34

slide-76
SLIDE 76

Information segmentation

Information content distributions of TFBS are distinctly non-random

(Wasserman et al 2000) Palindromicity, dyads (van Helden et al 2000) Variable gaps (Hu 2003)

TFBSs are not randomly drawn

Enhancing pattern detection sensitivity (2) II.35

slide-77
SLIDE 77

Building in biological knowledge in pattern finding - priors

How do priors work? Essentially by increasing the psudocounts by some fraction submitted in the prior Enhancing pattern detection sensitivity (3) A certain residue is according to our prior knowledge an A in 47/100 cases. New pseudocount for first residue, A: 50/100 x k x#number of sites Example: II.36

slide-78
SLIDE 78

’Biasing’ probabilistic pattern finder with prior knowledge - an unexplored area Examples Structural constraints (example: HLH factors have certain shared binding preferences) Information content ’landscape’ Enhancing pattern detection sensitivity (4) II.37

slide-79
SLIDE 79

Example of enhanced sensitivity using biologically based priors

use prior no prior background (no sites)

Enhancing pattern detection sensitivity (5) II.38

slide-80
SLIDE 80

Difficulties in pattern detection are attributed to: Low signal strength Cross species comparison as filters Complexity of background DNA New >0 order models Simplified models of TFBS properties Dyads, palindromes Segmentation of information content distributions Usage of biological knowledge as priors Enhancing pattern detection sensitivity: Key items II.39

slide-81
SLIDE 81

Evaluation of Patterns

How relevant is our new pattern? Algorithms for pattern comparison

II.40

slide-82
SLIDE 82

Pattern finders can generally NOT distinguish between patterns and over-represented ’junk’

Q: How do we know if a pattern is true A: In the lab! Q:How can we avoid labwork? A: Compare to already known patterns! Q: How do we know a function of a gene A: In the lab! Q:How can we avoid labwork? A: Compare to already known sequences! The sequence analogy

Evaluation of patterns(1) II.41

slide-83
SLIDE 83

Algorithms for pattern comparison

Sandelin & Wasserman Needleman-Wunsch variant Hughes et al Based on protein BLOCKS alignment algorithm (Pietrokowski) Evaluation of patterns(2) II.42

slide-84
SLIDE 84

Online Resources

CompareAce (Hughes et al) JASPAR (Sandelin & Wasserman) Integrated systems (yeast) Atlas YRSA Evaluation of patterns(3) II.43

slide-85
SLIDE 85

Pattern discovery is based on finding shared TFBS for co-regulated genes Co-regulated sequences can be based on different experiments and different clustering analysis algorithms Algorithms for pattern discovery can be exhaustive probabilistic Enhanced sensitivity can be achieved by Cross-species comparison Information segmentation Biologically based priors New patterns can be compared to verified patterns

Looking back at part II: II.44

slide-86
SLIDE 86

Wyeth Wasserman Albin Sandelin

Boris Lenhard

Analysis of regulatory sequences controlling the expression of biological networks

Part III: Programming Resources for the Analysis of Regulatory Sequences

slide-87
SLIDE 87

Section 3.1

Orientation

Objectives TFBS –Basic features Prerequisites for this part of the tutorial III.1

slide-88
SLIDE 88

3.1 Orientation

Objectives

  • Purpose: Introduction to the tools and “discipline” of

practical analysis of regulatory regions and networks

  • After mastering the material presented here, you should

be able to script and automate computational methods for the analysis of regulatory sequences.

  • More advanced Perl users will be provided with enough

information to contribute extensions and enhancements to the existing computational framework for regulatory sequence analysis.

III.2

slide-89
SLIDE 89

3.1 Orientation

Regulatory regions problem space

Sets of binding sites

AATCACCA AATCACCA AATCACCA AATCACCA AATCTCCC AATCTCCG AATCACAC AATCATCA AATCTCAC AATCTCTG AGTCCCCA AATCCCGG AATCTGAG AATCCATA ATTCAGCC AATAACTT GATAACCT AATTAGAC GATTACAG GATTAGCG ATTCTTCC TATGAACA GATTAAAA AGACCCCA

Sets of binding sites

AATCACCA AATCACCA AATCACCA AATCACCA AATCTCCC AATCTCCG AATCACAC AATCATCA AATCTCAC AATCTCTG AGTCCCCA AATCCCGG AATCTGAG AATCCATA ATTCAGCC AATAACTT GATAACCT AATTAGAC GATTACAG GATTAGCG ATTCTTCC TATGAACA GATTAAAA AGACCCCA

Specificity profiles for binding sites

A [ -2 0 -2 -0.415 0.585 -2 -2 2.088 -2 -2 -1 0.585 ] C [ 1 0.585 0 0 -1 -2 -2 -2 2.088 -2 0.585 0.807 ] G [0.585 0.322 0.807 1.585 1 -2 2 -2 -2 2.088 -2 0 ] T [0.319 0.322 1 -2 0 2.088 -1 -2 -2 -2 1.459 -0.415 ]

Specificity profiles for binding sites

A [ -2 0 -2 -0.415 0.585 -2 -2 2.088 -2 -2 -1 0.585 ] C [ 1 0.585 0 0 -1 -2 -2 -2 2.088 -2 0.585 0.807 ] G [0.585 0.322 0.807 1.585 1 -2 2 -2 -2 2.088 -2 0 ] T [0.319 0.322 1 -2 0 2.088 -1 -2 -2 -2 1.459 -0.415 ]

Clusters of binding sites Clusters of binding sites Transcription factors Transcription factor binding sites Regulatory nucleotide sequences Transcription factors Transcription factor binding sites Regulatory nucleotide sequences

TATA URE

URF Pol-II

III.3

slide-90
SLIDE 90

3.1 Orientation

TFBS – basic features

  • A computational framework for transcription factor binding site

analysis and manipulation

  • Implemented in Perl
  • Models and manipulates common objects from the regulatory

regions object space

  • Patterns representing TF specificity
  • Sets of patterns
  • Detected binding sites
  • Sets of detected binding sites
  • Unified DB interface to pattern databases
  • Pattern generators
  • Enables easy scripting and automation of regulatory region

analysis

III.4

slide-91
SLIDE 91

3.1 Orientation

Prerequisites for full benefit from this part of the tutorial

  • Several concepts covered in parts I and II
  • Profile matrices (PFM, ICM, PWM)
  • phylogenetic footprinting
  • Basic to intermediate knowledge of Perl programming
  • General syntax and semantics
  • References and complex data structures
  • Basic knowledge of Unix OSs and an access to one is needed

at the moment

III.5

slide-92
SLIDE 92

Section 3.2

Perl and Bioperl

Perl and Perl objects Bioperl Selected Bioperl objects III.6

slide-93
SLIDE 93

Annotation Sequence features Sequence

3.2 Perl and Bioperl

Bioinformatics problem|object space

LOCUS SARS 1942 bp mRNA linear PRI 04-FEB-2003 DEFINITION Homo sapiens seryl-tRNA synthetase (SARS), mRNA. ACCESSION NM_006513 VERSION NM_006513.2 GI:16306547 KEYWORDS . SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 1942) AUTHORS Hartlein,M. and Cusack,S. TITLE Structure, function and evolution of seryl-tRNA synthetases: implications for the evolution of aminoacyl-tRNA synthetases and the genetic code JOURNAL J. Mol. Evol. 40 (5), 519-530 (1995) MEDLINE 95302522 PUBMED 7540217 REFERENCE 2 (bases 1 to 1942) AUTHORS Vincent,C., Tarbouriech,N. and Hartlein,M. TITLE Genomic organization, cDNA sequence, bacterial expression, and purification of human seryl-tRNA synthase JOURNAL Eur. J. Biochem. 250 (1), 77-84 (1997) MEDLINE 98092290 PUBMED 9431993 COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The reference sequence was derived from BC000716.1, BC009390.1 and X91257.1. On Oct 22, 2001 this sequence version replaced gi:5730028. Summary: This gene belongs to the class II amino-acyl tRNA family. The encoded enzyme catalyzes the transfer of L-serine to tRNA (Ser) and is related to bacterial and yeast counterparts. COMPLETENESS: complete on the 3' end. FEATURES Location/Qualifiers source 1..1942 /organism="Homo sapiens" /db_xref="taxon:9606" /chromosome="1" /map="1p13.3-p13.1" gene 1..1942 /gene="SARS" /note="synonyms: SERS, SERRS" /db_xref="LocusID:6301" /db_xref="MIM:607529" misc_feature 79..234 /gene="SARS" /note="Seryl_tRNA_N; Region: Seryl-tRNA synthetase N-terminal domain. This domain is found associated with the Pfam tRNA synthetase class II domain (pfam00587) and represents the N-terminal domain of seryl-tRNA synthetase" /db_xref="CDD:pfam02403" misc_feature 82..1482 /gene="SARS" /note="Region: Seryl-tRNA synthetase [Translation, ribosomal structure and biogenesis]" /db_xref="CDD:COG0172" misc_feature 640..1167 /gene="SARS" /note="ThrS; Region: Threonyl-tRNA synthetase [Translation, ribosomal structure and biogenesis]" /db_xref="CDD:COG0441" misc_feature 649..1152 /gene="SARS" /note="tRNA-synt_2b; Region: tRNA synthetase class II core domain (G, H, P, S and T). Other tRNA synthetase sub-families are too dissimilar to be included. This domain is the core catalytic domain of tRNA synthetases and includes glycyl, histidyl, prolyl, seryl and threonyl tRNA synthetases" /db_xref="CDD:pfam00587" misc_feature 664..1161 /gene="SARS" /note="ProS; Region: Prolyl-tRNA synthetase [Translation, ribosomal structure and biogenesis]" /db_xref="CDD:COG0442" polyA_signal 1831..1836 /gene="SARS" polyA_site 1852 /gene="SARS" /evidence=experimental polyA_site 1895 /gene="SARS" /evidence=experimental BASE COUNT 530 a 460 c 538 g 414 t ORIGIN 1 gcagtgcggc ggtcacaggc tgagtgctgc ggcgcgatcc ttgcttccct gagcgttggc 61 ccgggaggaa agaagatggt gctggatctg gatttgtttc gggtggataa aggaggggac 121 ccagccctca tccgagagac gcaggagaag cgcttcaagg acccgggact agtggaccag 181 ctggtgaagg cagacagcga gtggcgacga tgtagatttc gggcagacaa cttgaacaag 241 ctgaagaacc tatgcagcaa gacaatcgga gagaaaatga agaaaaaaga gccagtggga 301 gatgatgagt ctgtcccaga gaatgtgctg agtttcgatg accttactgc agacgcttta 361 gctaacctga aagtctcaca aatcaaaaaa gtccgactcc tcattgatga agccatcctg 421 aagtgtgacg cggagcggat aaagttggaa gcagagcggt ttgagaacct ccgagagatt 481 gggaaccttc tgcacccttc tgtacccatc agtaacgatg aggatgtgga caacaaagta 541 gagaggattt ggggtgattg tacagtcagg aagaagtact ctcatgtgga cctggtggtg 601 atggtagatg gctttgaagg cgaaaagggg gccgtggtgg ctgggagtcg agggtacttc 661 ttgaaggggg tcctggtgtt cctggaacag gctctcatcc agtatgccct tcgcaccttg 721 ggaagtcggg gctacattcc catttatacc ccctttttca tgaggaagga ggtcatgcag 781 gaggtggcac agctcagcca gtttgatgaa gaactttata aggtgattgg caaaggcagt 841 gaaaagtctg atgacaactc ctatgatgag aagtacctga ttgccacctc agagcagccc 901 attgctgccc tgcaccggga tgagtggctc cggccggagg acctgcccat caagtatgct 961 ggcctgtcta cctgcttccg tcaggaggtg ggctcccatg gccgtgacac ccgtggcatc 1021 ttccgagtcc atcagtttga gaagattgaa cagtttgtgt actcatcacc ccatgacaac 1081 aagtcatggg agatgtttga agagatgatt accaccgcag aggagttcta ccagtccctg 1141 gggattcctt accacattgt gaatattgtc tcaggttctt tgaatcatgc tgccagtaag 1201 aagcttgacc tggaggcctg gtttccgggc tcaggagcct tccgtgagtt ggtctcctgt 1261 tctaattgca cggattacca ggctcgccgg cttcgaatcc gatatgggca aaccaagaag 1321 atgatggaca aggtggagtt tgtccatatg ctcaatgcta ccatgtgcgc cactacccgt 1381 accatctgcg ccatcctgga gaactaccag acagagaagg gcatcactgt gcctgagaaa 1441 ttgaaggagt tcatgccgcc aggactgcaa gaactgatcc cctttgtgaa gcctgcgccc 1501 attgagcagg agccatcaaa gaagcagaag aagcaacatg agggcagcaa aaagaaagca 1561 gcagcaagag acgtcaccct agaaaacagg ctgcagaaca tggaggtcac cgatgcttga 1621 acattcctgc ctccctattt gccaggcttt catttctgtc tgctgagatc tcagagcctg 1681 cccaacagca gggaagccaa gcacccattc atccccctgc ccccatctga ctgcgtagct 1741 gagaggggaa cagtgccatg taccacacag atgttcctgt ctcctcgcat gggcataggg 1801 acccatcatt gatgactgat gaaaccatgt aataaagcat ctctggggag ggcttaggac 1861 tcttcctcag tcttcttccc cgggcttgaa ccccgaaaaa aaaaaaaaaa aaaaaaaaaa 1921 aaaaaaaaaa aaaaaaaaaa aa // TBLASTN 2.2.3 [May-13-2002] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= (958 letters) Database: /opt/cgb/PEGASUS/BLAST/Pegasus_transcript 3265 sequences; 7,560,541 total letters Searching.......done Score E Sequences producing significant alignments: (bits) Value 9550__TRANSCRIPT__2305 KIAA1464 protein [OTHER] 68 8e-12 10825__TRANSCRIPT__1863 RAN binding protein 9 [OTHER] 66 3e-11 9615__TRANSCRIPT__2321 DEAD/H (Asp-Glu-Ala-Asp/His) box polypept... 39 0.005 >9550__TRANSCRIPT__2305 KIAA1464 protein [OTHER] Length = 1488 Score = 67.8 bits (164), Expect = 8e-12 Identities = 41/112 (36%), Positives = 57/112 (50%), Gaps = 9/112 (8%) Frame = +1 Query: 504 GFDLNVFGYCGFDGLITNSTEQSKEYAKPFGRDDVIGCGINFIDGSIFFTKNGIHLGNAF 563 G+D + +GY G DG S+ + Y F DVIGC +N I+G+ F+TKNG LG Sbjct: 1 GWDKHSYGYHGDDGHSFCSSGTGQPYGPTFTTGDVIGCCVNLINGTCFYTKNGHSLGVCI 180 Query: 564 TDLN--------DLEFVPYVALR-PGNSIKTNFGLNEDFVFDIIGYQDKWKS 606 DL P V L+ PG + NFG + F+FDI Y +W++ Sbjct: 181 RDLGGSALWSHFGWNLYPTVGLQTPGEIVDANFG-QQPFLFDIEDYMREWRA 333 Score = 45.4 bits (106), Expect = 4e-05 Identities = 29/118 (24%), Positives = 51/118 (42%) Frame = +1 Query: 708 GSLPNTLNVMINDYLIHEGLVDVAKGFLKDLQKDAVNVNGQHSESKDVIRHNERQIMKEE 767 G L M++ YL+H G A F R E I +E+ Sbjct: 379 GEWQAVLQNMVSSYLVHHGYCATATAF---------------------ARMTETPIQEEQ 495 Query: 768 RMVKIRQELRYLINKGQISKCINYIDNEIPDLLKNNLELVFELKLANYLVMIKKSSSK 825 +K RQ+++ L+ +G++ + I P LL++N L+F LK ++ M+ + S+ Sbjct: 496 ASIKNRQKIQKLVLEGRVGEAIETTQRFYPGLLEHNPNLLFMLKCRQFVEMVNGTDSE 669 >10825__TRANSCRIPT__1863 RAN binding protein 9 [OTHER] Length = 1107 Score = 65.9 bits (159), Expect = 3e-11 Identities = 45/131 (34%), Positives = 65/131 (49%), Gaps = 1/131 (0%) Frame = +1 Query: 485 GVSAMSLNVDGSINKCQKYGFDLNVFGYCGFDGLITNSTEQSKEYAKPFGRDDVIGCGIN 544 G+SA +N++ + G+D + +GY G DG S+ + Y F DVIGC +N Sbjct: 10 GLSAQGVNMN------RLPGWDKHSYGYHGDDGHSFCSSGTGQPYGPTFTTGDVIGCCVN 171 Query: 545 FIDGSIFFTKNGIHLGNAFTDLNDLEFVPYVALR-PGNSIKTNFGLNEDFVFDIIGYQDK 603 I+ + F+TKNG L + L P V L+ PG + NFG FVFDI Y + Sbjct: 172 LINNTCFYTKNGHSLDVKYAILQP-NLYPTVGLQTPGEVVDANFG-QHPFVFDIEDYMRE 345 Query: 604 WKSLAYEHICR 614 W++ I R Sbjct: 346 WRTKIQAQIDR 378 Score = 45.1 bits (105), Expect = 6e-05 Identities = 26/119 (21%), Positives = 52/119 (42%) Frame = +1 Query: 707 DGSLPNTLNVMINDYLIHEGLVDVAKGFLKDLQKDAVNVNGQHSESKDVIRHNERQIMKE 766 +G + M++ YL+H G A+ F R ++ +++E Sbjct: 397 EGEWQTMIQKMVSSYLVHHGYCATAEAF---------------------ARSTDQTVLEE 513 Query: 767 ERMVKIRQELRYLINKGQISKCINYIDNEIPDLLKNNLELVFELKLANYLVMIKKSSSK 825 +K RQ ++ L+ G++ + I P LL+ N L+F LK+ ++ M+ + S+ Sbjct: 514 LASIKNRQRIQKLVLAGRMGEAIETTQQLYPSLLERNPNLLFTLKVRQFIEMVNGTDSE 690 >9615__TRANSCRIPT__2321 DEAD/H (Asp-Glu-Ala-Asp/His) box polypeptide 1 [SIGNALING] Length = 2706 Score = 38.5 bits (88), Expect = 0.005 Identities = 31/90 (34%), Positives = 43/90 (47%), Gaps = 4/90 (4%) Frame = +1 Query: 510 FGYCGFDGLITNS-TEQSKEYAKPFGRDDVIGCGINFIDGSIFFTKNGIHLGNAF---TD 565 FG+ GF G S +Q Y + F D IGC ++ G + F+KNG LG AF Sbjct: 775 FGF-GFGGTGKKSHNKQFDNYGEEFTMHDTIGCYLDIDKGHVKFSKNGKDLGLAFEIPPH 951 Query: 566 LNDLEFVPYVALRPGNSIKTNFGLNEDFVF 595 + + P L+ +K NFG E+F F Sbjct: 952 MKNQALFPACVLK-NAELKFNFG-EEEFKF 1035 Database: /opt/cgb/PEGASUS/BLAST/Pegasus_transcript Posted date: Nov 13, 2002 1:39 PM Number of letters in database: 7,560,541 Number of sequences in database: 3265 Lambda K H 0.315 0.134 0.381 Gapped Lambda K H 0.267 0.0410 0.140 Matrix: BLOSUM62 Gap Penalties: Existence: 11, Extension: 1 Number of Hits to DB: 5,635,301 Number of Sequences: 3265 Number of extensions: 64670 Number of successful extensions: 222 Number of sequences better than 1.0e-01: 6 Number of HSP's better than 0.1 without gapping: 60 Number of HSP's successfully gapped in prelim test: 11 Number of HSP's that attempted gapping in prelim test: 156 Number of HSP's gapped (non-prelim): 76 length of query: 958 length of database: 2,520,180 effective HSP length: 99 effective length of query: 859 effective length of database: 2,196,945 effective search space: 1887175755 effective search space used: 1887175755 frameshift window, decay const: 50, 0.1 T: 13 A: 40 X1: 16 ( 7.3 bits) X2: 38 (14.6 bits) X3: 64 (24.7 bits) S1: 42 (22.0 bits) S2: 77 (34.3 bits)

BLAST report

Hit Hit Hit

HSP HSP HSP HSP HSP

Examples: BLAST and genbank sequence: combination of hierarchical and sequential elements: can be used to represent most complex data entities encountered in bioinformatics

III.7

slide-94
SLIDE 94

3.2 Perl and Bioperl

Perl and Perl objects (1/4)

Programmer’s efficiency For most tasks in bioinformatics, our time is much more valuable than CPU time Highly suitable for biological sequence (string) manipulation Built-in regular expressions Abundance of existing, reusable code Bioperl Numerical methods Web programming Objects make programming faster and easier Complexity hiding, Abstraction, Code reuse

III.8

slide-95
SLIDE 95

3.2 Perl and Bioperl

Perl and Perl objects (2/4)

use Bio::Seq; # import the module my $seqobj = Bio::Seq->new(-id => “MYSEQ_001”,

  • seq => “ACGCTAGGGATGGATAGGGATGGA”);

print $seqobj->seq; # prints “ACGCTAGGGATGGATAGGGATGGA” print $seqobj->id; # prints “MYSEQ_001” my $revseqobj = $seqobj->revcom; # creates a new Bio::Seq object # with reverse complement of the # original sequence print $revseqobj->seq; # prints “TCCATCCCTATCCATCCCTAGCGT”

Creating and using an object – an example of Bio::Seq

III.9

slide-96
SLIDE 96

3.2 Perl and Bioperl

Perl and Perl objects (4/4)

Bio::Seq=HASH(0x8330dcc) '_as_feat' => ARRAY(0x831f6f4) empty array '_root_verbose' => 0 'annotation' => Bio::Annotation::Collection=HASH(0x831f850) '_annotation' => HASH(0x831f634) empty hash '_root_verbose' => 0 '_typemap' => Bio::Annotation::TypeManager=HASH(0x831f928) '_root_verbose' => 0 '_type' => HASH(0x831f79c) 'comment' => 'Bio::Annotation::Comment' 'dblink' => 'Bio::Annotation::DBLink' 'reference' => 'Bio::Annotation::Reference' 'primary_seq' => Bio::PrimarySeq=HASH(0x831f6e8) '_root_verbose' => 0 'alphabet' => 'dna' 'display_id' => 'MYSEQ_001' 'seq' => 'ACGCTAGGGATGGATAGGGATGGA'

$seqobj :

internal structure Good news: you don’t have to care about it.

Bio::Seq object

Bio::SeqFeature::Exon Bio::SeqFeature::Exon Bio::SeqFeature::Generic Bio::SeqFeature::CDS Bio::Annotation::Collection Bio::PrimarySeq alphabet “dna” display_id “MYSEQ_001” seq “ACGCTAGGGATGGATAGGGATGGA”

...

empty

$seqobj :

“logical” structure: primary sequence (basic info) annotation collection list of sequence features All the components are accessed via specific methods: $seqobj->each_SeqFeature # returns the list of Bio::SeqFeature objects $seqobj->seq # returns the string “ACGCTAGGGATGGATAGGGATGGA” $seqobj->annotation # returns the Bio::AnnotationCollection object $seqobj->display_id # returns the string “MYSEQ_001”

III.10

slide-97
SLIDE 97

3.2 Perl and Bioperl

BioPerl (1/5)

  • A library of Perl modules for managing and manipulating life-

science information

  • Biological sequences and alignments
  • Reading and writing to files
  • Retrieval from existing databases
  • Running blast searches and the processing of results
  • Much, much more....
  • Reduces otherwise complex tasks to only a few lines of code
  • http://bioperl.org

III.11

slide-98
SLIDE 98

3.2 Perl and Bioperl

BioPerl (2/5)

Some of the most frequently used classes of Bioperl objects

Bio::Seq

seq() length() start() end() strand() desc() revcom() trunc() id() accession_number() species() annotation() all_SeqFeatures() add_SeqFeature() ...

Bio::SeqIO

next_seq() write_seq()

Bio::Tools::Run::StandAloneBlast

exists_blast() blastall() blasrpgp() bl2seq()

Bio::DB::EMBL

get_Seq_by_acc() get_Seq_by_id() get_Seq_by_gi get_Stream_by_acc() get_Stream_by_id() get_Stream_by_gi() proxy() ua() ...

Bio::SeqFeature::Generic

location() start() end() length() strand() score() frame() add_tag_value() each_tag_value() ...

III.12

slide-99
SLIDE 99

3.2 Perl and Bioperl

BioPerl (3/5)

Examples of bioperl usage

(adapted from Stajich et al., Genome Res. 2002)

use Bio::DB::EMBL; # import the required modules use Bio::SeqIO; my $db = Bio::DB::EMBL->new(); # $db is a Bio::DB::EMBL object my $seqobj = $db->get_Seq_By_acc(“U14680”); # $seqobj is a Bio::Seq object my $seqout = Bio::SeqIO->new(-format=>”genbank”); # $seqout is a new Bio::SeqIO object if (defined $seqobj) { # “if the database retrieval has succeeded” $seqout->write_seq($seqobj); # write sequence data from $seqobj } # to STDOUT(default) in genbank format

To retrieve a sequence from EMBL and print it in GenBank format:

(adapted from Stajich et al., Genome Res. 2002)

III.13

slide-100
SLIDE 100

3.2 Perl and Bioperl

BioPerl (3/5)

Examples of bioperl usage Hierarchical structure of a BLAST report is decomposed into a hierarchy of object:

Bio::SearchIO::Result::BLAST

Bio::Search:: Hit::BlastHit Bio::PrimarySeq alphabet “dna” display_id “MYSEQ_001” seq “ACGCTAGGGATGGATAGGGATGGA”

...

Bio::Search:: Hit::BlastHit Bio::Search:: Hit::BlastHit

Bio::SearchIO::Hit::BlastHit

Bio::Search:: HSP

...

Bio::Search:: HSP Bio::Search:: HSP name “M18833” description “Homo sapiens presenilin B” “”

Bio::SearchIO::Hit::HSP

name “M18833” description “MYSEQ_001” seq “ACGCTAGGGATGGATAGGGATGGA”

III.14

slide-101
SLIDE 101

use Bio::SearchIO; my $search = Bio::SearchIO->new (-format => ‘blast’, -file => ‘report.bls’); my @HitsToSave = (); my ($cutoff_Evalue, $cutoff_Len) = (1e-3, 120); while (my $result = $search->next_result) { while (my $sbjct = $result->next_hit) { while (my $hsp = $sbjct->next_hsp) { if ($hsp->evalue < $cutoff_Evalue and $hsp->length(‘total’) >= $cutoff_Len) { push @HitsToSave, $sbjct; last; } } } } print “Hits:\n”; foreach my $sbjct (@HitsToSave) { print $sbjct->name, “\n”; }

3.2 Perl and Bioperl

BioPerl (3/5)

Examples of bioperl usage

(adapted from Stajich et al., Genome Res. 2002)

# import the required modules # $search is a Bio::SearchIO::blast object # $seqout is a new Bio::SeqIO object # cutoff criteria for hits to save # retrieve the result of one BLAST run # iterate through hits (“subjects”) # iterate through HSPs of a hit(subject) # apply cutof criteria # # # add those that pass cuttofs to a list # test only the first(=best) HSP # # # # # # loop through saved hits # print hit name (usually sequence ID)

To extract hits with E-value ≤ 10-3 and length ≥ 120 bp from a BLAST report:

(adapted from Stajich et al., Genome Res. 2002)

III.15

slide-102
SLIDE 102

3.2 Perl and Bioperl

BioPerl (5/5)

Getting started with Bioperl

Bioperl tutorial

http://bioperl.org/bptutorial

Online documentation

http://doc.bioperl.org

man | perldoc pages

$ man Bio::Tools::Run::StandAloneBlast $ perldoc Bio::Graphics::Panel

Bioperl mailing list

bioperl-l@bioperl.org

III.16

slide-103
SLIDE 103

3.2 Perl and Bioperl

Key ideas

  • If you are not using bioperl, start using it

– it will increase your productivity and take much tedium out of coding

  • When writing new bioinformatics code,

make it

  • bject-oriented
  • publicly available
  • bioperl-compliant

III.17

slide-104
SLIDE 104

Section 3.3

Analysis of regulatory regions with TFBS

Regulatory regions problem space Basic features of TFBS modules Relationship to BioPerl Representing patterns: matrix objects Detecting binding sites in a single sequence Phylogenetic footprints Pattern discovery Storing and retrieving patterns

III.18

slide-105
SLIDE 105

3.3 Analysis of regulatory regions with TFBS

Regulatory regions problem space

Sets of binding sites

AATCACCA AATCACCA AATCACCA AATCACCA AATCTCCC AATCTCCG AATCACAC AATCATCA AATCTCAC AATCTCTG AGTCCCCA AATCCCGG AATCTGAG AATCCATA ATTCAGCC AATAACTT GATAACCT AATTAGAC GATTACAG GATTAGCG ATTCTTCC TATGAACA GATTAAAA AGACCCCA

Sets of binding sites

AATCACCA AATCACCA AATCACCA AATCACCA AATCTCCC AATCTCCG AATCACAC AATCATCA AATCTCAC AATCTCTG AGTCCCCA AATCCCGG AATCTGAG AATCCATA ATTCAGCC AATAACTT GATAACCT AATTAGAC GATTACAG GATTAGCG ATTCTTCC TATGAACA GATTAAAA AGACCCCA

Specificity profiles for binding sites

A [ -2 0 -2 -0.415 0.585 -2 -2 2.088 -2 -2 -1 0.585 ] C [ 1 0.585 0 0 -1 -2 -2 -2 2.088 -2 0.585 0.807 ] G [0.585 0.322 0.807 1.585 1 -2 2 -2 -2 2.088 -2 0 ] T [0.319 0.322 1 -2 0 2.088 -1 -2 -2 -2 1.459 -0.415 ]

Specificity profiles for binding sites

A [ -2 0 -2 -0.415 0.585 -2 -2 2.088 -2 -2 -1 0.585 ] C [ 1 0.585 0 0 -1 -2 -2 -2 2.088 -2 0.585 0.807 ] G [0.585 0.322 0.807 1.585 1 -2 2 -2 -2 2.088 -2 0 ] T [0.319 0.322 1 -2 0 2.088 -1 -2 -2 -2 1.459 -0.415 ]

Clusters of binding sites Clusters of binding sites Transcription factors Transcription factor binding sites Regulatory nucleotide sequences Transcription factors Transcription factor binding sites Regulatory nucleotide sequences

TATA URE

URF Pol-II

III.19

slide-106
SLIDE 106

3.3 Analysis of regulatory regions with TFBS

Basic Features of TFBS modules

  • Patterns representing TF specificity
  • Sets of patterns
  • Detected binding sites
  • Sets of detected binding sites
  • Unified DB interface to pattern databases
  • Pattern generators

http://forkhead.cgb.ki.se/TFBS/

III.20

slide-107
SLIDE 107

3.3 Analysis of regulatory regions with TFBS

Patterns

  • In TFBS, a pattern is a generalized representation of a particular

sequence motif. It can be e.g. :

  • At the moment, TFBS implements only matrix profiles.
  • TFBS contains a set of classes for pattern construction/discovery

(TFBS::PatternGen::*)

Matrix profile

A [ -2 0 -2 -0.415 0.585 -2 -2 2.088 -2 -2 -1 0.585 ] C [ 1 0.585 0 0 -1 -2 -2 -2 2.088 -2 0.585 0.807 ] G [0.585 0.322 0.807 1.585 1 -2 2 -2 -2 2.088 -2 0 ] T [0.319 0.322 1 -2 0 2.088 -1 -2 -2 -2 1.459 -0.415 ] A [ -2 0 -2 -0.415 0.585 -2 -2 2.088 -2 -2 -1 0.585 ] C [ 1 0.585 0 0 -1 -2 -2 -2 2.088 -2 0.585 0.807 ] G [0.585 0.322 0.807 1.585 1 -2 2 -2 -2 2.088 -2 0 ] T [0.319 0.322 1 -2 0 2.088 -1 -2 -2 -2 1.459 -0.415 ]

Consensus

CBKGGTGACGTM

Regular expression

/C[^A][GT]GGTGACG[AC]/

III.21

slide-108
SLIDE 108

3.3 Analysis of regulatory regions with TFBS

Classes and their roles

Class Description TFBS::Matrix:: PFM Position frequency (raw count) matrix; can be transformed to other matrix types TFBS::Matrix::ICM Information content matrix; used for drawing sequence logos TFBS::Matrix::PWM Position weight matrix; used for scanning nucleotide sequences and alignme nts thereof (phylogenetic footprinting) TFBS::Site A sequence feature object for transcription factor binding site TFBS::SiteSet An aggregate of TFBS::Site objects, with methods for manipulation of the set TFBS::SitePair A sequence feature object for pa ir of transcription factor binding sites in orthologous sequences TFBS::SitePairSet An aggregate of TFBS::SitePair objects, with methods for manipulation of the set TFBS::DB::FlatFileDir Read/write interface to a simple, flat file database of matrix obje ct data TFBS::DB::JASPAR2 Read/write interface to a MySQL database of matrix

  • bject data

TFBS::DB::TRANSFAC Read -only interface to public TRANSFAC database

  • n the WWW.

TFBS::PatternGen::SimplePFM Pattern generator that creates a position weight matrix from a set of short sequences of fixed size TFBS::PatternGen::Gibbs Pattern generator that uses Gibbs

III.22

slide-109
SLIDE 109

3.3 Analysis of regulatory regions with TFBS

Relationship to BioPerl

  • Tightly integrated with BioPerl
  • All objects have Bio::Root::Root as a base class
  • TFBS::Site IS A Bio::SeqFeature::Generic
  • TFBS::SitePair IS A Bio::SeqFeature::FeaturePair
  • All sequences can be passed to methods as Bio::Seq objects, all

alignments as Bio::SimpleAlign objects, and are internally manipulated as such

  • Tries to follow Bioperl coding conventions

III.23

slide-110
SLIDE 110

Given N sequence fragments of fixed length, one can assemble a position frequency matrix (number

  • f times a particular nucleotide appears at a given

position):

A [19 20 1 2 14 3 6 10] C [ 0 17 4 15 14 6] G [ 4 2 1 4 1 6] T [ 1 2 23 4 6 2 3 2] A [19 20 1 2 14 3 6 10] C [ 0 17 4 15 14 6] G [ 4 2 1 4 1 6] T [ 1 2 23 4 6 2 3 2]

AATCACCA AATCACCA AATCACCA AATCACCA AATCTCCC AATCTCCG AATCACAC AATCATCA AATCTCAC AATCTCTG AGTCCCCA AATCCCGG AATCTGAG AATCCATA ATTCAGCC AATAACTT GATAACCT AATTAGAC GATTACAG GATTAGCG ATTCTTCC TATGAACA GATTAAAA AGACCCCA AATCACCA AATCACCA AATCACCA AATCACCA AATCTCCC AATCTCCG AATCACAC AATCATCA AATCTCAC AATCTCTG AGTCCCCA AATCCCGG AATCTGAG AATCCATA ATTCAGCC AATAACTT GATAACCT AATTAGAC GATTACAG GATTAGCG ATTCTTCC TATGAACA GATTAAAA AGACCCCA

3.3 Analysis of regulatory regions with TFBS

Representing patterns: Matrices (1/4) III.24

slide-111
SLIDE 111

TFBS::Matrix::PFM

Position frequency matrix (raw count matrix)

length() name() ID() pdl_matrix() prettyprint() rawprint() revcom() set_matrix()

to_ICM() to_PWM() add() 3.3 Analysis of regulatory regions with TFBS

Representing patterns: Matrices (2/4)

Creating a position frequency matrix (PFM) object “manually

my $pfm = TFBS::Matrix::PFM->new (

  • matrix =>

[[qw(0 3 0 2 5 0 0 16 0 0 1 5)], [qw(7 5 3 3 1 0 0 0 16 0 5 6)], [qw(5 4 6 11 7 0 15 0 0 16 0 3)], [qw(4 4 7 0 3 16 1 0 0 0 10 2)] ],

  • name=>”CREB” );

The PFM object can be easily converted into other matrix types

my $icm = $pfm->to_ICM; my $pwm = $pfm->to_PWM;

III.25

slide-112
SLIDE 112

Drawing a sequence logo (PNG file)

$icm->draw_logo (

  • file=>"/w/TFBSdev/CREB.png",
  • full_scale => 2.25,
  • margin => 50,
  • xsize => 500,
  • ysize => 250,
  • graph_title=> “Sequence logo for the CREB profile",
  • x_title =>"Position",
  • y_title =>"IC (bits)");

Converting a PFM on ICM and printing it...

my $icm = $pfm->to_ICM(); print $icm->prettyprint; TFBS::Matrix::ICM

Information content matrix

length() name() ID() pdl_matrix() prettyprint() rawprint() revcom()

to_PWM() total_ic() draw_logo()

A [ 0 0.004 0 0.1 0.078 0 0 2 0 0 0.05 0.037 ] C [0.199 0.007 0.093 0.15 0.016 0 0 0 2 0 0.251 0.044 ] G [0.142 0.006 0.186 0.55 0.11 0 1.559 0 0 2 0 0.022 ] T [0.113 0.006 0.216 0 0.047 2 0.104 0 0 0 0.501 0.015 ] A [ 0 0.004 0 0.1 0.078 0 0 2 0 0 0.05 0.037 ] C [0.199 0.007 0.093 0.15 0.016 0 0 0 2 0 0.251 0.044 ] G [0.142 0.006 0.186 0.55 0.11 0 1.559 0 0 2 0 0.022 ] T [0.113 0.006 0.216 0 0.047 2 0.104 0 0 0 0.501 0.015 ]

3.3 Analysis of regulatory regions with TFBS

Representing patterns: Matrices (3/4) III.26

slide-113
SLIDE 113

Position weight matrix

TFBS::Matrix::PWM

length() name() ID() pdl_matrix() prettyprint() rawprint() revcom()

search_seq() search_aln()

The actual matrix used for scanning DNA sequences and scoring potential binding sites

3.3 Analysis of regulatory regions with TFBS

Representing patterns: Matrices (4/4)

Converting a PFM into a PWM

my $pwm = $pfm->to_PWM();

A [ -2 0 -2 -0.415 0.585 -2 -2 2.088 -2 -2 -1 0.585 ] C [ 1 0.585 0 0 -1 -2 -2 -2 2.088 -2 0.585 0.807 ] G [0.585 0.322 0.807 1.585 1 -2 2 -2 -2 2.088 -2 0 ] T [0.319 0.322 1 -2 0 2.088 -1 -2 -2 -2 1.459 -0.415 ] A [ -2 0 -2 -0.415 0.585 -2 -2 2.088 -2 -2 -1 0.585 ] C [ 1 0.585 0 0 -1 -2 -2 -2 2.088 -2 0.585 0.807 ] G [0.585 0.322 0.807 1.585 1 -2 2 -2 -2 2.088 -2 0 ] T [0.319 0.322 1 -2 0 2.088 -1 -2 -2 -2 1.459 -0.415 ] A [ 0 3 0 2 5 0 0 16 0 0 1 5 ] C [ 7 5 3 3 1 0 0 0 16 0 5 6 ] G [ 5 4 6 11 7 0 15 0 0 16 0 3 ] T [ 4 4 7 0 3 16 1 0 0 0 10 2 ] A [ 0 3 0 2 5 0 0 16 0 0 1 5 ] C [ 7 5 3 3 1 0 0 0 16 0 5 6 ] G [ 5 4 6 11 7 0 15 0 0 16 0 3 ] T [ 4 4 7 0 3 16 1 0 0 0 10 2 ] Converting a PFM into a PWM in a GC-rich background

my $pwm_GC70 = $pfm->to_PWM(-bg_probabilities => [0.15, 0.35, 0.35, 0.15]);

) ( 4 ) , ( log2 b p N N N i b n + +

III.27

slide-114
SLIDE 114

3.3 Analysis of regulatory regions with TFBS

Detecting binding sites in a single sequence (1/2)

Scanning a sequence against a PWM

A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ] C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ] G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ] T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ]

ACCCTCCCCAGGGGCGGGGGGCGGTGGCCAGGACGGTAGCTCC

Abs_score = 13.4 (sum of column scores)

Sp1

Calculating the relative score

A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ] C [-0.2284 -0.2284 -1.5 -1.5 1.5128 1.5128

  • 1.5 -0.2284 -1.5 -0.2284 -1.5 ]

G [ 1.2348 1.2348 1.2348 1.2348 2.1222 2.1222 2.1222 2.1222 0.4368 1.2348 1.2348 1.5128 1.5128 1.7457 1.7457 1.7457 1.7457

  • 1.5 ]

T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 1.7457 ] A [-0.2284 0.4368 -1.5 -1.5 -

  • 1.5

1.5 0.4368 -

  • 1.5

1.5 -

  • 1.5

1.5 -0.2284 0.4368 ] C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -

  • 1.5

1.5 -0.2284 -1.5 -0.2284 -1.5 ] G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -

  • 1.5

1.5 ] T [ 0.4368 0.4368 -

  • 0.2284

0.2284

  • 1.5

1.5 -

  • 1.5

1.5 -0.2284 0.4368 0.4368 0.4368 -

  • 1.5

1.5 1.7457 ]

Max_score = 15.2 (sum of highest column scores) Min_score = -10.3 (sum of lowest column scores)

93% = ⋅ − − = ⋅ = 100% 10.3) ( 15.2 (-10.3)

  • 13.4

% 100 Min_score

  • Max_score

Min_score

  • Abs_score

Rel_score

Scanning 1300 bp of human insulin receptor gene with Sp1 at rel_score threshold of 75% Ouch.

III.28

slide-115
SLIDE 115

TFBS::SiteSet add_site() add_siteset() Iterator() size() GFF() LRA() TFBS::Site

extends Bio::SeqFeature::Generic

pattern() start() end() strand() score() seq() 3.3 Analysis of regulatory regions with TFBS

Detecting binding sites in a single sequence (2/2)

Scanning a sequence with a matrix profile

my $siteset = $pwm->search_seq(

  • seqobj => $bio_seq_obj,
  • threshold => “75%”);

Iterating through the obtained set of sites

my $it = $siteset->Iterator(

  • sort_by => “start”,
  • reverse => 1);

while (my $site = $it->next()) { #...do whatever with the site }

Printing the sites in gff format

print $site_set->GFF();

III.29

slide-116
SLIDE 116

Low specificity of profiles:

  • too many hits
  • great majority not biologically

significant A dramatic improvement in the percentage of biologically significant detections Scanning a single sequence Scanning a pair orf orthologous sequences for conserved patterns in conserved sequence regions 3.3 Analysis of regulatory regions with TFBS

Phylogenetic Footprints (1/3) III.30

slide-117
SLIDE 117

TFBS::SitePairSet add_site() add_siteset() Iterator() size() GFF() set1() set2() TFBS::SitePair

extends Bio::SeqFeature::FeaturePair

site1() site2() pattern() 3.3 Analysis of regulatory regions with TFBS

Phylogenetic Footprints (2/3)

Scanning an alignment with a matrix profile

my $sitepairset = $pwm->search_aln(

  • alignobj => $simplealign_obj,
  • threshold => ”80%”,
  • window => 50,
  • conservation_ cutoff => “70%”);

Iterating through the obtained set of site pairs

my $it = $sitepairset->Iterator(

  • sort_by => “start”,
  • reverse => 1);

while (my $sitepair = $it->next()) { #... }

III.31

slide-118
SLIDE 118

3.3 Analysis of regulatory regions with TFBS

Phylogenetic Footprints (3/3)

Mechanism of an alignment scan by search_aln

  • Runs matrix searches for individual sequences;

each detected site is a subsequence with score exceeding a defined TF score threshold

  • Aligns the sequences (unless user has provided an

alignment)

  • Scans the alignment with a window of given

window size

  • Marks as conserved regions exceeding a given

conservation cutoff value

  • Of all detected TFs, reports only those that
  • appear at identical position in both sequences,
  • are located in the region marked as conserved

Additional parameters for scanning pairwise alignments

  • WINDOW SIZE – the size of the sliding window for scanning

the allignment to locate conserved regions

  • CONSERVATION CUTOFF – minimum percent of

conserved positions in the sliding window to mark the region as conserved

III.32

slide-119
SLIDE 119

3.3 Analysis of regulatory regions with TFBS

Pattern Discovery (1/2)

Discovering new paterns

  • verrepresented in a set of

biological sequences

AGTCGTACGTGAC ... AGTAGACGTGCCG ... ACGTGAGATACGT ... GAACGGAGTACGT ... TCGTGACGGTGAT ...

Algorithms

  • Gibbs sampling (Lawrence et al.)
  • Gibbs sampling/neural network

hybrid metod (Workman and Stormo)

  • Phylogeny-guided multiple

sequence footprinting (Blanchette and Tompa)

III.33

slide-120
SLIDE 120

3.3 Analysis of regulatory regions with TFBS

Pattern Discovery (2/2)

Retrieving a list of detected patterns

(here: position frequency matrices)

my @pfms = $patterngen->all_patterns();

TFBS::PatternGen::SimplePFM

pattern() patternSet() all_patterns

TFBS::PatternGen::Ann-Spec

pattern() patternSet() all_patterns SQI() parameters() seed() training() best_sites() best_energy() best_iter()

TFBS::PatternGen::Gibbs

pattern() patternSet() all_patterns() bg_probabilities()

Initializing the Gibbs pattern discovery program my $patterngen = TFBS::PatternGen::Gibbs->new (

  • sequences => \@list_of_seqobjects,
  • binary => '/Programs/Gibbs/Gibbs',
  • nr_hits => scalar @list_of_seqobjects,
  • motif_length => [8..12])

);

III.34

slide-121
SLIDE 121

3.3 Analysis of regulatory regions with TFBS

Storing and retrieving patterns

TFBS::DB::JASPAR2

Creating a database object

my $db = TFBS::DB::JASPAR2->new(

  • connect=>[“dbi:mysql:JASPAR2”,””,””]);

get_Matrix_by_ID() get_Matrix_by_name() get_MatrixSet() store_Matrix() get_ID_list() get_name_list() get_class_list() get_species_list() ...

TFBS::DB::FlatFileDir

get_Matrix_by_ID() get_Matrix_by_name() get_MatrixSet() store_Matrix() ...

Retrieving a set of all human and mouse PWM matrices:

my $pwm_set1 = $db->get_MatrixSet(-species => [qw(human mouse)],

  • class => “forkhead”);

Retrieving a PWM matrix object by name :

my $pwm = $db->get_Matrix_by_name(“CREB”,”PWM”);

Retrieving a set of matrix objects by list on IDs:

my $pwm_set2 = $db->get_MatrixSet (-ID => \@list_of_IDs);

Retrieving a set of all plant PWM matrices with i.c >= 7 :

my $pwm_set3 = $db->get_MatrixSet( -sysgroup => “plants”,

  • min_ic => 7);

III.35

slide-122
SLIDE 122

3.3 Analysis of regulatory regions with TFBS

Key ideas

  • TFBS is a suite of perl modules

designed to accelerate the manipulation

  • f entities encountered in transcription

factor site analysis.

  • It is tightly integrated with Bioperl
  • Currently implements profile matrix
  • bjects, search functions, phylogenetic

footprinting, matrix profile persistence and pattern generators.

III.36

slide-123
SLIDE 123

Section 3.4

Working Examples

#1 Searching for HNF1 sites in the UDPGT1 gene #2 Detection of clusters of binding sites #3 Pattern discovery with promoters from a yeast gene network #4 Analysis of binding sites in the promoters of all yeast genes The code and data available at http://lenhard.cgb.ki.se/ISMB2003/

III.37

slide-124
SLIDE 124

3.4 Working Examples

#1 Searching for HNF1 sites in the UDPGT1 gene (1/3)

  • UDPGT1 (aka UGT1)

: multiple alternative transcription start sites

  • alternative

transcripts exhibit tissue specificity

  • looking for HNF-1

sites (characteristic for liver-specific expression)

  • looking at 1000 bp

upstream of TSS

III.38

slide-125
SLIDE 125
  • Example 1a: Searching for HNF1 sites in the UDPGT1 gene on

a single sequence

  • To run it with default files, cd to its directory (Example3/) and

type perl example1a.pl UGT1A*.fa

3.4 Working Examples

#1 Searching for HNF1 sites in the UDPGT1 gene (2/3)

use strict; use TFBS::Matrix::PFM; # Define files my (@input_files) = @ARGV; my $pfm= TFBS::Matrix::PFM->new( -matrixfile=>"HNF1.pfm",

  • name=>"HNF-1", -class=>"Homeo" );

my $pwm = $pfm->to_PWM; foreach my $file (sort @input_files) { print ("\nIn $file:\n\n"); my $seqstream = Bio::SeqIO->new(-file => $file); while (my $seq = $seqstream->next_seq) { my $siteset = $pwm->search_seq(-seqstring => $seq->subseq(1,1000),

  • threshold => "75%");

print $siteset->GFF; } }

[borisl@lorien Example1]$ perl example1a.pl UGT1A*.fa In UGT1A3.human.-1000.fa: SEQ TFBS TF binding site 844 857 10.298 + 0 sequence gattaatggttaat ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 851 864 19.018 + 0 sequence ggttaataattaac ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 852 865 17.550 - 0 sequence agttaattattaac ; TF "HNF-1" ; class Homeo In UGT1A3.mouse.-1000.fa: SEQ TFBS TF binding site 169 182 6.979 + 0 sequence agtttaaaactacc ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 443 456 7.774 - 0 sequence ggttaagagatatc ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 870 883 9.317 + 0 sequence ctttaataattaat ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 871 884 10.623 - 0 sequence cattaattattaaa ; TF "HNF-1" ; class Homeo In UGT1A6.human.-1000.fa: SEQ TFBS TF binding site 370 383 7.610 + 0 sequence agcaaatgattgac ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 371 384 9.371 - 0 sequence agtcaatcatttgc ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 813 826 9.743 + 0 sequence ggttcatattaacc ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 813 826 9.813 - 0 sequence ggttaatatgaacc ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 840 853 13.780 + 0 sequence ggttaaatattaat ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 841 854 13.332 - 0 sequence aattaatatttaac ; TF "HNF-1" ; class Homeo In UGT1A6.mouse.-1000.fa: SEQ TFBS TF binding site 117 130 7.756 - 0 sequence agtttagttttaaa ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 203 216 7.944 + 0 sequence ggtggataatttca ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 520 533 8.103 - 0 sequence tcttaatcttcacc ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 607 620 7.443 + 0 sequence agttagtgcttccc ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 722 735 6.982 - 0 sequence tgctactgtttaaa ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 841 854 13.780 + 0 sequence ggttaaatattaat ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 842 855 13.332 - 0 sequence aattaatatttaac ; TF "HNF-1" ; class Homeo In UGT1A8.human.-1000.fa: SEQ TFBS TF binding site 39 52 9.562 + 0 sequence ggattatatttgac ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 184 197 7.667 + 0 sequence agtacatattttcc ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 595 608 7.993 + 0 sequence agttaaaatctaaa ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 846 859 9.964 + 0 sequence agtaaatcattggc ; TF "HNF-1" ; class Homeo

III.39

slide-126
SLIDE 126
  • Example 1b: Searching for HNF1 sites in the UDPGT1 gene on

a human:mouse alignment

  • To run it with default files, cd to its directory (Example3/) and type

perl example1b.pl UGT1A*.aln

3.4 Working Examples

#1 Searching for HNF1 sites in the UDPGT1 gene (3/3)

use strict; use TFBS::Matrix::PFM; use Bio::AlignIO; my (@input_files) = @ARGV; my $pfm= TFBS::Matrix::PFM->new( -matrixfile=>"HNF1.pfm",

  • name=>"HNF-1", -class=>"Homeo" );

my $pwm = $pfm->to_PWM; foreach my $file (sort @input_files) { print ("\nIn $file:\n\n"); my $alnstream = Bio::AlignIO->new(-file => $file, -format=>"clustalw"); while (my $aln = $alnstream->next_aln) { my $aln_prom = $aln->slice (1, $aln->column_from_residue_number($aln->get_seq_by_pos(1)->id,1000) ); my $sitepairset = $pwm->search_aln(-alignobj => $aln_prom,

  • threshold => "75%",
  • window => 50,
  • conservation => 70);

print $sitepairset->GFF; } }

[borisl@lorien Example1]$ perl example1b.pl UGT1A*.aln In UGT1A3.aln: In UGT1A6.aln: SEQ TFBS TF binding site 840 853 13.780 + 0 sequence GGTTAAATATTAAT ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 841 854 13.780 + 0 sequence GGTTAAATATTAAT ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 841 854 13.332 - 0 sequence AATTAATATTTAAC ; TF "HNF-1" ; class Homeo SEQ TFBS TF binding site 842 855 13.332 - 0 sequence AATTAATATTTAAC ; TF "HNF-1" ; class Homeo In UGT1A8.aln: [borisl@lorien Example1]$ gnome-terminal

III.40

slide-127
SLIDE 127

3.4 Working Examples

#3 Pattern discovery with promoters from a yeast gene network

  • Example 3: We start from a set of co-regulated yeast genes, grab

their promoter regions and run Gibbs sampling on them to obtain position frequency matrices from overrepresented patterns

  • To run it with default files, cd to its directory (Example3/)and type

perl example3.pl yeast_promoters.fa

use strict; use TFBS::PatternGen::Gibbs; my $patterngen= TFBS::PatternGen::Gibbs->new(-seq_file=>$ARGV[0],

  • nr_hits => 6,
  • motif_length => [8,9],
  • additional_params => '-x -e');

my $pattern_nr = 0; foreach my $pfm ($patterngen->all_patterns()) { next if $pfm->tag("MAP_score") < 0; # patterns with negative MAP score # are meaningless - a Gibbs quirk $pattern_nr++; # do something with found patterns print "Pattern $pattern_nr:\n"; print $pfm->prettyprint(), "MAP score: ", $pfm->tag("MAP_score"), "\n\n"; # draw logo to a file $pfm->draw_logo(-file => "Logo_$pattern_nr.png"); } [borisl@lorien Example3]$ perl example3.pl yeast_promoters.fa Pattern 1: A [7 7 7 2 7 7 0 7 ] C [0 0 0 0 0 0 0 0 ] G [0 0 0 0 0 0 0 0 ] T [0 0 0 5 0 0 7 0 ] MAP score: 17.36566 Pattern 2: A [0 0 0 0 4 0 0 0 1 ] C [0 6 2 6 2 0 6 0 4 ] G [6 0 0 0 0 0 0 6 0 ] T [0 0 4 0 0 6 0 0 1 ] MAP score: 17.36566 [borisl@lorien Example3]$ emacs out3.txt

III.41

slide-128
SLIDE 128

3.4 Working Examples

#2 Detection of clusters of binding sites (1/2)

  • Example 2: We define a cluster of binding sites as a set of

several binding sites satisfying a set of quantitative constraints

  • simple: number of sites present winthin a given sequence window
  • advanced: probabilistic methods
  • It is desirable to be able to define the clusters and detect them in

a set of sites detected on a single sequence or an alignment of sequences

Photoreceptors

  • 140
  • 60

Ret-1 CRX NRL NRL CRX Ret-1

  • 200
  • 40

Rhodopsin Enhancer Rhodopsin Promoter Arrestin

CRX Ret-3

  • 1760
  • 1950

AIRS

Rat senescence marker Human glucose-6-phosphatase Human Protein C

Liver Skeletal muscle

AChR-β MCK 5' MLC-1f3f

  • Myf/E Myf/E Myf/E

TEF 119 +30 Myf/E Myf/E SRF/CArG Mef-2 Novel Mef-2

  • 1256
  • 1050

Myf/E Myf/E Myf/E Mef-2 +25,000 +25,200 HNF-4 HNF-1 HNF-4 HNF-3 HNF-3

  • 240
  • 133
  • 135

C/EBP C/EBP HNF-1 C/EBP HNF-1

  • 38

HNF-1 HNF-3 C/EBP NF-I

  • 80
  • 9

NF-I

Photoreceptors

  • 140
  • 60

Ret-1 CRX NRL

  • 140
  • 60

Ret-1 CRX NRL NRL CRX Ret-1

  • 200
  • 40

NRL CRX Ret-1

  • 200
  • 40

Rhodopsin Enhancer Rhodopsin Promoter Arrestin

CRX Ret-3

  • 1760
  • 1950

AIRS CRX Ret-3

  • 1760
  • 1950

AIRS

Rat senescence marker Human glucose-6-phosphatase Human Protein C

Liver Skeletal muscle

AChR-β MCK 5' MLC-1f3f

  • Myf/E Myf/E Myf/E

TEF 119 +30 Myf/E Myf/E Myf/E TEF 119 +30 Myf/E Myf/E SRF/CArG Mef-2 Novel Mef-2

  • 1256
  • 1050

Myf/E Myf/E Myf/E Myf/E SRF/CArG Mef-2 Novel Mef-2

  • 1256
  • 1050

Myf/E Myf/E Myf/E Mef-2 +25,000 +25,200 Myf/E Myf/E Myf/E Mef-2 +25,000 +25,200 HNF-4 HNF-1 HNF-4 HNF-3 HNF-3

  • 240
  • 133

HNF-4 HNF-1 HNF-4 HNF-3 HNF-3

  • 240
  • 133
  • 135

C/EBP C/EBP HNF-1 C/EBP HNF-1

  • 38
  • 135

C/EBP C/EBP HNF-1 C/EBP HNF-1

  • 38

HNF-1 HNF-3 C/EBP NF-I

  • 80
  • 9

NF-I HNF-1 HNF-3 C/EBP NF-I

  • 80
  • 9

NF-I

III.42

slide-129
SLIDE 129

3.4 Working Examples

#2 Detection of clusters of binding sites (2/2)

use strict; use TFBS::SiteSetFilter::RegModule; use TFBS::Matrix::PFM; use Bio::SeqIO; my $seqstream= Bio::SeqIO->new(-file=>"../Example4/Allyeast_promoters.fa",

  • format=> 'fasta');

my $pfm1 = TFBS::Matrix::PFM->new( -matrixfile=>"../Example4/PAC.ma",

  • name=>"PAC" );

my $pfm2 = TFBS::Matrix::PFM->new( -matrixfile=>"../Example1/HNF1.pfm",

  • name=>"HNF-1" );

my $pwm1 = $pfm1->to_PWM(); my $pwm2 = $pfm2->to_PWM(); while (my $seq = $seqstream->next_seq()) { # analyze each sequence print $seq->id.":\n"; my $siteset= $pwm1->search_seq(-seqobj=> $seq, -threshold=> "75%"); $siteset->add_siteset ($pwm2->search_seq(-seqobj=> $seq, -threshold=> "75%")); my $filter = TFBS::SiteSetFilter::RegModule->new ( -module_elements => { "PAC" => { threshold => ”75%", min_nr_sites => 1 }, "HNF-1" => { threshold => "80%", min_nr_sites => 1 } },

  • length => 100,
  • min_score => 4

); my @modules = $filter->apply($siteset); foreach my $module (@modules) { print $module->GFF."\n"; } }

TFBS::SiteSetFilter::RegModule

new() apply()

YAL035W: YAL036C: YAL037W: YAL038W: SEQ TFBS TF binding site 83 96 6.841 + sequence TGTTAAAATTGATC ; TF "HNF-1" ; c SEQ TFBS TF binding site 84 97 6.960

  • sequence GGATCAATTTTAAC ; TF "HNF-1" ;

SEQ TFBS TF binding site 101 114 7.182 + sequence TGTAAATAAACAAT ; TF "HNF-1" ; SEQ TFBS TF binding site 102 115 7.439

  • sequence GATTGTTTATTTAC ; TF "HNF-1" ; c

SEQ TFBS TF binding site 150 158 6.040 + sequence CCATGACCC ; TF PAC ; class Unkn SEQ TFBS TF binding site 160 168 6.559

  • sequence CGACGATCC ; TF PAC ; class Unkn

YAL039C: YAL040C: YAL041W: YAL042W: YAL043C: YAL043C-A: YAL044C: YAL045C: YAL046C: YAL047C: YAL048C: YAL049C: YAL051W: YAL053W: YAL054C: YAL055W: SEQ TFBS TF binding site 405 413 9.473 + sequence TGATGAGCC ; TF PAC ; class Unkn SEQ TFBS TF binding site 411 419 8.652

  • sequence CGAGGAGGC ; TF PAC ; class Unkn

SEQ TFBS TF binding site 450 463 7.491 + sequence TGGTAATTTTTTGA ; TF "HNF-1" ; c SEQ TFBS TF binding site 451 464 9.047 + sequence GGTAATTTTTTGAA ; TF "HNF-1" ; c SEQ TFBS TF binding site 821 834 7.674

  • sequence GATTATTCTTTGAA ; TF "HNF-1" ; c

SEQ TFBS TF binding site 862 875 8.210 + sequence GGCAAAAAATTAAC ; TF "HNF-1" ; SEQ TFBS TF binding site 863 876 12.900

  • sequence TGTTAATTTTTTGC ; TF "HNF-1" ; c

SEQ TFBS TF binding site 881 894 7.454 + sequence GGTTCATAGTCCCC ; TF "HNF-1" ; SEQ TFBS TF binding site 881 889 6.040

  • sequence CTATGAACC ; TF PAC ; class Unkno

YAL056W: SEQ TFBS TF binding site 279 292 8.925

  • sequence GATTTATGATTTCA ; TF "HNF-1" ; c

SEQ TFBS TF binding site 321 334 7.606

  • sequence GGTTTTCAATTTCC ; TF "HNF-1" ;

SEQ TFBS TF binding site 345 353 7.432 + sequence CGAAGAGCA ; TF PAC ; class Unkn SEQ TFBS TF binding site 362 370 7.432

  • sequence CGATGCGCA ; TF PAC ; class Unkn

SEQ TFBS TF binding site 389 397 6.092

  • sequence CGAGGATCA ; TF PAC ; class Unkn

SEQ TFBS TF binding site 423 436 8.113

  • sequence TGTTGCTTATCAAC ; TF "HNF-1" ;

SEQ TFBS TF binding site 445 458 7.197 + sequence GGTAAAAATGTAAC ; TF "HNF-1" ; SEQ TFBS TF binding site 446 459 8.294

  • sequence TGTTACATTTTTAC ; TF "HNF-1" ; c

YAL058C-A: YAL058W: YAL059W: YAL060W:

III.43

slide-130
SLIDE 130

3.4 Working Examples

Key ideas

  • TFBS modules enable easy scripting of

common types of regulatory regions analysis

  • If your particular analysis requires

introduction of a new concept:

  • Try modeling it into an object;
  • Think it over, clean it up, document it and

share the code with the rest of us!

III.44

slide-131
SLIDE 131

Section 3.5

Developers’ Corner

III.45

slide-132
SLIDE 132

3.5 Developers’ corner

Getting involved in TFBS development

  • Public CVS
  • cvs –d :pserver:cvs@eriador.cgb.ki.se:/opt/cvs/TFBS login

Password: cvs

  • browse the repository at http://forkhead.cgb.ki.se/TFBS/
  • Full access to TFBS CVS repository
  • contact Boris.Lenhard@cgb.ki.se
  • The future of TFBS
  • At least part of the classes will move into BioPerl class hierarchy

III.46

slide-133
SLIDE 133

The end of part III

slide-134
SLIDE 134

REFLECTIONS

  • Part 1
  • Futility Theorem – Essentially predictions of individual TFBS have no

relationship to an in vivo function

  • Successful bioinformatics methods for site discrimination incorporate

additional information (clusters, conservation)

  • Future: Chromatin?
  • Part 2
  • Pattern discovery methods are severely restricted by the Signal-to-Noise

problem

  • Observed patterns must be carefully considered
  • Successful methods for pattern discovery incorporate additional information

(conservation, structural constraints on TFs)

  • Part 3
  • Online tools are available for gene-centric studies, but most

bioinformaticians will benefit from using programming resources such as TFBS

slide-135
SLIDE 135

Some Future Directions

  • Current restrictions
  • Phylogenetic footprinting methods do not take full advantage of

multiple sequence comparisons

  • Inadequate reference collections of validated regulatory

modules limits development of module discrimination tools

  • Long-term challenges
  • Pattern discovery methods are limited by noise – restricts our

ability to analyze putative regulons from microarray studies

  • Must address chromatin/structure in the nucleus