Gene Regulation Bioinformatics Wyeth W. Wasserman University of - - PowerPoint PPT Presentation

gene regulation bioinformatics
SMART_READER_LITE
LIVE PREVIEW

Gene Regulation Bioinformatics Wyeth W. Wasserman University of - - PowerPoint PPT Presentation

Gene Regulation Bioinformatics Wyeth W. Wasserman University of British Columbia www.cisreg.ca The Grand Challenge: Reliably Define Cis-Regulatory Mechanisms of Regulons CO-EXPRESSED GROUPS EXPRESSION DATA SEQUENCE ANALYSIS BIRS 2006 2


slide-1
SLIDE 1

Gene Regulation Bioinformatics

Wyeth W. Wasserman

University of British Columbia

www.cisreg.ca

slide-2
SLIDE 2

BIRS 2006 2

The Grand Challenge: Reliably Define Cis-Regulatory Mechanisms of Regulons

EXPRESSION DATA SEQUENCE ANALYSIS CO-EXPRESSED GROUPS

slide-3
SLIDE 3

BIRS 2006 3

REGULATORY PATHWAY INFERENCE from CO-EXPRESSED GENES

  • What is the appeal?
  • Understand how perceived signals at surface

result in downstream changes in cell phenotype

  • TFs occasionally serve as therapeutically relevant

targets

  • PPARγ, Estrogen Receptor, Glucocorticoid Receptor
  • Builds on data from powerful profiling technologies
  • Expression profiling; ChIP-chip
slide-4
SLIDE 4

BIRS 2006 4

Bioinformatics and Promoter Analysis

What can we do?

slide-5
SLIDE 5

BIRS 2006 5

Binding Profiles for a TF

A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3 C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12 G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2 T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4 Set of binding sites AAGTTAATGA CAGTTAATAA GAGTTAAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTGATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA AAGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAATGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA Set of binding sites AAGTTAATGA CAGTTAATAA GAGTTAAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTGATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA AAGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAATGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA

slide-6
SLIDE 6

% I dentity 200 bp Window Start Position (human sequence)

Phylogenetic Footprinting

SELECTI VI TY SENSI TI VI TY

ACTIN

slide-7
SLIDE 7

BIRS 2006 7

Co-Expressed Controls

Deciphering Regulation of Co- Expressed Genes

slide-8
SLIDE 8

BIRS 2006 8

  • POSSUM Procedure

Set of co- expressed or co-precipitated genes Automated sequence retrieval from EnsEMBL Phylogenetic Footprinting Detection of transcription factor binding sites Statistical significance of binding sites Putative mediating transcription factors

ORCA ORCA

slide-9
SLIDE 9

BIRS 2006 9

Empirical Selection of Parameters based

  • n Reference Studies
  • 20
  • 10

10 20 30 40 1.0E-09 1.0E-07 1.0E-05 1.0E-03 1.0E-01 Fisher p-value Z-score Muscle Liver NF-κB Z-score cutoff Fisher cutoff p65 c-Rel p50 NF-κB HNF-1 SRF TEF-1 MEF2 FREAC-2 Myf cEBP SP1 HNF-3β

slide-10
SLIDE 10

BIRS 2006 10

CRM Models

Trained models take as input a set of TF binding profiles and return significant clusters of TFBS

  • 0.2

0.2 0.4 0.6 0.8 1 100 510 920 1330 1740 2150 2560 2970 3380 3790 4200 4610 5020 5430 5840

slide-11
SLIDE 11

BIRS 2006 11

  • POSSUM Server
slide-12
SLIDE 12

BIRS 2006 12

WHAT CAN WE DO ?

slide-13
SLIDE 13

BIRS 2006 13

Identifying over-represented pairs of TFBSs in co-expressed genes

d d Calculate a Fisher exact probability that the pair of sites is

  • ver-represented

Correct for multiple testing Background Target

slide-14
SLIDE 14

BIRS 2006 14

cluster motif1 motif2 Hits No hits Hits No hits p-value Adjusted

4 CSRE STRE 15 46 362 6311 8.33E-07 6.49E-04 4 CSRE GCR1 43 18 2881 3792 1.62E-05 1.26E-02 7 STRE ADR1P 67 262 835 5838 6.38E-05 4.97E-02 7 STRE PHO2 70 259 881 5792 5.63E-05 4.39E-02 7 STRE TBP 69 260 868 5805 6.36E-05 4.96E-02 7 STRE UASPHR 55 274 628 6045 3.77E-05 2.94E-02 7 STRE GCR1 68 261 813 5860 1.58E-05 1.23E-02 8 STRE CAR1_r 25 150 372 6301 2.24E-05 1.75E-02 16 PAC RRPE 188 293 1958 4715 6.54E-06 5.10E-03 16 RRPE XBP1 424 57 5354 1319 5.11E-06 3.98E-03 16 RRPE SCB 411 70 5121 1552 2.78E-06 2.17E-03 16 RRPE PHO2 425 56 5388 1285 9.28E-06 7.24E-03 16 RRPE ROX1 273 208 3056 3617 2.09E-06 1.63E-03 16 RRPE TBP 425 56 5362 1311 3.74E-06 2.92E-03 16 RRPE FKH1 404 77 5097 1576 4.72E-05 3.68E-02 17 LYS14 RRPE 31 23 1857 4816 5.47E-06 4.27E-03 18 PAC RRPE 152 206 1958 4715 1.98E-07 1.55E-04 18 RAP1 RRPE 204 154 2901 3772 3.91E-07 3.05E-04 18 RRPE XBP1 326 32 5354 1319 3.08E-08 2.40E-05 18 RRPE SCB 309 49 5121 1552 6.59E-06 5.14E-03 18 RRPE PHO2 325 33 5388 1285 2.38E-07 1.86E-04 18 RRPE TBP 323 35 5362 1311 5.07E-07 3.96E-04 18 RRPE UASPHR 256 102 4051 2622 2.02E-05 1.57E-02 18 RRPE FKH1 312 46 5097 1576 4.20E-07 3.28E-04

Target Background

Over-represented Pairs of Sites in Yeast Fermentation Clusters

slide-15
SLIDE 15

BIRS 2006 15

What can we do?

  • Predict TFBS
  • Predict CRMs
  • Phylogenetic Footprinting
  • Motif Over-Representation
  • Motif Discovery
slide-16
SLIDE 16

BIRS 2006 16

Gibbs Sampling

(grossly over-simplified)

tgacttcc tgctacct agacctca ctgtagtg acgcatct cgatacgc ttcgctcc

1 2 3 4 5 6 7 8 A 2 0 2 2 2 1 0 1 C 0 2 3 3 2 1 6 2 G 0 4 1 0 1 0 1 1 T 4 1 1 2 2 5 0 2

slide-17
SLIDE 17

BIRS 2006 17

There are problems…

Exploring limitations

slide-18
SLIDE 18

BIRS 2006 18

Combinatorial interactions between TFs

slide-19
SLIDE 19

BIRS 2006 19

Why can’t we do better?

  • Predict TFBS
slide-20
SLIDE 20

BIRS 2006 20

Futility Conjuncture

Human Cardiac α-Actin gene analyzed with the JASPAR set of profiles

(each vertical line represents a TFBS prediction)

Futility Conjuncture: TFBS predictions are almost always wrong

Red boxes are protein coding exons - TFBS predictions excluded in this analysis

slide-21
SLIDE 21

BIRS 2006 21

Why can’t we do better?

  • Predict TFBS
  • Predict CRMs
slide-22
SLIDE 22

BIRS 2006 22

Cis-regulatory modules (CRMs) for specific expression in hepatocytes

slide-23
SLIDE 23

BIRS 2006 23

Why can’t we do better?

  • Predict TFBS
  • Predict CRMs
  • Phylogenetic Footprinting
slide-24
SLIDE 24

BIRS 2006 24

Regulatory Resolution Varies Widely Between Genes

Gene: NR2E1

slide-25
SLIDE 25

BIRS 2006 25

Why can’t we do better?

  • Predict TFBS
  • Predict CRMs
  • Phylogenetic Footprinting
  • Motif Over-Representation
slide-26
SLIDE 26

BIRS 2006 26

Ets TF Family

Structural classes of TFs often bind identical target sequences – we cannot specify which TF interacts with a motif.

slide-27
SLIDE 27

BIRS 2006 27

Challenges for Motif Over- Representation

  • Methods fail when noise (genes not co-

regulated) exceeds 20-50%

  • Most expression profiling experiments are not

sufficiently resolved to identify such co- regulated clusters

  • Works well for studies linked to a primary TF response,

but fail over long time periods or complex (multi-pathway) responses

slide-28
SLIDE 28

BIRS 2006 28

Why can’t we do better?

  • Predict TFBS
  • Predict CRMs
  • Phylogenetic Footprinting
  • Motif Over-Representation
  • Motif Discovery
slide-29
SLIDE 29

BIRS 2006 29

Applied Pattern Discovery is Acutely Sensitive to Noise

True Mef2 Binding Sites

10 12 14 16 18 100 200 300 400 500 600

SEQUENCE LENGTH PATTERN SIMILARITY

  • vs. TRUE MEF2 PROFILE

Pink line is negative control with no Mef2 sites included

slide-30
SLIDE 30

BIRS 2006 30

The Signal-to-Noise Battle

  • Background models
  • Phylogenetic footprinting
  • Motif combinations
  • Familial Binding Profiles
  • Concurrent motif discovery and expression

clustering

slide-31
SLIDE 31

BIRS 2006 31

Where are we going now?

Snippets of Active Projects

slide-32
SLIDE 32

BIRS 2006 32

An impending transition in promoter analysis…

  • Transitions in promoter analysis algorithms

separated by periods of slow progress

  • Focus on same tired reference collections using

progressively more convoluted algorithms

  • Advances can be triggered from new data

producing technologies, but more commonly from adopting principles well-known to laboratory researchers

  • CpG islands; CRMs; phylogenetic footprinting
  • The next transition: Incorporating data

from laboratory studies

slide-33
SLIDE 33

BIRS 2006 33

Informed Motif Discovery

Enhance the Signal

  • r

Reduce the Noise

slide-34
SLIDE 34

BIRS 2006 34

Informed Initial Choice

slide-35
SLIDE 35

BIRS 2006 35

slide-36
SLIDE 36

BIRS 2006 36

FBPs enhance sensitivity of pattern detection

slide-37
SLIDE 37

BIRS 2006 37

A new direction?

  • Laboratory (WET) data indicating the

locations of regulatory regions and/or specific TFBS can constrain the motif discovery process to improve the success rate

  • Extension – We should be able to

determine how much WET data is required for successful prediction

slide-38
SLIDE 38

BIRS 2006 38

TF binding data rod-specific genes METHOD predicted regulatory regions ( ) ( ) ( ) ( ) ( ) METHOD identification of overrepresented patterns corresponding to putative TFBS ( )

slide-39
SLIDE 39

Co-expressed genes Retrieve

  • rthologs

Align sequences Phylogenetic footprinting Prior prob of being part of a RR Prior prob of being part of a TFBS 2) Sample sites within regions 1) Sample regions Known RR Known TFBS Profile for known TF

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

Pattern discovery algorithm CRMs, TFBS and profiles

Knowledge Directed CRM Discovery

slide-40
SLIDE 40

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

ROC curve (exons excluded)

windows = 10 windows = 20 windows = 50 windows = 100 windows = 200 windows = 300

1 †“ specificity sensitivity 1 - specificity

slide-41
SLIDE 41

BIRS 2006 41

Software Just Finished

  • Test all forms of prior knowledge
  • CRM Length
  • Locations of Known CRMs
  • Location of Known TFBS
  • PSSMs for Contributing TFs
  • Etc
  • A limitation - Where to get organized prior

data?

slide-42
SLIDE 42

Open-access regulatory sequence repository – an information mall

Stefan Kirov Elodie Portales-Casamar Jonathan Lim Jay Snoddy

slide-43
SLIDE 43

BIRS 2006 43

PAZAR

Grand Bazaar, Istanbul

slide-44
SLIDE 44

BIRS 2006 44

JASPAR: AN OPEN-ACCESS DATABASE OF TF BINDING PROFILES

slide-45
SLIDE 45

BIRS 2006 45

slide-46
SLIDE 46
slide-47
SLIDE 47

BIRS 2006 47

slide-48
SLIDE 48
slide-49
SLIDE 49
slide-50
SLIDE 50

BIRS 2006 50

Retrieval/Browsing Interface

slide-51
SLIDE 51

Status

  • PAZAR – Database Implemented
  • API/Perl Modules – Available
  • Streamlined Submission Interface – Available
  • COHO - In Progress
  • Release impending
  • Open-Access/Open-Software: see www.pazar.info for

details

slide-52
SLIDE 52

BIRS 2006 52

Putting It All Together

slide-53
SLIDE 53

BIRS 2006 53

slide-54
SLIDE 54

BIRS 2006 54

Final Thoughts

  • The grand challenge remains for the analysis
  • f co-regulated human genes
  • Significant progress in the past five years

suggests that we will be able to decipher regulatory mechanisms for targeted experiments

  • Numerous attractive problems remain

available for bioinformatics students

slide-55
SLIDE 55

Thanks!

  • Jay Snoddy
  • Stefan Kirov (BMS)

VANDERBILT

  • CIHR
  • IBM
  • MSFHR
  • MerckFrosst
  • GenomeBC
  • GenomeCanada
  • CFI

$

  • James Mortimer
  • Brian Kennedy
  • Elodie Portales-Casamar
  • David Martin
  • David Arenillas
  • Jochen Brumm
  • Alice Chou
  • Debra Fulton
  • Miroslav Hatas
  • Shannan Ho Sui
  • Andrew Kwon
  • Jonathan Lim
  • Dora Pak
  • Raf Podowski
  • Diane Wu
  • Dimas Yusuf

THE AMAZING PEOPLE WHO DID THE WORK!