Beyond Genomics: Detecting Codes and Signals in the Cellular - - PowerPoint PPT Presentation

beyond genomics detecting codes and signals in the
SMART_READER_LITE
LIVE PREVIEW

Beyond Genomics: Detecting Codes and Signals in the Cellular - - PowerPoint PPT Presentation

Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome Brendan J. Frey University of Toronto Brendan Frey Purpose of my talk To identify aspects of bioinformatics in which attendees of ISIT may be able to make significant


slide-1
SLIDE 1

Brendan Frey

Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan J. Frey University of Toronto

slide-2
SLIDE 2

Brendan Frey

Purpose of my talk

To identify aspects of bioinformatics in which attendees of ISIT may be able to make significant contributions

slide-3
SLIDE 3

Brendan Frey

Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome

Brendan J. Frey University of Toronto

slide-4
SLIDE 4

Brendan Frey

The Genome

slide-5
SLIDE 5

Brendan Frey

Starting point: Discrete biological sequences

  • Symbols are Bases: G, C, A, T
  • Examples of biological sequences

– Genes – Peptides – DNA – RNA – Chromosomes – Viruses – Proteins – HIV

RED indicates a definition that you should remember

slide-6
SLIDE 6

Brendan Frey

DNA Sequence (GCATTCATGC…)

Sexual cell reproduction

Nucleus

Chromosomes: Inherited DNA sequence

Cell replication

slide-7
SLIDE 7

Brendan Frey

The genome

  • Genome: Chromosomal DNA sequence

from an organism or species

  • Examples

Genome Length (bases)

Human 3,000 million (750MB) Mouse 2,600 million Fly 100 million Yeast 13 million

slide-8
SLIDE 8

Brendan Frey

Genes

  • A gene is a subsequence of the genome

that encodes a functioning bio-molecule

  • The library of known genes

– Comprises only 1% of genome sequence – Increases in diversity every year – Is probably far from complete

slide-9
SLIDE 9

Brendan Frey

The Transcriptome

slide-10
SLIDE 10

Brendan Frey

Genome: The digital backbone

  • f molecular biology

Transcripts: Perform functions encoded in the genome

slide-11
SLIDE 11

Brendan Frey

Traditional genes

DNA Protein

Translation Output: Protein Input: Transcript Transcription

Transcript (RNA)

Input: DNA Output: Transcript

slide-12
SLIDE 12

Brendan Frey

Traditional genes

DNA Protein

Translation Transcription

Transcript (RNA)

Genome

Transcriptome

Proteome

slide-13
SLIDE 13

Brendan Frey

Transcription

Upstream region DNA

… …

Exon Intron Regulatory proteins Transcription proteins Transcript (RNA) Gene

slide-14
SLIDE 14

Brendan Frey

Transcription

Upstream region DNA

… …

Exon Intron

slide-15
SLIDE 15

Brendan Frey

CGTGGATAGTGAT

Regulatory protein Upstream region DNA

… …

Exon

Transcription

  • Code: Set of regulatory codewords
  • Signals: Concentrations of regulatory

proteins and the output transcript

  • Codewords in the upstream region bind

to corresponding regulatory proteins

slide-16
SLIDE 16

Brendan Frey

Transcript (RNA) Intron Exon … …

Splicing of transcripts

Regulatory proteins

slide-17
SLIDE 17

Brendan Frey

Transcript (RNA) Intron Exon … …

Splicing of transcripts

Regulatory proteins Splicing proteins

slide-18
SLIDE 18

Brendan Frey

Transcript (RNA) Intron Exon … …

Splicing of transcripts

  • The intron is spliced out
  • However, splicing may occur quite

differently…

slide-19
SLIDE 19

Brendan Frey

Splicing of transcripts

Transcript (RNA) Intron Exon … … Regulatory proteins Splicing proteins …

slide-20
SLIDE 20

Brendan Frey

Splicing of transcripts

… Regulatory proteins Splicing proteins

slide-21
SLIDE 21

Brendan Frey

Splicing of transcripts

… Splic Regulatory proteins

The middle exon is ‘skipped’, leading to a different transcript

slide-22
SLIDE 22

Brendan Frey

TTAGAT

Regulatory proteins

Splicing of transcripts

  • Code: Set of regulatory codewords
  • Signals: Concentrations of regulatory

proteins and different spliced transcripts

  • Codewords in the introns and exons bind

to corresponding regulatory proteins

TGGGGT

slide-23
SLIDE 23

Brendan Frey

The modern transcriptome

Genome

Cell nucleus

Transcript (mRNA)

SPLICING TRANSCRIPTION Liver

Transcript (RNA) Protein

TRANSLATION

Non-functional transcripts

SPLICING Brain

mRNA Protein A

SPLICING Liver

mRNA Protein B Non-traditional transcript

TRANSCRIPTION TRANSCRIPTION Brain and Liver

slide-24
SLIDE 24

Brendan Frey

The modern transcriptome

Genomic DNA

Cell nucleus

Protein

TRANSLATION

Protein Protein

SPLICING TRANSCRIPTION in Liver TRANSCRIPTION in Brain and Liver SPLICING Brain SPLICING Liver

ncRNA

TRANSCR.

Non-traditional transcript Transcript (RNA) Spliced transcript (mRNA) mRNA mRNA

Brain Liver

Non-functional transcripts

… it turns out to be surprising in many ways

Alternative transcripts

# genes, ½ trans, 60% AS, 18k AS, 20% dis, 10k ncRNA

slide-25
SLIDE 25

Brendan Frey

The Resources

slide-26
SLIDE 26

Brendan Frey

Your collaborators can do lab work…

  • Sequencing: Snag an actual transcript

and figure out its sequence

  • Microarrays: Find out if your predicted

transcript fragment is expressed in a tissue sample

  • Mass spectrometry: Find out if a protein

is present in a sample

slide-27
SLIDE 27

Brendan Frey

Databases

  • Genomes
  • Genome annotations
  • Libraries of observed transcript

fragments

  • Microarray datasets containing

measured concentrations of transcripts

slide-28
SLIDE 28

Brendan Frey

T C G G T C A C A T T C G G T C A C A T

  • 1. Fabricate microarray with probes

A G C T A G T G T A T C A A G C G G T G T T G A A

probes

Measuring transcript concentrations using microarrays

A G C C A G T G T A

  • 2. Extract transcripts from cell
  • 3. Add florescent tag
  • 4. Hybridize tagged sequence to

microarray

Cell

  • 5. Excite florescent tag with laser

and measure intensity

slide-29
SLIDE 29

Brendan Frey

Inkjet printer technology

Hughes et al, Nature Biotech 2001

Print nucleic acid sequences using inkjet printer

slide-30
SLIDE 30

Brendan Frey

Then and now…

  • First microarrays (late 1990s)

–‘Cancer chips’, ‘gene chips’, … –5,000-10,000 probes per slide –Noisy

  • Current microarrays

–‘Sub-gene resolution’ –200,000 probes per slide –Low noise –Multi-chip designs are cost effective

slide-31
SLIDE 31

Brendan Frey

The Case Study: Discovering protein-making transcripts using factor graphs

BJ Frey, …, TR Hughes Nature Genetics, September 2005

slide-32
SLIDE 32

Brendan Frey

Controversy about the gene library

Despite F r ey et al’s impre ssive

c o mputatio nal r

ec onstr uc tion of gene str uc tur e, we arg ue that this does not pr

  • ve the c omplexity of

the tr ansc r iptome

– F ANT OM/ RI K E N Co nso rtium Sc ie nc e , Marc h 2006

How it all started…

slide-33
SLIDE 33

Brendan Frey

Research on the transcriptome

Analysis of genome Detection of transcripts

Our project

2001-2005 1960’s-2000 2001-2006

2003-2005

slide-34
SLIDE 34

Brendan Frey

Estimates of number of undiscovered genes

2000 2001 2002 2003 2004 2005

Genome: ~10,000

(IHGSC, Nature)

Genome: ~3000

(IHGSC, Nature)

Kapranov et al, Rinn et al, Shoemaker et al: ~300,000

Bertone et al: ~11,000

(Science)

slide-35
SLIDE 35

Brendan Frey

Coordinates (in bases) in Chromosome 4

Number of probes per 8000 bases Number of known exons per 8000 bases

Our microarrays

  • Our genome analysis highlighted 1 million

possible exons (~180,000 already known)

  • We designed one 60-base probe for each

possible exon

slide-36
SLIDE 36

Brendan Frey

Pool Composition (mRNA per array hybridization) 1 Heart (2 µg), Skeletal muscle (2 µg) 2 Liver (2 µg) 3 Whole brain (1.5 µg), Cerebellum (0.48 µg), Olfactory bulb (0.15 µg) 4 Colon (0.96 µg), Intestine (1.04 µg) 5 Testis (3 µg), Epididymis (0.4 µg) 6 Femur (0.9 µg), Knee (0.4 µg), Calvaria (0.06 µg), Teeth+mandible (1.3 µg), Teeth (0.4 µg) 7 15d Embryo (1.3 µg), 12.5d Embryo (12.5 µg), 9.5d Embryo (0.3 µg), 14.5d Embryo head (0.25 µg), ES cells (0.24 µg) 8 Digit (1.3 µg), Tongue (0.6 µg), Trachea (0.15 µg) 9 Pancreas (1 µg), Mammary gland (0.9 µg), Adrenal gland (0.25 µg), Prostate gland (0.25 µg) 10 Salivary gland (1.26 µg), Lymph node (0.74 µg) 11 12.5d Placenta (1.15 µg), 9.5d Placenta (0.5 µg), 15d Placenta (0.35 µg) 12 Lung (1 µg), Kidney (1 µg), Adipose (1 µg), Bladder (0.05 µg)

Twelve pools of mouse mRNA

Our samples (37 tissues)

slide-37
SLIDE 37

Brendan Frey

Signal: The data

(small part of the data from Chromosome 4)

Example of a transcript

Code:

A ‘vector repetition code with deletions’

Each column is an expression profile

slide-38
SLIDE 38

Brendan Frey

The transcript model

Each transcript is modeled using

A prototype expression profile # probes before prototype (eg, 1) # probes after prototype (eg, 4) Flag indicating whether each probe corresponds to an exon

e e e e e

slide-39
SLIDE 39

Brendan Frey

r1 t1

...

r2 t2

Transcription start/stop indicator Relative index of prototype Exon versus non-exon indicator Expression profile (genomic order)

r3 t3 r4 t4 r5 t5 r6 t6 s4 e4 x4 s3 e3 x3 s2 e2 x2 s1 e1 x1 s5 e5 x5 s6 e6 x6 sn en xn rn tn

Probe sensitivity & noise

...

The prototype for xi is xi+ri, ri ∈ {-W,…,W}. We use W=100

The factor graph

ONLY 1 FREE PARAMETER:

κ, probability of starting a transcript

slide-40
SLIDE 40

Brendan Frey

r1 t1

...

r2 t2

Transcription start/stop indicator Relative index of prototype Exon versus non-exon indicator Expression profile (genomic order)

r3 t3 r4 t4 r5 t5 r6 t6 s4 e4 x4 s3 e3 x3 s2 e2 x2 s1 e1 x1 s5 e5 x5 s6 e6 x6 sn en xn rn tn

Probe sensitivity & noise

... After expression data (x) is observed, the factor graph becomes a tree

slide-41
SLIDE 41

Brendan Frey

... ... After expression data (x) is observed, the factor graph becomes a tree

r1

...

t1 r2 t2

Transcription start/stop indicator Relative index of prototype Exon versus non-exon indicator

r3 t3 r4 t4 r5 t5 r6 t6 s4 e4 s3 e3 s2 e2 s1 e1 s5 e5 s6 e6 sn en rn tn

Probe sensitivity & noise

... Computation: The max-product algorithm performs exact inference and learning.

slide-42
SLIDE 42

Brendan Frey

Summary of results *

  • 10 X more sensitive than other

transcript-based methods

  • Detected 155,839 exons
  • Predicted ~30,000 new exons
  • Reconciled discrepancies in thousands
  • f known transcripts

* Exon false positive rate: 2.7%

slide-43
SLIDE 43

Brendan Frey

Frey et al: ~0

(Nature Genetics)

SURPRISE!

Revisiting Estimates of number of undiscovered genes

2000 2001 2002 2003 2004 2005

Genome: ~10,000

(IHGSC, Nature)

Genome: ~3000

(IHGSC, Nature)

Kapranov et al, Rinn et al, Shoemaker et al: ~300,000

Bertone et al: ~11,000

(Science)

slide-44
SLIDE 44

Brendan Frey

2000 2001 2002 2003 2004 2005

Bertone et al: ~11,000

(Science)

Frey et al: ~0

(Nature Genetics)

SURPRISE! Contentious results

FANTOM3: 5,154

(FANTOM Consortium, Science)

slide-45
SLIDE 45

Brendan Frey

… [We disc o ve re d] ne w mo use pro te in-c o ding transc ripts, inc luding 5,154 enc oding

pr eviously-unidentified pr

  • teins …

– F ANT OM/ RI K E N Co nso rtium Sc ie nc e , Se p 2005

We wondered: Are these really new genes?

slide-46
SLIDE 46

Brendan Frey

… we fo und that 2917 of the

F ANT OM pr

  • teins are in fac t

splic e iso fo rms o f known

tr ansc r ipts …

– F re y e t al Sc ie nc e , Marc h 2006 … the number

  • f new pr
  • tein-

c oding genes fo und by us has

be e n r

evised fr

  • m 5154 to 2222…

– F ANT OM/ RI K E N Co nso rtium Sc ie nc e , Marc h 2006

slide-47
SLIDE 47

Brendan Frey

Last word…

… the number

  • f c ompletely new

pr

  • tein-c oding genes disc o ve re d

by the F ANT OM c o nso rtium is at

most in the hundr eds…

– F re y e t al Sc ie nc e , Marc h 2006

slide-48
SLIDE 48

Brendan Frey

The Closing Remarks

slide-49
SLIDE 49

Brendan Frey

  • Producing genome-wide libraries of

functioning transcripts, including

– Alternatively-spliced transcripts – Transcripts that don’t make proteins

  • Understanding functions of transcripts
  • Developing models of how transcription

and alternative splicing are regulated

  • Developing models of gene interactions

– ‘Genetic networks’

Open problems

slide-50
SLIDE 50

Brendan Frey

Should you work in computational biology?

Pluses

  • A major scientific frontier
  • Potential for high impact on society

Minuses

  • Mostly a collection of facts
  • Mechanisms are complex and beyond
  • ur control
  • Lacking a mathematical framework
slide-51
SLIDE 51

Brendan Frey

Remember, communication theory also once lacked a mathematical framework…

“Ok, Zorg, lets try using a prefix code”

slide-52
SLIDE 52

Brendan Frey

Should you work in computational biology?

Minuses

  • Mostly a collection of facts
  • Mechanisms are complex and beyond
  • ur control

Pluses

  • A major scientific frontier
  • Potential for high impact on society
  • Lacking a mathematical framework
slide-53
SLIDE 53

Brendan Frey

How do you enter this field?

  • Hire a tutor (ie, student or postdoc)
  • Hire a programmer
  • Get involved in several ‘winner’ projects
  • Be prepared to drop ‘loser’ projects
  • Build mutually-beneficial collaborations
  • How long will it take?
slide-54
SLIDE 54

Brendan Frey

For more information…

  • As of Friday July 14, 2006:

http://www.psi.toronto.edu/isit2006.html

– These slides – Pointers to helpful papers, databases, etc

slide-55
SLIDE 55

Brendan Frey

Acknowledgements

  • Frey Group

– Quaid D Morris (postdoc) – Leo Lee (postdoc) – Yoseph Barash (postdoc) – Ofer Shai (PhD) – Inmar Givoni (PhD) – Jim Huang (PhD) – Marc Robinson (programmer)

Genomics Collaborators

  • Hughes’ Lab
  • Blencowe’s Lab
  • Emili’s Lab
  • Boone’s Lab

Medical Collaborators: E Sat, J Rossant, BG Bruneau, JE Aubin