Sequencing and decoding genomes C. Victor Jongeneel, PhD Ludwig - - PowerPoint PPT Presentation

sequencing and decoding genomes
SMART_READER_LITE
LIVE PREVIEW

Sequencing and decoding genomes C. Victor Jongeneel, PhD Ludwig - - PowerPoint PPT Presentation

Sequencing and decoding genomes C. Victor Jongeneel, PhD Ludwig Institute for Cancer Research Swiss Institute of Bioinformatics Center for Integrative Genomics, U. of Lausanne Victor.Jongeneel@licr.org The Computational Biology Challenge


slide-1
SLIDE 1

Sequencing and decoding genomes

  • C. Victor Jongeneel, PhD

Ludwig Institute for Cancer Research Swiss Institute of Bioinformatics Center for Integrative Genomics, U. of Lausanne

Victor.Jongeneel@licr.org

slide-2
SLIDE 2

The Computational Biology Challenge "In principle, the string of genetic bits holds long-sought secrets of human development, physiology and

  • medicine. In practice, our ability to transform such

information into understanding remains woefully inadequate".

The Genome International Sequencing Consortium, ”Initial sequencing and analysis of the human genome,” Nature 409: 860-921 (2001) [Emphasis added]

slide-3
SLIDE 3

Outline of the talk

  • All about genomes
  • Sequencing technologies, old and new
  • From a test tube full of DNA to a genome sequence
  • And now, where are the genes?
  • Presenting the results: genome browsers
slide-4
SLIDE 4

What is a genome?

  • Set of genes transmitted between generations
  • Made of deoxyribonucleic acid (DNA)
  • A long string in a four-letter alphabet (A,C,G,T)
  • Present in every cell of the body
  • Contains the master plan for building an organism
  • Contains all of the instructions to keep a cell alive and to

allow it to divide

  • Copies itself at every cell division
slide-5
SLIDE 5

Genome complexity

  • Sizes:
  • viruses: 103 to 105 nt
  • bacteria: 105 to 107 nt
  • Baker’s yeast: 1.35 x 107 nt
  • mammals: 3-4 x 109 nt
  • plants: 108 to 1011 nt
  • Numbers of genes:
  • virus: 3 to 100
  • bacteria: 1000 to 5000
  • Baker’s yeast: ~6’000
  • mammals: 20’000 - 30’000
slide-6
SLIDE 6

Information carried by DNA

centromere telomere gene régulatory elements exons of genes locus control region

(NB: this drawing is not to scale)

repetitive sequences

slide-7
SLIDE 7

Structure of a typical vertebrate gene

5’- UTR 3’- UTR Stop (CpG islands)

slide-8
SLIDE 8

The human genome

  • Size: 3 x 109 nt for a single copy (haploid)
  • Highly repetitive sequences (>1000 copies) 25%
  • Middle repetitive sequences 25-30% ? >50% total
  • Sizes of genes: from 900 to >2’000’000 nt (including

introns)

  • Proportion encoding proteins: 5-7%
  • Number of chromosomes: 22 autosomes, 2 sex-linked

chromosomes (X and Y)

  • Sizes of chromosomes: 5 x 107 to 5 x 108 nt
slide-9
SLIDE 9

Outline of the talk

  • All about genomes
  • Sequencing technologies, old and new
  • From a test tube full of DNA to a genome sequence
  • And now, where are the genes?
  • Presenting the results: genome browsers
slide-10
SLIDE 10

Current sequencing technology

  • Sequencing method developed by Fred Sanger in the

1970’s

  • Principle: randomly terminate synthesis of a DNA strand

using a nucleotide analogue, separate products by electrophoresis in a polyacrylamide gel

  • Many technological improvements:
  • Use of fluorescently labeled nucleotides
  • Use of pre-cast gels in capillaries
  • Automation of sample handling
  • Use of microfluidics technologies
slide-11
SLIDE 11

Sequencing machines

slide-12
SLIDE 12

Sequencing machines – a current model

slide-13
SLIDE 13

Sequencing centre

slide-14
SLIDE 14

Raw data

slide-15
SLIDE 15

Typical output of current sequencing technology

  • One machine handles 500 samples per day
  • Each sample produces 700 nt of raw sequence
  • A typical centre will host a hundred sequencing

machines A typical centre can produce 30 million nucleotides worth

  • f sequence every working day
  • BUT this requires a very large investment and is labor

intensive

  • There are less than 100 such centres worldwide
slide-16
SLIDE 16

A real sequencing centre: DOE JGI

653 92% 185,133,995 109.947 Billion Total (3/99-2/26/06) 625 95% 20,517,314 13.056 Billion FY to Date (10/05-2/26/06) 714 93.69% 4,187,132 2.807 Billion Last month (1/06) 717 94.31% 2,047,968 2.147 Billion Current month (2/06) 787 83.27% 23,424 15.386 Million 2/26/06: MegaBACE4500 606 88.31% 6,528 3.512 Million 12/16/04: MegaBACE4000 696 92.77% 65,952 42.672 Million 2/26/06: ABI3730

  • Ave. Read Length‡

% Passed† Total Lanes** Total Q20* Bases Date(s)

slide-17
SLIDE 17

New sequencing technologies

  • Current leaders: 454 Life Sciences, Solexa
  • Main technological advances:
  • Single molecule amplification on solid supports: microbeads in

emulsion (454), surfaces with DNA primers (Solexa)

  • Bulk sequencing technologies:

pyrosequencing (454), sequencing by synthesis (Solexa)

  • Sophisticated signal acquisition and processing
  • Throughput from a single machine:
  • 200,000 sequences of 100 nt each in 4 hours (454)
  • 2 Mio sequences of 25 nt each in 8 hours (Solexa)
  • Both technologies provide single machine throughput similar to

an entire genome sequencing centre!

slide-18
SLIDE 18

454 sequencing

slide-19
SLIDE 19

A 454 sequencing machine (Roche Diagnostics)

slide-20
SLIDE 20

Some applications of “ultra-low cost” sequencing

  • Rapid sequencing of bacterial genomes
  • Sampling of “environmental” genomes
  • Sequencing of individual human genomes as a component of

preventative medicine.

  • Rapid hypothesis testing for genotype–phenotype associations
  • In vitro and in situ gene-expression profiling at all stages in the

development of a multicellular organism

  • Cancer research: for example, determining comprehensive mutation

sets for individual clones carrying out loss-of-heterozygosity analysis and profiling tumour sub-types for diagnosis and prognosis

slide-21
SLIDE 21

Making sequence information public

  • Sequence databases: EMBL (Europe), GenBank (US),

DDBJ (Japan)

  • Repositories of genome and transcriptome data
  • “The EMBL Nucleotide Sequence Database was frozen to

make Release 85 on 30-NOV-2005. The release contains 64,739,883 sequence entries comprising 116,106,677,726 nucleotides, of which 12,088,383 entries (59,629,958,692 nucleotides) are WGS (whole genome shotgun) data.”

  • Trace file repositories (raw data from sequencers) at all

major sequencing centres

slide-22
SLIDE 22

Growth of the public sequence repository

Growth of the EMBL sequence database

1.00E+05 1.00E+06 1.00E+07 1.00E+08 1.00E+09 1.00E+10 1.00E+11 1.00E+12 Feb-82 Nov-84 Aug-87 May-90 Jan-93 Oct-95 Jul-98 Apr-01 Jan-04 Oct-06 Date Nucleotides

Regression to an exponential: R=0.995 Doubling time: ~5 months Human genome

slide-23
SLIDE 23

Exponential growth in computing and sequencing

From Shendure et al, Nature Reviews Genetics 5:335 (2004)

slide-24
SLIDE 24

Outline of the talk

  • All about genomes
  • Sequencing technologies, old and new
  • From a test tube full of DNA to a genome sequence
  • And now, where are the genes?
  • Presenting the results: genome browsers
slide-25
SLIDE 25

Challenges in genome sequencing

  • Using current technology, pieces of the genome have to be

individually cloned and amplified (in bacterial vectors) before sequencing

  • Genome sizes in hundreds of millions of nucleotides, sequence

reads in hundreds

  • Millions of reads will be required to obtain several-fold coverage
  • These reads will have to be assembled based on overlaps
  • A sizable proportion of many genomes consists of repeated

sequences

  • A measured scaffold will need to be built to guide the assembly

process

  • This scaffold will be based on a physical map of the genome
slide-26
SLIDE 26

Shotgun sequencing and assembly

Figures from the U. of Maryland Center for Bioinformatics and Computational Biology

slide-27
SLIDE 27

Mis-assembly caused by a repeated sequence

Figures from the U. of Maryland Center for Bioinformatics and Computational Biology

slide-28
SLIDE 28

Scaffolds based on paired reads or BAC maps

Paired reads: the distance separating ends of clones is known (within limits) BAC map: the genome is divided into pieces

  • f 100-200 kb, which

are mapped relative to each other. A minimal set is then sequenced

Figures from the U. of Maryland Center for Bioinformatics and Computational Biology

slide-29
SLIDE 29

Genome assembly software

  • The assembly of larger and larger sequence contigs is a

difficult problem

  • Some of the most sophisticated software used in the life

sciences addresses this issue

  • Graph theory is used extensively (Hamiltonian or

Eulerian paths) in contig assembly

  • Examples:
  • Phrap (Phil Green, U. of Washington)
  • Celera assembler (Gene Myers, UC Berkeley)
  • Arachne (David Jaffe and Eric Lander, MIT)
slide-30
SLIDE 30

Anatomy of a genome sequencing project (ca 2005)

Genomic DNA Plasmid library Shotgun sequencing BAC library End sequencing Mapping Shotgun sequencing Super-contigs Finished sequence

Finishing, gap closure New sequencing technologies Sequence assembly software

Contigs

slide-31
SLIDE 31

Outline of the talk

  • All about genomes
  • Sequencing technologies, old and new
  • From a test tube full of DNA to a genome sequence
  • And now, where are the genes?
  • Presenting the results: genome browsers
slide-32
SLIDE 32

From sequence to code

  • A finished genome sequence is nothing but a set of long

and uninformative strings (chromosomes) of ACGT

  • A major task in any genome sequencing project is the

annotation

  • The annotation process tries to assign functions to sub-

strings of the chromosome

  • Annotation is just a representation of current knowledge

about how a genome works

slide-33
SLIDE 33

Gene prediction methods

  • Two different approaches:
  • Extrinsic methods:
  • Based on similarity to known nucleotide or protein sequences
  • Based on similarity between genomes (comparative

genomics)

  • Intrinsic or ab initio methods:
  • Based on the properties of the sequence (statistical

regularities)

  • Based on the presence of known signals
slide-34
SLIDE 34

Gene prediction from sequence similarity

  • Starting from protein sequences: compare translated genomic

sequence to protein sequence databases

  • Starting from known transcript sequences: align cDNA or EST

sequences with the genomic sequence and mark exon boundaries

  • Starting from hidden Markov models describing protein functional

domains: align the domain descriptor with the genome, respecting intron/exon boundary rules

  • Software for detecting sequence similarities:
  • TBLASTN (DNA-protein comparisons) and BLASTN (RNA-DNA

comparisons)

  • SIBsim4, BLAT (alignment of cDNA to the genome)
  • Genewise (alignment of HMM to the genome)
slide-35
SLIDE 35

Comparative genomics

  • Phylogenetic approach: compare two genomes that have diverged

during evolution

  • Ideal divergence: ~300 Mio years (human vs bird)
  • Sequence conservation => selective pressure => function
  • Allows the detection of any functional element (transcribed or non-

transcribed genes, regulatory elements, ...)

  • But...
  • No indication of the function being preserved; in fact, there are

abundant non-genic conserved sequences of unknown function

  • Highly dependent on having enough genomes sequenced
slide-36
SLIDE 36

Ab initio gene predictions

  • Rules or signals:
  • Promoters (combinations of short binding motifs for transcription

factors)

  • Consensus signal for mRNA polyadenylation (AATAAA)
  • Start and stop codons in coding sequences
  • Splicing signals
  • Statistical models for coding regions:
  • Synonymous codon usage biases introduce bias in

hexanucleotide composition

  • Implemented as an inhomogeneous 3-periodic fifth-order hidden

Markov model

  • Models for structurally conserved RNA-coding genes
  • Usually represented as stochastic context-free grammars
slide-37
SLIDE 37

ab initio predictions EST sequences

slide-38
SLIDE 38

Functional annotation of genes

  • From experimental evidence
  • From known function of orthologs (genes with identical

ancestry) in other species

  • From individual protein domains (e.g. Zn finger domain

almost always associated with transcriptional regulators)

  • From domain architecture
  • From co-regulation and presumed participation in known

process

slide-39
SLIDE 39

Outline of the talk

  • All about genomes
  • Sequencing technologies, old and new
  • From a test tube full of DNA to a genome sequence
  • And now, where are the genes?
  • Presenting the results: genome browsers
slide-40
SLIDE 40

Genome browsers

  • Goal: present as much information about a genome as possible
  • Approach: partition annotation into “tracks”, and let the user choose

which tracks s/he wants to see, as well as the level of detail in each track

  • Distributed annotation: in the presence of a common coordinate

system, annotation tracks can be contributed by any Internet- connected server

  • Three major browsers available:
  • Ensembl (Sanger Inst. and European Bioinfo. Inst.)
  • UCSC Genome Browser (UC Santa Cruz)
  • Genome Map Viewer (Natl. Center for Biotech. Info., NIH)
slide-41
SLIDE 41
slide-42
SLIDE 42
slide-43
SLIDE 43
slide-44
SLIDE 44
slide-45
SLIDE 45

Summary

  • Genome sequencing is a relatively mature technology,

generating biologically fascinating data

  • New technological advances are poised to dramatically

decrease the cost and time of both sequencing and re- sequencing

  • Our understanding of the information content of large

genomes is still very fragmentary

  • The current buzzword is systems biology, i.e. an

understanding of genome function at the systems level