[PPT] - Sequencing and decoding genomes C. Victor Jongeneel, PhD Ludwig PowerPoint Presentation

SLIDE 1

Sequencing and decoding genomes

C. Victor Jongeneel, PhD

Ludwig Institute for Cancer Research Swiss Institute of Bioinformatics Center for Integrative Genomics, U. of Lausanne

Victor.Jongeneel@licr.org

SLIDE 2

The Computational Biology Challenge "In principle, the string of genetic bits holds long-sought secrets of human development, physiology and

medicine. In practice, our ability to transform such

information into understanding remains woefully inadequate".

The Genome International Sequencing Consortium, ”Initial sequencing and analysis of the human genome,” Nature 409: 860-921 (2001) [Emphasis added]

SLIDE 3

Outline of the talk

All about genomes
Sequencing technologies, old and new
From a test tube full of DNA to a genome sequence
And now, where are the genes?
Presenting the results: genome browsers

SLIDE 4

What is a genome?

Set of genes transmitted between generations
Made of deoxyribonucleic acid (DNA)
A long string in a four-letter alphabet (A,C,G,T)
Present in every cell of the body
Contains the master plan for building an organism
Contains all of the instructions to keep a cell alive and to

allow it to divide

Copies itself at every cell division

SLIDE 5

Genome complexity

Sizes:
viruses: 103 to 105 nt
bacteria: 105 to 107 nt
Baker’s yeast: 1.35 x 107 nt
mammals: 3-4 x 109 nt
plants: 108 to 1011 nt
Numbers of genes:
virus: 3 to 100
bacteria: 1000 to 5000
Baker’s yeast: ~6’000
mammals: 20’000 - 30’000

SLIDE 6

Information carried by DNA

centromere telomere gene régulatory elements exons of genes locus control region

(NB: this drawing is not to scale)

repetitive sequences

SLIDE 7

Structure of a typical vertebrate gene

5’- UTR 3’- UTR Stop (CpG islands)

SLIDE 8

The human genome

Size: 3 x 109 nt for a single copy (haploid)
Highly repetitive sequences (>1000 copies) 25%
Middle repetitive sequences 25-30% ? >50% total
Sizes of genes: from 900 to >2’000’000 nt (including

introns)

Proportion encoding proteins: 5-7%
Number of chromosomes: 22 autosomes, 2 sex-linked

chromosomes (X and Y)

Sizes of chromosomes: 5 x 107 to 5 x 108 nt

SLIDE 9

Outline of the talk

All about genomes
Sequencing technologies, old and new
From a test tube full of DNA to a genome sequence
And now, where are the genes?
Presenting the results: genome browsers

SLIDE 10

Current sequencing technology

Sequencing method developed by Fred Sanger in the

1970’s

Principle: randomly terminate synthesis of a DNA strand

using a nucleotide analogue, separate products by electrophoresis in a polyacrylamide gel

Many technological improvements:
Use of fluorescently labeled nucleotides
Use of pre-cast gels in capillaries
Automation of sample handling
Use of microfluidics technologies

SLIDE 11

Sequencing machines

SLIDE 12

Sequencing machines – a current model

SLIDE 13

Sequencing centre

SLIDE 14

Raw data

SLIDE 15

Typical output of current sequencing technology

One machine handles 500 samples per day
Each sample produces 700 nt of raw sequence
A typical centre will host a hundred sequencing

machines A typical centre can produce 30 million nucleotides worth

f sequence every working day
BUT this requires a very large investment and is labor

intensive

There are less than 100 such centres worldwide

SLIDE 16

A real sequencing centre: DOE JGI

653 92% 185,133,995 109.947 Billion Total (3/99-2/26/06) 625 95% 20,517,314 13.056 Billion FY to Date (10/05-2/26/06) 714 93.69% 4,187,132 2.807 Billion Last month (1/06) 717 94.31% 2,047,968 2.147 Billion Current month (2/06) 787 83.27% 23,424 15.386 Million 2/26/06: MegaBACE4500 606 88.31% 6,528 3.512 Million 12/16/04: MegaBACE4000 696 92.77% 65,952 42.672 Million 2/26/06: ABI3730

Ave. Read Length‡

% Passed† Total Lanes** Total Q20* Bases Date(s)

SLIDE 17

New sequencing technologies

Current leaders: 454 Life Sciences, Solexa
Main technological advances:
Single molecule amplification on solid supports: microbeads in

emulsion (454), surfaces with DNA primers (Solexa)

Bulk sequencing technologies:

pyrosequencing (454), sequencing by synthesis (Solexa)

Sophisticated signal acquisition and processing
Throughput from a single machine:
200,000 sequences of 100 nt each in 4 hours (454)
2 Mio sequences of 25 nt each in 8 hours (Solexa)
Both technologies provide single machine throughput similar to

an entire genome sequencing centre!

SLIDE 18

454 sequencing

SLIDE 19

A 454 sequencing machine (Roche Diagnostics)

SLIDE 20

Some applications of “ultra-low cost” sequencing

Rapid sequencing of bacterial genomes
Sampling of “environmental” genomes
Sequencing of individual human genomes as a component of

preventative medicine.

Rapid hypothesis testing for genotype–phenotype associations
In vitro and in situ gene-expression profiling at all stages in the

development of a multicellular organism

Cancer research: for example, determining comprehensive mutation

sets for individual clones carrying out loss-of-heterozygosity analysis and profiling tumour sub-types for diagnosis and prognosis

SLIDE 21

Making sequence information public

Sequence databases: EMBL (Europe), GenBank (US),

DDBJ (Japan)

Repositories of genome and transcriptome data
“The EMBL Nucleotide Sequence Database was frozen to

make Release 85 on 30-NOV-2005. The release contains 64,739,883 sequence entries comprising 116,106,677,726 nucleotides, of which 12,088,383 entries (59,629,958,692 nucleotides) are WGS (whole genome shotgun) data.”

Trace file repositories (raw data from sequencers) at all

major sequencing centres

SLIDE 22

Growth of the public sequence repository

Growth of the EMBL sequence database

1.00E+05 1.00E+06 1.00E+07 1.00E+08 1.00E+09 1.00E+10 1.00E+11 1.00E+12 Feb-82 Nov-84 Aug-87 May-90 Jan-93 Oct-95 Jul-98 Apr-01 Jan-04 Oct-06 Date Nucleotides

Regression to an exponential: R=0.995 Doubling time: ~5 months Human genome

SLIDE 23

Exponential growth in computing and sequencing

From Shendure et al, Nature Reviews Genetics 5:335 (2004)

SLIDE 24

Outline of the talk

All about genomes
Sequencing technologies, old and new
From a test tube full of DNA to a genome sequence
And now, where are the genes?
Presenting the results: genome browsers

SLIDE 25

Challenges in genome sequencing

Using current technology, pieces of the genome have to be

individually cloned and amplified (in bacterial vectors) before sequencing

Genome sizes in hundreds of millions of nucleotides, sequence

reads in hundreds

Millions of reads will be required to obtain several-fold coverage
These reads will have to be assembled based on overlaps
A sizable proportion of many genomes consists of repeated

sequences

A measured scaffold will need to be built to guide the assembly

process

This scaffold will be based on a physical map of the genome

SLIDE 26

Shotgun sequencing and assembly

Figures from the U. of Maryland Center for Bioinformatics and Computational Biology

SLIDE 27

Mis-assembly caused by a repeated sequence

Figures from the U. of Maryland Center for Bioinformatics and Computational Biology

SLIDE 28

Scaffolds based on paired reads or BAC maps

Paired reads: the distance separating ends of clones is known (within limits) BAC map: the genome is divided into pieces

f 100-200 kb, which

are mapped relative to each other. A minimal set is then sequenced

Figures from the U. of Maryland Center for Bioinformatics and Computational Biology

SLIDE 29

Genome assembly software

The assembly of larger and larger sequence contigs is a

difficult problem

Some of the most sophisticated software used in the life

sciences addresses this issue

Graph theory is used extensively (Hamiltonian or

Eulerian paths) in contig assembly

Examples:
Phrap (Phil Green, U. of Washington)
Celera assembler (Gene Myers, UC Berkeley)
Arachne (David Jaffe and Eric Lander, MIT)

SLIDE 30

Anatomy of a genome sequencing project (ca 2005)

Genomic DNA Plasmid library Shotgun sequencing BAC library End sequencing Mapping Shotgun sequencing Super-contigs Finished sequence

Finishing, gap closure New sequencing technologies Sequence assembly software

Contigs

SLIDE 31

Outline of the talk

All about genomes
Sequencing technologies, old and new
From a test tube full of DNA to a genome sequence
And now, where are the genes?
Presenting the results: genome browsers

SLIDE 32

From sequence to code

A finished genome sequence is nothing but a set of long

and uninformative strings (chromosomes) of ACGT

A major task in any genome sequencing project is the

annotation

The annotation process tries to assign functions to sub-

strings of the chromosome

Annotation is just a representation of current knowledge

about how a genome works

SLIDE 33

Gene prediction methods

Two different approaches:
Extrinsic methods:
Based on similarity to known nucleotide or protein sequences
Based on similarity between genomes (comparative

genomics)

Intrinsic or ab initio methods:
Based on the properties of the sequence (statistical

regularities)

Based on the presence of known signals

SLIDE 34

Gene prediction from sequence similarity

Starting from protein sequences: compare translated genomic

sequence to protein sequence databases

Starting from known transcript sequences: align cDNA or EST

sequences with the genomic sequence and mark exon boundaries

Starting from hidden Markov models describing protein functional

domains: align the domain descriptor with the genome, respecting intron/exon boundary rules

Software for detecting sequence similarities:
TBLASTN (DNA-protein comparisons) and BLASTN (RNA-DNA

comparisons)

SIBsim4, BLAT (alignment of cDNA to the genome)
Genewise (alignment of HMM to the genome)

SLIDE 35

Comparative genomics

Phylogenetic approach: compare two genomes that have diverged

during evolution

Ideal divergence: ~300 Mio years (human vs bird)
Sequence conservation => selective pressure => function
Allows the detection of any functional element (transcribed or non-

transcribed genes, regulatory elements, ...)

But...
No indication of the function being preserved; in fact, there are

abundant non-genic conserved sequences of unknown function

Highly dependent on having enough genomes sequenced

SLIDE 36

Ab initio gene predictions

Rules or signals:
Promoters (combinations of short binding motifs for transcription

factors)

Consensus signal for mRNA polyadenylation (AATAAA)
Start and stop codons in coding sequences
Splicing signals
Statistical models for coding regions:
Synonymous codon usage biases introduce bias in

hexanucleotide composition

Implemented as an inhomogeneous 3-periodic fifth-order hidden

Markov model

Models for structurally conserved RNA-coding genes
Usually represented as stochastic context-free grammars

SLIDE 37

ab initio predictions EST sequences

SLIDE 38

Functional annotation of genes

From experimental evidence
From known function of orthologs (genes with identical

ancestry) in other species

From individual protein domains (e.g. Zn finger domain

almost always associated with transcriptional regulators)

From domain architecture
From co-regulation and presumed participation in known

process

SLIDE 39

Outline of the talk

All about genomes
Sequencing technologies, old and new
From a test tube full of DNA to a genome sequence
And now, where are the genes?
Presenting the results: genome browsers

SLIDE 40

Genome browsers

Goal: present as much information about a genome as possible
Approach: partition annotation into “tracks”, and let the user choose

which tracks s/he wants to see, as well as the level of detail in each track

Distributed annotation: in the presence of a common coordinate

system, annotation tracks can be contributed by any Internet- connected server

Three major browsers available:
Ensembl (Sanger Inst. and European Bioinfo. Inst.)
UCSC Genome Browser (UC Santa Cruz)
Genome Map Viewer (Natl. Center for Biotech. Info., NIH)

SLIDE 41

SLIDE 42

SLIDE 43

SLIDE 44

SLIDE 45

Summary

Genome sequencing is a relatively mature technology,

generating biologically fascinating data

New technological advances are poised to dramatically

decrease the cost and time of both sequencing and re- sequencing

Our understanding of the information content of large

genomes is still very fragmentary

The current buzzword is systems biology, i.e. an