Data Mining in Bioinformatics Day 6: Classification in Next - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 6: Classification in Next Generation Sequencing Data Analysis Dominik Grimm February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max Planck Institute Tübingen and Eberhard Karls Universität Tübingen Dominik Grimm: Data Mining in Bioinformatics, Page 1

Overview Genome sequencing: A brief review Classical sequencing methods Paired-end sequencing Next Generation Sequencing (NGS): A brief introduction Next Generation Sequencing approaches Illumina Genome Analyzer II Genome reconstruction Detecting structural variations Accurate indel prediction using paired-end short reads (Grimm et al. , 2013) SVM approach to predict indels Dominik Grimm: Data Mining in Bioinformatics, Page 2

Genome sequencing: A brief review Brief historical review First DNA sequences were obtained in the early 1970s (Min Jou et al. , 1971, 1972) Laborious techniques were required to retrieve small DNA pieces, e.g. in 1973 the lac-operator ( 24 base- pairs (bp) ) was sequenced by Walter Gilbert and Allen Maxam (Gilbert and Maxam, 1973) In 1977 two rapid sequencing methods were developed (almost simultaneously) Maxam-Gilbert sequencing at Harvard University USA Sanger sequencing by Frederick Sanger at the Uni- versity of Cambridge UK Nobel prize for Frederick Sanger, Walter Gilbert and Paul Berg in 1980 Dominik Grimm: Data Mining in Bioinformatics, Page 3

Genome sequencing: A brief review Maxam-Gilbert sequencing (Maxam and Gilbert, 1977) Gel electrophoresis to DNA sequences (ATTCGA) marked reconstruct the sequence at the 5' end ATTCGA A T G C DNA sequences get modified at A,T,G or C A T G C and get split of from the DNA backbone Rarely used, because: Sequences are marked at the 5’ and 3’ end with radioactive phosphor 32 P or a non radioactive biotin or fluorescein It is hard to automize Dominik Grimm: Data Mining in Bioinformatics, Page 4

Genome sequencing: A brief review Sanger sequencing (Sanger et al. , 1977) 3' 5' A T T C G A Template Primer Polymerase Polymerase Polymerase Polymerase + ddCTP + ddATP + ddTTP + ddGTP + a lot of dTTP + a lot of dCTP + a lot of dATP + a lot of dGTP 3' 5' 3' 5' 3' 5' 3' 5' A T T C G A A T T C G A A T T C G A A T T C G A A T T C G A A T T C G A A T 3' 5' 3' 5' A T T C G A A T T C G A A T T C G A A T T Widely used, because: A T G C Less toxic and radioactive substances are needed Is more efficient due to automation Method works for short sequence strand from 100 bp up to 1.5 kbp Dominik Grimm: Data Mining in Bioinformatics, Page 5

Genome sequencing: A brief review Shotgun sequencing (Sanger et al. , 1980, 1982) Several DNA copies using mechanical shear forces to break DNA into pieces at random positions find overlapping pieces and reconstruct the original sequence For the first time it was possible to sequence a long sequence even whole genomes Efficient bioinformatic techniques are essential to reconstruct the original sequence The Institute for Genome Research led by Craig Venter proposed a concept of highly parallel sequencing Dominik Grimm: Data Mining in Bioinformatics, Page 6

Genome sequencing: A brief review Shotgun sequencing Error-prone due to sequencing errors, repetitive nucleotide patterns or similar reads from distinct genomic positions ! DNA molecules are copied and sequenced nu- merous times The fold-coverage c is measuring the average number of reads covering a given nucleotide: | R | c ( R , g ) = 1 X | R i | , (1) | g | i =1 where g is a DNA sequence of length n and R a set of DNA reads with an average length m . Example : A sequence of length 4000 bp is reconstructed using 20 reads, each with an average length of 600 bp ! c = 3 x coverage (3 fold coverage) Dominik Grimm: Data Mining in Bioinformatics, Page 7

Genome sequencing: A brief review Paired-end sequencing (Edwards and Caskey, 1991) In paired-end (or mate pair) sequencing both ends of the same fragment are sequenced The distance between two reads of a paired-end read is known Reads that are reassembled approximately the known distance apart from each other are called happy, otherwise they are unhappy. expected distance between two reads Read 2 Read 2 Read 1 (happy) (unhappy) DNA sequence Paired-end reads and their distance information help to reconstruct the original sequence more reliable. Dominik Grimm: Data Mining in Bioinformatics, Page 8

Next Generation Sequencing Overview In the last three decades Sanger sequencing was the most used and productive way ! But classical Sanger it is still expensive, a lot of scientists, huge sequencing centers and a lot of time are required for whole genomes ! There is a high demand for low-cost sequencing techniques Dominik Grimm: Data Mining in Bioinformatics, Page 9

Next Generation Sequencing Cyclic-array sequencing technologies (Shendure and Ji, 2008) 454 pyrosequencing used in the 454 Genome Sequencer, Roche Applied Science SOLiD platform developed by Applied Biosystems Polonator developed by George M. Church’s group at Har- vard Medical School HeliScope Single Molecule Sequencer by Helicos Solexa technology used in the Illumina Genome Analyzer ! "Cyclic-array based sequencing can be summarized as the sequencing of a dense array of DNA features by iterative cycles of enzymatic manipulation and imaging- based data collection" (Shendure and Ji, 2008) Dominik Grimm: Data Mining in Bioinformatics, Page 10

Next Generation Sequencing The Illumina Genome Analyzer II 1. A DNA sequence library has to be prepared (a) Create several copies of the DNA strand (b) Fragment strand using nebulization or sonication (c) Amplify ends of fragments (d) Phosphorylate the 3’ end and add an Adenosine over- hang to the 5’ end (e) Ligate Illumina adapters 3. Adenosin addition 1. DNA End-repair P A A P 2. Phosphorylation 4. Adapter ligation P P Dominik Grimm: Data Mining in Bioinformatics, Page 11

Next Generation Sequencing The Illumina Genome Analyzer II 2. Flow cell preparation 5. 3' extension 6. Denaturation 7. Bridge formation 8. 3' aplification 36 times repeated 9. Bridge 10. Bridge formation denaturation Dominik Grimm: Data Mining in Bioinformatics, Page 12

Next Generation Sequencing The Illumina Genome Analyzer II 3. Sequencing 12.Single base extension and 13.Fluorescent base cleavage and 14.Repeated more than 50 times 11.Sequencing primers are hybridized laser based imaging terminator gets unblocked to determine sequence ? C C A T Laser ! Now it is possible to generate gigabases of high- quality reads within one day using only one machine and less than 6 hours of hands-on-time Dominik Grimm: Data Mining in Bioinformatics, Page 13

Next Generation Sequencing Sequencing costs (Wetterstrand, 2013) Dominik Grimm: Data Mining in Bioinformatics, Page 14

Genome reconstruction Reference guided mapping Millions of short reads ( ⇠ 30 up to 200 bp) are generated ! challenge to reconstruct the original sequence using as- sembly methods New approach: Align short reads against a known genome of the same species ( reference genome ) ! also re- ferred as mapping (tools: SHORE (Ossowski et al. , 2008), SSAHA2 (Ning et al. , 2001)) Reference Genome paired-end short reads Dominik Grimm: Data Mining in Bioinformatics, Page 15

Genome reconstruction Reference guided mapping with SHORE (Ossowski et al. , 2008) SHORE uses the best-match strategy to map reads ! Best matches are mapped at first and then the number of mismatches and gaps are increased iteratively Reads with 0 mismatches and gaps are mapped at first fol- lowed by alignments with Levenshtein Edit Distance (LED) = 1 (Levenshtein, 1965) and Hamming Distance (HD) = 1 (Hamming, 1950), LED = 2 and HD = 2, LED = 3 and HD = 3, LED = 4 and HD = 4. Dominik Grimm: Data Mining in Bioinformatics, Page 16

Genome reconstruction Hamming Distance (HD) (Hamming, 1950) The HD d HD measures the number of varying positions in two strings s 1 and s 2 of equal length m X d HD ( s 1 , s 2 ) = 1 , i = 1 , . . . , m (2) s 1 i 6 = s 2 i Example s 1 = ” ATCCATGC ” and s 2 = ” ATGGATAC ” s 1 : ” ATCCATGC ” s 2 : ” ATGGATAC ” ! d hm = 3 Dominik Grimm: Data Mining in Bioinformatics, Page 17

Genome reconstruction Levenshtein Edit Distance (LED) (Levenshtein, 1965) The LED d LED is the minimum number of edit operations to trans- form a string s 1 of length n into a string s 2 if length m . Allowed edit operations are deletion, insertion or substitution of a single charac- ter. Can be computed in O = ( nm ) using dynamic programming . 8 0 , if a i = b j < ω ( a i , b j ) = 1 , if a i 6 = b j , Substitution : 8 (3) d LED ( i � 1 , j � 1) + ω ( s 1 i , s 2 i ) > > > < d LED ( i, j ) = min d LED ( i, j � 1) + 1 Insertion > > d LED ( i � 1 , ) + 1 Deletion > : Example s 1 = ” sole ” and s 2 = solid ”” s 1 : ” sole ” s 2 : ” solid ” ! d LED = 2 , two operations: substitution ( e ! i ) , insertion ( d at the end ) Dominik Grimm: Data Mining in Bioinformatics, Page 18

Genome reconstruction Reconstruction is not trivial Reference Genome Left over reads: Both reads could Left over reads: Single read could not be mapped not be mapped Dominik Grimm: Data Mining in Bioinformatics, Page 19

Data Mining in Bioinformatics Day 6: Classification in Next - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 6: Classification in Next Generation Sequencing Data Analysis Dominik Grimm February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max Planck Institute Tbingen and Eberhard

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt March 1

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Data Mining in Bioinformatics Day 5: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 21

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt March

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 21 to March 4, 2011

Data Mining in Bioinformatics Day 10: Graph Mining in Bioinformatics Karsten Borgwardt February

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

At Creation Common Holy Day 1 Day 2 Day 8 Day 9 Day 3 Day 4 Day 5 Day 6 Day 7 7 Days The

Text Mining and Information Extraction Applications for Bioinformatics and Systems Biology Plant

Data Mining in Bioinformatics Day 6: Feature Selection in Bioinformatics Karsten Borgwardt

ChIP-seq analysis Morgane Thomas-Chollier Computa)onal systems

HTPMD High Throughput Parallel Molecular Dynamics Steve Cox RENCI Engagement Overview

human protein kinase CK2 Christian Nienberg 1, *, Anika Retterath 1 , Kira Sophie Becher 2 ,

Hands-on Exercises C H I P S T E R A N D F E D E R A T E D C L O U D Slides and Exercises m

The microRNAs of Caenorhabditis elegans (Lim et al . Genes & Development 2003) Vertebrate

Interprtation abstraite de modles de voies de signalisation intracellulaire Jrme Feret

Introduction to Genomics Atul Butte, MD atul_butte@harvard.edu Childrens Hospital Informatics

I. Clusters bajo Linux Isabel Campos Plasencia Responsable de Proyectos de Computacin 28, 29

Sambuz

Useful Links

Newsletter

Mail Us

Data Mining in Bioinformatics Day 6: Classification in Next - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 6: Classification in Next Generation Sequencing Data Analysis Dominik Grimm February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max Planck Institute Tbingen and Eberhard

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 9: String &amp; Text Mining in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt March 1

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Data Mining in Bioinformatics Day 5: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 21

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt March

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 21 to March 4, 2011

Data Mining in Bioinformatics Day 10: Graph Mining in Bioinformatics Karsten Borgwardt February

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

At Creation Common Holy Day 1 Day 2 Day 8 Day 9 Day 3 Day 4 Day 5 Day 6 Day 7 7 Days The

Text Mining and Information Extraction Applications for Bioinformatics and Systems Biology Plant

Data Mining in Bioinformatics Day 6: Feature Selection in Bioinformatics Karsten Borgwardt

ChIP-seq analysis Morgane Thomas-Chollier Computa)onal systems

HTPMD High Throughput Parallel Molecular Dynamics Steve Cox RENCI Engagement Overview

human protein kinase CK2 Christian Nienberg 1, *, Anika Retterath 1 , Kira Sophie Becher 2 ,

Hands-on Exercises C H I P S T E R A N D F E D E R A T E D C L O U D Slides and Exercises m

The microRNAs of Caenorhabditis elegans (Lim et al . Genes &amp; Development 2003) Vertebrate

Interprtation abstraite de modles de voies de signalisation intracellulaire Jrme Feret

Introduction to Genomics Atul Butte, MD atul_butte@harvard.edu Childrens Hospital Informatics

I. Clusters bajo Linux Isabel Campos Plasencia Responsable de Proyectos de Computacin 28, 29

Sambuz

Useful Links

Newsletter

Mail Us

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

The microRNAs of Caenorhabditis elegans (Lim et al . Genes & Development 2003) Vertebrate