Dominik Grimm: Data Mining in Bioinformatics, Page 1
Data Mining in Bioinformatics Day 6: Classification in Next - - PowerPoint PPT Presentation
Data Mining in Bioinformatics Day 6: Classification in Next - - PowerPoint PPT Presentation
Data Mining in Bioinformatics Day 6: Classification in Next Generation Sequencing Data Analysis Dominik Grimm February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max Planck Institute Tbingen and Eberhard
Overview
Dominik Grimm: Data Mining in Bioinformatics, Page 2
Genome sequencing: A brief review Classical sequencing methods Paired-end sequencing Next Generation Sequencing (NGS): A brief introduction Next Generation Sequencing approaches Illumina Genome Analyzer II Genome reconstruction Detecting structural variations Accurate indel prediction using paired-end short reads (Grimm et al., 2013) SVM approach to predict indels
Genome sequencing: A brief review
Dominik Grimm: Data Mining in Bioinformatics, Page 3
Brief historical review First DNA sequences were obtained in the early 1970s (Min Jou et al., 1971, 1972) Laborious techniques were required to retrieve small DNA pieces, e.g. in 1973 the lac-operator (24 base- pairs (bp)) was sequenced by Walter Gilbert and Allen Maxam (Gilbert and Maxam, 1973) In 1977 two rapid sequencing methods were developed (almost simultaneously) Maxam-Gilbert sequencing at Harvard University USA Sanger sequencing by Frederick Sanger at the Uni- versity of Cambridge UK Nobel prize for Frederick Sanger, Walter Gilbert and Paul Berg in 1980
Genome sequencing: A brief review
Dominik Grimm: Data Mining in Bioinformatics, Page 4
Maxam-Gilbert sequencing (Maxam and Gilbert, 1977)
A T G C
DNA sequences (ATTCGA) marked at the 5' end DNA sequences get modified at A,T,G or C and get split of from the DNA backbone
A T G C
Gel electrophoresis to reconstruct the sequence ATTCGA
Rarely used, because:
Sequences are marked at the 5’ and 3’ end with radioactive phosphor 32P or a non radioactive biotin or fluorescein It is hard to automize
Genome sequencing: A brief review
Dominik Grimm: Data Mining in Bioinformatics, Page 5
Sanger sequencing (Sanger et al., 1977)
A T T C G A
3' 5' Template Primer Polymerase + ddATP + a lot of dATP Polymerase + ddTTP + a lot of dTTP Polymerase + ddGTP + a lot of dGTP Polymerase + ddCTP + a lot of dCTP A T T C G A 3' 5' A A T T C G A 3' 5' A T T C G A A T T C G A 3' 5' A T A T T C G A 3' 5' A T T A T T C G A 3' 5' A T T C G A T T C G A 3' 5' A T T C G A
A T G C
Widely used, because:
Less toxic and radioactive substances are needed Is more efficient due to automation Method works for short sequence strand from 100 bp up to 1.5 kbp
Genome sequencing: A brief review
Dominik Grimm: Data Mining in Bioinformatics, Page 6
Shotgun sequencing (Sanger et al., 1980, 1982)
Several DNA copies using mechanical shear forces to break DNA into pieces at random positions find overlapping pieces and reconstruct the original sequence
For the first time it was possible to sequence a long sequence even whole genomes Efficient bioinformatic techniques are essential to reconstruct the original se- quence The Institute for Genome Research led by Craig Venter proposed a concept of highly parallel sequencing
Genome sequencing: A brief review
Dominik Grimm: Data Mining in Bioinformatics, Page 7
Shotgun sequencing Error-prone due to sequencing errors, repetitive nu- cleotide patterns or similar reads from distinct genomic po- sitions ! DNA molecules are copied and sequenced nu- merous times The fold-coverage c is measuring the average number of reads covering a given nucleotide:
c(R, g) = 1 |g|
|R|
X
i=1
|Ri|, (1)
where g is a DNA sequence of length n and R a set of DNA reads with an average length m.
Example: A sequence of length 4000 bp is reconstructed using 20 reads, each with an average length of 600 bp ! c = 3x coverage (3 fold coverage)
Genome sequencing: A brief review
Dominik Grimm: Data Mining in Bioinformatics, Page 8
Paired-end sequencing (Edwards and Caskey, 1991) In paired-end (or mate pair) sequencing both ends of the same fragment are sequenced The distance between two reads of a paired-end read is known Reads that are reassembled approximately the known dis- tance apart from each other are called happy, otherwise they are unhappy.
DNA sequence Read 1 Read 2 (happy) Read 2 (unhappy) expected distance between two reads
Paired-end reads and their distance information help to re- construct the original sequence more reliable.
Next Generation Sequencing
Dominik Grimm: Data Mining in Bioinformatics, Page 9
Overview In the last three decades Sanger sequencing was the most used and productive way
! But classical Sanger it is still expensive, a lot of scientists,
huge sequencing centers and a lot of time are required for whole genomes
! There is a high demand for low-cost sequencing tech-
niques
Next Generation Sequencing
Dominik Grimm: Data Mining in Bioinformatics, Page 10
Cyclic-array sequencing technologies (Shendure and Ji, 2008) 454 pyrosequencing used in the 454 Genome Sequencer, Roche Applied Science SOLiD platform developed by Applied Biosystems Polonator developed by George M. Church’s group at Har- vard Medical School HeliScope Single Molecule Sequencer by Helicos Solexa technology used in the Illumina Genome Analyzer
! "Cyclic-array based sequencing can be summarized as the sequencing of a dense array of DNA features by iterative cycles of enzymatic manipulation and imaging- based data collection" (Shendure and Ji, 2008)
Next Generation Sequencing
Dominik Grimm: Data Mining in Bioinformatics, Page 11
The Illumina Genome Analyzer II
- 1. A DNA sequence library has to be prepared
(a) Create several copies of the DNA strand (b) Fragment strand using nebulization or sonication (c) Amplify ends of fragments (d) Phosphorylate the 3’ end and add an Adenosine over- hang to the 5’ end (e) Ligate Illumina adapters
- 1. DNA End-repair
P P
- 2. Phosphorylation
- 4. Adapter ligation
P P
- 3. Adenosin addition
A A
Next Generation Sequencing
Dominik Grimm: Data Mining in Bioinformatics, Page 12
The Illumina Genome Analyzer II
- 2. Flow cell preparation
36 times repeated
- 8. 3' aplification
- 9. Bridge
denaturation
- 10. Bridge
formation
- 5. 3' extension
- 6. Denaturation
- 7. Bridge formation
Next Generation Sequencing
Dominik Grimm: Data Mining in Bioinformatics, Page 13
The Illumina Genome Analyzer II
- 3. Sequencing
C 11.Sequencing primers are hybridized ? 12.Single base extension and laser based imaging Laser 13.Fluorescent base cleavage and terminator gets unblocked C A T 14.Repeated more than 50 times to determine sequence
! Now it is possible to generate gigabases of high- quality reads within one day using only one machine and less than 6 hours of hands-on-time
Next Generation Sequencing
Dominik Grimm: Data Mining in Bioinformatics, Page 14
Sequencing costs (Wetterstrand, 2013)
Genome reconstruction
Dominik Grimm: Data Mining in Bioinformatics, Page 15
Reference guided mapping Millions of short reads (⇠30 up to 200 bp) are generated
! challenge to reconstruct the original sequence using as-
sembly methods New approach: Align short reads against a known genome
- f the same species (reference genome) ! also re-
ferred as mapping (tools: SHORE (Ossowski et al., 2008), SSAHA2 (Ning et al., 2001))
Reference Genome paired-end short reads
Genome reconstruction
Dominik Grimm: Data Mining in Bioinformatics, Page 16
Reference guided mapping with SHORE (Ossowski et al., 2008) SHORE uses the best-match strategy to map reads
! Best matches are mapped at first and then the number
- f mismatches and gaps are increased iteratively
Reads with 0 mismatches and gaps are mapped at first fol- lowed by alignments with Levenshtein Edit Distance (LED) = 1 (Levenshtein, 1965) and Hamming Distance (HD) = 1 (Hamming, 1950), LED = 2 and HD = 2, LED = 3 and HD = 3, LED = 4 and HD = 4.
Genome reconstruction
Dominik Grimm: Data Mining in Bioinformatics, Page 17
Hamming Distance (HD) (Hamming, 1950) The HD dHD measures the number of varying positions in two strings s1 and s2 of equal length m
dHD(s1, s2) =
X
s1i6=s2i
1, i = 1, . . . , m (2)
Example s1 = ”ATCCATGC” and s2 = ”ATGGATAC” s1 : ”ATCCATGC” s2 : ”ATGGATAC” ! dhm = 3
Genome reconstruction
Dominik Grimm: Data Mining in Bioinformatics, Page 18
Levenshtein Edit Distance (LED) (Levenshtein, 1965)
The LED dLED is the minimum number of edit operations to trans- form a string s1 of length n into a string s2 if length m. Allowed edit
- perations are deletion, insertion or substitution of a single charac-
- ter. Can be computed in O = (nm) using dynamic programming.
ω(ai, bj) = 8 < : 0, if ai = bj 1, if ai 6= bj, Substitution dLED(i, j) = min 8 > > > < > > > : dLED(i 1, j 1) + ω(s1i, s2i) dLED(i, j 1) + 1 Insertion dLED(i 1, ) + 1 Deletion
(3)
Example s1 = ”sole” and s2 = solid”” s1 : ”sole” s2 : ”solid” ! dLED = 2, two operations: substitution (e ! i), insertion (d at the end)
Genome reconstruction
Dominik Grimm: Data Mining in Bioinformatics, Page 19
Reconstruction is not trivial
Reference Genome Left over reads: Both reads could not be mapped Left over reads: Single read could not be mapped
Genome reconstruction
Dominik Grimm: Data Mining in Bioinformatics, Page 20
Reconstruction is not trivial
Reference Genome insertion size is larger than expected true insertion size Reference Genome insertion size is smaller than expected true insertion size
Detecting structural variations
Dominik Grimm: Data Mining in Bioinformatics, Page 21
Detecting structural variations One decisive factor for genetic differences between indi- viduals are structural variations. Examples for structural variations (SVs): Single nucleotide variations (SNVs) Singe nucleotide polymorphism (SNPs) Insertions and deletions (indels) Inversions, translocations, duplications ... SVs provide important insight and are widely used for evo- lutionary studies, association studies or to generate ge- netic markers for clinical studies
Detecting structural variations
Dominik Grimm: Data Mining in Bioinformatics, Page 22
Inferring indels with discordant paired-end reads (Tuzun et al., 2005)
Deletion Reference Genome Sequenced Genome Reference Genome Sequenced Genome Insertion
If the distance between two reads is significantly smaller than expected it is an indicator for an insertion If the distance between two reads is significantly larger than expected it is an indicator for a deletion
Detecting structural variations
Dominik Grimm: Data Mining in Bioinformatics, Page 23
How can we improve SV calling?
Detecting structural variations
Dominik Grimm: Data Mining in Bioinformatics, Page 24
Pindel: A pattern growth approach (Ye et al., 2009)
Deletion
Mapped Read
Deletion
Unmapped Read Split Reference Genome Breakpoint
Insertion
Mapped Read Reference Genome Insertion Unmapped Read
Keep all paired-end reads for which one read can be mapped uniquely and exactly (no mismatches are allowed) Use PrefixSpam (Pei et al., 2004) to search for the minimum and maximum unique prefix from the 3’ end of the unmapped read onto the reference within a window
- f two times the average insertion size
Try to map the remaining part of the unmapped read at the 5’ end A deletion is reported if at least two unmapped reads support the same mapping position
Detecting structural variations
Dominik Grimm: Data Mining in Bioinformatics, Page 25
Pindel: A pattern growth approach (Ye et al., 2009)
Pindel is fast and can be applied to large and whole genomes Pindel is able to detect indels at the base-pair level Pindel uses left-over reads to get more accurate candidates
Detecting structural variations
Dominik Grimm: Data Mining in Bioinformatics, Page 26
Which drawbacks does this approach have? What can we do to improve SV calling?
Detecting structural variations
Dominik Grimm: Data Mining in Bioinformatics, Page 27
Drawbacks
Method uses exclusively unique and exact matches ! A lot of reads are excluded because of sequencing errors, repet- itive regions and multiple mappings The second partner has to be mapped within two times the average insertion size Only one weak feature is used to accurately identify indels (at least two unmapped reads supporting the same mapping position)
Accurate indel prediction
Dominik Grimm: Data Mining in Bioinformatics, Page 28
Objectives to accomplish Use non-error free and non-unique reads (reads with mul- tiple mappings) Allow mismatches and gaps in the realignment step Allow realignments larger than two times the average in- sertion size Find a more comprehensive set of features Use a discriminative classifier to predict if a indel is a true
- r false one
Accurate indel prediction
Dominik Grimm: Data Mining in Bioinformatics, Page 29
Realignment using Gotoh’s algorithm (Gotoh, 1982)
The affine gap model: g(k) = go + (k 1)ge, where k is the number of gaps, go the costs to open a gap and ge to extend a gap.
D(i, j) = max 8 > > > > > < > > > > > : D(i 1, j 1) + ω(s1i, s2j) Match/Mismatch F(i, j) Deletion matrix E(i, j) Insertion matrix (4) where 1 i n, 1 j m and F(i, j) = max ( D(i 1, j) + go, 8i 1 F(i 1, j) + ge, 8i 2 (5) E(i, j) = max ( D(i, j 1) + go, 8j 1 E(i, j 1) + ge, 8j 2 (6) D(0, 0) = D(i, 0) = D(0, j) = 0 F(i, 0) = E(0, j) = 1
Accurate indel prediction
Dominik Grimm: Data Mining in Bioinformatics, Page 30
Realignment using Gotoh’s algorithm (Gotoh, 1982)
- inf
- inf
- inf
- inf
- inf
- inf
- inf
- inf
- inf
- inf
- inf
- inf
A T G G C C
- A
T C C
- 2
1
- 2
- 2
- 1
- 2
- 1
- 2
- 1
- 2
- 1
- 2
- 1
- 1
- 2
- 1
- 2
- 1
- 2
- 2
2
- 2
- 2
- 2
- 2
- 2
- 2
- 2
- 2
1
- 2
- 1
- 2
- 2
- 1
- 2
1
- 1
- 2
1
- 1
- 2
- 2
- 1
1
- 2
- 1
2
- 1
Accurate indel prediction
Dominik Grimm: Data Mining in Bioinformatics, Page 31
Realignment
Reference Genome
GCACGATCACTCACTTCA
Position of the mapped read is known
AAATTGCACGATCACTCACTTCATCATCTACTTCATATCCCACACCCACCCACAGTGACACACACTGTGCATCGTACGATCGATCGAAAGAACTAG
Anchor: mapped_read_position + read_length Alignment window: user defined size but has to be at least 1.5 times the average insertion size Sequence x Sequence z
CACCCACACACAGTATCGTACGATCG
Unmapped Read (Sequence y)
GCACGATCACTCACTTCA AAATTGCACGATCACTCACTTCATCATCTACTTCATATCCCACACCCACCCACAGT--------------ATCGTACGATCGATCGAAAGAACTAG CACCCACACACAGTGACACACACTGTGCATCGTACGATCG |||||||.|||||| ||||||||||||
Anchor: mapped_read_position + read_length Position of the mapped read is known Reference Genome Sequence x Sequence z Alignment window: user defined size but has to be at least 1.5 times the average insertion size Unmapped read (sequence y) aligned against the reference sequence x
GCACGATCACTCACTTCA AAATTGCACGATCACTCACTTCATCATCTACTTCATATCCCACACCCACCCACAGTGACACACACTGTGCATCGTACGATCGATCGAAAGAACTAG CACCCACACACAGT--------------ATCGTACGATCG |||||||.|||||| ||||||||||||
Anchor: mapped_read_position + read_length Position of the mapped read is known Reference Genome Sequence x Sequence z Alignment window: user defined size but has to be at least 1.5 times the average insertion size Unmapped read (sequence y) aligned against the reference sequence x
Accurate indel prediction
Dominik Grimm: Data Mining in Bioinformatics, Page 32
Pindel is using at least two unmapped reads supporting the same position as only evidence! How can we improve this? Are there other features we could use?
Accurate indel prediction
Dominik Grimm: Data Mining in Bioinformatics, Page 33
Alignment features
1 Number of uniquely mapped reads (UMRs) overlapping the deletion candidate 2 Number of error-free UMRs overlapping the deletion candidate 3 Number of non uniquely mapped reads (N-UMRs) overlapping the deletion candidate 4 Number of error-free N-UMRs overlapping the found deletion 5 Number of UMRs mapping 60bp upstream of the indel candidate 6 Number of error-free UMRs mapping 60bp upstream of the indel candidate 7 Number of N-UMRs mapping 60bp upstream of the indel candidate 8 Number of error-free N-UMRs mapping 60bp upstream of the indel candidate 9 Number of UMRs mapping 60bp downstream of the indel candidate 10 Number of error-free UMRs mapping 60bp downstream of the indel candidate 11 Number of N-UMRs mapping 60bp downstream of the indel candidate 12 Number of error-free N-UMRs mapping 60bp downstream of the indel candidate Category 1 (only for deletions) Category 2 (for all indels) reference genome deletion candidate
- verlapping reads
Accurate indel prediction
Dominik Grimm: Data Mining in Bioinformatics, Page 34
Alignment features
Category 3 (for all indels) Number of splitted reads supporting the same indel location (split read alignment support) Deletion/Insertion length 16 17 Category 4 (for all indels) Single position variation (SVP) from split read alignment confirmed by SVP from the mapping algorithm SVP from split read alignment not confirmed by SVP from the mapping algorithm SVP from mapping algorithm not conformed by SVP from the split read alignment 13 14 15 reference genome indel candidate SVP mapping supported SVP split read alignment supported SVP split read alignment not supported SVP mapping not supported
Accurate indel prediction
Dominik Grimm: Data Mining in Bioinformatics, Page 35
Model validation
X1 X2 training data testing data negative class - positive class + tp fn fp tn
Accurate indel prediction
Dominik Grimm: Data Mining in Bioinformatics, Page 36
Confusion Matrix True Positive (TP): Person is sick and test is positive False Positive (FP): Person is healthy but test is positive True Negative (TN): Person is healthy and test is negative False Negative (FN): Person is sick but test is negative
Ground truth Person is sick P=(TP+FN) Person is healthy N=(TN+FP) Test Outcome Test positive (TP+FP) Test negative (FN+TN) true positive (TP) false negative (FN) (type II error) false positive (FP) (type I error) true negative (TN)
Accurate indel prediction
Dominik Grimm: Data Mining in Bioinformatics, Page 37
Derivations from a confusion matrix
Sensitivity, True Positive Rate (TPR) or Recall: TPR = TP P = TP TP + FN (7) Specificity or True Negative Rate (TNR): TNR = TN N = TN FP + TN = 1 FPR (8) False Positive Rate (FPR): FPR = FP N = FP FP + TN (9) Precision or Positive Predictive Value (PPV): PPV = TP TP + FP (10) Accuracy (ACC): ACC = TP + TN P + N = TP + TN TP + FN + TN + FP (11)
Accurate indel prediction
Dominik Grimm: Data Mining in Bioinformatics, Page 38
Receiver operating characteristic curve (ROC)
The ROC curve is the fraction of the TP over all positives (TP+FN) against the fraction of TN over all negative (TN+FP) The Specificity-Sensitivity Break-Even-Point (Spec-Sens-BEP) is the point where the TPR is equal to the TNR.
Accurate indel prediction
Dominik Grimm: Data Mining in Bioinformatics, Page 39
How can we find out if an indel candidate is a TP, TN, FP, FN? How can we build an predictor?
Accurate indel prediction
Dominik Grimm: Data Mining in Bioinformatics, Page 40
How can we find out if an indel candidate is a TP, TN, FP, FN? How can we build an predictor?
We randomly select a set of indel candidates found by the realignment step We use Sanger sequencing to determine if this candidate is a true or false indel We train a C-SVM (using a linear kernel) using our se- quenced ground truth and all alignment features from the realignment step We cross-validate to determine the correct parameters and compute AUC and BEP We use the best parameters to build an predictor using a C-SVM and a linear kernel
Accurate indel prediction
Dominik Grimm: Data Mining in Bioinformatics, Page 41
Model complexity
C error training-error validation-error
- ptimal model
Accurate indel prediction
Dominik Grimm: Data Mining in Bioinformatics, Page 42
C-SVM training using a 10-fold cross-validation
Data Set Training Set t1 (90%) Test Set e1 10% Subraining Set (81%) Subtest Set 9% Train SVM with C=10-5 Test SVM Train SVM with C=105 Test SVM 101 steps Train with best performing C Test SVM Subtraining Set (81%) Subtest Set 9% Train SVM with C=10-5 Test SVM Train SVM with C=105 Test SVM 101 steps Train with best performing C Test SVM k different training & test sets (k-fold) Average Spec-Sens-BEP Training Set tk (90%) Test Set ek 10% Fold k Fold 1
Accurate indel prediction
Dominik Grimm: Data Mining in Bioinformatics, Page 43
Results
Accurate indel prediction
Dominik Grimm: Data Mining in Bioinformatics, Page 44
Feature weights
Accurate indel prediction
Dominik Grimm: Data Mining in Bioinformatics, Page 45
Feature weights
Accurate indel prediction
Dominik Grimm: Data Mining in Bioinformatics, Page 46
Population structure
(A) PCA using all positively classified indels from our approach. Classifying index candidates as true or false ones leads to a more distinct population structure (B) PCA for all indels from the tool Pindel (version 0.1) (C) PCA for all indels from the tool Pindel (version 0.24)
Accurate indel prediction
Dominik Grimm: Data Mining in Bioinformatics, Page 47
Functional indel annotation
Intergenic ¡(17,568) Transposon ¡(1,071) UTR ¡(5,712) Intron ¡(9,970) In-‑Frame ¡(3,417) Premature ¡stop ¡(1,491) Lost ¡start ¡(451) Lost ¡start ¡(451) Lost ¡stop ¡(436) Splicesite ¡(915) Whole ¡gene ¡(52) Intergenic ¡(1,289) Transposon ¡(13) UTR ¡(574) Intron ¡(805) In-‑Frame ¡(17) Premature ¡stop ¡(117) Lost ¡start ¡(1) Lost ¡start ¡(1) Lost ¡stop ¡(16) Splicesite ¡(1) 1 2-‑9 10-‑80 allele ¡frequency 1 2-‑9 10-‑80 allele ¡frequency 100 90 80 70 60 50 40 40 30 20 10 100 90 80 70 60 50 40 40 30 20 10 % %
Accurate indel prediction
Dominik Grimm: Data Mining in Bioinformatics, Page 48
Summary Using exact alignment algorithms lead to a larger set of indel candidates
! More computational resources required
Using more features and machine learning techniques lead to less false positive candidates
! Less spurious biological interpretations
Learning a classifier provides insights into the importance
- f different features
References and further reading
Dominik Grimm: Data Mining in Bioinformatics, Page 49
References
Edwards, A. and Caskey, C. T. (1991). Closure strategies for random dna sequencing. Methods, 3(1), 41–47. 8 Gilbert, W. and Maxam, A. (1973). The nucleotide sequence of the lac operator. Proc Natl Acad Sci U S A, 70(12), 3581–4. 3 Gotoh, O. (1982). Journal of molecular biology; an improved algorithm for matching biological sequences. 162(3), 705–708. 29, 30 Grimm, D., Hagmann, J., Koenig, D., Weigel, D., and Borgwardt, K. (2013). Accurate indel prediction using paired-end short reads. BMC Genomics. 2 Hamming, R. (1950). Error detecting and error correcting codes. Bell System Technical Journal. 16, 17 Levenshtein, V. I. (1965). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8), 707–710. 16, 18 Maxam, A. M. and Gilbert, W. (1977). A new method for sequencing dna. Biotechnology, 24, 99–103. 4 Min Jou, W., Haegeman, G., and Diers, W. (1971). Nucleotide sequences corresponding to parts of the ms2 coat protein cistron. Arch Int Physiol Biochim, 79(2), 420–2. 3 Min Jou, W., Haegeman, G., Ysebaert, M., and Fiers, W. (1972). Nucleotide sequence of the gene coding for the bacteriophage ms2 coat protein. Nature, 237(5350), 82–8. 3 Ning, Z., Cox, A., and Mullikin, J. (2001). Ssaha: a fast search method for large dna databases. Genome Research, 11(10), 1725–1729. 15 Ossowski, S., Schneeberger, K., Clark, R., Lanz, C., Warthmann, N., and Weigel, D. (2008). Sequencing of natural strains of arabidopsis thaliana with short reads. Genome Research, 18(12), 2024–2033. 15, 16 Pei, J., Han, J., Mortazavi-Asl, B., Wang, J., Pinto, H., Chen, Q., Dayal, U., and Hsu, M.-C. (2004). Transactions on knowledge and data engineering; mining sequential patterns by pattern-growth: the prefixspan approach. 16(11), 1424–1440. 24 Sanger, F ., Nicklen, S., and Coulson, A. R. (1977). Dna sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A, 74(12), 5463–7. 5 Sanger, F ., Coulson, A. R., Barrell, B. G., Smith, A. J., and Roe, B. A. (1980). Cloning in single-stranded bacteriophage as an aid to rapid dna sequencing. J Mol Biol, 143(2), 161–78. 6 Sanger, F ., Coulson, A. R., Hong, G. F ., Hill, D. F ., and Petersen, G. B. (1982). Nucleotide sequence of bacteriophage lambda dna. J Mol Biol, 162(4), 729–73. 6 Shendure, J. and Ji, H. (2008). Next-generation DNA sequencing. Nature Biotechnology, 26(10), 1135–1145. PMID: 18846087. 10 Tuzun, E., Sharp, A., Bailey, J., and et al. (2005). Fine-scale structural variation of the human genome. Nat Genet, (37), 727–32. 22 Wetterstrand, K. (2013). Dna sequencing costs: Data from the nhgri genome sequencing program (gsp). accessed [19.03.2013]. Available at: www.genome.gov/sequencingcosts. 14 Ye, K., Schulz, M., Long, Q., Apweiler, R., and Ning, Z. (2009). Bioinformatics (oxford, england); pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. 25(21), 2865–2871. 24, 25