MapReduce for accurate error correction of next-generation - - PowerPoint PPT Presentation

mapreduce for accurate error correction of next
SMART_READER_LITE
LIVE PREVIEW

MapReduce for accurate error correction of next-generation - - PowerPoint PPT Presentation

MapReduce for accurate error correction of next-generation sequencing data Assoc. Prof . Liang Zhao School of Computing and Electronic Information Guangxi University & Taihe Hospital Oct 5th, 2016 (GXU, THH) error correction GIW2016 1


slide-1
SLIDE 1

MapReduce for accurate error correction of next-generation sequencing data

  • Assoc. Prof.

Liang Zhao

School of Computing and Electronic Information

Guangxi University & Taihe Hospital

Oct 5th, 2016

(GXU, THH) error correction GIW2016 1 / 26

slide-2
SLIDE 2

NGS introduction

What is next-generation sequencing (NGS)?

High throughput, e.g., millions of sequences per run Low cost, e.g., $1000 per human genome

What does NGS data look like?

Sequencing) Adenine) Thymine) Guanine) Cytosine) Read)

(GXU, THH) error correction GIW2016 2 / 26

slide-3
SLIDE 3

Applications of NGS data analysis

De Novo genome assembly Genome re-sequencing Genetic variations:

Single Nucleotide Polymorphisms (SNPs) Small insertions/deletions (indels) Structural variations

Linking genetic variations to diseases

Genome-wide association studies (GWAS) Functional categorization of SNPs

RNA-Seq:

Gene expression Exon-intron structure

SNP associated trait categories on Human Chromosome 6 by 2014. The figure is obtained from EBI: http://www.ebi.ac.uk/fgpt/ gwas/images/timeseries/gwas-2014-05.png (GXU, THH) error correction GIW2016 3 / 26

slide-4
SLIDE 4

Errors in NGS data

Types of errors:

Substitution

◮ Error rate of Illumina sequences: 0.5% ∼ 2.5% ◮ Other platforms: negligible

Insertion/deletion (indel)

◮ Error rate of Roche 454 sequences: 1.5% ∼ 5% ◮ Error rate of PacBio sequences: 15% ∼ 20% ◮ Error rate of Oxford Nanopore sequences: 25% ∼ 40% ◮ Error rate of Illumina sequences: negligible

TCTGACTGCAACGGGCAATAT--GTCTCTGT GGGTCTCTGT TGACTGCAGC ACGGGCACTA GAGTGCAACG TAT--GTCTCTG ACT-CAACGGG GGCAATAT--GTCG TCTGACTGCA AT--GTCTCAGT 012345678901234567890 12345678

(a)%

(GXU, THH) error correction GIW2016 4 / 26

slide-5
SLIDE 5

Extra complexity introduced by errors

De novo assembly: De Bruijn graph-based

Branches Bubbles Tips

Mapping:

Incorrect place Multiple places Unable to map

Variants calling:

False positive SNP occurs in 1/300

TGCA% GGCA% GGGC% CGGG% ACGG% AACG% CAAC% CAAT% AATA% ATAT% TATG% ATGT% TCTC% CTCT% TCTG% CTGT% GCAG% CAGC% GGTC% GGGT% GCAC% CACT% GTCT% GTCG% CTCA% TCAG% CAGT% TCTG% CTGA% TGAC% GACT% ACTG%

2% 2%

CTGC%

2%

GAGT% AGTG% GTGC% TCAA% CTCA% ACTC%

2%

GCAA%

2%

ACTA% TGTC%

2% 3% 2% 3% 2% 2%

GGCA% GGGC% CGGG% ACGG% AACG% CAAC% CAAT% AATA% ATAT% TATG% ATGT% TCTC% CTCT% TCTG% CTGT% GTCT%

2%

GCAA%

2%

TGTC%

2% 3% 2% 3% 2% 2% 2% 2% 2%

TGCA% TCTG% CTGA% TGAC% GACT% ACTG% CTGC%

(b)% (c)%

(GXU, THH) error correction GIW2016 5 / 26

slide-6
SLIDE 6

Existing error correction approaches

K-spectrum-based approaches Suffix tree-/array-based approaches Multiple sequence alignment-based approaches Cluster-based approaches Probabilistic model-based approaches

(a)$ (b)$ (c)$

x y

nx ny

A$ C$ B$

(GXU, THH) error correction GIW2016 6 / 26

slide-7
SLIDE 7

K-spectrum-based approach

Algorithm briefing

Decompose reads into k-mers; Count the frequencies of k-mers; Substitute the k-mer having low frequency to the nearest high one.

◮ Bloom Filter ◮ graph model

Pros & cons:

Pretty fast Good scalability Very sensitive to k

(a)$

(GXU, THH) error correction GIW2016 7 / 26

slide-8
SLIDE 8

Suffix tree-/array-based approach

Algorithm briefing

Construct suffix tree/array for all the reads; Count the frequencies of all the nodes; Substitute the branch having low frequency to its neighbor with high frequency.

(b)$

x y

nx ny

A$ C$ B$

Pros & cons:

k is flexible Memory consuming Slow

(GXU, THH) error correction GIW2016 8 / 26

slide-9
SLIDE 9

Multiple sequence alignment-based approach

Algorithm briefing

Group reads by k-mers; Perform multiple sequence alignment; Edit reads by using the consensus of the alignment.

(c)$

Pros & cons:

More accurate Not very sensitive to k Time and space complexity are very high

(GXU, THH) error correction GIW2016 9 / 26

slide-10
SLIDE 10

The completeness-of-coverage

Existing approaches cannot guarantee the completeness of coverage.

posi%on'i" Reference' Reads'

Suppose read length is l, per base error rate is e, position coverage is d, and k-mer size is k, then the expected number of k-mers (the same k-mer) cover each position is: d

′ = d ∗ l − k + 1

l ∗ (1 − e)k

(GXU, THH) error correction GIW2016 10 / 26

slide-11
SLIDE 11

Two-layered MapReduce framework

graph3' graph2' graph1'

0' 1' 1' 0' 2' 0'

alignment1' alignment2' alignment3'

(GXU, THH) error correction GIW2016 11 / 26

slide-12
SLIDE 12

The first layer of MapReduce

k"mer&

…"

reads& groups&

(κ, j,i,ι)

(GXU, THH) error correction GIW2016 12 / 26

slide-13
SLIDE 13

The first layer of MapReduce

Input:

A set of Paired-end Read R.

Goal:

Fishing out prospective erroneous bases from R.

Procedures:

Map all the reads of R into groups:

◮ The keys are the k-mers; ◮ The values are the tuples, (κ, j, i), representing the k-mer κ is in read rj

starting at position i;

◮ The tuples having the same κ are assigned to the same group;

Perform multiple reads alignment, taking the k-mer κ as seed. Identify positions from the alignments having inconsistent bases composition. Recombine reads covering the same position that has been identified as erroneous.

(GXU, THH) error correction GIW2016 13 / 26

slide-14
SLIDE 14

The first layer of MapReduce

Completes the coverage by collecting reads from multiple groups. Improves the accuracy markedly.

(GXU, THH) error correction GIW2016 14 / 26

slide-15
SLIDE 15

The second layer of MapReduce

Input:

The prospective erroneous positions with covering reads provided.

Goal:

Correct erroneous bases of all the reads.

Procedure :

Map all the positions to computing units:

◮ The key is the position; ◮ The value is the reads covering the position.

Correct the prospective erroneous bases through the following statistics: Lx/x0 = log j I(ˆ j = x) ∗ pj + I(ˆ j = x) ∗ (1 − pj)/3 j I(ˆ j = x0) ∗ pj + I(ˆ j = x0) ∗ (1 − pj)/3 where I(·) is a indicator function, pj is the probability that the base j called correctly, x0 is the base having the largest support, and x be the prospective erroneous base to be corrected.

(GXU, THH) error correction GIW2016 15 / 26

slide-16
SLIDE 16

Data sets

Data sets used for performance evaluation

Data Genome Genome Read Coverage Number of Per base set name size (mbp) length (bp) paired-end reads error rate (%) R1

  • S. aueus

2.8 101 46.3× 1,294,104 1.17 R2 R.sphaeroides 4.6 101 33.6× 766,646 1.28 R3 H.sapiens 14 88.3 101 38.3× 16,757,120 0.86 R4

  • B. impatiens

249.2 124 150.8× 303,118,594 0.96 D1 E.Coli 4.6 101 30.0× 689,927 1.35 D2

  • S. cerevisiae

12.4 101 60.2× 3,599,533 1.53 D3 H.sapiens 22 41.8 101 30.0× 6,209,209 1.47 D4 150 60.0× 8,361,240 1.59 R1 and R4 are real data sets. D1 to D4 are simulated data sets.

(GXU, THH) error correction GIW2016 16 / 26

slide-17
SLIDE 17

Performance evaluation & comparison

data corrector reca prec gain pber∗ data corrector reca prec gain pber∗ R1 MEC 0.893 0.924 0.874 0.103 R2 MEC 0.944 0.963 0.894 0.120 Coral 0.803 0.858 0.728 0.210 Coral 0.663 0.970 0.642 0.460 Racer 0.822 0.929 0.760 0.190 Racer 0.921 0.949 0.872 0.150 BLESS 0.409 0.650 0.189 0.879 BLESS 0.722 0.989 0.714 0.340 BFC 0.817 0.927 0.753 0.196 BFC 0.726 0.990 0.716 0.323 SGA 0.815 0.922 0.746 0.196 SGA 0.641 0.985 0.631 0.460 R3 MEC 0.874 0.936 0.814 0.260 R4 MEC 0.836 0.884 0.746 0.271 Coral 0.690 0.779 0.495 0.430 Coral

  • Racer

0.797 0.890 0.699 0.230 Racer 0.541 0.703 0.313 0.484 BLESS 0.558 0.965 0.538 0.390 BLESS 0.018 0.003

  • 0.517

0.862 BFC 0.641 0.966 0.613 0.319 BFC 0.457 0.636 0.195 0.607 SGA 0.663 0.967 0.641 0.310 SGA 0.690 0.823 0.542 0.289 D1 MEC 0.999 0.996 0.995 0.007 D2 MEC 0.997 0.984 0.983 0.031 Coral 0.998 0.986 0.984 0.024 Coral 0.995 0.954 0.947 0.081 Racer 0.998 0.981 0.980 0.032 Racer 0.994 0.957 0.949 0.077 BLESS 0.980 0.998 0.979 0.033 BLESS 0.984 0.997 0.981 0.029 BFC 0.991 0.999 0.990 0.016 BFC 0.970 0.997 0.968 0.043 SGA 0.990 0.999 0.989 0.016 SGA 0.958 0.998 0.956 0.067 D3 MEC 0.987 0.915 0.896 0.340 D4 MEC 0.996 0.939 0.928 0.281 Coral 0.973 0.762 0.670 0.510 Coral 0.971 0.783 0.702 0.467 Racer 0.880 0.466

  • 0.130

1.800 Racer 0.883 0.487

  • 0.017

1.610 BLESS 0.881 0.892 0.774 0.350 BLESS 0.909 0.897 0.795 0.328 BFC 0.883 0.907 0.792 0.340 BFC 0.891 0.889 0.819 0.316 SGA 0.854 0.957 0.815 0.290 SGA 0.846 0.965 0.807 0.297 pber∗ is pber ×10−4; gain = (TP-FP)/(TP+FN); reca = TP/(TP+FN); prec = TP/(TP+FP). (GXU, THH) error correction GIW2016 17 / 26

slide-18
SLIDE 18

Impact of coverage on error correction

  • 0.2

0.4 0.6 0.8 10 15 20 25 30

Coverage Gain

Method

  • BFC

BLESS Coral MEC Racer SGA

The experiments are conducted on R3. MEC is less sensitive to the change of coverage.

(GXU, THH) error correction GIW2016 18 / 26

slide-19
SLIDE 19

High impact of k on k-spectrum-based approach

BLESS BFC SGA Gain −0.2 0.0 0.2 0.4 0.6 0.8 1.0 BLESS BFC SGA −0.2 0.0 0.2 0.4 0.6 0.8 1.0 15 17 19 21 23

Experiments are carried out on R3. The size of k has high impact on existing k-spectrum-based approaches.

(GXU, THH) error correction GIW2016 19 / 26

slide-20
SLIDE 20

Low impact of k on MEC

  • 0.7

0.8 0.9 1.0 14 16 18 20 22 24

K−mer size Gain

Dataset

  • D1

D2 D3 D4 R1 R2 R3 R4

Experiments are carried out on R3. The size of k has low impact on MEC.

(GXU, THH) error correction GIW2016 20 / 26

slide-21
SLIDE 21

Running time comparison

Environment of experiments:

CPU: 2 six-core Intel Xeon X5690 3.47GHz RAM: 96G

R1 R2 R3 R4 D1 D2 D3 D4 Time (h) 2 4 6 8 10 R1 R2 R3 R4 D1 D2 D3 D4 2 4 6 8 10 MEC Coral Racer BLESS BFC SGA

(GXU, THH) error correction GIW2016 21 / 26

slide-22
SLIDE 22

RAM usage comparison

Environment of experiments:

CPU: 2 six-core Intel Xeon X5690 3.47GHz RAM: 96G

R1 R2 R3 R4 D1 D2 D3 D4 RAM usage (G) 20 40 60 80 100 R1 R2 R3 R4 D1 D2 D3 D4 20 40 60 80 100 MEC Coral Racer BLESS BFC SGA

(GXU, THH) error correction GIW2016 22 / 26

slide-23
SLIDE 23

Concluding remarks

MEC is an accurate approach for correcting NGS substitution errors. It has the following advantages:

Markedly better accuracy

◮ completeness of coverage

Tolerant to various size of k

◮ k-mers are only used to group reads but not for correcting errors

Easy to deploy on cloud computing platform

◮ Both identifying prospective erroneous bases and correcting errors can be

carried out in parallel.

Future directions:

Improve the time and space complexity.

[1, 2, 3, 4, 5, 6, 7]

(GXU, THH) error correction GIW2016 23 / 26

slide-24
SLIDE 24

References

  • A. S. Alic, D. Ruzafa, J. Dopazo, and I. Blanquer, “Objective review of de novo stand-alone error

correction methods for ngs data,” WIREs Comput Mol Sci, vol. 6, pp. 111–146, 2016.

  • S. Goodwin, J. Gurtowski, S. Ethe-Sayers, P. Deshpande, M. C. Schatz, and W. R. McCombie,

“Oxford nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome,” Genome Research, 2015.

  • T. Hackl, R. Hedrich, J. Schultz, and F. Förster, “proovread: large-scale high-accuracy pacbio

correction through iterative short read consensus,” Bioinformatics, vol. 30, no. 21, pp. 3004–11, 2014.

  • Y. Heo, X.-L. Wu, D. Chen, J. Ma, and W.-M. Hwu, “BLESS: Bloom filter-based error correction

solution for high-throughput sequencing reads,” Bioinformatics, vol. 30, no. 10, pp. 1354–1362, 2014.

  • S. L. Salzberg, A. M. Phillippy, A. Zimin, D. Puiu, T. Magoc, S. Koren, T. J. Treangen, M. C. Schatz,
  • A. L. Delcher, M. Roberts, G. Marcais, M. Pop, and J. A. Yorke, “GAGE: A critical evaluation of

genome assemblies and assembly algorithms,” Genome Research, vol. 22, no. 3, pp. 557–567, 2011.

  • L. Salmela and J. Schröder, “Correcting errors in short reads by multiple alignments,”

Bioinformatics, vol. 27, no. 11, pp. 1455–1461, 2011. The 1000 Genomes Project Consortium, “A map of human genome variation from population-scale sequencing,” Nature, vol. 467, pp. 1061–1073, 2010.

(GXU, THH) error correction GIW2016 24 / 26

slide-25
SLIDE 25

Acknowledgment

  • Prof. Limsoon Wong, NUS
  • Prof. Jinyan Li, UTS

National Science Foundation of China (No. 31501070) Scientific Research Foundation of GXU (No. XGZ150316)

(GXU, THH) error correction GIW2016 25 / 26

slide-26
SLIDE 26

Good scalability of MEC

  • 1

2 3 4 1 2 3 4 5

Number of nodes Time (h)

(GXU, THH) error correction GIW2016 26 / 26

slide-27
SLIDE 27

Implementation of coverage completion

self%join* self%join* self%join* self%join* join* join* join* join* join* join*

(A)* (B)* (C)*

(GXU, THH) error correction GIW2016 27 / 26