Parallel and Memory-efficient Preprocessing for Metagenome Assembly - - PowerPoint PPT Presentation

parallel and memory efficient preprocessing for
SMART_READER_LITE
LIVE PREVIEW

Parallel and Memory-efficient Preprocessing for Metagenome Assembly - - PowerPoint PPT Presentation

Parallel and Memory-efficient Preprocessing for Metagenome Assembly Vasudevan Rengasamy Paul Medvedev Kamesh Madduri School of EECS The Pennsylvania State University {vxr162, pashadag, madduri}@cse.psu.edu HiCOMB 2017 1 / 46 Talk Outline


slide-1
SLIDE 1

Parallel and Memory-efficient Preprocessing for Metagenome Assembly

Vasudevan Rengasamy Paul Medvedev Kamesh Madduri

School of EECS The Pennsylvania State University {vxr162, pashadag, madduri}@cse.psu.edu HiCOMB 2017

1 / 46

slide-2
SLIDE 2

Talk Outline

Motivation for our work METAPREP: a new metagenome pre-processing strategy METAPREP evaluation Parallel Scaling Comparison to prior work Impact on Metagenome assembly Conclusions and Future work

2 / 46

slide-3
SLIDE 3

Metagenome assembly

What is metagenome assembly?

◮ Metagenome: Mixed genomes present in an environment

sample (Soil, Human gut, etc.).

◮ Assembly: Re-constructing genome sequence from reads.

3 / 46

slide-4
SLIDE 4

Metagenome assembly

What is metagenome assembly?

◮ Metagenome: Mixed genomes present in an environment

sample (Soil, Human gut, etc.).

◮ Assembly: Re-constructing genome sequence from reads.

Why is metagenome assembly challenging?

  • 1. Uneven coverage of genomes.
  • 2. Repeated sequences across genomes.
  • 3. Variable sizes of genomes.
  • 4. Large dataset sizes (as the output from multiple

sequencing runs may be merged). Metagenome assembly tools (MEGAHIT, MetaVelvet, metaSPAdes, etc.) attempt to overcome these challenges.

4 / 46

slide-5
SLIDE 5

MEGAHIT [Li2016] metagenome assembler

◮ State-of-the-art metagenome assembler. ◮ Uses a highly compressed de Bruijn graph representation. ◮ Refines assembly quality by using multiple k-mer lengths. ◮ Supports single-node shared memory parallelism (both

CPUs and GPUs).

◮ 47 minutes to assemble a metagenome dataset containing

4.26 Gbp.

5 / 46

slide-6
SLIDE 6

A preprocessing strategy for Metagenome assembly

◮ Introduced by Howe et al. [Howe2014]. ◮ After filtering low frequency k-mers, partition de Bruijn

graph into weakly connected components (WCCs).

◮ Assemble each large component independently.

6 / 46

slide-7
SLIDE 7

Recent work on metagenome partitioning [Flick2015]

◮ Construct an undirected read graph instead of a de Bruijn

graph.

◮ Find connected components in the read graph using a

distributed memory parallel approach based on Shiloach-Vishkin algorithm.

◮ Read graph components correspond to de Bruijn graph

WCCs.

TAACGACC AACGACCT ACTCAAAT CTCAACGA R0: R1: R2: R3:

7 / 46

slide-8
SLIDE 8

Our contributions

◮ Novel multi-stage algorithm to find connected components

from read graphs.

◮ End-to-end hybrid parallelism using MPI and OpenMP. ◮ Memory aware implementation. ◮ Evaluate impact of preprocessing on metagenome

assembly.

8 / 46

slide-9
SLIDE 9

METAPREP

◮ New Metagenome Preprocessing tool. ◮ Main memory use is parameterized.

◮ Multipass approach: Only enumerate a subset of k-mers in

each pass.

◮ e.g., 10 passes ⇒ 10× memory reduction.

◮ log(P) inter-node communication steps.

9 / 46

slide-10
SLIDE 10

Talk Outline

Motivation for our work METAPREP: a new metagenome pre-processing strategy METAPREP evaluation Parallel Scaling Comparison to prior work Impact on Metagenome assembly Conclusions and Future work

10 / 46

slide-11
SLIDE 11

METAPREP overview

Construct Read Graph Find Connected Components Input: FASTQ files Output: FASTQ files

11 / 46

slide-12
SLIDE 12

METAPREP overview

METAPREP step Function IndexCreate Create index files for parallel runs. 1 KmerGen Enumerate k-mer, readi tu- ples. 2 KmerGen-Comm Transfer k-mer, readi tuples to owner tasks. 3 LocalSort Sort tuples by k-mers. 4 LocalCC Identify connected compo- nents (CCs). 5 MergeCC Merge components across tasks, create output FASTQ files with reads from largest CC and other CCs.

Input: FASTQ files KmerHist FASTQPart LocalSort LocalCC MergeCC Output: FASTQ files KmerGen KmerGen-Comm Multiple Passes IndexCreate

12 / 46

slide-13
SLIDE 13

A simple strategy for static work partitioning

◮ Precompute an m-mer histogram (m ≪ k, defaults are

k = 27, m = 10)

◮ Used to partition k-mers across MPI tasks and threads in a

load balanced manner.

Reads: R1: ACTAGG R2: CTGTAA k-mers: ACTAG, CTAGG CTGTA, TGTAA m-mer histogram: AC - 1 CT - 2 TG - 1

13 / 46

slide-14
SLIDE 14

Notation

Notation Description M Total number of k-mers enumerated R Paired-end read count S Number of I/O passes P Number of MPI tasks T Number of threads per task

14 / 46

slide-15
SLIDE 15

k-mer Enumeration

◮ Generate k-mer, read_id tuples. ◮ Multiple threads write to single array without

  • synchronization. Offsets precomputed.

◮ Output: a buffer on each MPI task.

◮ k-mers are partially sorted.

◮ Time: O( MS PT ), Space ≈ 24M SP bytes.

... ...

To MPI Task 1

Send Buffer at MPI task i

Thread 1 offset Thread T offset To MPI Task P

15 / 46

slide-16
SLIDE 16

Sort by k-mer

◮ Sort tuples by k-mer value to identify reads with common

k-mer and create read graph edges.

◮ Radix sort implementation.

◮ Reuse send buffer ⇒ No additional memory . ◮ Partition tuples into T disjoint ranges. ◮ Sort ranges in parallel using T threads.

◮ Time: O( M PT ), Space ≈ 24M SP bytes.

16 / 46

slide-17
SLIDE 17

Identify connected components

◮ Find connected components using edges from local

k-mers.

◮ Union-by-index and path splitting.

10 8 2 6 20 5 10 8 2 6 20 5

Union (6,5) Union-by-index

10

8 2

6

20 5

Path Splitting Find (6)

10 8 2 6 20 5 10 8 2 6 20 5

17 / 46

slide-18
SLIDE 18

Identify connected components

◮ Find connected components using edges from local

k-mers.

◮ Union-by-index and path splitting. ◮ No critical sections.

◮ Store edges that merges components (similar to

[Patwary2012]).

◮ Process edges again in case of lost updates.

◮ Time: O( M PT log∗R), Space ≈ 12M SP + 4R bytes.

18 / 46

slide-19
SLIDE 19

Merge components

◮ Merge component forests in each MPI task in log P

iterations.

◮ Time: O(R log P log∗R), Space ≈ 8R bytes.

R1 R2 R4 R1 R3 R4 P0 P1 R3 R2 R1 R2 R4 P0 R3 R1 R2 R4 R1 R3 R4 P2 P3 R3 R2 R1 R2 R4 P2 R3 R1 R2 R4 P0 R3 0: 1: 2:

19 / 46

slide-20
SLIDE 20

Talk Outline

Motivation for our work METAPREP: a new metagenome pre-processing strategy METAPREP evaluation Parallel Scaling Comparison to prior work Impact on Metagenome assembly Conclusions and Future work

20 / 46

slide-21
SLIDE 21

Experiments and Results

Description of datasets

Read Count Size ID Dataset R (×106) (Gbp) Source HG Human gut 12.7 2.29 NCBI (SRR341725) LL Lake Lanier 21.3 4.26 NCBI (SRR947737) MM Mock microbial community 54.8 11.07 NCBI (SRX200676) IS Iowa, Continuous corn soil 1132.8 223.26 JGI (402461)

Machine configuration

◮ Edison supercomputer at NERSC

◮ Each node has 2× 12-core Ivy bridge processors and 64 GB

memory.

21 / 46

slide-22
SLIDE 22

Overview

Motivation for our work METAPREP: a new metagenome pre-processing strategy METAPREP evaluation Parallel Scaling Comparison to prior work Impact on Metagenome assembly Conclusions and Future work

22 / 46

slide-23
SLIDE 23

Single node scaling for Human Gut (HG) Dataset

1 2 4 8 12 24 Threads 50 100 150 200 250 300 Time(seconds)

KmerGen-I/O KmerGen LocalSort LocalCC-Opt CC-I/O Speedup

5 10 15 20 Relative Speedup

23 / 46

slide-24
SLIDE 24

Multi-node scaling for Human Gut (HG) Dataset

1 2 4 8 16 Nodes 5 10 15 20 25 Time (seconds) 6 4 8 16

KmerGen-I/O KmerGen KmerGen-Comm LocalSort LocalCC-Opt Merge-Comm MergeCC CC-I/O Speedup

24 / 46

slide-25
SLIDE 25

Multi-node scaling for LL and MM datasets

1 2 4 8 16 Nodes 5 10 15 20 25 30 35 40 Time (seconds) 12

LL (S=2)

1 2 4 8 16 Nodes 20 40 60 80 100 120 140 160 180 22

MM (S=4)

4 8 16 4 8 16 KmerGen-I/O KmerGen KmerGen-Comm LocalSort LocalCC-Opt Merge-Comm MergeCC CC-I/O Speedup

25 / 46

slide-26
SLIDE 26

Multi-node scaling for Iowa Continuous Soil dataset

16 64 Nodes 100 200 300 400 500 600 700 800 900 Time(seconds) 3.25X 1X KmerGen-I/O KmerGen KmerGen-Comm LocalSort LocalCC-Opt Merge-Comm MergeCC CC-I/O

For 16 node run, S = 8. For 64 node run, S = 2.

26 / 46

slide-27
SLIDE 27

Overview

Motivation for our work METAPREP: a new metagenome pre-processing strategy METAPREP evaluation Parallel Scaling Comparison to prior work Impact on Metagenome assembly Conclusions and Future work

27 / 46

slide-28
SLIDE 28

KmerGen performance comparison with KMC-2 k-mer counter [Deorowicz2015]

HG LL MM Dataset 20 40 60 80 100 120 140 Time (seconds) 1.56X 1.76X 1.57X

MetaPrep KMC-2

HG LL MM Dataset 10 20 30 40 50 60 70 80 90 2.72X 3.18X 6.76X

KMC-2 MetaPrep16

◮ MetaPrep16: METAPREP run using 16 nodes.

28 / 46

slide-29
SLIDE 29

Comparison with read graph connectivity [Flick2015]

Table 1: Execution time comparison with Metagenome partitioning work (AP_LB) using 16 nodes.

Dataset Time (seconds) METAPREP METAPREP AP_LB Speedup HG 5.5 23.6 4.22× LL 11.5 25.9 2.25× MM 19.6 56.1 2.86×

◮ 21 iterations for AP_LB vs 4 for METAPREP for MM dataset.

29 / 46

slide-30
SLIDE 30

Overview

Motivation for our work METAPREP: a new metagenome pre-processing strategy METAPREP evaluation Parallel Scaling Comparison to prior work Impact on Metagenome assembly Conclusions and Future work

30 / 46

slide-31
SLIDE 31

Largest component size

◮ Largest component size can be reduced by using filters.

  • 1. k-mer size (k) - Longer k-mers occur in fewer components
  • 2. k-mer frequency (KF) - Filter erroneous (low frequency) and

repeat k-mers (high frequency)

31 / 46

slide-32
SLIDE 32

Largest component size

◮ Largest component size can be reduced by using filters.

  • 1. k-mer size (k) - Longer k-mers occur in fewer components
  • 2. k-mer frequency (KF) - Filter erroneous (low frequency) and

repeat k-mers (high frequency)

None KF < 30 10 ≤ KF < 30 Filter 20 40 60 80 100 Largest Component Size (%)

HG dataset k=27 k=63 32 / 46

slide-33
SLIDE 33

MEGAHIT single-node execution time for MM dataset

Full LC (NoFilter) LC (KF<30) 500 1000 1500 2000 2500 Time (seconds) 1.36X

Megahit MetaPrep

33 / 46

slide-34
SLIDE 34

MEGAHIT assembly quality

Table 2: Assembly Quality Comparison - MM dataset.

Type Contigs Total (Mbp) N50 (bp) No Preproc 24 931 203.65 50 607 No Filter 25 002 203.65 50 550 KF < 30 40 632 208.24 23 126

34 / 46

slide-35
SLIDE 35

Talk Outline

Motivation for our work METAPREP: a new metagenome pre-processing strategy METAPREP evaluation Parallel Scaling Comparison to prior work Impact on Metagenome assembly Conclusions and Future work

35 / 46

slide-36
SLIDE 36

Conclusions

  • 1. Developed a new memory efficient parallel workflow for

partitioning metagenome dataset into connected components.

  • 2. Speedup up to 4.22× over AP_LB approach by [Flick2015].
  • 3. We can process a metagenome dataset with 1.13 billion

reads (Iowa continuous corn soil) in 14 minutes using 16 nodes of Edison.

  • 4. Preprocessing time (METAPREP) ≪ Assembly time.

36 / 46

slide-37
SLIDE 37

Future Work

  • 1. For most datasets, we observe creation of a single large

connected component after partitioning the read graph.

◮ Splitting components using filters impacts assembly quality. ◮ Does scaffolding help in improving assembly quality?

  • 2. Reduce data exchanged in the inter-node communication

step of connected components.

37 / 46

slide-38
SLIDE 38

Acknowledgment

This research is supported in part by NSF award #1439057. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

38 / 46

slide-39
SLIDE 39

References I

Sebastian Deorowicz, Marek Kokot, Szymon Grabowski, and Agnieszka Debudaj-Grabysz. KMC 2: Fast and resource-frugal k-mer counting. Bioinformatics, 31(10):1569–1576, 2015. Patrick Flick, Chirag Jain, Tony Pan, and Srinivas Aluru. A parallel connectivity algorithm for de Bruijn graphs in metagenomic applications. In Proc. Int’l. Conf. for High Performance Computing, Networking, Storage and Analysis (SC), 2015. Adina Chuang Howe, Janet K. Jansson, Stephanie A. Malfatti, Susannah G. Tringe, James M. Tiedje, and C. Titus Brown. Tackling soil diversity with the assembly of large, complex metagenomes. Proceedings of the National Academy of Sciences, 111(13):4904–4909, 2014.

39 / 46

slide-40
SLIDE 40

References II

Dinghua Li, Ruibang Luo, Chi-Man Liu, Chi-Ming Leung, Hing-Fung Ting, Kunihiko Sadakane, Hiroshi Yamashita, and Tak-Wah Lam. MEGAHIT v1.0: A fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods, 102:3–11, 2016. Md Mostofa Ali Patwary, Peder Refsnes, and Fredrik Manne. Multi-core spanning forest algorithms using the disjoint-set data structure. In Proc. IEEE Int’l. Parallel & Distributed Processing Symposium (IPDPS), 2012.

40 / 46

slide-41
SLIDE 41

Thank You

41 / 46

slide-42
SLIDE 42

Load Balance among 16 MPI tasks - MM dataset

K m e r G e n

  • I

/ O K m e r G e n K m e r G e n

  • C
  • m

m L

  • c

a l S

  • r

t L

  • c

a l C C

  • O

p t M e r g e C C

  • C
  • m

m M e r g e C C C C

  • I

/ O Preprocessing Step 1 2 3 4 5 6 7 Time (seconds)

42 / 46

slide-43
SLIDE 43

Multi-pass Execution - MM dataset

1 2 4 8 Passes 10 20 30 40 50 60 70 Time (seconds)

KmerGen KmerGen-Comm LocalSort LocalCC-Opt MergeCC CC-I/O Memory/Node

10 20 30 40 50 60 Memory (GB)

43 / 46

slide-44
SLIDE 44

Table 3: Index creation time (sequential).

Dataset # Chunks Time (seconds) FASTQPart merHist HG 384 32 109 LL 384 32 154 MM 384 33 343 IS 1536 180 5160

44 / 46

slide-45
SLIDE 45

Table 4: Impact of k on single-node METAPREP execution time (MM dataset).

k Time (seconds) KmerGen LocalSort LocalCC-Opt CC-I/O Total 27 77.02 55.33 6.41 5.40 144.16 63 59.73 67.60 5.16 5.35 137.84

45 / 46

slide-46
SLIDE 46

Table 5: Assembly Quality Comparison.

Dataset Type MEGAHIT assembly output statistics Contigs Total (Mbp) Max (bp) N50 (bp) HG No Preproc 63 519 116.19 217 183 5071 No Filter 63 483 116.18 217 183 5098 LC 58 770 113.83 217 183 5510 Other 4713 2.35 2860 513 KF < 30 64 571 119.01 217 183 5123 LC 56 732 110.13 217 183 5687 Other 7839 8.87 43 863 2271 LL No Preproc 179 828 165.63 225 770 1273 No Filter 181 751 166.67 225 805 1263 LC 141 136 148.75 225 805 1593 Other 40 615 17.9 4028 432 KF < 30 182 717 168.42 225 770 1275 LC 140 081 147.51 225 770 1587 Other 42 636 20.90 43 718 465 MM No Preproc 24 931 203.65 1 067 762 50 607 No Filter 25 002 203.65 1 067 762 50 550 LC 23 959 202.99 1 067 762 50 781 Other 1043 0.66 5788 695 KF < 30 40 632 208.24 611 608 23 126 LC 26 233 156.04 611 608 28 135 Other 14 399 52.19 591 560 12 285

46 / 46