Parallel and Memory-efficient Preprocessing for Metagenome Assembly - PowerPoint PPT Presentation

Parallel and Memory-efficient Preprocessing for Metagenome Assembly Vasudevan Rengasamy Paul Medvedev Kamesh Madduri School of EECS The Pennsylvania State University {vxr162, pashadag, madduri}@cse.psu.edu HiCOMB 2017 1 / 46

Talk Outline Motivation for our work M ETA P REP : a new metagenome pre-processing strategy M ETA P REP evaluation Parallel Scaling Comparison to prior work Impact on Metagenome assembly Conclusions and Future work 2 / 46

Metagenome assembly What is metagenome assembly? ◮ Metagenome: Mixed genomes present in an environment sample (Soil, Human gut, etc.). ◮ Assembly: Re-constructing genome sequence from reads . 3 / 46

Metagenome assembly What is metagenome assembly? ◮ Metagenome: Mixed genomes present in an environment sample (Soil, Human gut, etc.). ◮ Assembly: Re-constructing genome sequence from reads . Why is metagenome assembly challenging? 1. Uneven coverage of genomes. 2. Repeated sequences across genomes. 3. Variable sizes of genomes. 4. Large dataset sizes (as the output from multiple sequencing runs may be merged). Metagenome assembly tools (MEGAH I T, MetaVelvet, metaSPAdes, etc.) attempt to overcome these challenges. 4 / 46

MEGAHIT [Li2016] metagenome assembler ◮ State-of-the-art metagenome assembler. ◮ Uses a highly compressed de Bruijn graph representation. ◮ Refines assembly quality by using multiple k -mer lengths. ◮ Supports single-node shared memory parallelism (both CPUs and GPUs). ◮ 47 minutes to assemble a metagenome dataset containing 4.26 Gbp. 5 / 46

A preprocessing strategy for Metagenome assembly ◮ Introduced by Howe et al. [Howe2014]. ◮ After filtering low frequency k -mers, partition de Bruijn graph into weakly connected components (WCCs). ◮ Assemble each large component independently. 6 / 46

Recent work on metagenome partitioning [Flick2015] ◮ Construct an undirected read graph instead of a de Bruijn graph. ◮ Find connected components in the read graph using a distributed memory parallel approach based on Shiloach-Vishkin algorithm. ◮ Read graph components correspond to de Bruijn graph WCCs. R0: TAACGACC R1: R3: AACGACCT CTCAACGA R2: ACTCAAAT 7 / 46

Our contributions ◮ Novel multi-stage algorithm to find connected components from read graphs. ◮ End-to-end hybrid parallelism using MPI and OpenMP. ◮ Memory aware implementation. ◮ Evaluate impact of preprocessing on metagenome assembly. 8 / 46

M ETA P REP ◮ New Meta genome Prep rocessing tool. ◮ Main memory use is par ameterized. ◮ Multipass approach: Only enumerate a subset of k -mers in each pass. ◮ e.g., 10 passes ⇒ 10 × memory reduction. ◮ log ( P ) inter-node communication steps. 9 / 46

M ETA P REP overview Input: FASTQ files Construct Read Graph Find Connected Components Output: FASTQ files 11 / 46

M ETA P REP overview Input: KmerHist FASTQ files M ETA P REP step Function FASTQPart IndexCreate IndexCreate Create index files for parallel runs. Enumerate � k -mer, read i � tu- 1 KmerGen KmerGen ples. Transfer � k -mer, read i � tuples Multiple Passes KmerGen-Comm 2 KmerGen-Comm to owner tasks. LocalSort 3 LocalSort Sort tuples by k -mers. 4 LocalCC Identify connected compo- LocalCC nents (CCs). 5 MergeCC Merge components across tasks, create output FAST Q MergeCC fi les with reads from largest CC and other CCs. Output: FASTQ files 12 / 46

A simple strategy for static work partitioning ◮ Precompute an m -mer histogram ( m ≪ k , defaults are k = 27, m = 10) ◮ Used to partition k -mers across MPI tasks and threads in a load balanced manner. Reads: k-mers: m-mer histogram: R1: ACTAGG ACTAG, CTAGG AC - 1 R2: CTGTAA CTGTA, TGTAA CT - 2 TG - 1 13 / 46

Notation Notation Description M Total number of k -mers enumerated R Paired-end read count S Number of I/O passes P Number of MPI tasks T Number of threads per task 14 / 46

k -mer Enumeration ◮ Generate � k -mer, read_id � tuples. ◮ Multiple threads write to single array without synchr onization. O ff sets precomputed. ◮ Output: a bu ff er on each MP I task. ◮ k -mers are partially sorted. ◮ Time: O ( MS PT ) , Space ≈ 24 M SP bytes. Thread 1 offset Thread T offset ... ... To MPI Task 1 To MPI Task P Send Buffer at MPI task i 15 / 46

Sort by k -mer ◮ Sort tuples by k -mer value to identify reads with common k -mer and create read graph edges. ◮ Radix sort implementation. ◮ Reuse send buffer ⇒ No additional memory . ◮ Partition tuples into T disjoint ranges. ◮ Sort ranges in parallel using T threads. ◮ Time: O ( M PT ) , Space ≈ 24 M SP bytes. 16 / 46

Identify connected components ◮ Find connected components using edges from local k -mers. ◮ Union-by-index and path splitting. Union (6,5) 20 10 20 Union-by-index 10 5 5 8 2 8 2 6 6 Find (6) 20 20 20 Path Splitting 10 10 5 5 8 10 5 8 2 8 2 6 2 6 6 17 / 46

Identify connected components ◮ Find connected components using edges from local k -mers. ◮ Union-by-index and path splitting. ◮ No critical sections. ◮ Store edges that merges components (similar to [Patwary2012]). ◮ Process edges again in case of lost updates. ◮ Time: O ( M PT log ∗ R ) , Space ≈ 12 M SP + 4 R bytes. 18 / 46

Merge components ◮ Merge component forests in each MPI task in log P iterations. ◮ Time: O ( R log P log ∗ R ) , Space ≈ 8 R bytes. P0 P1 P2 P3 R1 R1 R1 R4 R1 R4 0: R4 R4 R3 R3 R3 R2 R3 R2 R2 R2 P0 P2 1: R1 R1 R4 R4 R3 R3 R2 R2 P0 2: R1 R4 R3 R2 19 / 46

Experiments and Results Description of datasets Read Count Size I D Dataset Source R ( × 10 6 ) (Gbp) 12 . 7 2 . 29 NCB I (SRR341725) HG Human gut 21 . 3 4 . 26 NCB I (SRR947737) LL Lake Lanier 54 . 8 11 . 07 NCB I (SRX200676) MM Mock microbial community 1132 . 8 223 . 26 JG I (402461) I S I owa, Continuous corn soil Machine con fi guration ◮ Edison supercomputer at NERSC ◮ Each node has 2 × 12-core I vy bridge processors and 64 GB memory. 21 / 46

Overview Motivation for our work M ETA P REP : a new metagenome pre-processing strategy M ETA P REP evaluation Parallel Scaling Comparison to prior work Impact on Metagenome assembly Conclusions and Future work 22 / 46

Single node scaling for Human Gut (HG) Dataset 300 KmerGen-I/O 250 20 KmerGen LocalSort 200 Relative Speedup LocalCC-Opt Time(seconds) 15 CC-I/O Speedup 150 10 100 5 50 0 1 2 4 8 12 24 Threads 23 / 46

Multi-node scaling for Human Gut (HG) Dataset 25 16 KmerGen-I/O KmerGen KmerGen-Comm 20 LocalSort LocalCC-Opt Merge-Comm MergeCC Time (seconds) 15 CC-I/O Speedup 8 10 6 4 5 0 1 2 4 8 16 Nodes 24 / 46

Multi-node scaling for LL and MM datasets LL (S=2) MM (S=4) 40 16 180 16 160 35 140 30 Time (seconds) 120 25 100 20 8 8 80 15 12 60 10 40 4 4 22 5 20 0 0 1 2 4 8 16 1 2 4 8 16 Nodes Nodes KmerGen-I/O LocalSort MergeCC LocalCC-Opt KmerGen CC-I/O KmerGen-Comm Merge-Comm Speedup 25 / 46

Multi-node scaling for Iowa Continuous Soil dataset 900 KmerGen-I/O 1X 800 KmerGen KmerGen-Comm 700 LocalSort LocalCC-Opt 600 Time(seconds) Merge-Comm 500 MergeCC CC-I/O 400 300 3.25X 200 100 0 16 64 Nodes For 16 node run, S = 8. For 64 node run, S = 2. 26 / 46

KmerGen performance comparison with KMC-2 k -mer counter [Deoro wicz2015] 140 90 MetaPrep KMC-2 80 120 KMC-2 MetaPrep16 70 100 60 1.57X Time (seconds) 80 50 40 60 30 40 20 6.76X 1.76X 20 10 1.56X 3.18X 2.72X 0 0 HG LL MM HG LL MM Dataset Dataset ◮ MetaPrep16: M ETA P REP run using 16 nodes. 28 / 46

Comparison with read graph connectivity [Flick2015] Table 1: Execution time comparison with Metagenome partitioning work (AP_LB) using 16 nodes. Time (seconds) M ETA P REP Dataset M ETA P REP AP_LB Speedup 5 . 5 23 . 6 4 . 22 × HG 2 . 25 × 11 . 5 25 . 9 LL 19 . 6 56 . 1 2 . 86 × MM ◮ 21 iterations for AP_LB vs 4 for M ETA P REP for MM dataset. 29 / 46

Largest component size ◮ Largest component size can be reduced by using fi lters. 1. k -mer size (k) - Longer k -mers occur in fewer components 2. k -mer frequency (KF) - Filter erroneous (low frequency) and repeat k -mers (high frequency) 31 / 46

Parallel and Memory-efficient Preprocessing for Metagenome Assembly - PowerPoint PPT Presentation

Parallel and Memory-efficient Preprocessing for Metagenome Assembly Vasudevan Rengasamy Paul Medvedev Kamesh Madduri School of EECS The Pennsylvania State University {vxr162, pashadag, madduri}@cse.psu.edu HiCOMB 2017 1 / 46 Talk Outline

CS6220: DATA MINING TECHNIQUES Chapter 3: Data Preprocessing Instructor: Yizhou Sun

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

TRACER TUTORIAL: TEXT REUSE DETECTION PREPROCESSING M arco B uchler, Emily Franzini and Greta

Preprocessing and Dimensionality Reduction J er emy Fix CentraleSup elec

Data Preprocessing Data Mining and Exploration: Preprocessing Data preparation is a big issue for

CSC2/458 Parallel and Distributed Systems Parallel Memory Systems: Coherence Sreepathi Pai

Data Preprocessing Why Data Preprocessing? Chris Williams, School of Informatics University of

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Last class: Virtual Memory Today: Virtual Memory Uses Efficient Use of Physical

Lessons Learned Designing an Open Source UMPC Ben Goska and Tim Harder Oregon State University

Route Server Automation and ROV Nick Pratley nick@ix.asn.au Life Under Lockdown: how to stop

Presentation: Cyclic Reduction Type Poisson and Helmholtz Solvers on a GPU Presentation February

RADIX SORT Parosh Aziz Abdulla Uppsala University September 21, 2008 Parosh Aziz Abdulla

1. Introduction Population projections are perhaps the most widely demanded product of national

Scalable SAR on the Cell/B.E. with Sourcery VSIPL++ HPEC Workshop Jules Bergmann, Don McCoy,

Tensorflow - A system for large-scale machine learning Presentation: Nat McAleese (nm583)

FA102a Introduction to New Media Design Professor Tom Klinkowstein fatik@hofstra.edu course