SWMapper: Scalable Read Mapper on SunWay TaihuLight Kai Xu 1,2 , - - PowerPoint PPT Presentation

swmapper scalable read mapper on sunway taihulight
SMART_READER_LITE
LIVE PREVIEW

SWMapper: Scalable Read Mapper on SunWay TaihuLight Kai Xu 1,2 , - - PowerPoint PPT Presentation

SWMapper: Scalable Read Mapper on SunWay TaihuLight Kai Xu 1,2 , Xiaohui Duan 1,2 , Xiangxu Meng 1 , Xin Li 1 , Bertil Schmidt 3 , Weiguo Liu 1,2 1 Shandong University 2 National Supercomputing Center in Wuxi 3 Jonhannes Gutenberg University


slide-1
SLIDE 1

SWMapper: Scalable Read Mapper

  • n SunWay TaihuLight

Kai Xu1,2, Xiaohui Duan1,2, Xiangxu Meng1, Xin Li1, Bertil Schmidt3, Weiguo Liu1,2

2 National Supercomputing Center in Wuxi

3Jonhannes Gutenberg University

1 Shandong University

slide-2
SLIDE 2

Outline

Introduction and Background Implementation Performance Evaluation Conclusion and Future Work

slide-3
SLIDE 3

Introduction

Related work

l trie–based

BWA, Bowtie, …

l k-mer-based

BWA-MEM, Bowtie2, GEM, FastHASH, mrsFAST, RazerS3, FEM, S-Aligner, BitMapper2, Hobbes3, …

l read mappers on compute clusters

BWA, pMAP, SEAL, BigBWA , SparkBWA , parSRA, CUSHAW3-UPC++, mer-Aligner, S-Aligner, …

slide-4
SLIDE 4

Introduction

  • 1. Index k-mer into a hash table or orther similar data structrure

Reference ACGTACGTAGCATGCATCGATCGTACGCATCGAT ACGT -> 0xE4 CGTA -> 0x39 GTAC -> 0x4E

  • 2. Extract and hash k-mer (“seed”)

Read ACGTCCGTAGCATGCT ACGT

  • > 0xE4

CCGT

  • > 0xE5

AGCA -> 0x18

  • 3. Probe hash table to find locations

0xE4: 1, 234, 456, 1246, 17983 0xE5: 89, 284, 956, 2246, 27983 0x18: 9, 423, 929, 3645, 47228

  • 4. Compute alignment (“extend”) at the locaions

Reference ACGTACGTAGCATGCATCGATCGTACGCATCGAT ||| | |||||||||| Read ACGTCCGTAGCATGCT

seed-and-extend strategy

slide-5
SLIDE 5

Outline

Introduction and Background Implementation Performance Evaluation Conclusion and Future Work

slide-6
SLIDE 6

Implementation

SW26010 Architecture

MC MPE CPE Cluster (8*8) Memory Network on chip(NoC) CPE Cluster (8*8) MC MPE Memory CPE Cluster (8*8) MC MPE Memory CPE Cluster (8*8) MC MPE Memory

DMA Memory LDM Compute Memory DMA

l Limited LDM, just 64KB l One MPI process can attach to one core group (CG) l Latency from CPE to MPE using DMA transferring pattern. l Memory size of one CG: 8 GB l Memory bandwidth of one CG: up to 34 GB/s

slide-7
SLIDE 7

Implementation

MPE Workflow

Build hash index Malloc two buffers for reads and results respectively Init index rds_id and res_id Load one batch of reads to read-buffer rds_id Update rds_id Call athread_spawn and start CPE alignment Load one batch of reads to read-buffer rds_id Update rds_id Call athread_spawn and start CPE alignment Call athread_join Call athread_join Write result-buffer res_id to disk Update res_id Write result-buffer res_id to disk Update res_id

If there are still reads

slide-8
SLIDE 8

Implementation

succinct hash index

… C A C A T C G T A G C A T … C A C A T A C A T C A T C G T C A T C G T C G T A C G T A G G G T C G T G T A T C G T G G G T C G G T C G T T C G T G T A T C A T C G G T T C G T G

long seed long seed

reference read

  • Seed length l
  • Reference length r
  • Store one seed for every s seeds
  • Seeds number = (r – l + 1)/s

42 45 48 51

position

  • Long seed length = l + s - 1
  • Divide read into long seeds
  • Divide each long seed into s seeds
  • Use seeds to find candidate positions

Match

slide-9
SLIDE 9

Implementation

Build hash index

A C A C A T C G T A G C A T G A C A C A T C G T A G C A T G A A 2 3 5 1 3 4 6 3 9 24 12 36 48 57 63 69 78 3 6 36 3 48 57 6 9 9 24 63 12 69 3 6 9 6 36 3 48 57 9 24 63 12 69 78 78 0x00 0x01 0x02 0x03 0x00 0x01 0x02 0x03 0x00 0x01 0x02 0x03 0x00 0x01 0x02 0x03 Process 1 Process 0 File system File system reference hash index hash value locations

  • ffset
  • Process 0 and 1 build

their own hash index

  • Two processes merge

them into a single hash index

slide-10
SLIDE 10

Implementation

CPE Workflow

Get parameters from MPE local_id = faaw(&global_id, 1) Get read from memory Encode read to bits- vector Divide reads into seeds Choose e + 1 long seeds to compute Get the candidate locations Use long seeds to filter candidate locations Remove duplicate locations Generate results and put the results to memory Call banded Myers to verify locations local_id = faaw(&global_id, 1)

If local_id < read_numer

slide-11
SLIDE 11

Implementation

Removing Duplicate Locations 12 17 16 31 18 34 20 36 s1 2 17 s2 2 31 s3 1 34 s4 1 36 s1 s2 s3 index s4 seed location Minimum heap Locations buffer

  • Load one block for each seed
  • Use one location of each seed to create a

minimum heap

  • Pop the smallest location
  • Push the next location from the same buffer,

discard the location which already exists in heap

  • Load next block when a block has no more

locations

slide-12
SLIDE 12

Implementation

Seed Filtration

… C A G T C G T A G C A T T … C G T C G T G T A T C G G 128 400 … … 720 890 … … A A G C C A G C long seed 720 reference read hash index locations

  • Matching short seed at position

720 (green) but not matching long

  • read. Thus, the position is

discarded.

  • We store the bases before and after

short seeds along with locations in the hash index.

slide-13
SLIDE 13

Implementation

Vectorization of banded Myers Algorithm

0x48 0xef 0x4f 0x5f 0xa5 0xb4 0x4c 0x8d 0xfe 0x4e 0x1f 0x2c 0x78 0xe8 0x61 0x63 0x48 0xa5 0xfd 0x78 0xef 0xb4 0x4e 0xe8 0x4f 0x4c 0x1f 0x61 0x5f 0x8d 0x2c 0x63 0xd3 0x5a 0x9d 0xf4 0xd3 0x5a 0x9d 0xf4 sub-ref 1 sub-ref 2 sub-ref 3 sub-ref 4 sub-ref 1 sub-ref 2 sub-ref 3 sub-ref 4

  • Transposition of the data layout.
  • Each row represents contiguous memory.
  • Data of the same color corresponds to bits-vector of sub-reference at different candidate locations.
slide-14
SLIDE 14

Implementation

Vectorization of banded Myers Algorithm

Algorithm SIMD pseudocode Require: bit-vectors of sub-reference ref_hi, ref_lo, bit-vectors of read read_hi, read_lo Ensure: edit distance err 1: //get match of read and reference 2: t1 = simd_vxorw(read_hi, ref_hi); 3: t2 = simd_vxorw(read_lo, ref_lo); 4: matchv = simd_vorw(t1, t2) 5: //x=match|vn; 6: xv = simd_vorw(matchv, vnv); 7: //d0=(vp+(x&vp)); 8: d0v = simd_vandw(xv, vpv); 9: d0v = simd_vaddw(d0v, vpv); 10: //d0=(d0∧vp)|x; 11: d0v = simd_log3x(d0v,vpv, xv, table[0]); 12: //hn=vp&d0; 13: hnv = simd_vandw(vpv, d0v); 14: //hp=vn|(vp|d0); 15: hpv = simd_lox3x(vnv, vpv, d0v, table[1]); 16: //x=d0>>1; 17: xv = simd_vsrlw(d0v, 1); 18: //vn=x&hp; 19: vnv = simd_vandw(xv, hpv); 20: //vp=hn|∼(x|hp); 21: vpv = simd_log3x(hnv, xv, hpv, table[2]); 22: //tmp_res =(d0&1)∧1; 23: tmp_res = simd_log3x(d0v, 1, 1, table[3]); 24: //err=err+tmp_res 25: errv = simd_add(errv, tmp_resv);

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Register a Register b Register c Register e Register d

  • Using the instruction log3x to compute e = (a & b)|c.
  • The truth table d of size 8 = 2^3 is calculated, e.g. the

value of the first bit in d is computed by ((0 & 1) | 1).

  • The bits of a, b, c are combined to get the result from

the corresponding location in d to get the final result stored in e.

slide-15
SLIDE 15

Outline

Introduction and Background Implementation Performance Evaluation Conclusion and Future Work

slide-16
SLIDE 16

Performance evaluation

datasets Datasets NCBI Acc NO. Read length(bps) Number of reads D1 N/A 100 100000 R1 ERR013135 108 20M Table 1: Datasets used for performance evaluation.

Intel machine configure: CPU: Intel Xeon W-2123v3 CPU (4 cores in total operating at 3.6 GHz) Memory: 16GB OS: Ubuntu 16.04 with Linux kernel 4.4.0-28-generic.

slide-17
SLIDE 17

Performance evaluation

Accuracy

Tools Mapped Reads Accuracy All[%] All-best[%] Any-best[%] RazerS3 99993 100.00 100.00 100.00 BitMapper2 99993 100.00 100.00 100.00 Hobbes3 99993 100.00 100.00 100.00 S-Aligner 99993 99.98 99.99 100.00 SWMapper 99993 100.00 100.00 100.00 Table 1: Results using different accuracy measures based on the Rabema benchmark

slide-18
SLIDE 18

Performance evaluation

Hash Index Construction Time Tools BitMapper2 Hobbes3.0 S-Aligner SWMapper Time(s) 54 47 238 37 Table 3: Hash index construction times (in seconds) for the first chromosome of GRCh38 Processes 1 2 4 8 16 Time(s) 1632 857 443 257 63 Efficiency 100% 95% 92% 79% 63% Table 4: Strong scaling test for index construction of SWMapper using a full human reference genome (GRCh38) with different numbers of MPI processes on Sunway TaihuLight.

slide-19
SLIDE 19

Performance evaluation

Runtime comparison (in seconds) for mapping all reads of Dataset R1 to the first chromosome of

  • GRCh38. Tools are executed on a Xeon CPU (green)
  • r on a single SW26010 (red).

Runtimes (in seconds) on a single CG after incrementally applying different optimization steps.

slide-20
SLIDE 20

Performance evaluation

Strong scaling Processes 1 2 4 8 16 32 64 128 Time(s) 6635 3368 1705 867 437 230 124 70 Efficiency 100% 99% 97% 96% 95% 90% 83% 74%

Table 5: Strong scaling test for mapping all reads of Dataset R1 to a whole human genome reference (GRCh38) using different numbers of MPI processes on Sunway TaihuLight.

slide-21
SLIDE 21

Outline

Introduction and Background Implementation Performance Evaluation Conclusion and Future Work

slide-22
SLIDE 22

Conclusion

ü We present the design, construction, and usage of a memory- efficient distributed succinct hash index structure for k-mers (k-mers are substrings of length k). ü Centred around this index structure we propose a number of algorithmic

  • ptimizations such as a seed filtration and removal of duplicate candidate locations in

addition to architecture-specific optimizations such as dynamic scheduling, asynchronous data transfer, and the overlapping of I/O and computation. ü We design and implement a vectorized version of the banded Myers algorithm for pairwise alignment that takes full ad- vantage of the SW26010 instruction set.

slide-23
SLIDE 23

SWMapper: Scalable Read Mapper

  • n SunWay TaihuLight

Thank You!

National Supercomputing Center in Wuxi Shandong University Speaker: Kai Xu xukai16@foxmail.com