SWMapper: Scalable Read Mapper
- n SunWay TaihuLight
Kai Xu1,2, Xiaohui Duan1,2, Xiangxu Meng1, Xin Li1, Bertil Schmidt3, Weiguo Liu1,2
2 National Supercomputing Center in Wuxi
3Jonhannes Gutenberg University
1 Shandong University
SWMapper: Scalable Read Mapper on SunWay TaihuLight Kai Xu 1,2 , - - PowerPoint PPT Presentation
SWMapper: Scalable Read Mapper on SunWay TaihuLight Kai Xu 1,2 , Xiaohui Duan 1,2 , Xiangxu Meng 1 , Xin Li 1 , Bertil Schmidt 3 , Weiguo Liu 1,2 1 Shandong University 2 National Supercomputing Center in Wuxi 3 Jonhannes Gutenberg University
Kai Xu1,2, Xiaohui Duan1,2, Xiangxu Meng1, Xin Li1, Bertil Schmidt3, Weiguo Liu1,2
2 National Supercomputing Center in Wuxi
3Jonhannes Gutenberg University
1 Shandong University
l trie–based
BWA, Bowtie, …
l k-mer-based
BWA-MEM, Bowtie2, GEM, FastHASH, mrsFAST, RazerS3, FEM, S-Aligner, BitMapper2, Hobbes3, …
l read mappers on compute clusters
BWA, pMAP, SEAL, BigBWA , SparkBWA , parSRA, CUSHAW3-UPC++, mer-Aligner, S-Aligner, …
Reference ACGTACGTAGCATGCATCGATCGTACGCATCGAT ACGT -> 0xE4 CGTA -> 0x39 GTAC -> 0x4E
Read ACGTCCGTAGCATGCT ACGT
CCGT
AGCA -> 0x18
0xE4: 1, 234, 456, 1246, 17983 0xE5: 89, 284, 956, 2246, 27983 0x18: 9, 423, 929, 3645, 47228
Reference ACGTACGTAGCATGCATCGATCGTACGCATCGAT ||| | |||||||||| Read ACGTCCGTAGCATGCT
seed-and-extend strategy
MC MPE CPE Cluster (8*8) Memory Network on chip(NoC) CPE Cluster (8*8) MC MPE Memory CPE Cluster (8*8) MC MPE Memory CPE Cluster (8*8) MC MPE Memory
DMA Memory LDM Compute Memory DMA
l Limited LDM, just 64KB l One MPI process can attach to one core group (CG) l Latency from CPE to MPE using DMA transferring pattern. l Memory size of one CG: 8 GB l Memory bandwidth of one CG: up to 34 GB/s
MPE Workflow
Build hash index Malloc two buffers for reads and results respectively Init index rds_id and res_id Load one batch of reads to read-buffer rds_id Update rds_id Call athread_spawn and start CPE alignment Load one batch of reads to read-buffer rds_id Update rds_id Call athread_spawn and start CPE alignment Call athread_join Call athread_join Write result-buffer res_id to disk Update res_id Write result-buffer res_id to disk Update res_id
If there are still reads
succinct hash index
… C A C A T C G T A G C A T … C A C A T A C A T C A T C G T C A T C G T C G T A C G T A G G G T C G T G T A T C G T G G G T C G G T C G T T C G T G T A T C A T C G G T T C G T G
long seed long seed
reference read
42 45 48 51
position
Match
Build hash index
A C A C A T C G T A G C A T G A C A C A T C G T A G C A T G A A 2 3 5 1 3 4 6 3 9 24 12 36 48 57 63 69 78 3 6 36 3 48 57 6 9 9 24 63 12 69 3 6 9 6 36 3 48 57 9 24 63 12 69 78 78 0x00 0x01 0x02 0x03 0x00 0x01 0x02 0x03 0x00 0x01 0x02 0x03 0x00 0x01 0x02 0x03 Process 1 Process 0 File system File system reference hash index hash value locations
their own hash index
them into a single hash index
CPE Workflow
Get parameters from MPE local_id = faaw(&global_id, 1) Get read from memory Encode read to bits- vector Divide reads into seeds Choose e + 1 long seeds to compute Get the candidate locations Use long seeds to filter candidate locations Remove duplicate locations Generate results and put the results to memory Call banded Myers to verify locations local_id = faaw(&global_id, 1)
If local_id < read_numer
Removing Duplicate Locations 12 17 16 31 18 34 20 36 s1 2 17 s2 2 31 s3 1 34 s4 1 36 s1 s2 s3 index s4 seed location Minimum heap Locations buffer
minimum heap
discard the location which already exists in heap
locations
Seed Filtration
… C A G T C G T A G C A T T … C G T C G T G T A T C G G 128 400 … … 720 890 … … A A G C C A G C long seed 720 reference read hash index locations
720 (green) but not matching long
discarded.
short seeds along with locations in the hash index.
Vectorization of banded Myers Algorithm
0x48 0xef 0x4f 0x5f 0xa5 0xb4 0x4c 0x8d 0xfe 0x4e 0x1f 0x2c 0x78 0xe8 0x61 0x63 0x48 0xa5 0xfd 0x78 0xef 0xb4 0x4e 0xe8 0x4f 0x4c 0x1f 0x61 0x5f 0x8d 0x2c 0x63 0xd3 0x5a 0x9d 0xf4 0xd3 0x5a 0x9d 0xf4 sub-ref 1 sub-ref 2 sub-ref 3 sub-ref 4 sub-ref 1 sub-ref 2 sub-ref 3 sub-ref 4
Vectorization of banded Myers Algorithm
Algorithm SIMD pseudocode Require: bit-vectors of sub-reference ref_hi, ref_lo, bit-vectors of read read_hi, read_lo Ensure: edit distance err 1: //get match of read and reference 2: t1 = simd_vxorw(read_hi, ref_hi); 3: t2 = simd_vxorw(read_lo, ref_lo); 4: matchv = simd_vorw(t1, t2) 5: //x=match|vn; 6: xv = simd_vorw(matchv, vnv); 7: //d0=(vp+(x&vp)); 8: d0v = simd_vandw(xv, vpv); 9: d0v = simd_vaddw(d0v, vpv); 10: //d0=(d0∧vp)|x; 11: d0v = simd_log3x(d0v,vpv, xv, table[0]); 12: //hn=vp&d0; 13: hnv = simd_vandw(vpv, d0v); 14: //hp=vn|(vp|d0); 15: hpv = simd_lox3x(vnv, vpv, d0v, table[1]); 16: //x=d0>>1; 17: xv = simd_vsrlw(d0v, 1); 18: //vn=x&hp; 19: vnv = simd_vandw(xv, hpv); 20: //vp=hn|∼(x|hp); 21: vpv = simd_log3x(hnv, xv, hpv, table[2]); 22: //tmp_res =(d0&1)∧1; 23: tmp_res = simd_log3x(d0v, 1, 1, table[3]); 24: //err=err+tmp_res 25: errv = simd_add(errv, tmp_resv);
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Register a Register b Register c Register e Register d
value of the first bit in d is computed by ((0 & 1) | 1).
the corresponding location in d to get the final result stored in e.
datasets Datasets NCBI Acc NO. Read length(bps) Number of reads D1 N/A 100 100000 R1 ERR013135 108 20M Table 1: Datasets used for performance evaluation.
Accuracy
Tools Mapped Reads Accuracy All[%] All-best[%] Any-best[%] RazerS3 99993 100.00 100.00 100.00 BitMapper2 99993 100.00 100.00 100.00 Hobbes3 99993 100.00 100.00 100.00 S-Aligner 99993 99.98 99.99 100.00 SWMapper 99993 100.00 100.00 100.00 Table 1: Results using different accuracy measures based on the Rabema benchmark
Hash Index Construction Time Tools BitMapper2 Hobbes3.0 S-Aligner SWMapper Time(s) 54 47 238 37 Table 3: Hash index construction times (in seconds) for the first chromosome of GRCh38 Processes 1 2 4 8 16 Time(s) 1632 857 443 257 63 Efficiency 100% 95% 92% 79% 63% Table 4: Strong scaling test for index construction of SWMapper using a full human reference genome (GRCh38) with different numbers of MPI processes on Sunway TaihuLight.
Runtime comparison (in seconds) for mapping all reads of Dataset R1 to the first chromosome of
Runtimes (in seconds) on a single CG after incrementally applying different optimization steps.
Strong scaling Processes 1 2 4 8 16 32 64 128 Time(s) 6635 3368 1705 867 437 230 124 70 Efficiency 100% 99% 97% 96% 95% 90% 83% 74%
Table 5: Strong scaling test for mapping all reads of Dataset R1 to a whole human genome reference (GRCh38) using different numbers of MPI processes on Sunway TaihuLight.
ü We present the design, construction, and usage of a memory- efficient distributed succinct hash index structure for k-mers (k-mers are substrings of length k). ü Centred around this index structure we propose a number of algorithmic
addition to architecture-specific optimizations such as dynamic scheduling, asynchronous data transfer, and the overlapping of I/O and computation. ü We design and implement a vectorized version of the banded Myers algorithm for pairwise alignment that takes full ad- vantage of the SW26010 instruction set.
National Supercomputing Center in Wuxi Shandong University Speaker: Kai Xu xukai16@foxmail.com