Fast Burrows Wheeler Compression ! Using All-Cores " - - PowerPoint PPT Presentation

fast burrows wheeler compression using all cores
SMART_READER_LITE
LIVE PREVIEW

Fast Burrows Wheeler Compression ! Using All-Cores " - - PowerPoint PPT Presentation

Fast Burrows Wheeler Compression ! Using All-Cores " Aditya'Deshpande*''and'''P'J'Narayanan' Centre&for&Visual&Informa1on&Technology&& Interna1onal&Ins1tute&of&Informa1on&Technology& Hyderabad


slide-1
SLIDE 1

IIIT Hyderabad!

Fast Burrows Wheeler Compression! Using All-Cores"

Aditya'Deshpande*''and'''P'J'Narayanan'

Centre&for&Visual&Informa1on&Technology&& Interna1onal&Ins1tute&of&Informa1on&Technology& Hyderabad&

ASHES,'2015'

&

*Now at the University of Illinois at Urbana Champaign

slide-2
SLIDE 2

IIIT Hyderabad!

Outline!

! What?&

! Use&all;cores&(CPU&+&GPU)&for&a&common&end;to;end&applica1on & ! Our&focus:&Burrows'Wheeler'Compression'(Bzip2)'

! How?&

! Use&fast&GPU&String&Sort&[Deshpande&and&Narayanan,&HiPC’13]& ! Domain&specific&techniques&for&GPU&BW&Compression& ! All;core&framework&to&use&both&CPU&and&GPU&together&

! Why?'

! Commodity&computers&have&mul1;core&CPU&+&many;core&GPU& ! All;core&end;to;end&applica1ons&help&end&user&leverage&them& both'

May 29, 2015 AsHES 2015

slide-3
SLIDE 3

IIIT Hyderabad!

Previous Work!

! MulHIcore'CPUs&(Coarse/Task&Parallelism)&–&

! LU,&QR,&Cholesky&Decomposi1on,&Random&PDF&Generators&etc.& ! FFT,&PBzip2,&String&Processing,&Bioinforma1cs,&Data&Struct.&etc.& ! Intel&MKL&and&other&libraries&

! ManyIcore'GPUs&(Fine/Data&Parallelism)&–&

! Scan,&Sort,&Hashing,&SpMV,&Lists,&Linear&Algebra&etc.& ! Graph&Algorithms:&BFS,&SSSP,&APSP,&SCC,&MST&etc.& ! cuBLAS,&cuFFT,&NvPP,&Magma,&cuSparse,&CUDPP,&Thrust&etc.& &

! The&focus&is&typically&not&endItoIend&and/or&allIcore& applica1ons&

May 29, 2015 AsHES 2015

slide-4
SLIDE 4

IIIT Hyderabad!

Burrows Wheeler Compression"

End;users&compress/de;compress&files&on&daily&basis.&Best&compressor& BW&Compression&(or&Bzip2),&a&three&step&procedure:& &

  • 1. Burrows'Wheeler'Transform'

Suffix&sort&and&use&the&last&last&column&(Most&compute&intensive)&

  • 2. MoveItoIFront'Transform'

&Similar&to&run;length&encoding&(~10%'of'runHme)'

  • 3. Huffman'Encoding'

Standard&frequency&of&chars&based&encoding'(~10%'of'runHme)'

tt edetttdomIIIIIIomeeddt sss eeehhhiniirrrrmmmhhhh wwwt t aaaoo aaaattrreeeefF nnaaan

AWer'BWT'

I meant what I said and I said what I meant From there to here from here to there I said what I meant

INPUT' Amenable' to'RLE'

May 29, 2015 AsHES 2015

slide-5
SLIDE 5

IIIT Hyderabad!

Burrows Wheeler Transform"

! Input&String:&I,'S' ! Sort&all&cyclic&shics&of&S' ! Last&column&of&sorted&strings,& with&index&of&original&string& is&BWT& ! O(N)&strings&are&sorted,& each&with&length&O(N)& ! Suffix&sort&in&BWT&has& long&1es&103&to&105&characters& ! Need&a&good&GPU&String&Sort& that&works&on&longer&1es&

a" b" a" n" a" n" a" n" a" b" a" n" a" n" a" n" a" b" b" a" n" a" n" a" n" a" b" a" n" a" 6" 4" 2" 1" 5" b" a" n" a" n" 1" 2" 3" 4" 5" n" a" n" a" b" a" 3" a" 6" INPUT" OUTPUT"MATRIX" S[N]" I[N]"

Last"column"can"be"easily"computed"by"offset" addiFon"even"if"we"output"this"shuffled"I[N]." Last"column"along"with"the"index"of"original"string" (i.e."4"since"I[4]=1)"is"the"BWT"OUTPUT"

I[N]" BWT" TRANSFORM"

May 29, 2015 AsHES 2015

slide-6
SLIDE 6

IIIT Hyderabad!

Sorting!

! Textbooks&teach&us&many&popular&sor1ng&methods& 29 " Quicksort " Mergesort " Radixsort "

Data is always numbers! !

12 " 39 " 42 " 37 " 12 " 29 " 37 " 39 " 42 " ! Real&data&is&beyond&just&numbers& ; Dic$onary*words&or&sentences* ; DNA*sequences,&mul$6dimensional*db*records* ; File*Paths*

Can we sort strings efficiently? !

May 29, 2015 AsHES 2015

slide-7
SLIDE 7

IIIT Hyderabad!

Irregularity in String Sorting!

! Number'SorHng'(or'Fixed'Length'SorHng)' ! Fixed&Length&Keys&(8&to&128&bits)& ! Standard&containers:&float,*int,*double*etc* ! Keys&Fit&into&registers& ! Comparisons&take&O(1)&1me& ' ' '

! String'SorHng'(or'Variable/Long'Length'SorHng)' ! Keys&have&no&restric1on&on&length& ! Itera1vely&load&keys&from&main&memory& ! Comparisons&not&O(1)&1me& ! Suffix&Sort&(1M*strings*of*1M*length!)* ' ' '

FIXED'LENGTH'KEYS* VARIABLE'LENGTH'KEYS* Variable&work&per&thread&and&arbitrary&memory& accesses:&IRREGULARITY&

May 29, 2015 AsHES 2015

slide-8
SLIDE 8

IIIT Hyderabad!

Previous String Sort!

CPU !

Burstsort !

[Sinha et al., JEA’07]"

MSD Radix Sort "

[Kärkkäinen and Rantala, SPIRE’08]"

! Multi-key Quicksort"

[Bentley and Sedgewick, SODA’97] !

GPU ! Thrust Merge Sort ! [Satish et al., IPDPS’09]" Fixed/Var. Merge Sort! [Davidson et al., InPar’12]" Hybrid Merge Sort! [Banerjee et al., AsHES’13]" Our String Sort [HiPC’13]! (Radix Sort)!

May 29, 2015 AsHES 2015

slide-9
SLIDE 9

IIIT Hyderabad!

Merge Sort: Iterative Comparisons! !

! Repe11ve&loading&for&resolving&1es& in&every&merge&step& ! &Davidson&et&al.&show&that&& ****“AEer*every*merge*step* comparisons*are*between*more* similar*strings”* ! Previous&GPU&String&Sorts&are&based&

  • n&Merge&Sort&

Itera<ve&comparisons&=&High&Latency&Global& Memory&Access&=&Divergence&!'

We'develop'a'Radix'Sort'based'String'Sort&

May 29, 2015 AsHES 2015

slide-10
SLIDE 10

IIIT Hyderabad!

Radix Sort for String Sorting!

k'char' prefix*

0" 0" 0" 0" 0" 1" 1" 1"

MSB'Segment'ID' (proxy&for&prefix)* Future&Sorts&& Seg&ID&+&& kIchar&prefix& as&Keys& First'Sort*

May 29, 2015 AsHES 2015

slide-11
SLIDE 11

IIIT Hyderabad!

Results"

& ! AWerISort'Tie'Length:&Indicates&difficulty&of&sor1ng&a&dataset& ! Suffix&sort&of&BWT&has&s1ll&higher&1es&and&requires&many&sort&steps& ! We&develop&domain;specific&sort&techniques&for&BWT&

Code from cvit.iiit.ac.in

May 29, 2015 AsHES 2015

slide-12
SLIDE 12

IIIT Hyderabad!

Modified String Sort for BWT"

! Doubling'MCU'length'of'String'Sort' ! MCU&length&determines&#sort&steps& ! Large&#sort&steps&for&long&1es&and&thus,&longer&run1me&"& ! Use&fixed&length&MCU&ini1ally,&then&double&to&reduce&sort&steps& ! 1.5&to&2.5x&speedup&

May 29, 2015 AsHES 2015

slide-13
SLIDE 13

IIIT Hyderabad!

Modified String Sort for BWT"

! ParHal'GPU'Sort'and'CPU'Merge' ! Cyclically&shiced&strings&have&special&property& ! We&can&sort&only&2/3rd&strings,&synthesize&rest&w/o&itera1ve&sort& ! Sort&all&(mod&3)&≠&0&strings&itera1vely& ! 1st&char&of&(mod&3)&=&0&string,&rank&of&next&in&2/3rd&sort&enough&to& sort&remaining&1/3rd&strings& ! Non;itera1ve&overlapped&merge&also&possible&(CPU)'

May 29, 2015 AsHES 2015

slide-14
SLIDE 14

IIIT Hyderabad!

Datasets GPU BWT"

! Datasets' ! Enwik8:'First&108&bytes&of&English&Wikipedia&Dump&(96MB)& ! WikiIxml:'&Wikipedia&xml&dump&(151MB)& ! LinuxI2.6.11.tar:'Publicly&available&linux&kernel&(199&MB)& ! Silesia'Corpus:'Data;compression&benchmark&(208MB)& ! TieILength'vs.'Block'Size'

May 29, 2015 AsHES 2015

slide-15
SLIDE 15

IIIT Hyderabad!

GPU BWT vs. Bzip2 BWT"

0.092 0.212 0.18 0.397 0.021 0.021 0.02 0.02 0.07 0.209 0.104 0.09 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

enwik8, MSD = 960, ASD = 298 wiki-xml, MSD = 960, ASD = 614 silesia.tar MSD=16320, ASD=1406 linux-2.6.11.tar, MSD = 65472, ASD = 2836

Average runtime (secs/per block) for CPU and GPU BWT Algorithms, Block Size : 900KB

GPU Sort (2/3rd + 1/3rd) CPU Merge (1/3rd + 2/3rd) CPU BWT (bzip2) GPU BWT Time (Increases with MSD/ASD) Constant Time Merge Operation

0.28 0.878 0.75 3.152 0.183 0.186 0.171 0.162 0.46 1.69 0.768 0.95 0.5 1 1.5 2 2.5 3 3.5 4

enwik8, MSD = 960, ASD = 576 wiki-xml, MSD = 960, ASD = 874 silesia.tar MSD=16320, ASD=4075 linux-2.6.11.tar, MSD = 65472, ASD = 10078

Average runtime (secs/per block) for CPU and GPU BWT Algorithms, Block Size : 4.5MB

GPU Sort (2/3rd + 1/3rd) CPU Merge (1/3rd + 2/3rd) CPU BWT (bzip2) GPU BWT Time increase with MSD/ASD Constant Time Merge Operation

0.526 1.81 1.57 9.39 0.414 0.406 0.385 0.412 1.08 4.53 1.79 2.31 1 2 3 4 5 6 7 8 9 10

enwik8, MSD = 960, ASD = 813 wiki-xml, MSD = 960, ASD = 929 silesia.tar MSD=16320, ASD=8430 linux-2.6.11.tar, MSD = 262080, ASD = 27340

Average runtime (secs/per block) for CPU and GPU BWT Algorithms, Block Size : 9MB

GPU Sort (2/3rd + 1/3rd) CPU Merge (1/3rd + 2/3rd) CPU BWT (bzip2) GPU BWT Time increase with MSD/ASD Constant Time Merge Operation

No'speedup'for' small'blocks' ' GPU'not'uHlized' sufficiently' Speedup'on'large' blocks'' ' GPU'sHll'slow'for' worstIcase'linux' dataset'

May 29, 2015 AsHES 2015

slide-16
SLIDE 16

IIIT Hyderabad!

String Perturbation"

! Large&#sort&steps&result&from&repeated&substrings/long&1es& ! Run1me&reduces&greatly&if&we&break&1es& ! Perturba1on&‘add'random'chars'at'fixed'interval’&to&break&1es& ! Useful&for&applica1ons&where&BWT&transformed&string&is&irrelevant,& and&BWT+IBWT&are&used&in&pairs&(viz.&BW&Compression)& ! Fixed&Perturba1on&can&be&removed&acer&IBWT&

3.152 0.971 0.559 0.323 0.163 0.164 0.166 0.165 1.151 1.154 1.15 1.112 0.5 1 1.5 2 2.5 3 3.5

0% MSD = 4032, ASD = 2329 0.01% MSD = 4032, ASD = 2535 0.1% MSD = 960, ASD = 828 1% MSD = 192, ASD = 76

Time for GPU BWT and CPU BWT vs. % perturbation for linux-2.6.11.tar, 4.5MB

GPU Sort (2/3rd + 1/3rd) CPU Merge (2/3rd + 1/3rd) CPU BWT GPU Sort time decreases with perturbation Constant Time Merge Operation

9.39 2.12 1.09 0.59 0.37 0.367 0.37 0.381 2.31 2.22 1.32 1.22 1 2 3 4 5 6 7 8 9 10

0% MSD = 262080, ASD = 27340 0.01% MSD = 8128, ASD = 5689 0.1% MSD = 960, ASD = 911 1% MSD = 192, ASD = 185

Time for GPU BWT and CPU BWT vs. % perturbation for linux-2.6.11.tar, 9MB

GPU Sort (2/3rd + 1/3rd) CPU Merge (2/3rd + 1/3rd) CPU BWT GPU Sort Time decreases with perturbation Constant Time Merge Operation

LinuxI9MB'Blocks,'8.2x'speedup'with'0.1%'perturbaHon''

May 29, 2015 AsHES 2015

slide-17
SLIDE 17

IIIT Hyderabad!

All-Core Framework"

INPUT

Work Item Work Item Work Item Work Item Work Item FIFO WORK QUEUE

CoSt CoSt CoSt CoSt

Atomic dequeue of Work Items and Parallel Execution Output Output Ouput Output Output OUTPUT QUEUE Enqueue to Output Queue

! System&made&of&CoSt’s:& ! GPU&with&controlling&CPU&thread&a&CoSt& ! Other&CPU&cores&are&CoSt’s& ! Split&blocks&across&CoSt’s,&dequeued&from&work;queue& & Only''CPU+GPU'CoST' =' Hybrid'BWC'

May 29, 2015 AsHES 2015

slide-18
SLIDE 18

IIIT Hyderabad!

Hybrid BWC on CPU+GPU CoSt"

(i)2/3rd Sort (ii)1/3rd Sort Block #1

GPU CPU

O/P block # 1 MTF +HUFF #1 MERGE MTF +HUFF #2 MERGE (i)2/3rd Sort (ii)1/3rd Sort Block #2 (i)2/3rd Sort (ii)1/3rd Sort Block #3 (i)2/3rd Sort (ii)1/3rd Sort Block #4 O/P block # 2 I/P block # 3 I/P block # 4 O/P block # 3 MERGE

! Patel&et&al.&did&all&3&steps&on&GPU,&2.78X&slowdown& ! Map&appropriate&opera1on&to&appropriate&compute&plaoorm& ! GPU&for&sorts&of&BWT,&CPU&does&sequen1al&merge,&MTF,&Huff& ! Pipeline&blocks&such&that&CPU&computa1on&overlaps&with&GPU& ! Throughput&BWC&=&BWT,&barring&first&and&last&block&offset&

May 29, 2015 AsHES 2015

slide-19
SLIDE 19

IIIT Hyderabad!

Hybrid (CPU+GPU) BWC"

INPUT

Work Item Work Item Work Item Work Item Work Item FIFO WORK QUEUE

CoSt CoSt CoSt CoSt

Atomic dequeue of Work Items and Parallel Execution Output Output Ouput Output Output OUTPUT QUEUE Enqueue to Output Queue

Only'' CPU+GPU' CoST' Use''All' CoSt’s' Seward’s'Bzip2'used'for'blocks'on'CPUIonly'CoSt’s'

May 29, 2015 AsHES 2015

slide-20
SLIDE 20

IIIT Hyderabad!

Results: Hybrid BWC"

Compression'RaHo'improves'with'increase'in'Block'Size' GPU'runHme'is'bener'with'larger'blocks'compared'to'CPU' ' GPU'runHme'improves'with'perturbaHon,'CPU'runHme'stays'the'same' Compressed'file'size'increases,'but'reasonable'Hll'0.1%'(<'stateIofItheIart)' ' RunHme'&'compressed'file'size'bener'than'stateIofItheIart'(Bzip2,'900KB)' Note,'CPU'does'much'less'work'using'900KB'blocks,'GPU'uses'9MB.' '

May 29, 2015 AsHES 2015

slide-21
SLIDE 21

IIIT Hyderabad!

Results: Hybrid BWC"

Compression'RaHo'improves'with'increase'in'Block'Size' GPU'runHme'is'bener'with'larger'blocks'compared'to'CPU' ' GPU'runHme'improves'with'perturbaHon,'CPU'runHme'stays'the'same' Compressed'file'size'increases,'but'reasonable'Hll'0.1%'(<'stateIofItheIart)' ' RunHme'&'compressed'file'size'bener'than'stateIofItheIart'(Bzip2,'900KB)' Note,'CPU'does'much'less'work'using'900KB'blocks,'GPU'uses'9MB.' '

1.297 1.386 1.047 1.207 1.828 2.92 2.25 1.903 8.4! 8.1! 4.6! 2.9! 0! 1! 2! 3! 4! 5! 6! 7! 8! 9!

0.5 1 1.5 2 2.5 3 3.5 enwik8 (96MB) wiki-xml (151MB) linux (199MB) silesia.tar (203MB)

Speed Up and Percent Reduction in Compressed File Size

Hybrid BWC (9MB) Speed Up vs. CPU BWC (900 KB) Hybrid BWC (9MB) Speed Up vs. CPU BWC (9MB) % Reduction in File Size (with 9MB Blocks) {right y-axis}

May 29, 2015 AsHES 2015

slide-22
SLIDE 22

IIIT Hyderabad!

Results: All-Core BWC"

! Using&CPU&CoSt’s&only:&3.06x&speedup& ! Using&all&CoSt’s&(CPU&and&GPU):&4.87x&speedup&

May 29, 2015 AsHES 2015

slide-23
SLIDE 23

IIIT Hyderabad!

Conclusions!

# Developed techniques for efficient use of all-core (CPU +GPU) systems! # String sort outperforms state-of-the-art significantly, adapts to future GPUs! # Our CPU+GPU hybrid GPU BWC shows a speed up for the first time on BWC using GPUs! # All-Core BWC shows improvement over using only the CPU or GPU cores for BWC! # Our results should encourage other developers to focus

  • n development of fast end-to-end applications!

May 29, 2015 AsHES 2015

slide-24
SLIDE 24

IIIT Hyderabad!

Thank you!"

All& codes& are& available& for& download& at& hqp://cvit.iiit.ac.in/& or& hqps://web.engr.illinois.edu/~ardeshp2,&CVIT/Personal&Webpage& & Please&contact&ardeshp2@illinois.edu&or&pjn@iiit.ac.in&for&more&details& & We&thank&the&‘Indo6Israeli*Project’&by&Department*of*Science*and*Technology* for&par1al&financial&support&for&this&work&

Questions?"

May 29, 2015 AsHES 2015