Lightweight Compression Methods Achieving 120GBps and More Piotr - - PowerPoint PPT Presentation

lightweight compression methods achieving 120gbps and more
SMART_READER_LITE
LIVE PREVIEW

Lightweight Compression Methods Achieving 120GBps and More Piotr - - PowerPoint PPT Presentation

Lightweight Compression Methods Achieving 120GBps and More Piotr Przymus Laboratoire dInformatique Fondamentale de Marseille Aix-Marseille University, France GPU Technology Conference Silicon Valley May 2017 P. Przymus Lightweight


slide-1
SLIDE 1

Lightweight Compression Methods Achieving 120GBps and More

Piotr Przymus

Laboratoire d’Informatique Fondamentale de Marseille Aix-Marseille University, France

GPU Technology Conference Silicon Valley May 2017

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 1/25

slide-2
SLIDE 2
  • K. Kaczmarski and P. Przymus, Fixed Length Lightweight Compression for

GPU Revised, Journal of Parallel and Distributed Computing, 2017. A lightweight compression library for GPU. github.com/mis-wut/feathergpu MIT-licenesed. This project was partly funded by National Science Centre, decision DEC-2012/07/D/ST6/02483. Team Krzysztof Kaczmarski

Warsaw University of Technology, Poland

Piotr Przymus

Aix-Marseille University, France Nicolaus Copernicus University in Toruń, Poland.

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 2/25

slide-3
SLIDE 3

Lightweight compression on GPU – motivation

Lightweight compression algorithms favours compression and decompression speed over compression ratio. Improved data transfer:

Disk ↔ RAM ↔ GPU. GPU ↔ GPU:

exchange of already compressed data, compress → transfer → decompress.

Lower memory footprint:

Less disk space used. Less RAM used. Less GPU memory used.

Improved internal memory access:

In some cases improved internal GPU memory access.

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 3/25

slide-4
SLIDE 4

Lightweight compression on GPU – motivation

Lightweight compression algorithms favours compression and decompression speed over compression ratio. Improved data transfer:

Disk ↔ RAM ↔ GPU. GPU ↔ GPU:

exchange of already compressed data, compress → transfer → decompress.

Lower memory footprint:

Less disk space used. Less RAM used. Less GPU memory used.

Improved internal memory access:

In some cases improved internal GPU memory access.

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 3/25

slide-5
SLIDE 5

Lightweight compression on GPU – motivation

Lightweight compression algorithms favours compression and decompression speed over compression ratio. Improved data transfer:

Disk ↔ RAM ↔ GPU. GPU ↔ GPU:

exchange of already compressed data, compress → transfer → decompress.

Lower memory footprint:

Less disk space used. Less RAM used. Less GPU memory used.

Improved internal memory access:

In some cases improved internal GPU memory access.

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 3/25

slide-6
SLIDE 6

Lightweight compression on GPU – motivation

Lightweight compression algorithms favours compression and decompression speed over compression ratio. Improved data transfer:

Disk ↔ RAM ↔ GPU. GPU ↔ GPU:

exchange of already compressed data, compress → transfer → decompress.

Lower memory footprint:

Less disk space used. Less RAM used. Less GPU memory used.

Improved internal memory access:

In some cases improved internal GPU memory access.

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 3/25

slide-7
SLIDE 7

Fixed length compression

Fixed length (FL) – is a simple well known compression scheme where fixed number of bits is suppressed. Suppressed bits should be equal to 0.

1 1 1 1 1 1 1 1 1 1 1 1 1

Figure: Original data, only 4 bits are used in each byte.

1 1 1 1 1 1 1 1 1 1 1 1 1

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 4/25

slide-8
SLIDE 8

Fixed length compression

Fixed length (FL) – is a simple well known compression scheme where fixed number of bits is suppressed. Suppressed bits should be equal to 0.

1 1 1 1 1 1 1 1 1 1 1 1 1

Figure: Original data, only 4 bits are used in each byte.

1 1 1 1 1 1 1 1 1 1 1 1 1

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 4/25

slide-9
SLIDE 9

Fixed length compression

Fixed length (FL) – is a simple well known compression scheme where fixed number of bits is suppressed. Suppressed bits should be equal to 0.

1 1 1 1 1 1 1 1 1 1 1 1 1

Figure: Original data, only 4 bits are used in each byte.

1 1 1 1 1 1 1 1 1 1 1 1 1

Figure: Compressed data (each byte encodes two words of length 4 bits.)

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 4/25

slide-10
SLIDE 10

Fixed length compression

Fixed length (FL) – is a simple well known compression scheme where fixed number of bits is suppressed. Suppressed bits should be equal to 0.

1 1 1 1 1 1 1 1 1 1 1 1 1

Figure: Original data, only 4 bits are used in each byte.

1 1 1 1 1 1 1 1 1 1 1 1 1

Figure: Compressed data (each byte encodes two words of length 4 bits.)

compression ratio (CR) = Uncompressed size Compressed size = 2,

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 4/25

slide-11
SLIDE 11

Fixed length compression

Fixed length (FL) compression: easy to implement, easy to achieve high data throughput. Many applications: Database compression: Columns, Indexes, Timeseries compression, Graph compression, etc. Many variants: Patched FL, Adaptive FL, DELTA-*

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 5/25

slide-12
SLIDE 12

Fixed length compression on GPU

Performance over flexibility (Fang et al. 2010) High performance but highly simplified version of algorithm.

Words are mapped to full bytes e.g. 4 bits word will be mapped to 1 byte.

Uses map primitive. Coalesced reads and writes: YES. Direct memory access: YES. Flexibility over performance (Nvbio and Kaczmarski, Przymus 2012-2017) No simplifications at the cost of lower performance. Supports all possible bit encodings. Uses allgather or gather primitive. Coalesced reads and writes: NO. Direct memory access: YES.

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 6/25

slide-13
SLIDE 13

Fixed length compression on GPU

1 ... 31 0 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031 3233343536373839404142434445464748495051525354555657585960616263 .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

1024 .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 1055

Figure: Read pattern: GPU version of FL algorithm

0 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031 3233343536373839404142434445464748495051525354555657585960616263 6465666768697071727374757677787980818283848586878889909192939495

Figure: Write pattern: GPU version of FL algorithm

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 7/25

slide-14
SLIDE 14

Fixed length on GPU (C+D, GTX Titan Black)

200 400 600 800 1000

Data Size MB

20 40 60 80 100 120

Bandwidth GB/s

int max int min long max long min int long

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 8/25

slide-15
SLIDE 15

Fixed length on GPU (1 GB of data, GTX Titan Black)

Bit Encoding 8 16 24 32 40 48 56 63 50 100 150 200 250 300

  • Compr. GB/s

50 100 150 200 250 300

  • Decompr. GB/s

int long

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 9/25

slide-16
SLIDE 16

Can we do better?

Aligned Fixed Length (AFL) algorithm. The FL algorithm is optimized for CPU memory access scheme. We can do better with GPU friendly memory organisation scheme. Features No simplifications, high performance on GPU. Still works quite well on CPU, but loses some cache hits benefits. Supports all possible bit encodings. Uses allgather or gather primitive. Coalesced reads and writes: YES. Direct memory access: YES.

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 10/25

slide-17
SLIDE 17

Aligned FL on GPU

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031 3233343536373839404142434445464748495051525354555657585960616263 .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

1024 .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 1055

Figure: Read pattern: GPU version of Aligned FL algorithm

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031 3233343536373839404142434445464748495051525354555657585960616263 6465666768697071727374757677787980818283848586878889909192939495

Figure: Write pattern: GPU version of Aligned FL algorithm

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 11/25

slide-18
SLIDE 18

Aligned FL on GPU (C+D, GTX Titan Black)

200 400 600 800 1000

Data Size MB

20 40 60 80 100 120

Bandwidth GB/s

int max int min long max long min int long

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 12/25

slide-19
SLIDE 19

Aligned FL on GPU (1 GB of data, GTX Titan Black)

Bit Encoding 8 16 24 32 40 48 56 63 50 100 150 200 250 300

  • Compr. GB/s

50 100 150 200 250 300

  • Decompr. GB/s

int long

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 13/25

slide-20
SLIDE 20

Direct memory access

Direct memory access (Random access): single value access with decompression on-the-fly, no need for explicite decompression step, simple integration with existing algorithms. Example applications of direct access FL: Bioinformatics – NVBIO, Databases – Fang et al. 2010, Timeseries, Graph – Kaczmarski and Przymus 2012-2017. Aligned FL supports direct access!

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 14/25

slide-21
SLIDE 21

Direct access AFL on GPU (C+D, GTX Titan Black)

200 400 600 800 1000

Data Size MB

20 40 60 80 100 120

Bandwidth GB/s

int max int min long max long min int long

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 15/25

slide-22
SLIDE 22

Initial transformations

A initial data preparation is often used in order to minimize bits usage. Frame of reference (FOR): Subtracts a given value from all values {v0, v1, . . . , vn} → {v0 − f , v1 − f , . . . , vn − f }. Straightforward and simple integration with AFL. DELTA: Transforms data into the differences between successive values {v0, v1, . . . , vn} → {v1 − v0, v2 − v1, . . . , vn − vn−1}. Integration with AFL algorithm is not that simple.

Coalesced memory reads and writes require different scheme of data reads. Threads operate on data subsequence {ak, ak+32, ak+64, . . .}. Solution: Interthread communication within warp using shuffle instruction available starting from the Kepler.

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 16/25

slide-23
SLIDE 23

DELTA Aligned FL on GPU (C+D, GTX Titan Black)

200 400 600 800 1000

Data Size MB

20 40 60 80 100 120

Bandwidth GB/s

int max int min long max long min int long

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 17/25

slide-24
SLIDE 24

Dealing with the data outliers

FL and AFL algorithms are prone to data outliers. Example: Input data: {1, 2, 3, 2, 2, 3, 1, 1, 3, 2, 3, 1, 1}

2 bits FL (or AFL) encoding may be used.

Input data: {1, 2, 3, 2, 2, 3, 1, 1, 64, 2, 3, 1, 1}

6 bits FL (or AFL) encoding may be used.

Solution: Outlier aware algorithms Patched FL and Patched Aligned FL Adaptive FL and Adaptive Aligned FL

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 18/25

slide-25
SLIDE 25

Patched Aligned FL

Patched FL – Zukowski et al. 2006 for CPU. Patched FL (new memory organisation) – Yan et al. 2009 for CPU. Patched FL on GPU – Kaczmarski and Przymus 2012, 2013. Structure: compressed data, exceptions positions, exceptions values. Patched Aligned FL: One step compression:

threads gather outliers in local memory buffers, buffer overflow is managed per warp (voting of threads).

Two step decompression:

step 1: AFL decompression, step 2: exceptions extraction.

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 19/25

slide-26
SLIDE 26

Patched Aligned FL on GPU (C+D, GTX Titan Black)

200 400 600 800 1000

Data Size MB

20 40 60 80 100 120

Bandwidth GB/s

int optim. int pessim. long optim. long pessim. int long

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 20/25

slide-27
SLIDE 27

Adaptive FL

First introduced by Delbru et al. 2010 for CPU: processes data in chunks, sets different bit length encoding for each chunk. Adaptive FL does not fit GPU memory model very well. Adaptive Aligned FL: Aligned FL + new organisation of compressed chunks. Bit length encoding is established per warp. Compressed chunks are order according to warps completion time. Note that warps may finish in different order then data order.

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 21/25

slide-28
SLIDE 28

Adaptive Aligned FL on GPU

202020101023333333393939 5 5 5 1818181831313224 3 3 17383838 1 1111 272727132828282837373716 9 9 9 6 12121226262626341414142222151515 8 8 8 36363636212121 2 2 2 2 1935353535 4 2925 0 0 0 0 7 7 7 7 3030

Figure: Adaptive aligned memory organisation: main compression array.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 86297423831247906444 3 304835566143251578 0 7159 5 228551323684941921 6 5579674026 9

Figure: Adaptive aligned memory organisation: warp offset index.

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 22/25

slide-29
SLIDE 29

Adaptive Aligned FL on GPU (C+D, GTX Titan Black)

200 400 600 800 1000

Data Size MB

20 40 60 80 100 120

Bandwidth GB/s

int optim. int pessim. long optim. long pessim. int long

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 23/25

slide-30
SLIDE 30

TPC-H Data Benchmark

  • Org. /

Size Bandwidth [GB/s] Column used Sort. MB Alg. CR Comp. Decomp.

l discount

N 239

FL

8.00 52.7137 34.0756

  • dec. /

AFL

8.00 211.8846 181.8800 int(4)

PAFL

8.00 200.6869 72.8897

AAFL

30.79 101.4848 125.6707

l quantity

N 239

FL

2.46 45.4952 32.0122 int(4) /

AFL

2.46 154.2067 141.1335 int(4)

PAFL

2.46 153.2897 65.5559

AAFL

9.72 81.5276 105.6029

l partkey

N 239

FL

1.52 37.7608 29.4157 id /

AFL

1.52 125.7666 120.2389 int(4)

PAFL

1.52 125.0484 60.6303

AAFL

6.05 72.4169 93.3693

l shipdate

Y 239

FL

1.06 28.2662 26.2429 date /

DELTA-AFL

1.88 136.6975 127.8505 int(4)

DELTA-PAFL

31.91 205.3076 55.5019

DELTA-AAFL

126.31 210.5169 209.3501

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 24/25

slide-31
SLIDE 31

A lightweight compression library for GPU. github.com/mis-wut/feathergpu MIT-licenesed. Supported algorithms:

FL, FOR, Adaptive FL, Patched AFL Aligned FL, Aligned FOR, Adaptive Aligned FL, Patched Aligned AFL

GTX Titan Black: Compression up to 250GB/s Decompression up to 250GB/s Tesla P100 16 GB (New!) Compression up to 550GB/s Decompression up to 450GB/s

  • K. Kaczmarski and P. Przymus, Fixed Length Lightweight Compression for

GPU Revised, Journal of Parallel and Distributed Computing, 2017.

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 25/25

slide-32
SLIDE 32

Appendix

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 1/9

slide-33
SLIDE 33

Aligned FL on GPU (C+D, Tesla P100 16 GB)

200 400 600 800 1000

Data Size MB

50 100 150 200 250

Bandwidth GB/s

int max int min long max long min int long

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 2/9

slide-34
SLIDE 34

Aligned FL on GPU (1 GB of data, Tesla P100 16 GB)

Bit Encoding 8 16 24 32 40 48 56 63 50 100 150 200 250 300 350 400 450 500

  • Compr. GB/s

50 100 150 200 250 300 350 400 450 500

  • Decompr. GB/s

int long

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 3/9

slide-35
SLIDE 35

Direct access Aligned FL on GPU

Bit Encoding 8 16 24 32 40 48 56 63 50 100 150 200 250 300

  • Compr. GB/s

50 100 150 200 250 300

  • Decompr. GB/s

int long

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 4/9

slide-36
SLIDE 36

DELTA Aligned FL on GPU

Bit Encoding 8 16 24 32 40 48 56 63 50 100 150 200 250 300

  • Compr. GB/s

50 100 150 200 250 300

  • Decompr. GB/s

int long

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 5/9

slide-37
SLIDE 37

Patched Aligned FL on GPU (compression)

200 400 600 800 1000

Data Size MB

50 100 150 200 250

Bandwidth GB/s

int optim. int pessim. long optim. long pessim. int long

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 6/9

slide-38
SLIDE 38

Patched Aligned FL on GPU (decompression)

200 400 600 800 1000

Data Size MB

50 100 150 200 250

Bandwidth GB/s

int optim. int pessim. long optim. long pessim. int long

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 7/9

slide-39
SLIDE 39

Adaptive Aligned FL on GPU (compression)

200 400 600 800 1000

Data Size MB

50 100 150 200 250

Bandwidth GB/s

int optim. int pessim. long optim. long pessim. int long

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 8/9

slide-40
SLIDE 40

Adaptive Aligned FL on GPU (decompression)

200 400 600 800 1000

Data Size MB

50 100 150 200 250

Bandwidth GB/s

int optim. int pessim. long optim. long pessim. int long

  • P. Przymus

Lightweight Compression Methods Achieving 120GBps and More 9/9