Gzip Compression Using Altera OpenCL Mohamed Abdelfattah (University - - PowerPoint PPT Presentation

gzip compression using altera opencl
SMART_READER_LITE
LIVE PREVIEW

Gzip Compression Using Altera OpenCL Mohamed Abdelfattah (University - - PowerPoint PPT Presentation

Gzip Compression Using Altera OpenCL Mohamed Abdelfattah (University of Toronto) Andrei Hagiescu Deshanand Singh Gzip Widely-used lossless compression program Gzip = LZ77 + Huffman Big data needs fast compression Gigabyte-per-second


slide-1
SLIDE 1

Gzip Compression Using Altera OpenCL

Mohamed Abdelfattah (University of Toronto) Andrei Hagiescu Deshanand Singh

slide-2
SLIDE 2

Gzip

 Widely-used lossless compression program  Gzip = LZ77 + Huffman  Big data needs fast compression

 Lower disk space in data centers  Less power on communication networks

2

Gigabyte-per-second

slide-3
SLIDE 3

LZ77 Compression Example

 This sentence is an easy sentence to compress.

3

  • 1. Scan file byte by byte
  • 2. Look for matches
  • 3. Replace with a reference to previous occurrence
slide-4
SLIDE 4

LZ77 Compression Example

4

 This sentence is an easy sentence to compress.

  • 1. Scan file byte by byte
  • 2. Look for matches
  • 3. Replace with a reference to previous occurrence
slide-5
SLIDE 5

LZ77 Compression Example

5

 This sentence is an easy sentence to compress.

  • 1. Scan file byte by byte
  • 2. Look for matches
  • 3. Replace with a reference to previous occurrence
slide-6
SLIDE 6

LZ77 Compression Example

6

 This sentence is an easy sentence to compress.

  • 1. Scan file byte by byte
  • 2. Look for matches
  • 3. Replace with a reference to previous occurrence
slide-7
SLIDE 7

LZ77 Compression Example

7

 This sentence is an easy sentence to compress.

  • 1. Scan file byte by byte
  • 2. Look for matches
  • 3. Replace with a reference to previous occurrence
slide-8
SLIDE 8

LZ77 Compression Example

8

 This sentence is an easy sentence to compress.

  • 1. Scan file byte by byte
  • 2. Look for matches
  • 3. Replace with a reference to previous occurrence
slide-9
SLIDE 9

 This sentence is an easy sentence to compress.

LZ77 Compression Example

9

  • 1. Scan file byte by byte
  • 2. Look for matches
  • 1. Match length
  • 2. Match offset
  • 3. Replace with a reference to previous occurrence
slide-10
SLIDE 10

 This sentence is an easy sentence to compress.

LZ77 Compression Example

10

  • 1. Scan file byte by byte
  • 2. Look for matches
  • 1. Match length = 2
  • 2. Match offset
  • 3. Replace with a reference to previous occurrence
slide-11
SLIDE 11

 This sentence is an easy sentence to compress.

LZ77 Compression Example

11

  • 1. Scan file byte by byte
  • 2. Look for matches
  • 1. Match length = 3
  • 2. Match offset
  • 3. Replace with a reference to previous occurrence
slide-12
SLIDE 12

 This sentence is an easy sentence to compress.

LZ77 Compression Example

12

  • 1. Scan file byte by byte
  • 2. Look for matches
  • 1. Match length = 8
  • 2. Match offset
  • 3. Replace with a reference to previous occurrence

Match offset = 20 bytes

slide-13
SLIDE 13

 This sentence is an easy sentence to compress.

LZ77 Compression Example

13

  • 1. Scan file byte by byte
  • 2. Look for matches
  • 1. Match length = 8
  • 2. Match offset = 20
  • 3. Replace with a reference to previous occurrence

Match offset = 20 bytes

slide-14
SLIDE 14

 This sentence is an easy @(8,20) to compress.

LZ77 Compression Example

14

  • 1. Scan file byte by byte
  • 2. Look for matches
  • Match length = 8
  • Match offset = 20
  • 3. Replace with a reference to previous occurrence
  • Marker, length, offset
slide-15
SLIDE 15

 This sentence is an easy sentence to compress.  This sentence is an easy @(8,20) to compress.

LZ77 Compression Example

15

  • 1. Scan file byte by byte
  • 2. Look for matches
  • Match length = 8
  • Match offset = 20
  • 3. Replace with a reference to previous occurrence
  • Marker, length, offset

Saved 5 bytes!

slide-16
SLIDE 16

Altera OpenCL Compiler for FPGAs

16

void kernel simple(global int *input, int size, global int *output) { for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y;

  • utput[i] = z;

} }

OpenCL Single-threaded Code

Host CPU FPGA Accelerator

PCIe

Altera’s OpenCL Compiler

Load x Load y Store z

DDRx Memory

//host code //Enqueue buffer //Enqueue Kernel(s) //dequeue buffers …

Host Code Altera’s OpenCL Compiler

slide-17
SLIDE 17

Altera OpenCL Compiler for FPGAs

17

void kernel simple(global int *input, int size, global int *output) { for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y;

  • utput[i] = z;

} }

OpenCL Single-threaded Code

Host CPU FPGA Accelerator

PCIe

Altera’s OpenCL Compiler

Load x Load y Store z

DDRx Memory

//host code //Enqueue buffer //Enqueue Kernel(s) //dequeue buffers …

Host Code Altera’s OpenCL Compiler 1

slide-18
SLIDE 18

Altera OpenCL Compiler for FPGAs

18

void kernel simple(global int *input, int size, global int *output) { for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y;

  • utput[i] = z;

} }

Host CPU FPGA Accelerator

PCIe

Altera’s OpenCL Compiler

Load x Load y Store z

DDRx Memory

//host code //Enqueue buffer //Enqueue Kernel(s) //dequeue buffers …

Host Code Altera’s OpenCL Compiler 2 1 OpenCL Single-threaded Code

slide-19
SLIDE 19

Altera OpenCL Compiler for FPGAs

19

void kernel simple(global int *input, int size, global int *output) { for(i=1..size) { int x = input[i]; int y = input[i+1]; int z = x + y;

  • utput[i] = z;

} }

Host CPU FPGA Accelerator

PCIe

Altera’s OpenCL Compiler

Load x Load y Store z

DDRx Memory

//host code //Enqueue buffer //Enqueue Kernel(s) //dequeue buffers …

Host Code Altera’s OpenCL Compiler 3 2 1 OpenCL Single-threaded Code

slide-20
SLIDE 20

FPGAs can be VERY Custom Host CPU FPGA Accelerator

PCIe

Load x Load y Store z

DDRx Memory

IO Channels IO Channels QDR? RDL? Different memory types ARM Host on FPGA chip

slide-21
SLIDE 21

Implementation Overview

21

  • 1. Shift In

New Data

  • 2. Dictionary

Lookup/Update

  • 3. Match Search

& Filtering

  • 4. Write to
  • utput
slide-22
SLIDE 22
  • 1. Shift In New Data

22

Current Window Input from DDR memory

slide-23
SLIDE 23
  • 1. Shift In New Data

23

Current Window sample_text e.g.

  • l d _ t e x t

Cycle boundary

slide-24
SLIDE 24
  • 1. Shift In New Data

24

Current Window sample_text e.g.

  • l d _ t e x t

Cycle boundary VEC = 4 Use text in our example, but can be anything

slide-25
SLIDE 25
  • 1. Shift In New Data

25

Current Window sample_text e.g.

t e x t

Cycle boundary

slide-26
SLIDE 26
  • 1. Shift In New Data

26

Current Window le_text e.g.

t e x t s a m p

Cycle boundary

slide-27
SLIDE 27

Implementation Overview

27

  • 1. Shift In

New Data

  • 2. Dictionary

Lookup/Update

  • 3. Match Search

& Filtering

  • 4. Write to
  • utput
slide-28
SLIDE 28

e x t s x t s a t s a m t e x t

  • 2. Dictionary Lookup/Update

28

t e x t s a m p Current Window:

  • 1. Compute hash
  • 2. Look for match

in 4 dictionaries

  • 3. Update dictionaries

Dictionary Dictionary 1 Dictionary 2 Dictionary 3

Dictionaries buffer the text that we have already processed, e.g.:

slide-29
SLIDE 29
  • 2. Dictionary Lookup/Update

29

t e x t s a m p Current Window: t e x t e x t s x t s a t s a m

Dictionary Dictionary 1 Dictionary 2 Dictionary 3

t a n _ t e x t

Hash

t e x l t e e n

slide-30
SLIDE 30
  • 2. Dictionary Lookup/Update

30

t e x t s a m p Current Window: t e x t e x t s x t s a t s a m

Dictionary Dictionary 1 Dictionary 2 Dictionary 3

t a n _ t e x t

Hash

t e x l t e e n e a t e e a r s e e p s e n t e

slide-31
SLIDE 31
  • 2. Dictionary Lookup/Update

31

t e x t s a m p Current Window: t e x t e x t s x t s a t s a m

Dictionary Dictionary 1 Dictionary 2 Dictionary 3

t a n _ t e x t

Hash

t e x l t e e n e a t e e a r s e e p s e n t e x a n t x y l

  • x

e l y x i r t

slide-32
SLIDE 32
  • 2. Dictionary Lookup/Update

32

t e x t s a m p Current Window: t e x t e x t s x t s a t s a m

Dictionary Dictionary 1 Dictionary 2 Dictionary 3

t a n _ t e x t

Hash

t e x l t e e n e a t e e a r s e e p s e n t e x a n t x y l

  • x

e l y x i r t t e e n t e a l t a n _ t a m e

Possile matches from history (dictionaries)

slide-33
SLIDE 33
  • 2. Dictionary Lookup/Update

33

Dictionary Dictionary 1 Dictionary 2 Dictionary 3

t e x t s a m p Current Window: t e x t e x t s x t s a t s a m Hash

slide-34
SLIDE 34
  • 2. Dictionary Lookup/Update

34

W0 RD02 RD03 RD00 RD01 Dictionary W1 RD12 RD13 RD10 RD11 Dictionary 1 W2 RD22 RD23 RD20 RD21 Dictionary 2 W3 RD32 RD33 RD30 RD31 Dictionary 3

t e x t s a m p Current Window:

Generate exactly the number of read/write ports that we need and the width

t e x t

t a n _ t e x t t e x l t e e n

256 read ports, 16 write ports – 128 bits

slide-35
SLIDE 35

Implementation Overview

35

  • 1. Shift In

New Data

  • 2. Dictionary

Lookup/Update

  • 3. Match Search

& Filtering

  • 4. Write to
  • utput
slide-36
SLIDE 36
  • 3. Match Search & Filtering

36

Current Windows: t e x t e x t s x t s a t s a m

t a n _ t e x t t e x l t e e n e a t e e a r s e e p s e n t e x a n t x y l

  • x

e l y x i r t t e e n t e a l t a n _ t a m e

Comparison Windows: A set of candidate matches for each incoming substring The substrings Compare current window against each of its 4 compare windows

slide-37
SLIDE 37
  • 3. Match Search & Filtering

37

Current Window: t e x t

t a n _ t e x t t e x l t e e n

Comparison Windows: 1 4 3 2 Match Length: Comparators We have another 3 of those Compare each byte

slide-38
SLIDE 38
  • 3. Match Search & Filtering

38

Current Window: t e x t

t a n _ t e x t t e x l t e e n

Comparison Windows: 1 4 3 2 Match Length: Comparators 4 Match Reduction Best Length:

slide-39
SLIDE 39
  • 3. Match Search & Filtering

39

slide-40
SLIDE 40
  • 3. Match Search & Filtering

40

slide-41
SLIDE 41
  • 3. Match Search & Filtering

41

slide-42
SLIDE 42
  • 3. Match Search & Filtering

42

Typical C-code Fixed loop bounds – compiler can unroll loop

slide-43
SLIDE 43
  • 3. Match Search & Filtering

 One bestlength associated with each current_window

43

t e x t e x t s x t s a t s a m

3

3 4 3

3

1

t e x t s a m p

slide-44
SLIDE 44
  • 3. Match Search & Filtering

44

3

t e x t s a m p

Cycle boundary 1 3 4 Matches

1 2 4 1 2 3

Best lengths:

Select the best combination of matches from the set of candidate matches 1. Remove matches that are longer when encoded than original 2. From the remaining set; select the best ones

  • (heuristic for bin-packing)  last-fit

3. Compute “first valid position” for next step

slide-45
SLIDE 45
  • 3. Match Search & Filtering

45

3

t e x t s a m p

Cycle boundary 1 3 4 Matches

1 2 4 1 2 3

Best lengths:

Too short Last-fit Overlap Last-fit

Select the best combination of matches from the set of candidate matches 1. Remove matches that are longer when encoded than original 2. From the remaining set; select the best ones

  • (heuristic for bin-packing)  last-fit

3. Compute “first valid position” for next step

slide-46
SLIDE 46
  • 3. Match Search & Filtering

46

3

t e x t s a m p

Cycle boundary 1 3 4 Matches

4 1 2 3

Best lengths:

Last-fit

1 2

Too short Overlap Last-fit

Select the best combination of matches from the set of candidate matches 1. Remove matches that are longer when encoded than original 2. From the remaining set; select the best ones

  • (heuristic for bin-packing)  last-fit

3. Compute “first valid position” for next step

slide-47
SLIDE 47
  • 3. Match Search & Filtering

47

3

t e x t s a m p

Cycle boundary 1 3 4 Matches:

1 2 3

Select the best combination of matches from the set of candidate matches 1. Remove matches that are longer when encoded than original 2. From the remaining set; select the best ones

  • (heuristic for bin-packing)  last-fit

3. Compute “first valid position” for next step

Best lengths:

Last-fit

 First Valid position next cycle

1 2 3

3

slide-48
SLIDE 48

Implementation Overview

53

  • 1. Shift In

New Data

  • 2. Dictionary

Lookup/Update

  • 3. Match Search

& Filtering

  • 4. Write to
  • utput
slide-49
SLIDE 49
  • 4. Writing to Output

 Marker, length, offset

 Length is limited by VEC (=16 in our case) – fits in 4 bits  Offset is limited by 0x40000 (doesn’t make sense to be more) – fits in 21 bits

 Use either 3 or 4 bytes for this:

 Offset < 2048  Offset = 2048 .. 262144

54

MARKER

LENGTH

OFFSET

OFFSET

OFFSET OFFSET MARKER

LENGTH OFFSET

slide-50
SLIDE 50

Results

55

OFFSET OFFSET MARKER

LENGTH OFFSET

slide-51
SLIDE 51

Comparison against CPU/Verilog – Best Gzips out there!

56

slide-52
SLIDE 52

Comparison against CPU/Verilog

57

  • Best implementation of Gzip on CPU
  • By Intel corporation
  • On Intel Core i5 (32nm) processor
  • 2013
  • Compression Speed: 338 MB/s
  • Compression ratio: 2.18X
slide-53
SLIDE 53

Comparison against CPU/Verilog

58

  • Best implementation on ASICs
  • AHA products group
  • Coming up Q2 2014
  • Compression Speed: 2.5 GB/s
slide-54
SLIDE 54

Comparison against CPU/Verilog

59

  • Best implementation on FPGAs
  • Verilog
  • IBM Corporation
  • Nov. 2013 ICCAD
  • Altera Stratix-V A7
  • Compression Speed: 3 GB/s
slide-55
SLIDE 55

Comparison against CPU/Verilog

60

  • OpenCL design example
  • Altera Stratix-V A7
  • Developed in 1 month
  • Compression speed ?
  • Compression Ratio ?
slide-56
SLIDE 56

Comparison against CPU/Verilog

61

2.7 GB/s 3 GB/s 2.5 GB/s 0.3 GB/s

slide-57
SLIDE 57

Comparison against CPU

62

Same compression ratio 12X better performance/Watt

slide-58
SLIDE 58

Comparison against Verilog

63

12% more resources

Much lower design effort and design time Days instead of months

10% Slower

slide-59
SLIDE 59

Thank You Thank You