Compressing Intermediate Keys between Mappers and Reducers in - - PowerPoint PPT Presentation

compressing intermediate keys between mappers and
SMART_READER_LITE
LIVE PREVIEW

Compressing Intermediate Keys between Mappers and Reducers in - - PowerPoint PPT Presentation

Background Semantically-informed byte-level compression User-level semantic compression Compressing Intermediate Keys between Mappers and Reducers in SciHadoop Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt { adamcrume,buck,carlosm,scott }


slide-1
SLIDE 1

Background Semantically-informed byte-level compression User-level semantic compression

Compressing Intermediate Keys between Mappers and Reducers in SciHadoop

Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt {adamcrume,buck,carlosm,scott}@cs.ucsc.edu

University of California, Santa Cruz

November 12, 2012

Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 1 / 22

slide-2
SLIDE 2

Background Semantically-informed byte-level compression User-level semantic compression

Outline

1

Background

2

Semantically-informed byte-level compression

3

User-level semantic compression

Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 2 / 22

slide-3
SLIDE 3

Background Semantically-informed byte-level compression User-level semantic compression

MapReduce overview

Mapper Mapper Mapper Mapper Mapper Reducer Reducer Scheduler Input Output Output

Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 3 / 22

slide-4
SLIDE 4

Background Semantically-informed byte-level compression User-level semantic compression

Hadoop internal data flow

Disk Mapper Combiner Disk Sort Reducer network transfer

Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 4 / 22

slide-5
SLIDE 5

Background Semantically-informed byte-level compression User-level semantic compression

Array key/value pairs

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 (0, 0) → 0 (0, 1) → 1 (0, 2) → 2 (0, 3) → 3 (1, 0) → 4 (1, 1) → 5 (1, 2) → 6 (1, 3) → 7 (2, 0) → 8 (2, 1) → 9 (2, 2) → 10 (2, 3) → 11 (3, 0) → 12 (3, 1) → 13 (3, 2) → 14 (3, 3) → 15

Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 5 / 22

slide-6
SLIDE 6

Background Semantically-informed byte-level compression User-level semantic compression

Outline

1

Background

2

Semantically-informed byte-level compression

3

User-level semantic compression

Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 6 / 22

slide-7
SLIDE 7

Background Semantically-informed byte-level compression User-level semantic compression

Linear sequences

00000000 14 04 00 00 00 0d 00 00 00 03 00 00 00 00 00 00 00000010 00 00 00 00 00 01 c2 11 37 34 14 04 00 00 00 0d 00000020 00 00 00 03 00 00 00 00 00 00 00 01 00 00 00 01 00000030 9c 65 aa 33 14 04 00 00 00 0d 00 00 00 03 00 00 00000040 00 00 00 00 00 02 00 00 00 01 8d fc 61 b2 14 04 00000050 00 00 00 0d 00 00 00 03 00 00 00 00 00 00 00 03 00000060 00 00 00 01 f9 3c 62 ab 14 04 00 00 00 0d 00 00 00000070 00 03 00 00 00 00 00 00 00 04 00 00 00 01 a4 ba

Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 7 / 22

slide-8
SLIDE 8

Background Semantically-informed byte-level compression User-level semantic compression

Sequence detection

1 2 3 4 5 1;0 2;0 3;1

  • 1;2

0;9 1;1 0;5 2;4 0;5 1;5

  • 1;0
  • 2;1

0;1 2;1 3;0 1 2 3 4 Stride Phase δ; r ≡ increment=δ, run length=r

Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 8 / 22

slide-9
SLIDE 9

Background Semantically-informed byte-level compression User-level semantic compression

Predictive coding

Keys: (1,1) (1,2) (1,3) (1,4) (1,5) (2,1) (2,2) (2,3) (2,4) (2,5) Original: 1 1 1 2 1 3 1 4 1 5 2 1 Predictions: 1 4 1 5 1 6 Delta (output): 1 1 1 2 1 3 1

  • 7

Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 9 / 22

slide-10
SLIDE 10

Background Semantically-informed byte-level compression User-level semantic compression

Semantically-informed byte-level compression (results)

2 4 6 8 10 12 O r i g i n a l g z i p t r a n s f

  • r

m + g z i p b z i p 2 t r a n s f

  • r

m + b z i p 2 Megabytes File size by compression method 100% 13.6% 0.28% 4.27% 0.0039%

Tested on grid points from a 100 × 100 × 100 rectangle

Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 10 / 22

slide-11
SLIDE 11

Background Semantically-informed byte-level compression User-level semantic compression

Outline

1

Background

2

Semantically-informed byte-level compression

3

User-level semantic compression

Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 11 / 22

slide-12
SLIDE 12

Background Semantically-informed byte-level compression User-level semantic compression

Key redundancy

Key/value pairs are independent in MapReduce 1 2 3 4 Mapper 1 4 9 16 Reducer Pairs are not independent conceptually 1 2 3 4 Mapper 1 4 9 16 Reducer

Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 12 / 22

slide-13
SLIDE 13

Background Semantically-informed byte-level compression User-level semantic compression

SciHadoop semantic compression

1 2 3 4 Address per cell (0, 0) → 1 (0, 1) → 2 (1, 0) → 3 (1, 1) → 4 vs Address range per block (0, 0) - (1, 1) → {1, 2, 3, 4}

Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 13 / 22

slide-14
SLIDE 14

Background Semantically-informed byte-level compression User-level semantic compression

N-dimensional aggregation

Optimal choice is not obvious

Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 14 / 22

slide-15
SLIDE 15

Background Semantically-informed byte-level compression User-level semantic compression

Linearizing with a space-filling curve

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

1–5, 7, 9–10, 13 Cells are numbered with a space-filling curve, and contiguous numbers are collapsed into ranges

Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 15 / 22

slide-16
SLIDE 16

Background Semantically-informed byte-level compression User-level semantic compression

Overlapping keys problem

Ranges are unequal, so reducer won’t reduce

Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 16 / 22

slide-17
SLIDE 17

Background Semantically-informed byte-level compression User-level semantic compression

Unavoidable overlap

Alignment is insufficient

Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 17 / 22

slide-18
SLIDE 18

Background Semantically-informed byte-level compression User-level semantic compression

Key splitting

Overlapping ranges are split on the overlap boundaries

Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 18 / 22

slide-19
SLIDE 19

Background Semantically-informed byte-level compression User-level semantic compression

Effect of key aggregation

5 10 15 20 25 Original Compressed Total dataset size (MB) File overhead Keys Values 3.81 MB 15.26 MB 1.91 MB 3.81 MB 25.05 KB 5.84 KB

Data size is reduced by 84.5% for a 100×100×100 grid of integers

Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 19 / 22

slide-20
SLIDE 20

Background Semantically-informed byte-level compression User-level semantic compression

Result

Query: median of a sliding 3 × 3 × 3 window in an 800 × 800 × 800 grid of integers Cluster: 5 nodes, with 5 reducers and 10 map slots. Intermediate data (“Map output materialized bytes”) was reduced by 60.7% Intermediate key/value pair count (“Reduce input records”) was reduced by 73.3% Total runtime was reduced by 28.5%

Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 20 / 22

slide-21
SLIDE 21

Background Semantically-informed byte-level compression User-level semantic compression

Conclusion

Compression must be fast to be useful Semantic compression has an advantage with multiple read/write cycles Scientific processing in Hadoop is becoming more feasible

Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 21 / 22

slide-22
SLIDE 22

Background Semantically-informed byte-level compression User-level semantic compression

Questions?

Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 22 / 22