Compressing Intermediate Keys between Mappers and Reducers in - PowerPoint PPT Presentation

Background Semantically-informed byte-level compression User-level semantic compression Compressing Intermediate Keys between Mappers and Reducers in SciHadoop Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt { adamcrume,buck,carlosm,scott } @cs.ucsc.edu University of California, Santa Cruz November 12, 2012 Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 1 / 22

Background Semantically-informed byte-level compression User-level semantic compression Outline Background 1 Semantically-informed byte-level compression 2 User-level semantic compression 3 Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 2 / 22

Background Semantically-informed byte-level compression User-level semantic compression MapReduce overview Scheduler Input Mapper Mapper Output Reducer Mapper Mapper Output Reducer Mapper Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 3 / 22

Background Semantically-informed byte-level compression User-level semantic compression Hadoop internal data flow Mapper Combiner Sort Reducer network transfer Disk Disk Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 4 / 22

Background Semantically-informed byte-level compression User-level semantic compression Array key/value pairs (0, 0) → 0 (2, 0) → 8 (0, 1) → 1 (2, 1) → 9 0 1 2 3 (0, 2) → 2 (2, 2) → 10 4 5 6 7 (0, 3) → 3 (2, 3) → 11 (1, 0) → 4 (3, 0) → 12 8 9 10 11 (1, 1) → 5 (3, 1) → 13 12 13 14 15 (1, 2) → 6 (3, 2) → 14 (1, 3) → 7 (3, 3) → 15 Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 5 / 22

Background Semantically-informed byte-level compression User-level semantic compression Linear sequences 00000000 14 04 00 00 00 0d 00 00 00 03 00 00 00 00 00 00 00000010 00 00 00 00 00 01 c2 11 37 34 14 04 00 00 00 0d 00000020 00 00 00 03 00 00 00 00 00 00 00 01 00 00 00 01 00000030 9c 65 aa 33 14 04 00 00 00 0d 00 00 00 03 00 00 00000040 00 00 00 00 00 02 00 00 00 01 8d fc 61 b2 14 04 00000050 00 00 00 0d 00 00 00 03 00 00 00 00 00 00 00 03 00000060 00 00 00 01 f9 3c 62 ab 14 04 00 00 00 0d 00 00 00000070 00 03 00 00 00 00 00 00 00 04 00 00 00 01 a4 ba Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 7 / 22

Background Semantically-informed byte-level compression User-level semantic compression Sequence detection 1 1;0 2;0 3;1 2 Stride 3 -1;2 0;9 1;1 0;5 2;4 0;5 1;5 4 5 -1;0 -2;1 0;1 2;1 3;0 0 1 2 3 4 Phase δ ; r ≡ increment= δ , run length= r Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 8 / 22

Background Semantically-informed byte-level compression User-level semantic compression Predictive coding (1,1) (1,2) (1,3) (1,4) (1,5) Keys: (2,1) (2,2) (2,3) (2,4) (2,5) Original: 1 1 1 2 1 3 1 4 1 5 2 1 Predictions: 1 4 1 5 1 6 Delta (output): 1 1 1 2 1 3 0 0 0 0 1 -7 Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 9 / 22

Background Semantically-informed byte-level compression User-level semantic compression Semantically-informed byte-level compression (results) File size by compression method 100% 12 10 8 Megabytes 6 4 13.6% 2 4.27% 0.28% 0.0039% 0 O g t b t r r z z r a a i i i g p n p n i s 2 s n f f o o a r r l m m + + g b z z i i p p 2 Tested on grid points from a 100 × 100 × 100 rectangle Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 10 / 22

Background Semantically-informed byte-level compression User-level semantic compression Key redundancy Key/value pairs are independent in MapReduce 1 1 2 4 Mapper Reducer 3 9 4 16 Pairs are not independent conceptually 1 2 1 4 Mapper Reducer 3 4 9 16 Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 12 / 22

Background Semantically-informed byte-level compression User-level semantic compression SciHadoop semantic compression 1 2 3 4 Address per cell (0, 0) → 1 Address range per block (0, 1) → 2 vs (0, 0) - (1, 1) → { 1, 2, 3, 4 } (1, 0) → 3 (1, 1) → 4 Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 13 / 22

Background Semantically-informed byte-level compression User-level semantic compression N-dimensional aggregation Optimal choice is not obvious Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 14 / 22

Background Semantically-informed byte-level compression User-level semantic compression Linearizing with a space-filling curve 1 2 5 6 3 4 7 8 9 10 13 14 11 12 15 16 1–5, 7, 9–10, 13 Cells are numbered with a space-filling curve, and contiguous numbers are collapsed into ranges Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 15 / 22

Background Semantically-informed byte-level compression User-level semantic compression Overlapping keys problem Ranges are unequal, so reducer won’t reduce Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 16 / 22

Background Semantically-informed byte-level compression User-level semantic compression Unavoidable overlap Alignment is insufficient Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 17 / 22

Background Semantically-informed byte-level compression User-level semantic compression Key splitting Overlapping ranges are split on the overlap boundaries Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 18 / 22

Background Semantically-informed byte-level compression User-level semantic compression Effect of key aggregation 25 Values Keys File overhead 20 3.81 MB Total dataset size (MB) 15 10 15.26 MB 5 3.81 MB 25.05 KB 1.91 MB 5.84 KB 0 Original Compressed Data size is reduced by 84.5% for a 100 × 100 × 100 grid of integers Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 19 / 22

Background Semantically-informed byte-level compression User-level semantic compression Result Query: median of a sliding 3 × 3 × 3 window in an 800 × 800 × 800 grid of integers Cluster: 5 nodes, with 5 reducers and 10 map slots. Intermediate data (“Map output materialized bytes”) was reduced by 60.7% Intermediate key/value pair count (“Reduce input records”) was reduced by 73.3% Total runtime was reduced by 28.5% Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 20 / 22

Background Semantically-informed byte-level compression User-level semantic compression Conclusion Compression must be fast to be useful Semantic compression has an advantage with multiple read/write cycles Scientific processing in Hadoop is becoming more feasible Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 21 / 22

Background Semantically-informed byte-level compression User-level semantic compression Questions? Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 22 / 22

Compressing Intermediate Keys between Mappers and Reducers in - PowerPoint PPT Presentation

Background Semantically-informed byte-level compression User-level semantic compression Compressing Intermediate Keys between Mappers and Reducers in SciHadoop Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt { adamcrume,buck,carlosm,scott }

Galactic Mappers Geography Teachers Association of NSW Annual Conference - 10/04/18 Jaye Dunn and

Lempel- -Ziv Ziv- -Welch (LZW) Welch (LZW) Lempel Data Compressing Model Data Compressing

High Throughput Maskless Lithography Sokudo lithography breakfast forum July 14 th 2010 Bert Jan

Compressing RSA/Rabin keys Public keys D. J. Bernstein Each user publishes a key 2 2047 + 1

Intermediate forms: A-Normal Form Matt Might University of Utah www.ucombinator.org

2010 2500 keys > 100 uses 1250 keys > 1000 uses 2018 11000 keys >

Generalized Intermediate Value Theorem Intermediate Value Theorem Theorem Intermediate Value

Custer Baker Intermediate School Welcome to Custer Baker Intermediate School Intermediate

Everglades excerpts of a talk by Fritz Davis 2004 John Kunkel Small The Keys Lower

Compressing Strings of the Kernel Wolfram Sang Consultant 21.8.2014, LinuxCon14 Wolfram Sang

Latest developments at SENIS Dr. Dragana Popovic Renella, COO OUR PRODUCTS.... MAPPERS

Dynamic mappers of NGS reads Karel Binda (LIGM Universit Paris-Est) Valentina Boeva

Intermediate Capital Group PLC Half Year Results 30 September 2011 Intermediate Capital Group

Keys to Develop and Secure Potential Prospects Keys to Develop and Secure Potential Prospects

Lecture Outline Intermediate Code & Intermediate code Local Optimizations Local

RIOT and CAN Vincent Dupont OTA keys RIOT Summit September 25-26, 2017 Vincent Dupont (OTA

The Problem with a lot of slides stolen from 4074: Adv. Anim. & Rendering Alexei Efros ,

Deep Image Compression using BINet Andr Nortje 18247717@sun.ac.za Prof. Herman Engelbrecht

Compression Strategies & Alternate Summarization Systems and Applications Ling 573 May 23,

Nonconstant mean curvature solutions of the Einstein constraint equations Gantumur Tsogtgerel

Material Modelling for the Simulation of Microforming Processes at Elevated Temperature D.

CS 309: Autonomous Intelligent Robotics FRI I Lecture 10: Introduction to ROS Instructor: Justin

Rule Languages: Rule Languages: Automotive Use Case Automotive Use Case Kurt Godden Kurt

Algorithmic and Software Challenges when Moving Towards Exascale Jack Dongarra University of

Compressing Intermediate Keys between Mappers and Reducers in - PowerPoint PPT Presentation

Background Semantically-informed byte-level compression User-level semantic compression Compressing Intermediate Keys between Mappers and Reducers in SciHadoop Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt { adamcrume,buck,carlosm,scott }

Galactic Mappers Geography Teachers Association of NSW Annual Conference - 10/04/18 Jaye Dunn and

Lempel- -Ziv Ziv- -Welch (LZW) Welch (LZW) Lempel Data Compressing Model Data Compressing

High Throughput Maskless Lithography Sokudo lithography breakfast forum July 14 th 2010 Bert Jan

Compressing RSA/Rabin keys Public keys D. J. Bernstein Each user publishes a key 2 2047 + 1

Intermediate forms: A-Normal Form Matt Might University of Utah www.ucombinator.org

2010 2500 keys &gt; 100 uses 1250 keys &gt; 1000 uses 2018 11000 keys &gt;

Generalized Intermediate Value Theorem Intermediate Value Theorem Theorem Intermediate Value

Custer Baker Intermediate School Welcome to Custer Baker Intermediate School Intermediate

Everglades excerpts of a talk by Fritz Davis 2004 John Kunkel Small The Keys Lower

Compressing Strings of the Kernel Wolfram Sang Consultant 21.8.2014, LinuxCon14 Wolfram Sang

Latest developments at SENIS Dr. Dragana Popovic Renella, COO OUR PRODUCTS.... MAPPERS

Dynamic mappers of NGS reads Karel Binda (LIGM Universit Paris-Est) Valentina Boeva

Intermediate Capital Group PLC Half Year Results 30 September 2011 Intermediate Capital Group

Keys to Develop and Secure Potential Prospects Keys to Develop and Secure Potential Prospects

Lecture Outline Intermediate Code &amp; Intermediate code Local Optimizations Local

RIOT and CAN Vincent Dupont OTA keys RIOT Summit September 25-26, 2017 Vincent Dupont (OTA

The Problem with a lot of slides stolen from 4074: Adv. Anim. &amp; Rendering Alexei Efros ,

Deep Image Compression using BINet Andr Nortje 18247717@sun.ac.za Prof. Herman Engelbrecht

Compression Strategies &amp; Alternate Summarization Systems and Applications Ling 573 May 23,

Nonconstant mean curvature solutions of the Einstein constraint equations Gantumur Tsogtgerel

Material Modelling for the Simulation of Microforming Processes at Elevated Temperature D.

CS 309: Autonomous Intelligent Robotics FRI I Lecture 10: Introduction to ROS Instructor: Justin

Rule Languages: Rule Languages: Automotive Use Case Automotive Use Case Kurt Godden Kurt

Algorithmic and Software Challenges when Moving Towards Exascale Jack Dongarra University of

2010 2500 keys > 100 uses 1250 keys > 1000 uses 2018 11000 keys >

Lecture Outline Intermediate Code & Intermediate code Local Optimizations Local

The Problem with a lot of slides stolen from 4074: Adv. Anim. & Rendering Alexei Efros ,

Compression Strategies & Alternate Summarization Systems and Applications Ling 573 May 23,