SpongeFiles Mitigating Data Skew in MapReduce Using Distributed - PowerPoint PPT Presentation

SpongeFiles ： Mitigating Data Skew in MapReduce Using Distributed Memory Khaled Elmeleegy Benjamin Reed Christopher Olston Turn Inc. Facebook Inc. Google Inc. kelmeleegy@turn.com br33d@fb.com olston@google.com 1

Background • MapReduce are the primary platform for processing web & social networking data sets • These data sets tend to be heavily skewed • Native : hot news, hot people(e.g. holistic aggregation) • Machine learning: “ unkonw topic” “ unknow city” • Skew leads to overwhelm that node's memory capacity 2

Data Skew ： harmness & solution • Harmness • The major slow down of the MR job • Solution • Provide sufficient memory • Use data skew avoidance techniques • …… 3

Solution1 ： Pro rovide Suff fficient Mem emory • Method ： • Providing every task with enough memory • Shortcoming ： • Tasks are only run-time known • Required memory may not exist at the node executing the task • Conclusion ： • Although can mitigate spilling , but very wasteful 4

Solution2: Data skew avoidance techniques • Method: • Skew-resistant partitioning schemas • Skew detection and work migration techniques • Shortcoming ： • UDFs may be vulnerable to data skew • Conclusion: • Alleviating some, but not all 5

Sometimes ， We have to resort to Spill 6

Hadoop’s Map Phase Spill 0 1 2 …… kvindeces[] 3.75% p k v p k v p k v …… By default 100MB kvoffset[] 1.25% k v k v k v …… kvbuffer[] 95% 7

Hadoop’s Reduce Phase Spill Memory buffer InMemFSMergerThread merge MapOutputCopier …… …… In Memory MapOutputCopier …… Spill to Disk MapOutputCopier On Disk MapOutputCopier LocalFSMerger merge copiers …… …… Local Disk 8

How we expect th that we e can share th the memory …… 9

Here comes the SpongeFile Share memory in the same node Share memory between peers 10

SpongeFile • Utilize remote idle memory • Operates at app level (unlike remote memory) • Single spilled obj stored in a single sponge file • Composed of large chunks • Be used to complement a process's memory pool • Much simpler than regular files(for fast read & write ) 11

Design • No concurrent • single writer and a single • Does not persist after it is read • lifetime is well defined • Do not need a naming service • Each chunk can lie in the : • machine's local memory, • a remote machine's memory, • a local file system, or a distributed file system 12

Local Memory ry Chunk Allocator Effect ： Share memory between tasks in the same node Steps: 1. Acquires the shared pool's lock 2. Tries to find a free chunk 3. Release the lock & return the chunk handle (meta data) 4. Return a error if no free chunk 13

Remote Memory ry Chunk Allocator Effect: Share memory between tasks among peers Steps: 1. Get the list of sponge servers with free memory 2. Find a server with free space (on the same rack) 3. Writes data & gets back a handle 14

Disk Chunk Allocator Effect: It is the last resort, similar to spill on disk Steps: 1. Tries on the underlying local file system 2. If local disks have no free space, then tries the distributed file systems 15

Garbage Collection Tasks are alive: delete their SpongeFiles before they exit Tasks failed: sponge servers perform periodic garbage collections 16

Potential weakness analyt ytics • May Increase the probability of task failure • But ： = 1% per month • N ： number of machine • t ： running time • MTTF : mean time to failure 17

Evaluation • Microbenchmarks • 2.5Ghz quad core Xeon CPUs • 16GB of memory • 7200RPM 300G ATA drives • 1Gb Ethernet • Red Hat Enterprise Linux Server release 5.3 • Ext4 fs • Macrobenchmarks • Hadoop 0.20.2 of 30 node(2 map task slots & 1 reduce) • Pig 0.7 • With above 18

Microbenchmarks Spill a 1 MB buffer 10,000 times to disk and mem In Memory On Disk category Time(ms) category Time(ms) Local shared memory 1 Disk 25 Local memory Disk with back- 7 174 ( through sponge server) ground IO Disk with back- Remote memory (over the 9 ground IO and 499 network memory pressure 19

Microbenchmarks’s conclusion 1. spilling to shared memory is the least expensive 2. Then comes spilling locally via the local sponge(more processing and multiple messageexchanges) 3. Disk spilling is two orders of magnitude slower than memory 20

Macrobenchmarks • The jobs’ data sets: • Two versions of Hadoop: • The original • With SpongeFile • Two configuration of memory size ： • 4GB • 16GB 21

Test1 • When memory size is small, spilling to SpongeFiles performs better than spilling to disk • When memory is abundant, performance depends on the amount of data spilled and the time dierence between when the data is spilled and when it is read back 22

Test2 • Using SpongeFiles reduces the job1‘s runtime by over 85% in case of disk contention and memory pressure(Similar behavior is seen for the spam quantiles) • For the frequent anchor text job, when memory is abundant and even with disk contention, spilling to disk performs slightly better than spilling to SpongeFiles. 23

Test3 • No spilling performs best • Spilling to local sponge memory performs second • But spilling to SpongeFiles is the only one practical 24

Related work • Cooperative caching (for share) • Network memory (for small objects) • Remote paging systems (not the same level) Conclusion • Complementary to skew avoidance • Reduce job runtimes by up to 55% in absence of disk contention and by up to 85% 25

SpongeFiles Mitigating Data Skew in MapReduce Using Distributed - PowerPoint PPT Presentation

SpongeFiles Mitigating Data Skew in MapReduce Using Distributed Memory Khaled Elmeleegy Benjamin Reed Christopher Olston Turn Inc. Facebook Inc. Google Inc. kelmeleegy@turn.com br33d@fb.com olston@google.com 1 Background

Keccak Guido Bertoni 1 Joan Daemen 1 Michal Peeters 2 Gilles Van Assche 1 1 STMicroelectronics 2

Overview of the Sponge, Duplex and Farfalle constructions Gilles Van Assche 1 1 STMicroelectronics

1 Timothy 4:1-4 (ESV) Now the Spirit expressly says that in later times some will depart from the

The Spirit Jacques Ellul The Spirit In Galatians 5:16-26 And because you are children, 1.

Prt rt

SHA-3 vs the world David Wong Snefru MD4 Snefru MD4 Snefru MD4 MD5 MerkleDamgrd

Sponge-based PRNGs A Provable Security Perspective Stefano Tessaro UCSB Base on joint work with

The PHOTON Family of Lightweight Hash Functions Jian Guo, Thomas Peyrin and Axel Poschmann I2R

KangarooTwelve draft-viguier-kangarootwelve-00 t Viguier 1 Beno CFRG Meeting, July 18, 2017

Innovations in permutation-based encryption & authentication . based on joint work with

CPSC 418/MATH 318 Introduction to Cryptography Message Authentication Codes Randy Yee

EXPLICIT INSTRUCTION WEBINAR #5: ORGANIZING FOR INSTRUCTION P R E S E N T E D B Y: G I N A H O

state-recovery analysis of spritz Stefan Klbl 2 rc4 and tls RC4 Stream Cipher

Permutation-based cryptography for the Internet of Things Gilles Van Assche 1 1 STMicroelectronics

Bitcoin overview Joseph Bonneau This lecture Crypto background hash functions digital

Synthesis of 5-BromoVerongamine, an Antibacterial Dibromotyrosine Metabolite from Pseudoceratina

Permutation-based symmetric cryptography and Keccak Joan Daemen 1 joint work with . .. . . .

Collision Spectrum, Entropy Loss, T-Sponges and Cryptanalysis of gluon -64 L eo Perrin Dmitry

Practical Analysis of Reduced-Round K ECCAK Mar a Naya-Plasencia, Andrea R ock and Willi

Generic security of the Keyed Sponge based on joint work with Guido Bertoni 1 , Michal Peeters 1

Sponge-Based Control-Flow Protection for IoT Devices Werner, Unterluggauer, Schaffenrath, Mangard

www.pestcontrolcoronavirus.com Welcome and Introductions Sean Rollo, BCE President, CPMA

Self healing concrete Technology for bio-based products

TORRES STRAIT TREK Recruitment, Reporting & Promotion (above) - Charlie David ( 2nd from

Sambuz

Useful Links

Newsletter

Mail Us

SpongeFiles Mitigating Data Skew in MapReduce Using Distributed - PowerPoint PPT Presentation

SpongeFiles Mitigating Data Skew in MapReduce Using Distributed Memory Khaled Elmeleegy Benjamin Reed Christopher Olston Turn Inc. Facebook Inc. Google Inc. kelmeleegy@turn.com br33d@fb.com olston@google.com 1 Background

Keccak Guido Bertoni 1 Joan Daemen 1 Michal Peeters 2 Gilles Van Assche 1 1 STMicroelectronics 2

Overview of the Sponge, Duplex and Farfalle constructions Gilles Van Assche 1 1 STMicroelectronics

1 Timothy 4:1-4 (ESV) Now the Spirit expressly says that in later times some will depart from the

The Spirit Jacques Ellul The Spirit In Galatians 5:16-26 And because you are children, 1.

Prt rt

SHA-3 vs the world David Wong Snefru MD4 Snefru MD4 Snefru MD4 MD5 MerkleDamgrd

Sponge-based PRNGs A Provable Security Perspective Stefano Tessaro UCSB Base on joint work with

The PHOTON Family of Lightweight Hash Functions Jian Guo, Thomas Peyrin and Axel Poschmann I2R

KangarooTwelve draft-viguier-kangarootwelve-00 t Viguier 1 Beno CFRG Meeting, July 18, 2017

Innovations in permutation-based encryption &amp; authentication . based on joint work with

CPSC 418/MATH 318 Introduction to Cryptography Message Authentication Codes Randy Yee

EXPLICIT INSTRUCTION WEBINAR #5: ORGANIZING FOR INSTRUCTION P R E S E N T E D B Y: G I N A H O

state-recovery analysis of spritz Stefan Klbl 2 rc4 and tls RC4 Stream Cipher

Permutation-based cryptography for the Internet of Things Gilles Van Assche 1 1 STMicroelectronics

Bitcoin overview Joseph Bonneau This lecture Crypto background hash functions digital

Synthesis of 5-BromoVerongamine, an Antibacterial Dibromotyrosine Metabolite from Pseudoceratina

Permutation-based symmetric cryptography and Keccak Joan Daemen 1 joint work with . .. . . .

Collision Spectrum, Entropy Loss, T-Sponges and Cryptanalysis of gluon -64 L eo Perrin Dmitry

Practical Analysis of Reduced-Round K ECCAK Mar a Naya-Plasencia, Andrea R ock and Willi

Generic security of the Keyed Sponge based on joint work with Guido Bertoni 1 , Michal Peeters 1

Sponge-Based Control-Flow Protection for IoT Devices Werner, Unterluggauer, Schaffenrath, Mangard

www.pestcontrolcoronavirus.com Welcome and Introductions Sean Rollo, BCE President, CPMA

Self healing concrete Technology for bio-based products

TORRES STRAIT TREK Recruitment, Reporting &amp; Promotion (above) - Charlie David ( 2nd from

Sambuz

Useful Links

Newsletter

Mail Us

Innovations in permutation-based encryption & authentication . based on joint work with

TORRES STRAIT TREK Recruitment, Reporting & Promotion (above) - Charlie David ( 2nd from