LEEN : Locality/Fairness- Aware Key Partitioning for MapReduce in - PowerPoint PPT Presentation

LEEN : Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud i # Shadi Ibr ahim, Hai Jin, L u L u, Song Wu, Bingshe ng He *, Qi L Huazho ng Unive r sity o f Sc ie nc e and T e c hno lo gy *Nanyang T e c hno lo gic al Unive r sity # China De ve lo pme nt banc k

Motivation  Ma pRe duc e is Be c oming ve ry Popula r  Hadoop is wide ly use d by E nte r pr ise and Ac ade mia  Yahoo! , F ac e book, Baidu, … .  Cor ne ll, Mar yland, HUST , … ..  T he wide dive rsity of T oda y’s Da ta Inte nsive a pplic a tions :  Se ar c h E ngine  Soc ial ne twor ks  Sc ie ntific Applic ation 2

Motivation  Some a pplic a tions e xpe rie nc e d Da ta Ske w in the shuffle pha se [1,2]  the c urre nt Ma pRe duc e imple me nta tions ha ve ove rlooke d the ske w issue  Re sults:  Hash par titioning is inade quate in the pr e se ne se of data ske w  De sign L E E N: L oc ality and fair ne ss awar e ke y par titioning 1 . X. Qiu, J. Ekanayake, S. Beason, T. Gunarathne, G. Fox, R. Barga, and D. Gannon, “Cloud technologies for bioinformatics applications”, Proc. ACM Work. Many-Task Computing on Grids and Supercomputers (MTAGS 2009) , ACM Press, Nov. 2009. 2. J. Lin, “The Curse of Zipf and Limits to Parallelization: A Look at the Stragglers Problem in MapReduce”, Proc. Work. Large-Scale Distributed Systems for Information Retrieval (LSDS-IR'09) , Jul. 2009. 3

Outlines  Motivation  Hash par titioning in MapRe duc e  L E E N: L oc ality and F air ne ss Awar e Ke y Par titioning  E valuation  Conc lusion 4

Hash partitioning in Hadoop  T he c ur r e nt Hadoop’s hash par titioning wor ks we ll whe n the ke ys ar e e qually appe ar e d and unifor mly stor e d in the data node s  In the pr e se nc e of Par titioning Ske w:  Va ria tion in Inte rme dia te Ke ys’ fre que nc ie s  Va ria tion in Inte rme dia te Ke y’s distribution a mong st diffe re nt da ta node  Native blindly hash- par titioning is to be inade quate and will le ad to:  Ne twork c ong e stion  Unfa irne ss in re duc e rs’ inputs  Re duc e c omputa tion Ske w  Pe rforma nc e de g ra da tion 5

The Problem (Motivational Example) Data Node1 Data Node2 Data Node3 K1 K1 K1 K1 K1 K1 K2 K2 K2 K2 K2 K2 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K2 K2 K2 K2 K2 K2 K2 K2 K3 K3 K3 K3 K3 K3 K3 K3 K1 K1 K1 K1 K1 K1 K2 K2 K4 K4 K4 K4 K2 K2 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K5 K5 K6 K6 K4 K4 K5 K5 K5 K5 K6 K6 K6 K6 K6 K6 K4 K4 K5 K5 K5 K5 K5 K5 K5 K5 K5 K5 hash (Hash code (Intermediate-key) Modulo ReduceID) K1 K1 K2 K2 K3 K3 K4 K4 K5 K5 K6 K6 Data Node1 Data Node2 Data Node3 Total Data Transfer 11 15 18 Total 44/54 Reduce Input 29 17 8 cv 58% 6

Example: Wordcount Example  6- node , 2 GB data se t! 83% of the Maps output  Combine F unc tion is disable d  T r ansfe r r e d Data is Transferred Data Local Data Data During Failed Reduce r e lative ly L ar ge 1000  Data Distr 900 ibution is 800 700 Data Size (MB) Imbalanc e d 600 500 400 300 Data 200 Distribution 100 0 Max-Min 20% Map Output Reduce Input Map Output Reduce Input Map Output Reduce Input Map Output Reduce Input Map Output Reduce Input Map Output Reduce Input Ratio cv 42% DataNode01 DataNode02 DataNode03 DataNode04 DataNode05 DataNode06

Our Work  Async hr onous Map and Re duc e e xe c ution  L oc ality- Awar e and F air ne ss- Awar e Ke y Par titioning L E E N

Asynchronous Map and Reduce execution  De fault Hadoop:  Se ve ra l ma ps a nd re duc e s a re c onc urre ntly running on e a c h da ta  Ove rla p c omputa tion a nd da ta tra nsfe r  Our Appr oac h  ke e p a tra c k on a ll the inte rme dia te ke ys’ fre que nc ie s a nd ke y’s distributions (using Da ta Node - Ke ys F re que nc y T a ble )  Could br ing a little ove r he ad due to the unutilize d ne twor k dur ing the map phase  it c an faste n the map e xe c ution be c ause the c omple te I/ O disk r e sour c e s will be r e se r ve d to the map tasks.  F or e xample , the ave r age e xe c ution time of map tasks (32 in de fault Hadoop, 26 Using our appr oac h) 9

LEEN Partitioning Algorithm  E xte nd the L oc ality- awar e c onc e pt to the Re duc e T asks  Conside r fair distr ibution of r e duc e r s’ inputs  Re sults:  Ba la nc e d distribution of re duc e rs’ input  Minimize the da ta tra nsfe r during shuffle pha se  Improve the re sponse time Clo se T o o ptima l tra de o ff b e twe e n Da ta L o c a lity a nd re duc e rs’ input F a irne ss F air ne ss [0,1] Minimum [0,100] L oc ality 10

LEEN Partitioning Algorithm (details)  Ke ys ar F air ne ss e sor te d ac c or ding to the ir Value s  F a irne ss L oc a lity Va lue L oc ality F F air air ne ss in distr ne ss ibution of K i amongst data node F L K i = Node with Be st L oc ality  F or e ac h ke y, node s ar e sor te d in de sc e nding or de r ac c or ding to the fr e que nc y of the spe c ific Ke y  Par tition a ke y to a node using F a irne ss- Sc ore Va lue  F or a spe c ific Ke y K i  If (F a irne ss- Sc ore N j > F a irne ss- Sc ore N j+1 ) move to the ne xt node  E lse pa rtition K i to N j 11

LEEN details (Example) K1 K1 K2 K2 k3 k3 k4 k4 k5 k5 k6 k6 3 3 5 5 4 4 4 4 1 1 1 1 Node 1 Node 1 18 18 9 9 1 1 0 0 3 3 2 2 3 3 Node 2 Node 2 18 18 4 4 3 3 0 0 6 6 5 5 0 0 Node 3 Node 3 18 18 T T otal otal 16 16 9 9 4 4 13 13 8 8 4 4 F L K 4.66 2.93 1.88 2.70 2.71 1.66 Data Transfer = 24/54 cv = 14% F or K N 1 N 2 N 3 Data Node1 Data Node1 Data Node2 Data Node1 Data Node3 Data Node1 1 If (to N 2 )  15 25 14  F a irne ss- Sc ore = 4.9 K1 K1 K1 K1 K1 K1 K2 K2 K2 K2 K2 K2 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K2 K2 K2 K2 (to N 3 )  15 9 30  F a irne ss- Sc ore = 8.8 K2 K2 K2 K2 K3 K3 K3 K3 K3 K3 K3 K3 K1 K1 K1 K1 K1 K1 K2 K2 K4 K4 K4 K4 K2 K2 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K5 K5 K6 K6 K4 K4 K5 K5 K5 K5 K6 K6 K6 K6 K6 K6 K4 K4 K5 K5 K5 K5 K5 K5 K5 K5 K5 K5 12

Evaluation  Cluste r of 7 Node s  Inte l Xe on two quad- c or e 2.33GHz  8 GB Me mor y  1 T B Disk  E ac h node r uns RHE L 5 with ke r ne l 2.6.22  Xe n 3.2  Hadoop ve r sion 0.18.0  De signe d 6 te st se ts  Manipulate the Par titioning Ske w De gr e e By modifying the e xisting te xtwr ite r c ode in Hadoop for ge ne r ating the input data into the HDF S 13

Test sets 1 2 3 4 5 6 Node s numbe r 6PMs 6PMs 6PMs 6PMs 24VMs 24VMs Da ta Size 14GB 8GB 4.6GB 12.8GB 6GB 10.5GB Ke ys fre que nc ie s 230% 1% 117% 230% 25% 85% va ria tion Ke y distribution 1% 195% 150% 20% 180% 170% va ria tion (a ve ra g e ) L oc a lity Ra ng e 24- 26% 1- 97.5% 1- 85% 15- 35% 1- 50% 1- 30% Pre se nc e of Ke ys’ F re que nc ie s Va ria tion Pa rtitioning Ske w Non- uniform Ke y’s distribution a mong st Da ta Node s 14

Keys’ Frequencies Variation  E ac h ke y is unifor mly distr ibute d among the data node s  Ke ys fr e que nc ie s ar e signific antly var ying L oc a lity Ra ng e [ , ] 6% 24- 26% 10 x 15

Non-Uniform Key Distribution  E ac h ke y is non- unifor mly distr ibute d among the data node s  Ke ys fr e que nc ie s ar e ne ar ly e qual 1- 97.5% 9% 16

Partitioning Skew 3 4 5 6 L oc a lity Ra ng e 1- 85% 15- 35% 1- 50% 1- 30% 17

Conclusion  Par titioning Ske w is a c halle nge for MapRe duc e - base d applic ations:  T oday, dive r sity of Data- inte nsive applic ations  Soc ia l Ne twork, Se a rc h e ng ine , Sc ie ntific Ana lysis , e tc  Par titioning Ske w is due to two fac tor s:  Sig nific a nt va ria nc e in inte rme dia te ke ys’ fre que nc ie s  Sig nific a nt va ria nc e in inte rme dia te ke y’s distributions a mong the diffe re nt da ta .  Our solution is to e xte nd the L oc ality c onc e pt to the r e duc e phase  Pa rtition the Ke ys a c c ording to  the ir hig h fre que nc ie s  F a irne ss in da ta distribution a mong diffe re nt da ta node s  Up to 40% impr ove me nt using simple applic ation e xample !  F utur e wor k  Apply L E E N to diffe r e nt ke y and value s size 18

Thank you! Questions ? shadi@hust.edu.cn http://grid.hust.edu.cn/shadi

LEEN : Locality/Fairness- Aware Key Partitioning for MapReduce in - PowerPoint PPT Presentation

LEEN : Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud i # Shadi Ibr ahim, Hai Jin, L u L u, Song Wu, Bingshe ng He , Qi L Huazho ng Unive r sity o f Sc ie nc e and T e c hno lo gy Nanyang T e c hno lo gic al Unive

CONTEXT LOCALITY LOCALITY LOCALITY LOCALITY LAYOUTS M E E R L U S T R O A D PICK

Locality Locality CS 105 Tour of the Black Holes of Computing Principle of Locality: Programs

Partitioning and Divide-and- Conquer Strategies Partitioning Strategies Partitioning simply

Partitioning Introduction to Partitioning Mahapatra-Texas A&M-Spring02 1 System

PAP: Power Aware Partitioning of Reconfigurable Systems Vijay R. P. Kappagantula Rabi Mahapatra

locality.org.uk Locality is the national network of ambitious and enterprising community-led

Highway Locality Budget Scheme Steve Dibben Highway Locality Manager Mid Herts Group

arXiv:1710.06921v1 [cs.CY] 18 Oct 2017 ABSTRACT and fairness-aware ML methods [6, 7, 8, 9, 10,

Partitioning under the hood in MySQL 5.5 Mattias Jonsson, Partitioning developer Mikael

1 1 Slide 5 Slide 6 Partitioning and Load Balancing Partitioning Goals Assignment of

Partitioning Problem and Usage Lecture 8 CSCI 4974/6971 26 Sep 2016 1 / 14 Todays Biz 1.

Investigating hypergraph-partitioning-based sparse matrix partitioning methods Bora U car

Fairness in Machine Learning Fairness in Supervised Learning Make decisions by machine learning:

On Combining State Space Reductions with Global Fairness Assumptions Shaojie Zhang 1 Jun Sun 2 Jun

Media Fairness, Diversity 1 Outline Fairness (case studies, basic definitions) Diversity

Fairness in Machine Learning: Part I Privacy & Fairness in Data Science CS848 Fall 2019

Creative Audio Programming for the Web with Tone.js Anna Xamb anna.xambo@dmu.ac.uk Music,

Organizational Culture Chris May: Vitae Consulting LLC 630-608-7072 cmayconsulting@gmail.com

Multi-Dimensional Arrays Chapter 8 1-Dimentional and 2-Dimentional Arrays In the previous

Buffering to Redis for Efficient Real-Time Processing Percona Live, April 24, 2018

What does random mean? Random - Something or a group of things that follow no criteria

Real-time Protection Against Ransomware at End-Hosts Written By Amin Kharraz and Engin Kirda

Diet monitoring is a big issue in many health- related topics, so there has been many

DAY-2 PRACTICE (ORALLY)

LEEN : Locality/Fairness- Aware Key Partitioning for MapReduce in - PowerPoint PPT Presentation

LEEN : Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud i # Shadi Ibr ahim, Hai Jin, L u L u, Song Wu, Bingshe ng He *, Qi L Huazho ng Unive r sity o f Sc ie nc e and T e c hno lo gy *Nanyang T e c hno lo gic al Unive

CONTEXT LOCALITY LOCALITY LOCALITY LOCALITY LAYOUTS M E E R L U S T R O A D PICK

Locality Locality CS 105 Tour of the Black Holes of Computing Principle of Locality: Programs

Partitioning and Divide-and- Conquer Strategies Partitioning Strategies Partitioning simply

Partitioning Introduction to Partitioning Mahapatra-Texas A&amp;M-Spring02 1 System

PAP: Power Aware Partitioning of Reconfigurable Systems Vijay R. P. Kappagantula Rabi Mahapatra

locality.org.uk Locality is the national network of ambitious and enterprising community-led

Highway Locality Budget Scheme Steve Dibben Highway Locality Manager Mid Herts Group

arXiv:1710.06921v1 [cs.CY] 18 Oct 2017 ABSTRACT and fairness-aware ML methods [6, 7, 8, 9, 10,

Partitioning under the hood in MySQL 5.5 Mattias Jonsson, Partitioning developer Mikael

1 1 Slide 5 Slide 6 Partitioning and Load Balancing Partitioning Goals Assignment of

Partitioning Problem and Usage Lecture 8 CSCI 4974/6971 26 Sep 2016 1 / 14 Todays Biz 1.

Investigating hypergraph-partitioning-based sparse matrix partitioning methods Bora U car

Fairness in Machine Learning Fairness in Supervised Learning Make decisions by machine learning:

On Combining State Space Reductions with Global Fairness Assumptions Shaojie Zhang 1 Jun Sun 2 Jun

Media Fairness, Diversity 1 Outline Fairness (case studies, basic definitions) Diversity

Fairness in Machine Learning: Part I Privacy &amp; Fairness in Data Science CS848 Fall 2019

Creative Audio Programming for the Web with Tone.js Anna Xamb anna.xambo@dmu.ac.uk Music,

Organizational Culture Chris May: Vitae Consulting LLC 630-608-7072 cmayconsulting@gmail.com

Multi-Dimensional Arrays Chapter 8 1-Dimentional and 2-Dimentional Arrays In the previous

Buffering to Redis for Efficient Real-Time Processing Percona Live, April 24, 2018

What does random mean? Random - Something or a group of things that follow no criteria

Real-time Protection Against Ransomware at End-Hosts Written By Amin Kharraz and Engin Kirda

Diet monitoring is a big issue in many health- related topics, so there has been many

DAY-2 PRACTICE (ORALLY)

LEEN : Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud i # Shadi Ibr ahim, Hai Jin, L u L u, Song Wu, Bingshe ng He , Qi L Huazho ng Unive r sity o f Sc ie nc e and T e c hno lo gy Nanyang T e c hno lo gic al Unive

Partitioning Introduction to Partitioning Mahapatra-Texas A&M-Spring02 1 System

Fairness in Machine Learning: Part I Privacy & Fairness in Data Science CS848 Fall 2019