 
              LEEN : Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud i # Shadi Ibr ahim, Hai Jin, L u L u, Song Wu, Bingshe ng He *, Qi L Huazho ng Unive r sity o f Sc ie nc e and T e c hno lo gy *Nanyang T e c hno lo gic al Unive r sity # China De ve lo pme nt banc k
Motivation  Ma pRe duc e is Be c oming ve ry Popula r  Hadoop is wide ly use d by E nte r pr ise and Ac ade mia  Yahoo! , F ac e book, Baidu, … .  Cor ne ll, Mar yland, HUST , … ..  T he wide dive rsity of T oda y’s Da ta Inte nsive a pplic a tions :  Se ar c h E ngine  Soc ial ne twor ks  Sc ie ntific Applic ation 2
Motivation  Some a pplic a tions e xpe rie nc e d Da ta Ske w in the shuffle pha se [1,2]  the c urre nt Ma pRe duc e imple me nta tions ha ve ove rlooke d the ske w issue  Re sults:  Hash par titioning is inade quate in the pr e se ne se of data ske w  De sign L E E N: L oc ality and fair ne ss awar e ke y par titioning 1 . X. Qiu, J. Ekanayake, S. Beason, T. Gunarathne, G. Fox, R. Barga, and D. Gannon, “Cloud technologies for bioinformatics applications”, Proc. ACM Work. Many-Task Computing on Grids and Supercomputers (MTAGS 2009) , ACM Press, Nov. 2009. 2. J. Lin, “The Curse of Zipf and Limits to Parallelization: A Look at the Stragglers Problem in MapReduce”, Proc. Work. Large-Scale Distributed Systems for Information Retrieval (LSDS-IR'09) , Jul. 2009. 3
Outlines  Motivation  Hash par titioning in MapRe duc e  L E E N: L oc ality and F air ne ss Awar e Ke y Par titioning  E valuation  Conc lusion 4
Hash partitioning in Hadoop  T he c ur r e nt Hadoop’s hash par titioning wor ks we ll whe n the ke ys ar e e qually appe ar e d and unifor mly stor e d in the data node s  In the pr e se nc e of Par titioning Ske w:  Va ria tion in Inte rme dia te Ke ys’ fre que nc ie s  Va ria tion in Inte rme dia te Ke y’s distribution a mong st diffe re nt da ta node  Native blindly hash- par titioning is to be inade quate and will le ad to:  Ne twork c ong e stion  Unfa irne ss in re duc e rs’ inputs  Re duc e c omputa tion Ske w  Pe rforma nc e de g ra da tion 5
The Problem (Motivational Example) Data Node1 Data Node2 Data Node3 K1 K1 K1 K1 K1 K1 K2 K2 K2 K2 K2 K2 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K2 K2 K2 K2 K2 K2 K2 K2 K3 K3 K3 K3 K3 K3 K3 K3 K1 K1 K1 K1 K1 K1 K2 K2 K4 K4 K4 K4 K2 K2 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K5 K5 K6 K6 K4 K4 K5 K5 K5 K5 K6 K6 K6 K6 K6 K6 K4 K4 K5 K5 K5 K5 K5 K5 K5 K5 K5 K5 hash (Hash code (Intermediate-key) Modulo ReduceID) K1 K1 K2 K2 K3 K3 K4 K4 K5 K5 K6 K6 Data Node1 Data Node2 Data Node3 Total Data Transfer 11 15 18 Total 44/54 Reduce Input 29 17 8 cv 58% 6
Example: Wordcount Example  6- node , 2 GB data se t! 83% of the Maps output  Combine F unc tion is disable d  T r ansfe r r e d Data is Transferred Data Local Data Data During Failed Reduce r e lative ly L ar ge 1000  Data Distr 900 ibution is 800 700 Data Size (MB) Imbalanc e d 600 500 400 300 Data 200 Distribution 100 0 Max-Min 20% Map Output Reduce Input Map Output Reduce Input Map Output Reduce Input Map Output Reduce Input Map Output Reduce Input Map Output Reduce Input Ratio cv 42% DataNode01 DataNode02 DataNode03 DataNode04 DataNode05 DataNode06
Our Work  Async hr onous Map and Re duc e e xe c ution  L oc ality- Awar e and F air ne ss- Awar e Ke y Par titioning L E E N
Asynchronous Map and Reduce execution  De fault Hadoop:  Se ve ra l ma ps a nd re duc e s a re c onc urre ntly running on e a c h da ta  Ove rla p c omputa tion a nd da ta tra nsfe r  Our Appr oac h  ke e p a tra c k on a ll the inte rme dia te ke ys’ fre que nc ie s a nd ke y’s distributions (using Da ta Node - Ke ys F re que nc y T a ble )  Could br ing a little ove r he ad due to the unutilize d ne twor k dur ing the map phase  it c an faste n the map e xe c ution be c ause the c omple te I/ O disk r e sour c e s will be r e se r ve d to the map tasks.  F or e xample , the ave r age e xe c ution time of map tasks (32 in de fault Hadoop, 26 Using our appr oac h) 9
LEEN Partitioning Algorithm  E xte nd the L oc ality- awar e c onc e pt to the Re duc e T asks  Conside r fair distr ibution of r e duc e r s’ inputs  Re sults:  Ba la nc e d distribution of re duc e rs’ input  Minimize the da ta tra nsfe r during shuffle pha se  Improve the re sponse time Clo se T o o ptima l tra de o ff b e twe e n Da ta L o c a lity a nd re duc e rs’ input F a irne ss F air ne ss [0,1] Minimum [0,100] L oc ality 10
LEEN Partitioning Algorithm (details)  Ke ys ar F air ne ss e sor te d ac c or ding to the ir Value s  F a irne ss L oc a lity Va lue L oc ality F F air air ne ss in distr ne ss ibution of K i amongst data node F L K i = Node with Be st L oc ality  F or e ac h ke y, node s ar e sor te d in de sc e nding or de r ac c or ding to the fr e que nc y of the spe c ific Ke y  Par tition a ke y to a node using F a irne ss- Sc ore Va lue  F or a spe c ific Ke y K i  If (F a irne ss- Sc ore N j > F a irne ss- Sc ore N j+1 ) move to the ne xt node  E lse pa rtition K i to N j 11
LEEN details (Example) K1 K1 K2 K2 k3 k3 k4 k4 k5 k5 k6 k6 3 3 5 5 4 4 4 4 1 1 1 1 Node 1 Node 1 18 18 9 9 1 1 0 0 3 3 2 2 3 3 Node 2 Node 2 18 18 4 4 3 3 0 0 6 6 5 5 0 0 Node 3 Node 3 18 18 T T otal otal 16 16 9 9 4 4 13 13 8 8 4 4 F L K 4.66 2.93 1.88 2.70 2.71 1.66 Data Transfer = 24/54 cv = 14% F or K N 1 N 2 N 3 Data Node1 Data Node1 Data Node2 Data Node1 Data Node3 Data Node1 1 If (to N 2 )  15 25 14  F a irne ss- Sc ore = 4.9 K1 K1 K1 K1 K1 K1 K2 K2 K2 K2 K2 K2 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K2 K2 K2 K2 (to N 3 )  15 9 30  F a irne ss- Sc ore = 8.8 K2 K2 K2 K2 K3 K3 K3 K3 K3 K3 K3 K3 K1 K1 K1 K1 K1 K1 K2 K2 K4 K4 K4 K4 K2 K2 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K5 K5 K6 K6 K4 K4 K5 K5 K5 K5 K6 K6 K6 K6 K6 K6 K4 K4 K5 K5 K5 K5 K5 K5 K5 K5 K5 K5 12
Evaluation  Cluste r of 7 Node s  Inte l Xe on two quad- c or e 2.33GHz  8 GB Me mor y  1 T B Disk  E ac h node r uns RHE L 5 with ke r ne l 2.6.22  Xe n 3.2  Hadoop ve r sion 0.18.0  De signe d 6 te st se ts  Manipulate the Par titioning Ske w De gr e e By modifying the e xisting te xtwr ite r c ode in Hadoop for ge ne r ating the input data into the HDF S 13
Test sets 1 2 3 4 5 6 Node s numbe r 6PMs 6PMs 6PMs 6PMs 24VMs 24VMs Da ta Size 14GB 8GB 4.6GB 12.8GB 6GB 10.5GB Ke ys fre que nc ie s 230% 1% 117% 230% 25% 85% va ria tion Ke y distribution 1% 195% 150% 20% 180% 170% va ria tion (a ve ra g e ) L oc a lity Ra ng e 24- 26% 1- 97.5% 1- 85% 15- 35% 1- 50% 1- 30% Pre se nc e of Ke ys’ F re que nc ie s Va ria tion Pa rtitioning Ske w Non- uniform Ke y’s distribution a mong st Da ta Node s 14
Keys’ Frequencies Variation  E ac h ke y is unifor mly distr ibute d among the data node s  Ke ys fr e que nc ie s ar e signific antly var ying L oc a lity Ra ng e [ , ] 6% 24- 26% 10 x 15
Non-Uniform Key Distribution  E ac h ke y is non- unifor mly distr ibute d among the data node s  Ke ys fr e que nc ie s ar e ne ar ly e qual 1- 97.5% 9% 16
Partitioning Skew 3 4 5 6 L oc a lity Ra ng e 1- 85% 15- 35% 1- 50% 1- 30% 17
Conclusion  Par titioning Ske w is a c halle nge for MapRe duc e - base d applic ations:  T oday, dive r sity of Data- inte nsive applic ations  Soc ia l Ne twork, Se a rc h e ng ine , Sc ie ntific Ana lysis , e tc  Par titioning Ske w is due to two fac tor s:  Sig nific a nt va ria nc e in inte rme dia te ke ys’ fre que nc ie s  Sig nific a nt va ria nc e in inte rme dia te ke y’s distributions a mong the diffe re nt da ta .  Our solution is to e xte nd the L oc ality c onc e pt to the r e duc e phase  Pa rtition the Ke ys a c c ording to  the ir hig h fre que nc ie s  F a irne ss in da ta distribution a mong diffe re nt da ta node s  Up to 40% impr ove me nt using simple applic ation e xample !  F utur e wor k  Apply L E E N to diffe r e nt ke y and value s size 18
Thank you! Questions ? shadi@hust.edu.cn http://grid.hust.edu.cn/shadi
Recommend
More recommend