leen locality fairness aware key partitioning for
play

LEEN : Locality/Fairness- Aware Key Partitioning for MapReduce in - PowerPoint PPT Presentation

LEEN : Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud i # Shadi Ibr ahim, Hai Jin, L u L u, Song Wu, Bingshe ng He *, Qi L Huazho ng Unive r sity o f Sc ie nc e and T e c hno lo gy *Nanyang T e c hno lo gic al Unive


  1. LEEN : Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud i # Shadi Ibr ahim, Hai Jin, L u L u, Song Wu, Bingshe ng He *, Qi L Huazho ng Unive r sity o f Sc ie nc e and T e c hno lo gy *Nanyang T e c hno lo gic al Unive r sity # China De ve lo pme nt banc k

  2. Motivation  Ma pRe duc e is Be c oming ve ry Popula r  Hadoop is wide ly use d by E nte r pr ise and Ac ade mia  Yahoo! , F ac e book, Baidu, … .  Cor ne ll, Mar yland, HUST , … ..  T he wide dive rsity of T oda y’s Da ta Inte nsive a pplic a tions :  Se ar c h E ngine  Soc ial ne twor ks  Sc ie ntific Applic ation 2

  3. Motivation  Some a pplic a tions e xpe rie nc e d Da ta Ske w in the shuffle pha se [1,2]  the c urre nt Ma pRe duc e imple me nta tions ha ve ove rlooke d the ske w issue  Re sults:  Hash par titioning is inade quate in the pr e se ne se of data ske w  De sign L E E N: L oc ality and fair ne ss awar e ke y par titioning 1 . X. Qiu, J. Ekanayake, S. Beason, T. Gunarathne, G. Fox, R. Barga, and D. Gannon, “Cloud technologies for bioinformatics applications”, Proc. ACM Work. Many-Task Computing on Grids and Supercomputers (MTAGS 2009) , ACM Press, Nov. 2009. 2. J. Lin, “The Curse of Zipf and Limits to Parallelization: A Look at the Stragglers Problem in MapReduce”, Proc. Work. Large-Scale Distributed Systems for Information Retrieval (LSDS-IR'09) , Jul. 2009. 3

  4. Outlines  Motivation  Hash par titioning in MapRe duc e  L E E N: L oc ality and F air ne ss Awar e Ke y Par titioning  E valuation  Conc lusion 4

  5. Hash partitioning in Hadoop  T he c ur r e nt Hadoop’s hash par titioning wor ks we ll whe n the ke ys ar e e qually appe ar e d and unifor mly stor e d in the data node s  In the pr e se nc e of Par titioning Ske w:  Va ria tion in Inte rme dia te Ke ys’ fre que nc ie s  Va ria tion in Inte rme dia te Ke y’s distribution a mong st diffe re nt da ta node  Native blindly hash- par titioning is to be inade quate and will le ad to:  Ne twork c ong e stion  Unfa irne ss in re duc e rs’ inputs  Re duc e c omputa tion Ske w  Pe rforma nc e de g ra da tion 5

  6. The Problem (Motivational Example) Data Node1 Data Node2 Data Node3 K1 K1 K1 K1 K1 K1 K2 K2 K2 K2 K2 K2 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K2 K2 K2 K2 K2 K2 K2 K2 K3 K3 K3 K3 K3 K3 K3 K3 K1 K1 K1 K1 K1 K1 K2 K2 K4 K4 K4 K4 K2 K2 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K5 K5 K6 K6 K4 K4 K5 K5 K5 K5 K6 K6 K6 K6 K6 K6 K4 K4 K5 K5 K5 K5 K5 K5 K5 K5 K5 K5 hash (Hash code (Intermediate-key) Modulo ReduceID) K1 K1 K2 K2 K3 K3 K4 K4 K5 K5 K6 K6 Data Node1 Data Node2 Data Node3 Total Data Transfer 11 15 18 Total 44/54 Reduce Input 29 17 8 cv 58% 6

  7. Example: Wordcount Example  6- node , 2 GB data se t! 83% of the Maps output  Combine F unc tion is disable d  T r ansfe r r e d Data is Transferred Data Local Data Data During Failed Reduce r e lative ly L ar ge 1000  Data Distr 900 ibution is 800 700 Data Size (MB) Imbalanc e d 600 500 400 300 Data 200 Distribution 100 0 Max-Min 20% Map Output Reduce Input Map Output Reduce Input Map Output Reduce Input Map Output Reduce Input Map Output Reduce Input Map Output Reduce Input Ratio cv 42% DataNode01 DataNode02 DataNode03 DataNode04 DataNode05 DataNode06

  8. Our Work  Async hr onous Map and Re duc e e xe c ution  L oc ality- Awar e and F air ne ss- Awar e Ke y Par titioning L E E N

  9. Asynchronous Map and Reduce execution  De fault Hadoop:  Se ve ra l ma ps a nd re duc e s a re c onc urre ntly running on e a c h da ta  Ove rla p c omputa tion a nd da ta tra nsfe r  Our Appr oac h  ke e p a tra c k on a ll the inte rme dia te ke ys’ fre que nc ie s a nd ke y’s distributions (using Da ta Node - Ke ys F re que nc y T a ble )  Could br ing a little ove r he ad due to the unutilize d ne twor k dur ing the map phase  it c an faste n the map e xe c ution be c ause the c omple te I/ O disk r e sour c e s will be r e se r ve d to the map tasks.  F or e xample , the ave r age e xe c ution time of map tasks (32 in de fault Hadoop, 26 Using our appr oac h) 9

  10. LEEN Partitioning Algorithm  E xte nd the L oc ality- awar e c onc e pt to the Re duc e T asks  Conside r fair distr ibution of r e duc e r s’ inputs  Re sults:  Ba la nc e d distribution of re duc e rs’ input  Minimize the da ta tra nsfe r during shuffle pha se  Improve the re sponse time Clo se T o o ptima l tra de o ff b e twe e n Da ta L o c a lity a nd re duc e rs’ input F a irne ss F air ne ss [0,1] Minimum [0,100] L oc ality 10

  11. LEEN Partitioning Algorithm (details)  Ke ys ar F air ne ss e sor te d ac c or ding to the ir Value s  F a irne ss L oc a lity Va lue L oc ality F F air air ne ss in distr ne ss ibution of K i amongst data node F L K i = Node with Be st L oc ality  F or e ac h ke y, node s ar e sor te d in de sc e nding or de r ac c or ding to the fr e que nc y of the spe c ific Ke y  Par tition a ke y to a node using F a irne ss- Sc ore Va lue  F or a spe c ific Ke y K i  If (F a irne ss- Sc ore N j > F a irne ss- Sc ore N j+1 ) move to the ne xt node  E lse pa rtition K i to N j 11

  12. LEEN details (Example) K1 K1 K2 K2 k3 k3 k4 k4 k5 k5 k6 k6 3 3 5 5 4 4 4 4 1 1 1 1 Node 1 Node 1 18 18 9 9 1 1 0 0 3 3 2 2 3 3 Node 2 Node 2 18 18 4 4 3 3 0 0 6 6 5 5 0 0 Node 3 Node 3 18 18 T T otal otal 16 16 9 9 4 4 13 13 8 8 4 4 F L K 4.66 2.93 1.88 2.70 2.71 1.66 Data Transfer = 24/54 cv = 14% F or K N 1 N 2 N 3 Data Node1 Data Node1 Data Node2 Data Node1 Data Node3 Data Node1 1 If (to N 2 )  15 25 14  F a irne ss- Sc ore = 4.9 K1 K1 K1 K1 K1 K1 K2 K2 K2 K2 K2 K2 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K1 K2 K2 K2 K2 (to N 3 )  15 9 30  F a irne ss- Sc ore = 8.8 K2 K2 K2 K2 K3 K3 K3 K3 K3 K3 K3 K3 K1 K1 K1 K1 K1 K1 K2 K2 K4 K4 K4 K4 K2 K2 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K4 K5 K5 K6 K6 K4 K4 K5 K5 K5 K5 K6 K6 K6 K6 K6 K6 K4 K4 K5 K5 K5 K5 K5 K5 K5 K5 K5 K5 12

  13. Evaluation  Cluste r of 7 Node s  Inte l Xe on two quad- c or e 2.33GHz  8 GB Me mor y  1 T B Disk  E ac h node r uns RHE L 5 with ke r ne l 2.6.22  Xe n 3.2  Hadoop ve r sion 0.18.0  De signe d 6 te st se ts  Manipulate the Par titioning Ske w De gr e e By modifying the e xisting te xtwr ite r c ode in Hadoop for ge ne r ating the input data into the HDF S 13

  14. Test sets 1 2 3 4 5 6 Node s numbe r 6PMs 6PMs 6PMs 6PMs 24VMs 24VMs Da ta Size 14GB 8GB 4.6GB 12.8GB 6GB 10.5GB Ke ys fre que nc ie s 230% 1% 117% 230% 25% 85% va ria tion Ke y distribution 1% 195% 150% 20% 180% 170% va ria tion (a ve ra g e ) L oc a lity Ra ng e 24- 26% 1- 97.5% 1- 85% 15- 35% 1- 50% 1- 30% Pre se nc e of Ke ys’ F re que nc ie s Va ria tion Pa rtitioning Ske w Non- uniform Ke y’s distribution a mong st Da ta Node s 14

  15. Keys’ Frequencies Variation  E ac h ke y is unifor mly distr ibute d among the data node s  Ke ys fr e que nc ie s ar e signific antly var ying L oc a lity Ra ng e [ , ] 6% 24- 26% 10 x 15

  16. Non-Uniform Key Distribution  E ac h ke y is non- unifor mly distr ibute d among the data node s  Ke ys fr e que nc ie s ar e ne ar ly e qual 1- 97.5% 9% 16

  17. Partitioning Skew 3 4 5 6 L oc a lity Ra ng e 1- 85% 15- 35% 1- 50% 1- 30% 17

  18. Conclusion  Par titioning Ske w is a c halle nge for MapRe duc e - base d applic ations:  T oday, dive r sity of Data- inte nsive applic ations  Soc ia l Ne twork, Se a rc h e ng ine , Sc ie ntific Ana lysis , e tc  Par titioning Ske w is due to two fac tor s:  Sig nific a nt va ria nc e in inte rme dia te ke ys’ fre que nc ie s  Sig nific a nt va ria nc e in inte rme dia te ke y’s distributions a mong the diffe re nt da ta .  Our solution is to e xte nd the L oc ality c onc e pt to the r e duc e phase  Pa rtition the Ke ys a c c ording to  the ir hig h fre que nc ie s  F a irne ss in da ta distribution a mong diffe re nt da ta node s  Up to 40% impr ove me nt using simple applic ation e xample !  F utur e wor k  Apply L E E N to diffe r e nt ke y and value s size 18

  19. Thank you! Questions ? shadi@hust.edu.cn http://grid.hust.edu.cn/shadi

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend