graph processing frameworks
play

Graph Processing Frameworks Lecture 24 CSCI 4974/6971 5 Dec 2016 - PowerPoint PPT Presentation

Graph Processing Frameworks Lecture 24 CSCI 4974/6971 5 Dec 2016 1 / 13 Todays Biz 1. Reminders 2. Review 3. Graph Processing Frameworks 4. 2D Partitioning 2 / 13 Reminders Assignment 6: due date Dec 8th Final Project


  1. Pregel Execution 3. The master assigns a portion of the user’s input to each worker. The input is treated as a set of records, each of which contains an arbitrary number of vertices and edges. After the input has finished loading, all vertices are marked are active. Pregel 30

  2. Pregel Execution 4. The master instructs each worker to perform a superstep. The worker loops through its active vertices, and call Compute() for each active vertex. It also delivers messages that were sent in the previous superstep. When the worker finishes it responds to the master with the number of vertices that will be active in the next superstep. Pregel 31

  3. Pregel Execution Pregel 32

  4. Pregel Execution Pregel 33

  5. Fault Tolerance • Checkpointing is used to implement fault tolerance. – At the start of every superstep the master may instruct the workers to save the state of their partitions in stable storage. – This includes vertex values, edge values and incoming messages. • Master uses “ping“ messages to detect worker failures. Pregel 34

  6. Fault Tolerance • When one or more workers fail, their associated partitions’ current state is lost. • Master reassigns these partitions to available set of workers. – They reload their partition state from the most recent available checkpoint. This can be many steps old. – The entire system is restarted from this superstep. • Confined recovery can be used to reduce this load Pregel 35

  7. Applications PageRank Pregel 36

  8. PageRank PageRank is a link analysis algorithm that is used to determine the importance of a document based on the number of references to it and the importance of the source documents themselves. [This was named after Larry Page (and not after rank of a webpage)] Pregel 37

  9. PageRank A = A given page T 1 …. T n = Pages that point to page A (citations) d = Damping factor between 0 and 1 (usually kept as 0.85) C(T) = number of links going out of T PR(A) = the PageRank of page A PR ( T ) PR ( T ) PR ( T )        1 2 n PR ( A ) ( 1 d ) d ( ........ ) C ( T ) C ( T ) C ( T ) 1 2 n Pregel 38

  10. PageRank Courtesy: Wikipedia Pregel 39

  11. PageRank PageRank can be solved in 2 ways: • A system of linear equations • An iterative loop till convergence We look at the pseudo code of iterative version Initial value of PageRank of all pages = 1.0; While ( sum of PageRank of all pages – numPages > epsilon) { for each Page P i in list { PageRank(P i ) = (1-d); for each page P j linking to page P i { PageRank(P i ) += d × (PageRank(P j )/numOutLinks(P j )); } } Pregel 40 }

  12. PageRank in MapReduce – Phase I Parsing HTML • Map task takes (URL, page content) pairs and maps them to (URL, (PR init , list-of-urls)) – PR init is the “seed” PageRank for URL – list-of-urls contains all pages pointed to by URL • Reduce task is just the identity function Pregel 41

  13. PageRank in MapReduce – Phase 2 PageRank Distribution • Map task takes (URL, (cur_rank, url_list)) – For each u in url_list, emit ( u , cur_rank/|url_list|) – Emit (URL, url_list) to carry the points-to list along through iterations • Reduce task gets (URL, url_list) and many (URL, val ) values – Sum val s and fix up with d – Emit (URL, (new_rank, url_list)) Pregel 42

  14. PageRank in MapReduce - Finalize • A non-parallelizable component determines whether convergence has been achieved • If so, write out the PageRank lists - done • Otherwise, feed output of Phase 2 into another Phase 2 iteration Pregel 43

  15. PageRank in Pregel Class PageRankVertex : public Vertex<double, void, double> { public: virtual void Compute(MessageIterator* msgs) { if (superstep() >= 1) { double sum = 0; for (; !msgs->done(); msgs->Next()) sum += msgs->Value(); *MutableValue() = 0.15 + 0.85 * sum; } if (supersteps() < 30) { const int64 n = GetOutEdgeIterator().size(); SendMessageToAllNeighbors(GetValue() / n); } else { VoteToHalt(); }}}; Pregel 44

  16. PageRank in Pregel The pregel implementation contains the PageRankVertex, which inherits from the Vertex class. The class has the vertex value type double to store tentative PageRank and message type double to carry PageRank fractions. The graph is initialized so that in superstep 0, value of each vertex is 1.0 . Pregel 45

  17. PageRank in Pregel In each superstep, each vertex sends out along each outgoing edge its tentative PageRank divided by the number of outgoing edges. Also, each vertex sums up the values arriving on messages into sum and sets its own tentative PageRank to   0 . 15 0 . 85 sum For convergence, either there is a limit on the number of supersteps or aggregators are used to detect convergence. Pregel 46

  18. Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella <claudio@apache.org> @claudiomartella Hadoop Summit @ Amsterdam - 3 April 2014

  19. 2

  20. Graphs are simple 3

  21. A computer network 4

  22. A social network 5

  23. A semantic network 6

  24. A map 7

  25. Graphs are huge • Google’s index contains 50B pages • Facebook has around1.1B users • Google+ has around 570M users • T witter has around 530M users VERY rough estimates! 8

  26. 9

  27. Graphs aren’t easy 10

  28. Graphs are nasty. 11

  29. Each vertex depends on its neighbours, recursively. 12

  30. Recursive problems are nicely solved iteratively. 13

  31. PageRank in MapReduce • Record: < v_i, pr, [ v_j, ..., v_k ] > • Mapper: emits < v_j, pr / #neighbours > • Reducer: sums the partial values 14

  32. MapReduce datafmow 15

  33. Drawbacks • Each job is executed N times • Job bootstrap • Mappers send PR values and structure • Extensive IO at input, shuffme & sort, output 16

  34. 17

  35. Timeline • Inspired by Google Pregel (2010) • Donated to ASF by Yahoo! in 2011 • T op-level project in 2012 • 1.0 release in January 2013 • 1.1 release in days 2014 18

  36. Plays well with Hadoop 19

  37. Vertex-centric API 20

  38. BSP machine 21

  39. BSP & Giraph 22

  40. Advantages • No locks: message-based communication • No semaphores: global synchronization • Iteration isolation: massively parallelizable 23

  41. Architecture 24

  42. Giraph job lifetime 25

  43. Designed for iterations • Stateful (in-memory) • Only intermediate values (messages) sent • Hits the disk at input, output, checkpoint • Can go out-of-core 26

  44. A bunch of other things • Combiners (minimises messages) • Aggregators (global aggregations) • MasterCompute (executed on master) • WorkerContext (executed per worker) • PartitionContext (executed per partition) 27

  45. Shortest Paths 28

  46. Shortest Paths 29

  47. Shortest Paths 30

  48. Shortest Paths 31

  49. Shortest Paths 32

  50. Composable API 33

  51. Checkpointing 34

  52. No SPoFs 35

  53. Giraph scales ref: https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-e dges/10151617006153920 36

  54. Giraph is fast • 100x over MR (Pr) • jobs run within minutes • given you have resources ;-) 37

  55. Serialised objects 38

  56. Primitive types • Autoboxing is expensive • Objects overhead (JVM) • Use primitive types on your own • Use primitive types-based libs (e.g. fastutils) 39

  57. Sharded aggregators 40

  58. Many stores with Gora 41

  59. And graph databases 42

  60. Current and next steps • Out-of-core graph and messages • Jython interface • Remove Writable from < I V E M > • Partitioned supernodes • More documentation 43

  61. GraphLab: A New Framework for Parallel Machine Learning Yucheng Low, Aapo Kyrola, Carlos Guestrin, Joseph Gonzalez, Danny Bickson, Joe Hellerstein Presented by Guozhang Wang DB Lunch, Nov.8, 2010

  62. Overview  Programming ML Algorithms in Parallel ◦ Common Parallelism and MapReduce ◦ Global Synchronization Barriers  GraphLab ◦ Data Dependency as a Graph ◦ Synchronization as Fold/Reduce  Implementation and Experiments  From Multicore to Distributed Environment

  63. Parallel Processing for ML  Parallel ML is a Necessity ◦ 13 Million Wikipedia Pages ◦ 3.6 Billion photos on Flickr ◦ etc  Parallel ML is Hard to Program ◦ Concurrency v.s. Deadlock ◦ Load Balancing ◦ Debug ◦ etc

  64. MapReduce is the Solution?  High-level abstraction: Statistical Query Model [Chu et al, 2006] Weighted Linear Regression: only sufficient statistics 𝚺 = A -1 b, A = 𝚻 w i (x i x i T ), b = 𝚻 w i (x i y i )

  65. MapReduce is the Solution?  High-level abstraction: Statistical Query Model [Chu et al, 2006] K-Means: only data assignments Embarrassingly Parallel independent computation class mean = avg( x i ), x i in class No Communication needed

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend