apache giraph
play

Apache Giraph Large-scale Graph Processing on Hadoop Claudio - PowerPoint PPT Presentation

Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella <claudio@apache.org> @claudiomartella 2 Graphs are simple 3 A computer network 4 A social network 5 A semantic network 6 A map 7 Predicting break ups Graph


  1. Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella <claudio@apache.org> @claudiomartella

  2. 2

  3. Graphs are simple 3

  4. A computer network 4

  5. A social network 5

  6. A semantic network 6

  7. A map 7

  8. Predicting break ups Graph approach Aggregation approach 8

  9. Graphs are nasty. 9

  10. Each vertex depends on its neighbours, recursively. 10

  11. Recursive problems are nicely solved iteratively. 11

  12. 12

  13. PageRank in MapReduce • Record: < v_i, pr, [ v_j, ..., v_k ] > • Mapper: emits < v_j, pr / #neighbours > • Reducer: sums the partial values 13

  14. MapReduce dataflow 14

  15. Drawbacks • Each job is executed N times • Job bootstrap • Mappers send PR values and structure • Extensive IO at input, shuffle & sort, output 15

  16. 16

  17. Timeline • Inspired by Google Pregel (2010) • Donated to ASF by Yahoo! in 2011 • Top-level project in 2012 • 1.0 release in January 2013 • 1.1 release in November 2014 17

  18. Plays well with Hadoop 18

  19. Vertex-centric API 19

  20. Shortest Paths 20

  21. Shortest Paths 21

  22. Shortest Paths 22

  23. Shortest Paths 23

  24. Shortest Paths 24

  25. Code def compute(vertex, messages): minValue = Inf # float(‘Inf’) for m in messages: minValue = min(minValue, m) if minValue < vertex.getValue(): vertex.setValue(minValue) for edge in vertex.getEdges(): message = minValue + edge.getValue() sendMessage(edge.getTargetId(), message) vertex.voteToHalt() 25

  26. 26

  27. 27

  28. 28

  29. 29

  30. BSP & Giraph 30

  31. Advantages • No locks: message-based communication • No semaphores: global synchronization • Iteration isolation: massively parallelizable 31

  32. Designed for iterations • Stateful (in-memory) • Only intermediate values (messages) sent • Hits the disk at input, output, checkpoint • Can go out-of-core 32

  33. Giraph job lifetime 33

  34. Architecture 34

  35. Composable API 35

  36. Checkpointing 36

  37. No SPoFs 37

  38. Giraph scales ref: https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion- edges/10151617006153920 38

  39. Giraph is fast • 100x over MR (Pr) • jobs run within minutes • given you have resources ;-) 39

  40. Serialised objects 40

  41. Primitive types • Autoboxing is expensive • Objects overhead (JVM) • Use primitive types on your own • Use primitive types-based libs (e.g. fastutils) 41

  42. Sharded aggregators 42

  43. Okapi • Apache Mahout for graphs • Graph-based recommenders: ALS, SGD, SVD++, etc. • Graph analytics: Graph partitioning, Community Detection, K-Core, etc. 43

  44. Thank you http://giraph.apache.org <claudio@apache.org> @claudiomartella

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend