Apache Giraph Large-scale Graph Processing on Hadoop Claudio - - PowerPoint PPT Presentation

apache giraph
SMART_READER_LITE
LIVE PREVIEW

Apache Giraph Large-scale Graph Processing on Hadoop Claudio - - PowerPoint PPT Presentation

Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella <claudio@apache.org> @claudiomartella 2 Graphs are simple 3 A computer network 4 A social network 5 A semantic network 6 A map 7 Predicting break ups Graph


slide-1
SLIDE 1

Apache Giraph

Large-scale Graph Processing on Hadoop Claudio Martella <claudio@apache.org> @claudiomartella

slide-2
SLIDE 2

2

slide-3
SLIDE 3

Graphs are simple

3

slide-4
SLIDE 4

A computer network

4

slide-5
SLIDE 5

A social network

5

slide-6
SLIDE 6

A semantic network

6

slide-7
SLIDE 7

A map

7

slide-8
SLIDE 8

Predicting break ups

8

Aggregation approach Graph approach

slide-9
SLIDE 9

Graphs are nasty.

9

slide-10
SLIDE 10

Each vertex depends

  • n its neighbours,

recursively.

10

slide-11
SLIDE 11

Recursive problems are nicely solved iteratively.

11

slide-12
SLIDE 12

12

slide-13
SLIDE 13

PageRank in MapReduce

  • Record: < v_i, pr, [ v_j, ..., v_k ] >
  • Mapper: emits < v_j, pr / #neighbours >
  • Reducer: sums the partial values

13

slide-14
SLIDE 14

MapReduce dataflow

14

slide-15
SLIDE 15

Drawbacks

  • Each job is executed N times
  • Job bootstrap
  • Mappers send PR values and structure
  • Extensive IO at input, shuffle & sort,
  • utput

15

slide-16
SLIDE 16

16

slide-17
SLIDE 17

Timeline

  • Inspired by Google Pregel (2010)
  • Donated to ASF by Yahoo! in 2011
  • Top-level project in 2012
  • 1.0 release in January 2013
  • 1.1 release in November 2014

17

slide-18
SLIDE 18

Plays well with Hadoop

18

slide-19
SLIDE 19

Vertex-centric API

19

slide-20
SLIDE 20

Shortest Paths

20

slide-21
SLIDE 21

Shortest Paths

21

slide-22
SLIDE 22

Shortest Paths

22

slide-23
SLIDE 23

Shortest Paths

23

slide-24
SLIDE 24

Shortest Paths

24

slide-25
SLIDE 25

Code

def compute(vertex, messages): minValue = Inf # float(‘Inf’) for m in messages: minValue = min(minValue, m) if minValue < vertex.getValue(): vertex.setValue(minValue) for edge in vertex.getEdges(): message = minValue + edge.getValue() sendMessage(edge.getTargetId(), message) vertex.voteToHalt()

25

slide-26
SLIDE 26

26

slide-27
SLIDE 27

27

slide-28
SLIDE 28

28

slide-29
SLIDE 29

29

slide-30
SLIDE 30

BSP & Giraph

30

slide-31
SLIDE 31

Advantages

  • No locks: message-based

communication

  • No semaphores: global synchronization
  • Iteration isolation: massively

parallelizable

31

slide-32
SLIDE 32

Designed for iterations

  • Stateful (in-memory)
  • Only intermediate values (messages)

sent

  • Hits the disk at input, output, checkpoint
  • Can go out-of-core

32

slide-33
SLIDE 33

Giraph job lifetime

33

slide-34
SLIDE 34

Architecture

34

slide-35
SLIDE 35

Composable API

35

slide-36
SLIDE 36

Checkpointing

36

slide-37
SLIDE 37

No SPoFs

37

slide-38
SLIDE 38

Giraph scales

38

ref: https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion- edges/10151617006153920

slide-39
SLIDE 39

Giraph is fast

  • 100x over MR (Pr)
  • jobs run within minutes
  • given you have resources

;-)

39

slide-40
SLIDE 40

Serialised objects

40

slide-41
SLIDE 41

Primitive types

  • Autoboxing is expensive
  • Objects overhead (JVM)
  • Use primitive types on your own
  • Use primitive types-based libs (e.g.

fastutils)

41

slide-42
SLIDE 42

Sharded aggregators

42

slide-43
SLIDE 43

Okapi

  • Apache Mahout for graphs
  • Graph-based

recommenders: ALS, SGD, SVD++, etc.

  • Graph analytics: Graph

partitioning, Community Detection, K-Core, etc.

43

slide-44
SLIDE 44

Thank you

<claudio@apache.org> @claudiomartella http://giraph.apache.org