FAQs Online GEAR presentation will be available on 4/6 You will - - PDF document

faqs
SMART_READER_LITE
LIVE PREVIEW

FAQs Online GEAR presentation will be available on 4/6 You will - - PDF document

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA PART B. GEAR SESSIONS SESSION 3: BIG GRAPH ANALYSIS Sangmi Lee Pallickara Computer Science, Colorado State


slide-1
SLIDE 1

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1

CS535 BIG DATA

PART B. GEAR SESSIONS

SESSION 3: BIG GRAPH ANALYSIS

Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535

CS535 Big Data | Computer Science | Colorado State University

FAQs

  • Online GEAR presentation will be available on 4/6
  • You will have 3 days of discussion period on Piazza
  • 4/6 ~ 4/8

CS535 Big Data | Computer Science | Colorado State University

slide-2
SLIDE 2

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2

Topics of Todays Class

  • GraphX: Graph Processing in a Distributed Dataflow Framework
  • Part 1: Introduction and Graph parallelism
  • Part 2: Distributed Graph Representation
  • Part 3: Implementation of Distributed Graph Processing

CS535 Big Data | Computer Science | Colorado State University

GEAR Session 3. Big Graph Analysis

Lecture 2. Distributed Large Graph Analysis-II

GraphX: Graph Processing in a Distributed Dataflow Framework

CS535 Big Data | Computer Science | Colorado State University

slide-3
SLIDE 3

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3

This material is built based on

  • Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J. and Stoica, I., 2014.

Graphx: Graph processing in a distributed dataflow framework. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14) (pp. 599- 613).

  • KARYPIS, G., AND KUMAR, V. Multilevel k-way partitioning scheme for irregular
  • graphs. J. Parallel Distrib. Comput.
  • 48, 1 (1998), 96–129.
  • GraphX Programming Guide https://spark.apache.org/docs/latest/graphx-programming-

guide.html

CS535 Big Data | Computer Science | Colorado State University

Introduction

  • GraphX is a library built on top of the Apache Spark for graphs and graph-parallel

computation

  • Introduces a Graph abstraction
  • Directed multigraph with properties attached to each vertex and edge
  • Provides a set of graph operators
  • E.g. subgraph, JoinVertices, and aggregateMessages
  • Provides an optimized variant of the Pregel API
  • Implements graph algorithms and builders
  • PageRank
  • Connected Components
  • Triangle Counting

CS535 Big Data | Computer Science | Colorado State University

slide-4
SLIDE 4

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 4

Computational Challenges

  • Graph processing systems outperform general-purpose distributed dataflow

frameworks with own specialized optimization schemes

  • E.g. Pregel, PowerGraph, BLAS, Kineograph
  • Graphs are often only a part of the large analytics process
  • Combines graphs with unstructured and tabular data
  • Analytics pipelines are forced to compose multiple systems
  • Extra data movement and duplication
  • Fault tolerance
  • Design of graph processing systems on top of general purpose distributed

dataflow systems is needed

CS535 Big Data | Computer Science | Colorado State University

GEAR Session 3. Big Graph Analysis

Lecture 2. Distributed Large Graph Analysis-II GraphX: Graph Processing in a Distributed Dataflow Framework

Distributed Dataflow Model and Optimization Schemes for Graph Processing

CS535 Big Data | Computer Science | Colorado State University

slide-5
SLIDE 5

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 5

Dataflow Models - Traditional Network Programming

  • Message-passing between nodes (e.g. MPI)
  • Very difficult to do at scale
  • How to split the problem across nodes?
  • Network communication & data locality
  • How to deal with failures? (inevitable at scale)
  • Stragglers?
  • Node not failed but slow
  • Writing programs for each machine
  • Rarely used in commodity datacenters!

CS535 Big Data | Computer Science | Colorado State University

Dataflow Models – Modern distributed dataflow models

  • Restrict the programming interface
  • System can do more automatically
  • Express jobs as graphs of high-level operators
  • System picks how to split each operator into tasks and where to run each task
  • Run parts multiple times for fault recovery
  • Examples: MapReduce, Spark, Dryad, Storm, Pig, Hive…
  • Examples of dataflow operators
  • join, map, groupby, … most of the operators introduced in the Apache Spark discussion

CS535 Big Data | Computer Science | Colorado State University

slide-6
SLIDE 6

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 6

Why did these graph processing systems evolve separately from distributed dataflow frameworks?

  • Early emphasis on single stage computation and on-disk processing
  • Limited capability to handle iterative graph algorithms
  • Repeatedly and randomly access subsets of the graph
  • E.g. MapReduce
  • Early distributed dataflow frameworks did not support fine-grained control over the data

partitioning

  • Recent frameworks (e.g. Spark and Naiad) support in-memory representation and fine-grained control
  • ver data partitioning

CS535 Big Data | Computer Science | Colorado State University

Optimization used in GraphX

  • Encoding graph as a collections
  • Vertex-cut partitioning
  • Executing graph algorithms as the common dataflow operators
  • Join optimizations
  • E.g. CSR indexing, join elimination and join-site specification
  • Materialized view maintenance
  • Vertex mirroring and delta updates
  • Applying above techniques and provides a new set of the Spark dataflow operators for

graph processing

  • Reducing memory overhead and improve system performance
  • Immutability GraphX reuses indices across graph and collection views over multiple iterations

CS535 Big Data | Computer Science | Colorado State University

slide-7
SLIDE 7

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 7

GEAR Session 3. Big Graph Analysis

Lecture 2. Distributed Large Graph Analysis-II GraphX: Graph Processing in a Distributed Dataflow Framework

The Property Graphs as Collections and Executing Graph Algorithms

CS535 Big Data | Computer Science | Colorado State University

Property Graph

  • User-defined properties with each vertex and edge
  • Meta-data
  • e.g. user profiles and time stamps
  • Program state
  • E.g. the PageRank of vertices or inferred affinities
  • Applicable for natural phenomena such as social networks and web graphs
  • Often highly skewed
  • Power-law degree distributions
  • Orders of magnitude more edges than vertices

CS535 Big Data | Computer Science | Colorado State University

slide-8
SLIDE 8

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 8

Transforming a Property Graph to a Pair of Collections

  • Vertex collection
  • Vertex properties (with a unique key: Vertex Identifier)
  • Vertex Identifiers are 64-bit integer
  • Derived externally (e.g. using userID) or applying a hash function to the vertex property (e.g. URL)
  • Edge collection
  • Edge properties (with source and destination vertex identifiers)
  • Having a pair of collection enables the system to compute graph algorithms with

existing dataflow operations

  • Join: adding additional vertex properties
  • Creating new collections: creating a new graph
  • E.g. maintaining a graph for PageRanks and another graph for membership information while sharing the same

edge collection

CS535 Big Data | Computer Science | Colorado State University

The Graph-Parallel Abstraction (Discussed in W10-A)

  • Iterative local

transformations

  • E.g. PageRank algorithm
  • Vertex program
  • Launches the vertex program

for each vertex and interacts with adjacent vertex programs through messages (e.g. pregel),

  • r shared state (e.g.

PowerGraph)

  • Example with the PageRank

algorithm

def PageRank(v: Id, msgs: List[Double]) { // Compute the message sum var msgSum = 0 for (m <- msgs) { msgSum += m } // Update the PageRank PR(v) = 0.15 + 0.85 * msgSum // Broadcast messages with new PR for (j <- OutNbrs(v)) { msg = PR(v) / NumLinks(v) send_msg(to=j, msg) } // Check for termination if (converged(PR(v))) voteToHalt(v) } PageRank in Pregel

CS535 Big Data | Computer Science | Colorado State University

slide-9
SLIDE 9

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 9

The Graph-Parallel Abstraction (Discussed in W10-A)

  • Advantage
  • Well-suited for iterative graph algorithms for the static neighborhood structure of the graph
  • Disadvantage
  • It cannot express computation where disconnected vertices interact
  • It cannot process graph data that changes the graph structure in the course of the computation

CS535 Big Data | Computer Science | Colorado State University

The GAS Decomposition

  • Gonzalez et al.1 observed that most vertex programs interact with neighboring vertices

by collecting messages in the form of a generalized commutative associative sum and then broadcasting new messages in an inherently parallel loop

1 GONZALEZ, J. E., LOW, Y., GU, H., BICKSON, D., AND GUESTRIN, C. “Powergraph: Distributed graph-parallel

computation on natural graphs,” OSDI’12, USENIX Association, pp. 17–30.

CS535 Big Data | Computer Science | Colorado State University

slide-10
SLIDE 10

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 10

Types of graph computation [1/3]

  • Gather: Your computation gathers information from neighboring vertices
  • e.g. authority value of the HITS algorithm
  • e.g. current PageRank value

CS535 Big Data | Computer Science | Colorado State University

Types of graph computation [2/3]

  • Apply: The vertex applies an update the vertex property
  • e.g. update the authority value with the sum of new authority values after normalizing

the value

  • e.g. Add passed PageRank values and normalize it and update the current PageRank

value

CS535 Big Data | Computer Science | Colorado State University

slide-11
SLIDE 11

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 11

Types of graph computation [3/3]

  • Scatter: a vertex should send out information to neighboring vertices.

CS535 Big Data | Computer Science | Colorado State University

The GAS Decomposition

  • The GAS decomposition
  • Splits vertex programs into three data-parallel stages
  • Gather
  • Apply
  • Scatter

def PageRank(v: Id, msgs: List[Double]) { // Compute the message sum var msgSum = 0 for (m <- msgs) { msgSum += m } // Update the PageRank PR(v) = 0.15 + 0.85 * msgSum // Broadcast messages with new PR for (j <- OutNbrs(v)) { msg = PR(v) / NumLinks(v) send_msg(to=j, msg) } // Check for termination if (converged(PR(v))) voteToHalt(v) }

CS535 Big Data | Computer Science | Colorado State University

Gather Apply Scatter

slide-12
SLIDE 12

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 12

The GAS Decomposition

  • pull-based model of message computation
  • The system asks the vertex program for value of the message between adjacent vertices
  • Rather than the user sending messages directly from the vertex program
  • Therefore, vertex-cut is suitable for this style of computation
  • Limited communication pattern
  • Supports only between adjacent vertices

CS535 Big Data | Computer Science | Colorado State University

Graph Computation as Dataflow Ops.

  • The graph-parallel computation can be expressed as a sequence of join stages, group-

by stages and map operations

  • Join stage
  • Vertex and edge properties are joined to form the triplets view
  • Consists of each edge and its corresponding source and destination vertex properties
  • Group-by stage
  • The triplets are grouped by source or destination vertex to construct the neighborhood of each vertex

to construct the neighborhood of each vertex and compute aggregates

  • Gathers messages destined to the same vertex
  • Map operation
  • Applies the message final results for the given vertex to update the vertex property
  • Join operation
  • To distribute the values to the vertices

CS535 Big Data | Computer Science | Colorado State University

slide-13
SLIDE 13

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 13

Discussions

  • Assume that you implement the

PageRank algorithm using three stages in GraphX. What stage will be applied for the line5?

  • a. Join stage
  • b. GroupBy stage
  • c. map operations
  • d. All of the above

0: def PageRank(v: Id, msgs: List[Double]) { 1: // Compute the message sum 2: var msgSum = 0 3: for (m <- msgs) { msgSum += m } 4: // Update the PageRank 5: PR(v) = 0.15 + 0.85 * msgSum 6: // Broadcast messages with new PR 7: for (j <- OutNbrs(v)) { 8: msg = PR(v) / NumLinks(v) 9: send_msg(to=j, msg) 10: } 11: // Check for termination 12: if (converged(PR(v))) voteToHalt(v) 13: }

CS535 Big Data | Computer Science | Colorado State University

Discussions

  • Assume that you implement the

PageRank algorithm using three stages in GraphX. What stage will be applied for the line3?

  • a. Join stage
  • b. GroupBy stage
  • c. map operations
  • d. All of the above

0: def PageRank(v: Id, msgs: List[Double]) { 1: // Compute the message sum 2: var msgSum = 0 3: for (m <- msgs) { msgSum += m } 4: // Update the PageRank 5: PR(v) = 0.15 + 0.85 * msgSum 6: // Broadcast messages with new PR 7: for (j <- OutNbrs(v)) { 8: msg = PR(v) / NumLinks(v) 9: send_msg(to=j, msg) 10: } 11: // Check for termination 12: if (converged(PR(v))) voteToHalt(v) 13: }

CS535 Big Data | Computer Science | Colorado State University

slide-14
SLIDE 14

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 14

The GAS Decomposition with GraphX

  • Gather
  • GroupBy stage
  • Apply
  • Map operation
  • Scatter
  • Join stage

CS535 Big Data | Computer Science | Colorado State University

Triplets view

  • Each edge and its corresponding source and destination vertex properties

A B A Vertices Edges B A B Triplets CREATE VIEW triplets AS SELECT s.Id, d.Id, s.P, e.P, d.P FROM edges AS e JOIN vertices AS s JOIN vertices AS d ON e.srcId = s.Id AND e.dstId = d.Id Constructing Triplets in SQL

CS535 Big Data | Computer Science | Colorado State University

slide-15
SLIDE 15

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 15

GraphX Graph Operators

  • Transform vertex and edge collections
  • Graph Constructor
  • Logically binds a pair of vertex and edge property collections into a property graph
  • Verities integrity constrains – every vertex occurs only once and that edges do not link to missing

vertices

  • def Graph(v: Collection[(Id, V)], e: Collection[(Id, Id, E)])
  • Collection views
  • Vertex and edges operators expose the graph’s vertex and edge property collections
  • Triplets operator returns the triplets view of the graph
  • def vertices: Collection[(Id, V)]
  • def edges: Collection[(Id, Id, E)]
  • def triplets: Collection[Triplet]

CS535 Big Data | Computer Science | Colorado State University

GraphX Graph Operators

  • Graph-parallel computation
  • MapReduce Triplets operator encodes the two-stage process of graph-parallel computation
  • Composes the map and group-by dataflow operators on the triplets view
  • User-defined map function is applied to each triplet
  • Generates values and aggregates them at the destination vertex using user-defined binary

aggregation function

  • def mrTriplets(f: (Triplet) => M, sum: (M, M) => M): Collection[(Id, M)]
  • In SQL

SELECT t.dstID, reduce(mapF(t)) AS msgSum FROM triplets AS t GROUP BY t.dstId

CS535 Big Data | Computer Science | Colorado State University

slide-16
SLIDE 16

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 16

GraphX Graph Operators

  • Convenience functions
  • def mapV(f: (Id, V) => V): Graph[V, E]
  • def mapE(f: (Id, Id, E) => E): Graph[V, E]
  • def leftJoinV(v: Collection[(Id, V)], f: (Id, V, V) => V): Graph[V, E]
  • def leftJoinE(e: Collection[(Id, Id, E)], f: (Id, Id, E, E) => E): Graph[V, E]
  • def subgraph(vPred: (Id, V) => Boolean, ePred: (Triplet) => Boolean) : Graph[V, E]
  • def reverse: Graph[V, E] }

CS535 Big Data | Computer Science | Colorado State University

Example use of mrTriplets

A B E D C F 42 23 30 75 19 16 A mapF( )=1 B Source property 42 Target property 23 Message to vertex B V id Property A B 2 C ? D ? E ? F ? Resulting vertices

Compute the number of older followers for each user in a social network

val graph: Graph[User, Double] def mapUDF(t: Triplet[User, Double]) = ??? What will be your computation here? def reduceUDF(a: Int, b: Int): Int = a + b val seniors: Collection[(Id, Int)] = graph.mrTriplets(mapUDF, reduceUDF)

CS535 Big Data | Computer Science | Colorado State University

slide-17
SLIDE 17

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 17

Example use of mrTriplets

A B E D C F 42 23 30 75 19 16 A mapF( )=1 B Source property 42 Target property 23 Message to vertex B V id Property A B 2 C 1 D 1 E F 3 Resulting vertices

Compute the number of older followers for each user in a social network

val graph: Graph[User, Double] def mapUDF(t: Triplet[User, Double]) = if (t.src.age > t.dst.age) 1 else 0 def reduceUDF(a: Int, b: Int): Int = a + b val seniors: Collection[(Id, Int)] = graph.mrTriplets(mapUDF, reduceUDF)

CS535 Big Data | Computer Science | Colorado State University

Implementation of the Pregel abstraction using GraphX

  • Initializes the vertex properties

with an additional field to track active vertices

  • While they are active,

messages are computed using the mrTriplets operator

  • Edge-parallel map operation
  • Message computation
  • Commutative associated

aggregation

def Pregel(g: Graph[V, E], vprog: (Id, V, M) => V, sendMsg: (Triplet) => M, gather: (M, M) => M): Collection[V] = { // Set all vertices as active g = g.mapV((id, v) => (v, halt=false)) // Loop until convergence while (g.vertices.exists(v => !v.halt)) { // Compute the messages val msgs: Collection[(Id, M)] = // Restrict to edges with active source g.subgraph(ePred=(s,d,sP,eP,dP)=>!sP.halt) // Compute messages .mrTriplets(sendMsg, gather) // Receive messages and run vertex program g = g.leftJoinV(msgs).mapV(vprog) } return g.vertices }

CS535 Big Data | Computer Science | Colorado State University

slide-18
SLIDE 18

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 18

GEAR Session 3. Big Graph Analysis

Lecture 2. Distributed Large Graph Analysis-II GraphX: Graph Processing in a Distributed Dataflow Framework

Distributed Representation of a Graph

CS535 Big Data | Computer Science | Colorado State University

Distributed Graph Representation

  • GraphX represents graphs internally as a pair of vertex and edge collections built on

the Spark RDD abstraction

  • Indexing and graph-specific partitioning as a layer on top of RDDs

1 2 3 4 5 6 1 2

Edge partition A

1 3 4 1

Edge partition B

4 5 1 5

Edge partition C

1 6 5 6 Edges 1 2

Vertex partition A

Vertices 3 1 1 1 4 5

Vertex partition B

6 1 1

Partition A

Routing Table

Partition B A 1,2,3 B 1 C 1 A B 4,5 C 5.6

Partition A Partition B Partition C

Bitmask Bitmask CS535 Big Data | Computer Science | Colorado State University

slide-19
SLIDE 19

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 19

Vertices and Edges

  • Vertex collection is hash-partitioned by the vertex ids
  • Vertices are stored in a local hash index within each

partition

  • Bitmask stores the visibility of each vertex
  • Soft deletions to promote index reuse
  • If vertex 5 and adjacent edges are restricted from the graph,

they are removed from the corresponding collection by updating the bitmasks

  • Your computation can reuse this index
  • Edges are divided into three edge partitions by

applying a partition function

  • E.g. 2D partitioning
  • Vertices are partitioned by vertex id

1 2

Edge partition A

1 3 4 1

Edge partition B

4 5 1 5

Edge partition C

1 6 5 6 Edges 1 2

Vertex partition A

Vertices 3 1 1 1 4 5

Vertex partition B

6 1 1

Bitmask Bitmask CS535 Big Data | Computer Science | Colorado State University

Routing table

  • Encoding the edge partitions for each vertex
  • Join site information is stored in the routing table

Partition A

Routing Table

Partition B A 1,2,3 B 1 C 1 A B 4,5 C 5.6 CS535 Big Data | Computer Science | Colorado State University

slide-20
SLIDE 20

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 20

Graph Partitioning: EdgePartition2D

  • Inspired by the multilevel k-way partitioning1
  • 2D graph partitioning
  • Upper bound of 2 " − 1 on the vertex replication factor
  • ,where n is the number of partitions

1KARYPIS, G., AND KUMAR, V. Multilevel k-way partitioning scheme for irregular graphs. J. Parallel Distrib. Comput.

48, 1 (1998), 96–129.

CS535 Big Data | Computer Science | Colorado State University

Graph Partitioning: EdgePartition2D

  • Consider a graph G = (V, E)
  • ,where V is the set of vertices and E is the set of edges
  • Every vertex in V has a vertex identifier and a vertex property
  • Every edge in E has source and destination vertex identifiers and edge property
  • Goal
  • Create n partitions of G such that:
  • The partitions should incur minimum communication
  • The workload should be balanced

CS535 Big Data | Computer Science | Colorado State University

slide-21
SLIDE 21

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 21

Step 1: Creating a partition table

If n is a perfect square rows(# of rows) = ! cols (# of columns) = ! If n is not a perfect square rows = the floor value of (n + cols -1) cols = the ceiling of the decimal value of ! For example, if n = 27, cols = 6 and rows = 5 The last column would have 3 rows

! !

CS535 Big Data | Computer Science | Colorado State University

Step 2: Assigning vertices and edges

! !

Vertex assignment Using elementary modular hash v%n Vertices are equally distributed among the partitions Edge assignment The source vertex (src) is mapped on the columns col = ((src x mixingPrime)% !, if n is a perfect square col = ((src x mixingPrime)% (

# $%&'), otherwise

,where mixingPrime is a large prime number to improve the balance of edge distributions The destination vertices (des) is mapped on the rows row = ((des x mixingPrime)% !, if n is a perfect square row = ((des x mixingPrime)% (

# $%&'), if n is not a perfect square

and col < cols - 1 row = ((des x mixingPrime)% )*+,-.)/.0+. otherwise

CS535 Big Data | Computer Science | Colorado State University

slide-22
SLIDE 22

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 22

Step 3: Storing edge properties

! ! Storing Edge Properties (col x ! + row) if n is a perfect square (col x rows + rows) otherwise

Edge assignment The source vertex (src) is mapped on the columns col = ((src x mixingPrime)% !, if n is a perfect square col = ((src x mixingPrime)% (

# $%&'), otherwise

,where mixingPrime is a large prime number to improve the balance of edge distributions The destination vertices (des) is mapped on the rows row = ((des x mixingPrime)% !, if n is a perfect square row = ((des x mixingPrime)% (

# $%&'), if n is not a perfect square

and col < cols - 1 row = ((des x mixingPrime)% )*+,-.)/.0+. otherwise

CS535 Big Data | Computer Science | Colorado State University

Discussions

  • Let’s locate a set of edges using EdgePartition2D
  • {(s, d1) , (s, d2) , (s, d3) , (s, d4) , (s, d5) } (sharing the same source vertex)
  • Where will they be located?
  • a. a single cell
  • b. a single row
  • c. a single column
  • d. randomly dispersed

! !

CS535 Big Data | Computer Science | Colorado State University

slide-23
SLIDE 23

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 23

Discussions

  • Let’s locate a set of edges using EdgePartition2D
  • {(s, d1) , (s, d2) , (s, d3) , (s, d4) , (s, d5) } (sharing the same source vertex)
  • Where will they be located?
  • a. a single cell
  • b. a single row
  • c. a single column
  • d. randomly dispersed

! !

CS535 Big Data | Computer Science | Colorado State University

Understanding the effect of EdgePartition2D

  • Let’s locate an edge (vsrc, vdes)
  • All the edges where vsrc is the source vertex
  • Would be placed in the same column, col
  • Example:
  • If vsrc = 9 and mixingPrime = 3 for the 25 (=n) partitions
  • (9 x 3)%5 = 2
  • The actual cell will be determined by the

destination vertex

  • If vdes is 2 and mixingPrime = 3
  • (2 x 3)%5 = 1
  • Therefore, the edge (vsrc, vdes) is stored in the

partition 11 (the partition defined as the 2nd row and the 3rd column)

! ! 0 1 2 3. 4

CS535 Big Data | Computer Science | Colorado State University

slide-24
SLIDE 24

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 24

Understanding the effect of EdgePartition2D

  • A vertex with the vertex id of v can be in any of

the cell in the column of (v x mixingPrime)% !

  • If it was a source vertex
  • Similarly, a vertex with the vertex id of v can be

in any of the cell in the raw of (v x mixingPrime)% !

  • If it was a destination vertex
  • Can a vertex v be in any other cells except

aforementioned set of cells?

  • No!

! ! 0 1 2 3. 4

CS535 Big Data | Computer Science | Colorado State University

Understanding the effect of EdgePartition2D

  • Therefore, any edge containing v has to be

placed in any of ! + ! -1 = 2 ! - 1 partitions

  • The upper bound on the vertex replication

factor is 2 " - 1

  • This is directly related to the communication cost to

synchronize the status of the vertex properties

! ! 0 1 2 3. 4

Naman Shah, Matthew Malensek, Harshil Shah, Shrideep Pallickara, and Sangmi Lee Pallickara, “Scalable Network Analytics for Characterization of Outbreak Influence in Voluminous Epidemiology Datasets,” Concurrency and Computation: Practice & Experience. John- Wiley. 2018 Naman Shah, Harshil Shah, Matthew Malensek, Sangmi Lee Pallickara, and Shrideep Pallickara. “Network Analysis for Identifying and Characterizing Disease Outbreak Influence from Voluminous Epidemiology Data,” Proceedings of the IEEE International Conference on Big Data (IEEE BigData). Washington D.C., USA. 2016

CS535 Big Data | Computer Science | Colorado State University