faqs
play

FAQs Online GEAR presentation will be available on 4/6 You will - PDF document

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA PART B. GEAR SESSIONS SESSION 3: BIG GRAPH ANALYSIS Sangmi Lee Pallickara Computer Science, Colorado State


  1. CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA PART B. GEAR SESSIONS SESSION 3: BIG GRAPH ANALYSIS Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 CS535 Big Data | Computer Science | Colorado State University FAQs • Online GEAR presentation will be available on 4/6 • You will have 3 days of discussion period on Piazza • 4/6 ~ 4/8 http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1

  2. CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Topics of Todays Class • GraphX: Graph Processing in a Distributed Dataflow Framework • Part 1: Introduction and Graph parallelism • Part 2: Distributed Graph Representation • Part 3: Implementation of Distributed Graph Processing CS535 Big Data | Computer Science | Colorado State University GEAR Session 3. Big Graph Analysis Lecture 2. Distributed Large Graph Analysis-II GraphX: Graph Processing in a Distributed Dataflow Framework http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2

  3. CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University This material is built based on • Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J. and Stoica, I., 2014. Graphx: Graph processing in a distributed dataflow framework. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14) (pp. 599- 613). • KARYPIS, G., AND KUMAR, V. Multilevel k-way partitioning scheme for irregular graphs. J. Parallel Distrib. Comput. • 48 , 1 (1998), 96–129. • GraphX Programming Guide https://spark.apache.org/docs/latest/graphx-programming- guide.html CS535 Big Data | Computer Science | Colorado State University Introduction • GraphX is a library built on top of the Apache Spark for graphs and graph-parallel computation • Introduces a Graph abstraction • Directed multigraph with properties attached to each vertex and edge • Provides a set of graph operators • E.g. subgraph, JoinVertices, and aggregateMessages • Provides an optimized variant of the Pregel API • Implements graph algorithms and builders • PageRank • Connected Components • Triangle Counting http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3

  4. CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Computational Challenges • Graph processing systems outperform general-purpose distributed dataflow frameworks with own specialized optimization schemes • E.g. Pregel, PowerGraph, BLAS, Kineograph • Graphs are often only a part of the large analytics process • Combines graphs with unstructured and tabular data • Analytics pipelines are forced to compose multiple systems • Extra data movement and duplication • Fault tolerance • Design of graph processing systems on top of general purpose distributed dataflow systems is needed CS535 Big Data | Computer Science | Colorado State University GEAR Session 3. Big Graph Analysis Lecture 2. Distributed Large Graph Analysis-II GraphX: Graph Processing in a Distributed Dataflow Framework Distributed Dataflow Model and Optimization Schemes for Graph Processing http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 4

  5. CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Dataflow Models - Traditional Network Programming • Message-passing between nodes (e.g. MPI) • Very difficult to do at scale • How to split the problem across nodes? • Network communication & data locality • How to deal with failures ? (inevitable at scale) • Stragglers ? • Node not failed but slow • Writing programs for each machine • Rarely used in commodity datacenters! CS535 Big Data | Computer Science | Colorado State University Dataflow Models – Modern distributed dataflow models • Restrict the programming interface • System can do more automatically • Express jobs as graphs of high-level operators • System picks how to split each operator into tasks and where to run each task • Run parts multiple times for fault recovery • Examples: MapReduce, Spark, Dryad, Storm, Pig, Hive… • Examples of dataflow operators • join, map, groupby , … most of the operators introduced in the Apache Spark discussion http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 5

  6. CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Why did these graph processing systems evolve separately from distributed dataflow frameworks? • Early emphasis on single stage computation and on-disk processing • Limited capability to handle iterative graph algorithms • Repeatedly and randomly access subsets of the graph • E.g. MapReduce • Early distributed dataflow frameworks did not support fine-grained control over the data partitioning • Recent frameworks (e.g. Spark and Naiad) support in-memory representation and fine-grained control over data partitioning CS535 Big Data | Computer Science | Colorado State University Optimization used in GraphX • Encoding graph as a collections • Vertex-cut partitioning • Executing graph algorithms as the common dataflow operators • Join optimizations • E.g. CSR indexing, join elimination and join-site specification • Materialized view maintenance • Vertex mirroring and delta updates • Applying above techniques and provides a new set of the Spark dataflow operators for graph processing • Reducing memory overhead and improve system performance • Immutability GraphX reuses indices across graph and collection views over multiple iterations http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 6

  7. CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University GEAR Session 3. Big Graph Analysis Lecture 2. Distributed Large Graph Analysis-II GraphX: Graph Processing in a Distributed Dataflow Framework The Property Graphs as Collections and Executing Graph Algorithms CS535 Big Data | Computer Science | Colorado State University Property Graph • User-defined properties with each vertex and edge • Meta-data • e.g. user profiles and time stamps • Program state • E.g. the PageRank of vertices or inferred affinities • Applicable for natural phenomena such as social networks and web graphs • Often highly skewed • Power-law degree distributions • Orders of magnitude more edges than vertices http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 7

  8. CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Transforming a Property Graph to a Pair of Collections • Vertex collection • Vertex properties (with a unique key: Vertex Identifier) • Vertex Identifiers are 64-bit integer • Derived externally (e.g. using userID) or applying a hash function to the vertex property (e.g. URL) • Edge collection • Edge properties (with source and destination vertex identifiers) • Having a pair of collection enables the system to compute graph algorithms with existing dataflow operations • Join: adding additional vertex properties • Creating new collections: creating a new graph • E.g. maintaining a graph for PageRanks and another graph for membership information while sharing the same edge collection CS535 Big Data | Computer Science | Colorado State University The Graph-Parallel Abstraction (Discussed in W10-A) • Iterative local def PageRank(v: Id, msgs: List[Double]) { transformations // Compute the message sum • E.g. PageRank algorithm var msgSum = 0 for (m <- msgs) { msgSum += m } • Vertex program // Update the PageRank • Launches the vertex program PR(v) = 0.15 + 0.85 * msgSum for each vertex and interacts // Broadcast messages with new PR with adjacent vertex programs for (j <- OutNbrs(v)) { through messages (e.g. pregel), msg = PR(v) / NumLinks(v) send_msg(to=j, msg) or shared state (e.g. } PowerGraph) // Check for termination • Example with the PageRank if (converged(PR(v))) voteToHalt(v) algorithm } PageRank in Pregel http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 8

  9. CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University The Graph-Parallel Abstraction (Discussed in W10-A) • Advantage • Well-suited for iterative graph algorithms for the static neighborhood structure of the graph • Disadvantage • It cannot express computation where disconnected vertices interact • It cannot process graph data that changes the graph structure in the course of the computation CS535 Big Data | Computer Science | Colorado State University The GAS Decomposition • Gonzalez et al. 1 observed that most vertex programs interact with neighboring vertices by collecting messages in the form of a generalized commutative associative sum and then broadcasting new messages in an inherently parallel loop 1 GONZALEZ, J. E., LOW, Y., GU, H., BICKSON, D., AND GUESTRIN, C. “Powergraph: Distributed graph-parallel computation on natural graphs,” OSDI’12, USENIX Association, pp. 17–30. http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 9

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend