FAQs Online GEAR presentation will be available on 4/6 You will - PDF document

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA PART B. GEAR SESSIONS SESSION 3: BIG GRAPH ANALYSIS Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 CS535 Big Data | Computer Science | Colorado State University FAQs • Online GEAR presentation will be available on 4/6 • You will have 3 days of discussion period on Piazza • 4/6 ~ 4/8 http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Topics of Todays Class • GraphX: Graph Processing in a Distributed Dataflow Framework • Part 1: Introduction and Graph parallelism • Part 2: Distributed Graph Representation • Part 3: Implementation of Distributed Graph Processing CS535 Big Data | Computer Science | Colorado State University GEAR Session 3. Big Graph Analysis Lecture 2. Distributed Large Graph Analysis-II GraphX: Graph Processing in a Distributed Dataflow Framework http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University This material is built based on • Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J. and Stoica, I., 2014. Graphx: Graph processing in a distributed dataflow framework. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14) (pp. 599- 613). • KARYPIS, G., AND KUMAR, V. Multilevel k-way partitioning scheme for irregular graphs. J. Parallel Distrib. Comput. • 48 , 1 (1998), 96–129. • GraphX Programming Guide https://spark.apache.org/docs/latest/graphx-programming- guide.html CS535 Big Data | Computer Science | Colorado State University Introduction • GraphX is a library built on top of the Apache Spark for graphs and graph-parallel computation • Introduces a Graph abstraction • Directed multigraph with properties attached to each vertex and edge • Provides a set of graph operators • E.g. subgraph, JoinVertices, and aggregateMessages • Provides an optimized variant of the Pregel API • Implements graph algorithms and builders • PageRank • Connected Components • Triangle Counting http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Computational Challenges • Graph processing systems outperform general-purpose distributed dataflow frameworks with own specialized optimization schemes • E.g. Pregel, PowerGraph, BLAS, Kineograph • Graphs are often only a part of the large analytics process • Combines graphs with unstructured and tabular data • Analytics pipelines are forced to compose multiple systems • Extra data movement and duplication • Fault tolerance • Design of graph processing systems on top of general purpose distributed dataflow systems is needed CS535 Big Data | Computer Science | Colorado State University GEAR Session 3. Big Graph Analysis Lecture 2. Distributed Large Graph Analysis-II GraphX: Graph Processing in a Distributed Dataflow Framework Distributed Dataflow Model and Optimization Schemes for Graph Processing http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 4

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Dataflow Models - Traditional Network Programming • Message-passing between nodes (e.g. MPI) • Very difficult to do at scale • How to split the problem across nodes? • Network communication & data locality • How to deal with failures ? (inevitable at scale) • Stragglers ? • Node not failed but slow • Writing programs for each machine • Rarely used in commodity datacenters! CS535 Big Data | Computer Science | Colorado State University Dataflow Models – Modern distributed dataflow models • Restrict the programming interface • System can do more automatically • Express jobs as graphs of high-level operators • System picks how to split each operator into tasks and where to run each task • Run parts multiple times for fault recovery • Examples: MapReduce, Spark, Dryad, Storm, Pig, Hive… • Examples of dataflow operators • join, map, groupby , … most of the operators introduced in the Apache Spark discussion http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 5

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Why did these graph processing systems evolve separately from distributed dataflow frameworks? • Early emphasis on single stage computation and on-disk processing • Limited capability to handle iterative graph algorithms • Repeatedly and randomly access subsets of the graph • E.g. MapReduce • Early distributed dataflow frameworks did not support fine-grained control over the data partitioning • Recent frameworks (e.g. Spark and Naiad) support in-memory representation and fine-grained control over data partitioning CS535 Big Data | Computer Science | Colorado State University Optimization used in GraphX • Encoding graph as a collections • Vertex-cut partitioning • Executing graph algorithms as the common dataflow operators • Join optimizations • E.g. CSR indexing, join elimination and join-site specification • Materialized view maintenance • Vertex mirroring and delta updates • Applying above techniques and provides a new set of the Spark dataflow operators for graph processing • Reducing memory overhead and improve system performance • Immutability GraphX reuses indices across graph and collection views over multiple iterations http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 6

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University GEAR Session 3. Big Graph Analysis Lecture 2. Distributed Large Graph Analysis-II GraphX: Graph Processing in a Distributed Dataflow Framework The Property Graphs as Collections and Executing Graph Algorithms CS535 Big Data | Computer Science | Colorado State University Property Graph • User-defined properties with each vertex and edge • Meta-data • e.g. user profiles and time stamps • Program state • E.g. the PageRank of vertices or inferred affinities • Applicable for natural phenomena such as social networks and web graphs • Often highly skewed • Power-law degree distributions • Orders of magnitude more edges than vertices http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 7

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Transforming a Property Graph to a Pair of Collections • Vertex collection • Vertex properties (with a unique key: Vertex Identifier) • Vertex Identifiers are 64-bit integer • Derived externally (e.g. using userID) or applying a hash function to the vertex property (e.g. URL) • Edge collection • Edge properties (with source and destination vertex identifiers) • Having a pair of collection enables the system to compute graph algorithms with existing dataflow operations • Join: adding additional vertex properties • Creating new collections: creating a new graph • E.g. maintaining a graph for PageRanks and another graph for membership information while sharing the same edge collection CS535 Big Data | Computer Science | Colorado State University The Graph-Parallel Abstraction (Discussed in W10-A) • Iterative local def PageRank(v: Id, msgs: List[Double]) { transformations // Compute the message sum • E.g. PageRank algorithm var msgSum = 0 for (m <- msgs) { msgSum += m } • Vertex program // Update the PageRank • Launches the vertex program PR(v) = 0.15 + 0.85 * msgSum for each vertex and interacts // Broadcast messages with new PR with adjacent vertex programs for (j <- OutNbrs(v)) { through messages (e.g. pregel), msg = PR(v) / NumLinks(v) send_msg(to=j, msg) or shared state (e.g. } PowerGraph) // Check for termination • Example with the PageRank if (converged(PR(v))) voteToHalt(v) algorithm } PageRank in Pregel http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 8

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University The Graph-Parallel Abstraction (Discussed in W10-A) • Advantage • Well-suited for iterative graph algorithms for the static neighborhood structure of the graph • Disadvantage • It cannot express computation where disconnected vertices interact • It cannot process graph data that changes the graph structure in the course of the computation CS535 Big Data | Computer Science | Colorado State University The GAS Decomposition • Gonzalez et al. 1 observed that most vertex programs interact with neighboring vertices by collecting messages in the form of a generalized commutative associative sum and then broadcasting new messages in an inherently parallel loop 1 GONZALEZ, J. E., LOW, Y., GU, H., BICKSON, D., AND GUESTRIN, C. “Powergraph: Distributed graph-parallel computation on natural graphs,” OSDI’12, USENIX Association, pp. 17–30. http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 9

FAQs Online GEAR presentation will be available on 4/6 You will - PDF document

CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA PART B. GEAR SESSIONS SESSION 3: BIG GRAPH ANALYSIS Sangmi Lee Pallickara Computer Science, Colorado State

FAQs Safety Protective devices for machines FAQs What is functional safety and why is machine

Glossary Glossary FAQS FAQS Tools and Resources Tools and Resources Welcome to Your HR Leader

FAQs on Accreditation Criteria for FAQs on Accreditation Criteria for Government and Private

Announcements Check course web page under assignments for FAQs Read FAQs before sending

Under Labor Law 537 The FAQs can be accessed here -

FAQs Pat Tabor spearheaded a project when he was on the Board to have a source of information on

Promotion Open Session Introduction This document outlines the full transcript of the FAQS from

Budget Update FAQs and Clarifications Board of Education February 5, 2020 Kathleen Askelson,

DRN OC Updates October 5, 2015 Agenda Discussion of revised CDM Implementation FAQs: Shelley

PREVENTING MUSCULOSKELETAL DISORDERS AND TRAINING : FAQS DIANA ROBLA Social partners

Final Paper Format Guide and Presentation FAQs This document provides a basic overview of

Water and Sewer Department (WTWSD) Water Quality- July 12, 2016 FAQs Q: Is my public water

Crack Pipe FAQs: What service providers need to know Presenter: Andrew Ivsins Presentation

Welcome! The Webinar will Begin Shortly Technical Assistance FAQs 1. Why cant I hear anything?

UC SPONSORED RETIREE HEALTH PLANS FREQUENTLY ASKED QUESTIONS ( FAQs ) v.07102020 FAQ #1 When I

Travel Welcome to Acorn Adventure Ardche Adventure FAQs Any questions?

Overview and Introduction to Ray Tracing Shaders Chris Wyman, NVIDIA Twitter: @_cwyman_ E-mail:

A basis of the fixed point subgroup of an automorphism of a free group Oleg Bogopolski and Olga

Coupling On-line and Off-line Random Graphs Woojin Kim March 1st Introduction Preliminary

Simple realizability of complete abstract topological graphs simplified Jan Kyn cl Charles

Optimal Matching For SLiding Window Compression With Suffix Tries An advantage of sliding

Fractal 3D modeling of asteroids using wavelets on arbitrary meshes Andre Jalobeanu Automated

Low dimensional Euclidean buildings: III Thibaut Dumont University of Jyv askyl a June

Chapter 8: Gra Graph ph Mining Mining Jilles Vreeken Revision 1, December 4 th typos