Tackling Large Graphs with Secondary Storage Amitabha Roy EPFL 1

Graphs Social networks Document networks Biological networks Humans, phones, bank accounts 2

Graph are Difficult • Graph mining is challenging problem • Traversal leads to data-dependent accesses • Little predictability • Hard to parallelize efficiently 3

Tackling Large Graphs • Normal approach • Throw resources at the problem • What does it take to process a trillion edges ? 4

Big Iron HPC/Graph500 benchmarks (June 2014) Graph Edges Hardware 1 trillion Tsubame 1 trillion Cray 1 trillion Blue Gene 1 trillion NEC 5

Large Clusters Avery Ching, Facebook @Strata, 2/13/2014 Yes, using 3940 machines 6

Big Data • Data is growing exponentially • 40 Zettabytes by 2020 • Unlikely you can put it all in DRAM • Need PM, SSD, Magnetic disks • Secondary storage != DRAM • Also applicable to graphs 7

Motivation If I can store the graph then why can’t I process it ? • 32 machines x 2TB magnetic disk = 64 TB storage • 1 trillion edges x 16 bytes per edge = 16 TB storage 8

Problem #1 • Irregular access patterns 5 1 1 3 2 3 6 2 4 5 6 4 9

Problem #1 200X • Random access penalties 20X 1.4X RAM Disk SSD 2ms seeks on a graph with a trillion edges ~ 1 year !   10

Problem #2 • Partitioning graphs across machines is hard • Random partitions very poor for real-world graphs Twitter graph: 20X difference with 32 machines ! 11

Outline • X-Stream (address problem #1) • SlipStream (address problem #2) 12

X-Stream • Single machine graph processing system [SOSP’13] • Turns graph processing into sequential access • Change computation model • Partitioning of graph 13

Scatter-Gather Existing computational model 5 1 3 6 2 4 14

Scatter-Gather Activate vertex 5 1 3 6 2 4 15

Scatter-Gather Scatter Updates 5 1 3 6 2 4 16

Scatter-Gather Gather Updates 5 1 3 6 2 4 17

Storage Vertices Edges 5 1 1 → 5 1 2 1 → 6 3 3 6 2 4 6 → 2 5 6 → 4 4 6 18

Edge File Vertices Edges 5 1 1 → 5 1 2 1 → 6 3 3 6 2 4 6 → 2 5 6 → 4 4 6 19

Edge File 5 1 → 5 1 1 → 6 3 6 2 SEEK 6 → 2 4 6 → 4 20

Edge-centric Scatter-Gather Scan entire edge list SCAN 5 1 → 5 1 1 → 6 3 6 2 6 → 2 4 6 → 4 21

Edge-centric Scatter-Gather Use only necessary edges SCAN 5 1 → 5 1 1 → 6 3 6 2 6 → 2 4 6 → 4 22

Tradeoff ✔ Achieve sequential bandwidth ✖ Need to scan entire edge list Winning Tradeoff ! 23

Winning Tradeoff • Real-world graphs have small diameter • Traversals in just a few iterations of scatter-gather • Large number of active vertices in most iterations 24

Benefit Order oblivious SCAN 5 1 → 5 1 3 6 → 4 6 2 1 → 6 4 6 → 2 25

What about the vertices ? SCAN 5 1 2 1 → 5 1 1 → 6 3 3 SEEK 6 2 4 5 6 → 2 4 6 6 → 4 26

What about the vertices ? Seeking in RAM is free ! How can we fit vertices in RAM ? SCAN 5 1 2 1 → 5 1 1 → 6 3 3 SEEK 6 2 4 5 6 → 2 4 6 6 → 4 27

Streaming Partitions Fits in RAM 1 → 5 1 → 6 1 2 → 3 5 2 1 3 3 3 → 5 6 2 4 4 5 6 → 2 6 6 → 4 28

Streaming Partitions Load in RAM SCAN 1 → 5 1 → 6 1 2 → 3 5 2 1 3 3 3 → 5 6 2 4 4 5 6 → 2 6 6 → 4 29

Producing Partitions • No requirement on quality (# of cross edges) • Need only fit into RAM • Random partitions are great • Random partitions work great 30

Algorithms Supported • Supports traversal algorithms • BFS, WCC, MIS, SCC, K-Cores, SSSP, BC • Supports algebraic operations on the graph • BP, ALS, SpMV, Pagerank • Good testbed for newer streaming algorithms • HyperANF, Semi-streaming Triangle Counting 31

Competition • Graphchi • Another on-disk graph processing system (OSDI’12) • Special on-disk data structure: shards • Makes accesses look sequential • Producing shards requires sorting edges 32

SSD GraphChi (Sharding) 3000 X-Stream (Total time) 2250 Time (seconds) 1500 750 0 Netflix/ALS Twitter/Pagerank RMAT27/WCC 33

More Competition • Applies to any two level memory • Includes CPU cache and DRAM • Main memory graph processing ? • Looked at Ligra (PPoPP 2012) 34

BFS Ligra X-Stream 100.0 Time (seconds) 10.0 1.0 0.1 1 2 4 8 16 CPUs 35

BFS Ligra X-Stream Ligra (setup) 1000.0 Time (seconds) 100.0 10.0 1.0 0.1 1 2 4 8 16 CPUs 36

Where we stand 1 trillion Pregel SIGMOD’10 100 billion Edges Powergraph 300 machines OSDI’12 10 billion X-Stream SOSP’13 Ligra 1 machine PPoPP’12 How do we get further ? Scale out 37

SlipStream • Aggregate bandwidth and storage of a cluster • Solves the graph partitioning problem • Rethinking storage access • Rethinking streaming partition execution • We know how to do it right for one machine 38

Scaling Out • Assign different streaming partitions to machines Graph partitioning is hard to get right 39

Load Imbalance Red Blue SP SP 40

Load Imbalance Red Blue IDLE IDLE SP 41

Flat Storage Stripe data across all disks Allow any machine to access any disk Red Blue SP SP ✔ Balance Capacity ✔ Balance BW SP SP 42

Flat Storage Stripe data across all disks Allow any machine to access any disk Red Blue Flat Storage Box SP SP SP SP 43

Flat Storage • Assumes full bisection bandwidth network • Can be done at data-center scales • Nightingale et. al. OSDI 2012 using CLOS switches • Already true at rack scale • Like in our cluster 44

Flat Storage Red Blue Flat Storage Box SP SP SP SP 45

Flat Storage Using only half the available bandwidth Red IDLE Flat Storage Box IDLE SP SP 46

Extracting Parallelism • Edge-centric loop • Stream in edges/updates • Access vertices • What if… • Independent copies of vertices on machines 47

Extracting Parallelism Scan Scatter/Gather Vertices 48

Scatter Step Scan Edges Scatter Vertices 49

Scatter Step Scan Edges Scatter machine 1 Vertices Scatter machine 2 Vertices Flat Storage Box 50

Gather Step Scan Updates Gather machine 1 Vertices Gather machine 2 Vertices Flat Storage Box 51

Merge Step Application of updates is commutative machine 1 Vertices Merge Vertices machine 2 Vertices No need to go to disk 52

X-Stream to SlipStream SlipStream graph algorithms = X-Stream graph algorithms + Merge function • Easy to write merge function (looks like gather) 53

Putting it Together Red Flat Storage Box SP SP 54

Putting it Together Copy Red Flat Storage Box SP SP 55

Putting it Together ✔ Back to Full Bandwidth Red Red Flat Storage Box SP SP 56

Automatic Load Balancing Compute Box Flat Storage Box 57

Recap • Graph Partitioning across machines is hard • Drop locality using flat storage • Make it one disk • Same streaming partition on multiple nodes • Extract full bandwidth from the aggregated disk • Systems approach to solving algorithms problem 58

Flat Storage • Distributed Storage layer for SlipStream • Looked at other designs • FDS (OSDI 2012) • GFS (SOSP 2003) • … • Implementing distributed storage is hard ☹ 59

The Hard Bit Store Block X 60

The Hard Bit Where is block X ? Need a location service f: file, block → machine, offset 61

Block Location Store block of updates 62

Block Location is Irrelevant Give me any block of updates Streaming is order oblivious ! 63

Random Schedule • Centralized metadata service ⇒ randomization • Connect to a random machine for load/store • Extremely simple implementation 64

Downside ? • Can lead to collisions • Collisions reduce utilization Red Blue rand() = 1 rand() = 1 SP SP SP SP 65

No Downside • Utilization lower bound at (1 - 1/e) ~ 62% 66

Recap • Building distributed storage is hard • Algorithms approach to solving systems problem • Streaming algorithms are order oblivious • Randomized schedule 67

Evaluation Results 32 cores 32 GB RAM Rack 1 32 200 GB SSD 2 TB 5200 RPM 10 GigE full bisection 68

Scalability • Solve larger problems using more machines • Used synthetic scale-free graphs • Double problem size (vertices and edges) • Double machine count • Till 32 machines, 4 billion vertices, 64 billion edges 69

Scaling RMAT (SSD) PR BFS SCC WCC BP MCST Cond. MIS SPMV SSSP 4 32X problem size at 2.7X cost Normalized Wall Time 3 2 1 0 1 2 4 8 16 32 Machines 70

Scaling RMAT (SSD) PR BFS SCC WCC BP MCST Cond. MIS SPMV SSSP 4 32X problem size at 2.7X cost Normalized Wall Time 3 0.5X Engineering Loss of sequentiality 1X 2 Collisions 0.5X 1 0 1 2 4 8 16 32 Machines 71

Capacity • Largest graph we can fit in our cluster • 32 billion vertices, 1 trillion edges • Magnetic disks • BFS • Projected seeks were 1 year 72

Tackling Large Graphs with Secondary Storage Amitabha Roy EPFL 1 - PowerPoint PPT Presentation

Tackling Large Graphs with Secondary Storage Amitabha Roy EPFL 1 Graphs Social networks Document networks Biological networks Humans, phones, bank accounts 2 Graph are Difficult Graph mining is challenging problem Traversal leads

Secondary Framing Secondary Framing Secondary Framing Secondary Framing 1 1 Secondary Framing

Graphs () Graphs () Graphs Graphs Graphs are collections of nodes

Weighted graphs Weighted graphs Weighted graphs Weighted graphs Graphs with numbers, called

Operating Systems Fall 2014 Secondary Storage Myungjin Lee myungjin.lee@ed.ac.uk 1 Secondary

Exercise and Secondary Exercise and Secondary Exercise and Secondary Exercise and Secondary

Week 4 Kullmann Graphs and directed graphs Elementary Graph Algorithms Representing graphs

On some classes of Deza graphs Deza graphs without 3-cocliques Line graphs V.V. Kabanov 1 Deza

Graphs Graphs Examples Definitions Implementation/Representation of graphs Graphs

Pregel Large-Scale Graph Processing William Jones Analysing large graphs is hard. We are

Graphs Graphs Simple graphs Algorithms Depth-first search Breadth-first search

Searching on Graphs November 16, 2016 CMPE 250 Graphs- Searching on Graphs November 16, 2016 1

CS200: Graphs Prichard Ch. 14 Rosen Ch. 10 CS200 - Graphs 1 Graphs A collection of What can

Today. Types of graphs. Today. Types of graphs. Complete Graphs. Trees. Hypercubes. Today.

> SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE

Center Jason Acord Systems Engineer 2 Secondary backup storage ( onsite) Backup Backup copy

Introd u cing SUSE Enterprise Storage 5 1 SUSE Enterprise Storage 5 SUSE Enterprise Storage 5 is

Herding Cycles Edouard Schaal Mathieu Taschereau-Dumouchel CREI, UPF and BGSE Cornell

A puzzle about G odels numbering James Avery 1 Jean-Yves Moyen 1 Jakob Grue Simonsen 1

CSSA Convention 2015 at Pitzer College A Tour by Laurel Woodley West Hall with Dorm Rooms West

Remote Operations Workshop Shelter Island September 17-20, 2002 Andrew Hutton Mike Spata

101 13,000+ County centers incl. Educational programs Eastern Band of offered Cherokee 2.6M

Monthly y Webinar Se Seri ries August, 2018 To Todays Agenda Announcements and Trial

Lower bounds certification for multivariate real functions using SDP Joint Work with B. Werner,

Preface: Magic in the Stone CS105: Great Insights in Computer Science Welcome! John Robert