piccolo building fast distributed programs with
play

Piccolo: Building fast distributed programs with partitioned tables - PowerPoint PPT Presentation

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University Motivating Example: PageRank for each node X in graph: Repeat until for each edge X Z: convergence next[Z] += curr[X] Fits


  1. Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University

  2. Motivating Example: PageRank for each node X in graph: Repeat until for each edge X  Z: convergence next[Z] += curr[X] Fits in Input Graph Curr Next memory! A  B,C,D A: 0.25 A: 0.2 A: 0.12 A: 0.2 A: 0.25 A: 0 B  E B: 0.17 B: 0.16 B: 0.15 B: 0 B: 0.16 B: 0.17 C  D C: 0.22 C: 0.21 C: 0.2 C: 0 C: 0.21 C: 0.22 … … … … … … …

  3. PageRank in MapReduce  Data flow models do not expose global state. 1 Rank stream Graph stream Rank stream A:0.1, B:0.2 A->B,C, B->D A:0.1, B:0.2 Distributed Storage 2 3

  4. PageRank in MapReduce  Data flow models do not expose global state. 1 Rank stream Graph stream Rank stream A:0.1, B:0.2 A->B,C, B->D A:0.1, B:0.2 Distributed Storage 2 3

  5. PageRank With MPI/RPC Graph Ranks A->B,C A: 0 User explicitly 1 … … programs communication Distributed Ranks Storage Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 B->D … …

  6. Piccolo’s Goal: Distributed Shared State 1 Distributed Distributed in- Storage memory state read/write Graph Ranks A->B,C A: 0 B->D B: 0 … … 2 3

  7. Piccolo’s Goal: Distributed Shared State Graph Ranks A->B,C A: 0 Piccolo runtime 1 … … handles communication Ranks Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 B->D … …

  8. Ease of use Performance

  9. Talk outline  Motivation  Piccolo's Programming Model  Runtime Scheduling  Evaluation

  10. Programming Model Implemented as 1 library for C++ x and Python get/put read/write update/iterate Graph Ranks A  B,C A: 0 B  D B: 0 … … 2 3

  11. Naïve PageRank with Piccolo curr = Table(key=PageID, value=double) next = Table(key=PageID, value=double) def pr_kernel pr_kernel(graph, curr, next): i = my_instance Jobs run by n = len(graph)/NUM_MACHINES many machines for s in graph[(i-1)*n:i*n] for t in s.out: next[t] += curr[s.id] / len(s.out) Controller launches def main main(): for i in range(50): jobs in parallel launch_jobs(NUM_MACHINES, pr_kernel, Run by a single graph, curr, next) swap(curr, next) controller next.clear()

  12. Naïve PageRank is Slow get Graph Ranks B->D A: 0 1 put … … get get put put Ranks Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 A->B,C … …

  13. PageRank: Exploiting Locality Control table curr = Table(…,partitions=100,partition_by=site) partitioning next = Table(…,partitions=100,partition_by=site) group_tables(curr,next,graph) Co-locate tables def pr_kernel pr_kernel(graph, curr, next): for s in graph.get_iterator(my_instance) for t in s.out: next[t] += curr[s.id] / len(s.out) def main main(): for i in range(50): launch_jobs(curr.num_partitions, pr_kernel, graph, curr, next, Co-locate execution with locality=curr) table swap(curr, next) next.clear()

  14. Exploiting Locality get Graph Ranks B->D A: 0 1 put … … get get put put Ranks Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 A->B,C … …

  15. Exploiting Locality get Graph Ranks B->D A: 0 1 put … … get get put put Ranks Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 A->B,C … …

  16. Synchronization How to handle Graph Ranks 1 B->D A: 0 synchronization? … … put (a=0.3) Ranks put (a=0.2) Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 A->B,C … …

  17. Synchronization Primitives  Avoid write conflicts with accumulation functions  NewValue = Accum(OldValue, Update)  sum, product, min, max  Global barriers are sufficient  Tables provide release consistency

  18. PageRank: Efficient Synchronization Accumulation curr = Table(…,partition_by=site,accumulate=sum) via sum next = Table(…,partition_by=site,accumulate=sum) group_tables(curr,next,graph) Update invokes def pr_kernel pr_kernel(graph, curr, next): accumulation function for s in graph.get_iterator(my_instance) for t in s.out: next.update(t, curr.get(s.id)/len(s.out)) def main main(): for i in range(50): handle = launch_jobs(curr.num_partitions, pr_kernel, graph, curr, next, locality=curr) Explicitly wait barrier(handle) between iterations swap(curr, next) next.clear()

  19. Efficient Synchronization Runtime Graph Ranks 1 B->D A: 0 computes sum … … Workers buffer updates locally  Release consistency put (a=0.3) update (a, 0.3) Ranks put (a=0.2) update (a, 0.2) Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 A->B,C … …

  20. Table Consistency Graph Ranks 1 B->D A: 0 … … update (a, 0.3) put (a=0.3) Ranks put (a=0.2) update (a, 0.2) Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 A->B,C … …

  21. PageRank with Checkpointing curr = Table(…,partition_by=site,accumulate=sum) next = Table(…,partition_by=site,accumulate=sum) group_tables(curr,next) def pr_kernel pr_kernel(graph, curr, next): Restore previous for node in graph.get_iterator(my_instance) computation for t in s.out: next.update(t,curr.get(s.id)/len(s.out)) User decides which def main main(): tables to checkpoint curr, userdata = restore() and when last = userdata.get(‘iter’, 0) for i in range(last,50): handle = launch_jobs(curr.num_partitions, pr_kernel, graph, curr, next, locality=curr) cp_barrier(handle, tables=(next), userdata={‘iter’:i}) swap(curr, next) next.clear()

  22. Recovery via Checkpointing Graph Ranks 1 B->D A: 0 … … Runtime uses Chandy-Lamport Distributed protocol Storage Ranks Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 A->B,C … …

  23. Talk Outline  Motivation  Piccolo's Programming Model  Runtime Scheduling  Evaluation

  24. Other workers Load Balancing are updating P6! J1, J3, J5, P1, P2 P3, P4 P5, P6 J1 J3 J2 J4 J6 P1 P3 P5 J6 P6 1 2 3 J3 J5 Pause updates! Coordinates master work- stealing

  25. Talk Outline  Motivation  Piccolo's Programming Model  System Design  Evaluation

  26. Piccolo is Fast 400 PageRank iteration time Main Hadoop Overheads: 350  Sorting Hadoop 300  HDFS (seconds) Piccolo 250  Serialization 200 150 100 50 0 8 16 32 64 Workers  NYU cluster, 12 nodes, 64 cores  100M-page graph

  27. 1 billion page Piccolo Scales Well graph ideal PageRank iteration time 70 60 (seconds) 50 40 30 20 10 0 12 24 48 100 200 Workers  EC2 Cluster - linearly scaled input graph

  28. Other applications  Iterative Applications  N-Body Simulation No straightforward  Matrix Multiply Hadoop implementation  Asynchronous Applications  Distributed web crawler ‏

  29. Related Work  Data flow  MapReduce, Dryad  Tuple Spaces  Linda, JavaSpaces  Distributed Shared Memory  CRL, TreadMarks, Munin, Ivy  UPC, Titanium

  30. Conclusion  Distributed shared table model  User-specified policies provide for  Effective use of locality  Efficient synchronization  Robust failure recovery

  31. Gratuitous Cat Picture I can haz kwestions? Try it out: piccolo.news.cs.nyu.edu

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend