optimizing indirect memory references with milk
play

Optimizing Indirect Memory References with milk Vladimir Kiriansky, - PowerPoint PPT Presentation

Optimizing Indirect Memory References with milk Vladimir Kiriansky, Yunming Zhang, Saman Amarasinghe MIT PACT 16 September 13, 2016, Haifa, Israel 1 Indirect Accesses 2 Indirect Accesses with OpenMP 3 Indirect Accesses


  1. Optimizing Indirect Memory References 
 with milk Vladimir Kiriansky, Yunming Zhang, Saman Amarasinghe 
 MIT PACT ’16 
 September 13, 2016, Haifa, Israel 1

  2. Indirect Accesses 2

  3. Indirect Accesses 
 with OpenMP 3

  4. Indirect Accesses 
 with OpenMP 5 4 Speedup 3 OpenMP +Milk 2 1 0 uniform [0..100M) 8 threads, 8MB L3 3

  5. Indirect Accesses 
 with milk milk if(!milk) 5 4 Speedup 3 OpenMP +Milk 2 1 0 uniform [0..100M) 8 threads, 8MB L3 4

  6. No Locality? Address Time 5

  7. No Locality? • Cache miss Address • TLB miss • DRAM row miss • No prefetching Time 6

  8. No Locality? Address Time 7

  9. No Locality? Address Time 8

  10. No Locality? Address Time 9

  11. Milk Clustering 8 threads Address Time 10

  12. Milk Clustering • Cache hit Address • TLB hit • DRAM row hit • Effective prefetching Time 11

  13. Milk Clustering • Cache hit Address • TLB hit • DRAM row hit • Effective prefetching • No need for atomics! Time 12

  14. Big (sparse) Data http://research.blogs.lincoln.ac.uk/ 
 files/2011/02/map-of-internet.png

  15. Big (sparse) Data • Terabyte Working Sets 
 - AWS 2TB VM • In-memory Databases, Key-value stores • Machine Learning • Graph Analytics 14

  16. Outline • Milk programming model 
 • milk syntax 
 • MILK compiler and runtime 15

  17. Foundations • Milk programming model — extending BSP 
 • milk syntax — OpenMP for C/C++ 
 • MILK compiler and runtime — LLVM/Clang 16

  18. Milk — BSP extension • Bulk-synchronous parallel (BSP) superstep 
 - updates visible after a barrier • Milk virtual processors can access only • One random cache line from DRAM • Sequential streams • Cache-resident data 17

  19. Superstep Locality in 
 Graph Applications Temporal Locality (infinite cache) Spatial Locality (64 byte) 100 1.00 1.00 1.00 1.00 Ideal Cache Hit % 80 0.80 0.80 0.80 0.80 R oad (d=2.4) 60 0.60 0.60 0.60 0.60 T witter (d=24) 40 0.40 0.40 0.40 0.40 W eb (d=39) 20 0.20 0.20 0.20 0.20 0 0.00 0.00 0.00 0.00 R T W R T W R T W R T W R T W Betweenness 
 Breadth-First 
 Connected 
 Single-Source PageRank [GAPBS] Centrality Search Components Shortest Paths 18

  20. Milk Execution Model • Collection • Distribution • Delivery 19

  21. Collection += f(i); 0 1 2 3 4 5 6 7 d 7 0 14 5 18 7 0 7 count 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

  22. Collection += f(i); 0 1 2 3 4 5 6 7 d 7 0 14 5 18 7 0 7 7 0 14 5 18 7 0 7 f(0) f(1) f(2) f(3) f(4) f(5) f(6) f(7) count 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

  23. Distribution += f(i); 0 1 2 3 4 5 6 7 d 7 0 14 5 18 7 0 7 7 0 14 5 18 7 0 7 f(0) f(1) f(2) f(3) f(4) f(5) f(6) f(7) count 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

  24. Distribution += f(i); 0 1 2 3 4 5 6 7 d 7 0 14 5 18 7 0 7 7 0 0 5 7 7 14 18 f(3) f(5) f(6) f(7) f(2) f(4) f(1) f(0) count 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

  25. Delivery += f(i); 0 1 2 3 4 5 6 7 d 7 0 14 5 18 7 0 7 7 0 5 7 0 7 14 18 f(1) f(3) f(5) f(6) f(7) f(2) f(4) f(0) count 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

  26. Delivery += f(i); 0 1 2 3 4 5 6 7 d 7 0 14 5 18 7 0 7 count 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

  27. milk syntax • milk clause in parallel loop • milk directive per indirect access tag — address to group by 0 pack — additional state f(1) 23

  28. pack Combiners 24

  29. Combiners += f(i); 0 1 2 3 4 5 6 7 d 7 0 14 5 18 7 0 7 0 0 5 7 7 7 14 18 f(1) f(6) f(3) f(0) f(5) f(7) f(2) f(4) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 count

  30. Combiners 0 1 2 3 4 5 6 7 d 7 0 14 5 18 7 0 7 + + + 0 5 7 14 18 f(1) f(3) f(0) f(5) f(7) f(2) f(4) f(6) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 count

  31. MILK compiler and runtime • Collection — loop transformation • Delivery — outlined function with continuation 
 • Distribution — runtime library 
 parallel multipass radix partitioning 27

  32. Example: PageRank 28

  33. Example: PageRank 7 17 0.5 28

  34. PageRank with OpenMP 29

  35. PageRank with milk 30

  36. PageRank with milk 31

  37. PageRank with milk 7 17 0.5 32

  38. PageRank: Collection 0.5 7 33

  39. Tag Distribution L2 pails … 9-bit radix partition 34

  40. Tag Distribution L2 pails 0.5 … 7 17 17 17 0.5 p=7 35

  41. Tag Distribution L2 pails … 17 17 7 17 0.5 p=7 0.5 35

  42. Tag Distribution L2 pails … 17 17 7 7 17 0.5 7 0.5 p=7 36

  43. Distribution: Pail Overflow L2 DRAM pails tubs 0.2 … 17 17 7 17 17 0.5 7 0.5 0.2 p=7 0.2 37

  44. Milk Delivery DRAM tubs L2 17 0.2 27 0.1 7 0.3 17 27 17 7 17 17 0.5 7 0.5 0.2 38

  45. Milk Delivery DRAM tubs L2 17 0.2 27 0.1 7 0.3 17 27 17 7 17 17 0.5 7 0.5 0.2 39

  46. Related Work • Database JOIN optimizations • [Shatdal94] cache partitioning • [Manegold02, Kim09, Albutiu12, Balkesen15] 
 TLB, SIMD, NUMA, 
 non-temporal writes, software write buffers 40

  47. Overall Speedup with milk 3x 2.7 × V=32M 2.5x [i7-4790K] 
 2x 8 MB L3 Speedup 1.5x 1.4 × 1x 0.5x 0x [GAPBS] BC BFS CC PR SSSP Betweenness 
 Breadth-First 
 Connected 
 Single-Source PageRank Centrality Search Components Shortest Paths 41

  48. Indirect Access Cache Hit% baseline milk 100 V=32M 80 [i7-4790K] 
 8 MB L3 Cache Hit % 256KB L2 60 40 20 0 BC BFS CC PR SSSP > 80% DRAM → < 22% 42

  49. 
 Stall Cycle Reduction baseline 100% milk PageRank 
 80% % of Total Cycles V=32M 
 60% d=16 
 uniform 40% 20% 0% L2 miss stalls 
 L3 miss stalls 
 256 KB L2 8 MB L3 baseline: 6 of 7 cycles stalled! 43

  50. Larger Graphs 
 → Larger Speedups 2M 8M 32M 3x 2.5x d=16 
 2x uniform Speedup 1.5x 8 MB L3 [i7-4790K] 1x 0.5x 0x BC BFS CC PR SSSP 44

  51. Higher Degree → Higher Locality 5x 4x CountDegree 3x Speedup V=16M V=32M 2x 1x 0x 1 2 4 8 16 32 64 16M edges 2B edges Average Degree 45

  52. Q & A http://milk-lang.org/ 46

  53. Backup Slides 47

  54. Graph Datasets Social Web Road Graph Facebook Twitter Twitter62 CC12 .sk US 1.5 B Vertices 300 M 62 M 3.5 B 51 M 24 M Degree 290 200 24 36 39 2.4 [Backstrom14][Ching15][Beamer15] [CommonCrawl] 53

  55. Degree Distribution RMAT25 Uniform25 Twitter’ V=62M, d=24 V=32M, d=16 100 % Cumulative Edges % 75 % L3 50 % 25 % 0 % 2 6 8 4 2 6 8 4 2 1 2 2 9 3 8 0 3 1 0 1 5 2 3 4 1 8 5 4 4 4 6 2 9 5 5 1 5 4 3 3 Vertex Degree Rank 52

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend