Optimizing Indirect Memory References with milk Vladimir Kiriansky, - - PowerPoint PPT Presentation

optimizing indirect memory references with milk
SMART_READER_LITE
LIVE PREVIEW

Optimizing Indirect Memory References with milk Vladimir Kiriansky, - - PowerPoint PPT Presentation

Optimizing Indirect Memory References with milk Vladimir Kiriansky, Yunming Zhang, Saman Amarasinghe MIT PACT 16 September 13, 2016, Haifa, Israel 1 Indirect Accesses 2 Indirect Accesses with OpenMP 3 Indirect Accesses


slide-1
SLIDE 1

1

Optimizing Indirect Memory References 
 with milk

Vladimir Kiriansky, Yunming Zhang, Saman Amarasinghe
 MIT PACT ’16
 September 13, 2016, Haifa, Israel

slide-2
SLIDE 2

Indirect Accesses

2

slide-3
SLIDE 3

Indirect Accesses 
 with OpenMP

3

slide-4
SLIDE 4

Indirect Accesses 
 with OpenMP

3

Speedup 1 2 3 4 5

OpenMP +Milk uniform [0..100M) 8 threads, 8MB L3

slide-5
SLIDE 5

Indirect Accesses 
 with milk

4

Speedup 1 2 3 4 5

OpenMP +Milk

milk

if(!milk)

uniform [0..100M) 8 threads, 8MB L3

slide-6
SLIDE 6

No Locality?

5

Address

Time

slide-7
SLIDE 7

No Locality?

6

  • Cache miss
  • TLB miss
  • DRAM row miss
  • No prefetching

Address

Time

slide-8
SLIDE 8

No Locality?

7

Time

Address

slide-9
SLIDE 9

No Locality?

8

Time

Address

slide-10
SLIDE 10

No Locality?

9

Time

Address

slide-11
SLIDE 11

Milk Clustering

10

Time

8 threads

Address

slide-12
SLIDE 12

Milk Clustering

11

  • Cache hit
  • TLB hit
  • DRAM row hit
  • Effective prefetching

Time

Address

slide-13
SLIDE 13

Milk Clustering

12

  • Cache hit
  • TLB hit
  • DRAM row hit
  • Effective prefetching
  • No need for atomics!

Time

Address

slide-14
SLIDE 14

http://research.blogs.lincoln.ac.uk/
 files/2011/02/map-of-internet.png

Big (sparse) Data

slide-15
SLIDE 15

Big (sparse) Data

  • Terabyte Working Sets

  • AWS 2TB VM
  • In-memory Databases, Key-value stores
  • Machine Learning
  • Graph Analytics

14

slide-16
SLIDE 16

Outline

  • Milk programming model

  • milk syntax

  • MILK compiler and runtime

15

slide-17
SLIDE 17

Foundations

  • Milk programming model — extending BSP

  • milk syntax — OpenMP for C/C++

  • MILK compiler and runtime — LLVM/Clang

16

slide-18
SLIDE 18

Milk — BSP extension

  • Bulk-synchronous parallel (BSP) superstep

  • updates visible after a barrier
  • Milk virtual processors can access only
  • One random cache line from DRAM
  • Sequential streams
  • Cache-resident data

17

slide-19
SLIDE 19

Superstep Locality in
 Graph Applications

18

0.00 0.20 0.40 0.60 0.80 1.00

R T W

0.00 0.20 0.40 0.60 0.80 1.00

R T W

20 40 60 80 100

R T W

Temporal Locality (infinite cache) Spatial Locality (64 byte)

0.00 0.20 0.40 0.60 0.80 1.00

R T W

0.00 0.20 0.40 0.60 0.80 1.00

R T W

Ideal Cache Hit % Betweenness
 Centrality Breadth-First 
 Search Connected
 Components PageRank Single-Source Shortest Paths [GAPBS] Road (d=2.4) Twitter (d=24) Web (d=39)

slide-20
SLIDE 20

Milk Execution Model

19

  • Collection
  • Distribution
  • Delivery
slide-21
SLIDE 21

+= f(i);

7 14 5 18 7 7

d

1 2 3 4 5 6 7

Collection

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

count

slide-22
SLIDE 22

+= f(i);

7 14 5 18 7 7

d

1 2 3 4 5 6 7

7 14 5 18 7 7

f(0) f(1) f(2) f(3) f(4) f(5) f(6) f(7)

Collection

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

count

slide-23
SLIDE 23

+= f(i);

7 14 5 18 7 7

d

1 2 3 4 5 6 7

7 14 5 18 7 7

f(0) f(1) f(2) f(3) f(4) f(5) f(6) f(7)

Distribution

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

count

slide-24
SLIDE 24

+= f(i);

7 14 5 18 7 7

d

1 2 3 4 5 6 7

7 14 5 18 7 7

f(0) f(1) f(2) f(3) f(4) f(5) f(6) f(7)

Distribution

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

count

slide-25
SLIDE 25

+= f(i);

7 14 5 18 7 7

d

1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

count

Delivery

5 7 7 7 14 18

f(5) f(1) f(3) f(0) f(6) f(7) f(2) f(4)

slide-26
SLIDE 26

+= f(i);

7 14 5 18 7 7

d

1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

count

Delivery

slide-27
SLIDE 27

milk syntax

  • milk clause in parallel loop
  • milk directive per indirect access

tag — address to group by pack — additional state

23

f(1)

slide-28
SLIDE 28

pack Combiners

24

slide-29
SLIDE 29

7 14 5 18 7 7

d

1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

count

Combiners

5 7 7 7 14 18

f(1) f(6) f(3) f(0) f(5) f(7) f(2) f(4)

+= f(i);

slide-30
SLIDE 30

7 14 5 18 7 7

d

1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

count

Combiners

7 14 18

f(1) f(6) f(0) f(5) f(7) f(2) f(4)

+ + +

5

f(3)

slide-31
SLIDE 31

MILK compiler and runtime

27

  • Collection — loop transformation
  • Delivery — outlined function with continuation

  • Distribution — runtime library


parallel multipass radix partitioning

slide-32
SLIDE 32

Example: PageRank

28

slide-33
SLIDE 33

Example: PageRank

28

7

0.5

17

slide-34
SLIDE 34

PageRank with OpenMP

29

slide-35
SLIDE 35

PageRank with milk

30

slide-36
SLIDE 36

PageRank with milk

31

slide-37
SLIDE 37

PageRank with milk

32

7

0.5

17

slide-38
SLIDE 38

PageRank: Collection

33

0.5

7

slide-39
SLIDE 39

Tag Distribution

34

9-bit radix partition pails

L2

slide-40
SLIDE 40

Tag Distribution

35

pails

L2

p=7

17 0.5

17 17 7

0.5

slide-41
SLIDE 41

Tag Distribution

35

pails

L2

p=7

17 0.5

17 17 7

0.5

slide-42
SLIDE 42

Tag Distribution

36

pails

L2

p=7

17 0.5 7 0.5

17 7 17 7

slide-43
SLIDE 43

0.2

Distribution: Pail Overflow

37

pails tubs

DRAM L2

p=7

17 0.5 7 0.5 0.2

17 7 17

0.2

17

slide-44
SLIDE 44

0.2 27 0.1 7 0.3

17 27 17

17 0.5 7 0.5 0.2

Milk Delivery

38

L2

tubs

DRAM

17 7 17

slide-45
SLIDE 45

Milk Delivery

39

L2

tubs

DRAM

0.2 27 0.1 7 0.3

17 27 17

17 0.5 7 0.5 0.2

17 7 17

slide-46
SLIDE 46

Related Work

  • Database JOIN optimizations
  • [Shatdal94] cache partitioning
  • [Manegold02, Kim09, Albutiu12, Balkesen15]


TLB, SIMD, NUMA, 
 non-temporal writes, software write buffers

40

slide-47
SLIDE 47

Overall Speedup with milk

41

Speedup

0x 0.5x 1x 1.5x 2x 2.5x 3x

BC BFS CC PR SSSP Betweenness
 Centrality Breadth-First 
 Search Connected
 Components PageRank Single-Source Shortest Paths

[GAPBS]

1.4× 2.7×

V=32M [i7-4790K]
 8 MB L3

slide-48
SLIDE 48

Indirect Access Cache Hit%

42

Cache Hit % 20 40 60 80 100

BC BFS CC PR SSSP baseline milk

V=32M [i7-4790K]
 8 MB L3 256KB L2 > 80% DRAM → < 22%

slide-49
SLIDE 49

Stall Cycle Reduction

43

% of Total Cycles 0% 20% 40% 60% 80% 100%

L2 miss stalls
 256 KB L2 L3 miss stalls
 8 MB L3 baseline milk PageRank
 
 V=32M
 d=16
 uniform

baseline: 6 of 7 cycles stalled!

slide-50
SLIDE 50

Larger Graphs 
 → Larger Speedups

44

Speedup 0x 0.5x 1x 1.5x 2x 2.5x 3x

BC BFS CC PR SSSP

2M 8M 32M d=16
 uniform 8 MB L3

[i7-4790K]

slide-51
SLIDE 51

Higher Degree → Higher Locality

45

Speedup 0x 1x 2x 3x 4x 5x 1 2 4 8 16 32 64

V=16M V=32M

Average Degree

16M edges 2B edges CountDegree

slide-52
SLIDE 52

Q & A

46

http://milk-lang.org/

slide-53
SLIDE 53

Backup Slides

47

slide-54
SLIDE 54

Graph Datasets

53

Social Web Road Graph Facebook Twitter Twitter62 CC12 .sk US Vertices 1.5 B 300 M 62 M 3.5 B 51 M 24 M Degree 290 200 24 36 39 2.4

[Backstrom14][Ching15][Beamer15] [CommonCrawl]

slide-55
SLIDE 55

Degree Distribution

52

Cumulative Edges % 0 % 25 % 50 % 75 % 100 % Vertex Degree Rank 2 1 6 1 2 8 1 2 4 8 1 9 2 6 5 5 3 6 5 2 4 2 8 8 4 1 9 4 3 4 3 3 5 5 4 4 3 2

RMAT25 Uniform25 Twitter’ V=32M, d=16 V=62M, d=24

L3