PHI: ARCHITECTURAL SUPPORT FOR SYNCHRONIZATION- AND - - PowerPoint PPT Presentation

phi architectural support for synchronization and
SMART_READER_LITE
LIVE PREVIEW

PHI: ARCHITECTURAL SUPPORT FOR SYNCHRONIZATION- AND - - PowerPoint PPT Presentation

PHI: ARCHITECTURAL SUPPORT FOR SYNCHRONIZATION- AND BANDWIDTH-EFFICIENT COMMUTATIVE SCATTER UPDATES Anurag Mukkara , Nathan Beckmann, Daniel Sanchez MICRO 2019 Scatter updates are common but inefficient 2 Scatter updates are common but


slide-1
SLIDE 1

PHI: ARCHITECTURAL SUPPORT FOR SYNCHRONIZATION- AND BANDWIDTH-EFFICIENT COMMUTATIVE SCATTER UPDATES

Anurag Mukkara, Nathan Beckmann, Daniel Sanchez MICRO 2019

slide-2
SLIDE 2

Scatter updates are common but inefficient

2

slide-3
SLIDE 3

Scatter updates are common but inefficient

2

¨ Scatter updates are common in sparse algorithms

¤ e.g., in push graph algorithms, vertices scatter updates to

  • utgoing neighbors
slide-4
SLIDE 4

Scatter updates are common but inefficient

2

¨ Scatter updates are common in sparse algorithms

¤ e.g., in push graph algorithms, vertices scatter updates to

  • utgoing neighbors

¨ Current memory hierarchies are optimized for reads

¤ Scatter updates suffer from high synchronization and high

memory bandwidth

slide-5
SLIDE 5

Scatter updates are common but inefficient

2

¨ Scatter updates are common in sparse algorithms

¤ e.g., in push graph algorithms, vertices scatter updates to

  • utgoing neighbors

¨ Current memory hierarchies are optimized for reads

¤ Scatter updates suffer from high synchronization and high

memory bandwidth

Cache Core Memory Fetch Fetch WB WB

slide-6
SLIDE 6

Scatter updates are common but inefficient

2

¨ Scatter updates are common in sparse algorithms

¤ e.g., in push graph algorithms, vertices scatter updates to

  • utgoing neighbors

¨ Current memory hierarchies are optimized for reads

¤ Scatter updates suffer from high synchronization and high

memory bandwidth

¨ Key insight: Many scatter updates are commutative and can

be reordered for performance

Cache Core Memory Fetch Fetch WB WB

slide-7
SLIDE 7

Scatter updates are common but inefficient

2

¨ Scatter updates are common in sparse algorithms ¤ e.g., in push graph algorithms, vertices scatter updates to outgoing

neighbors

¨ Current memory hierarchies are optimized for reads ¤ Scatter updates suffer from high synchronization and high memory

bandwidth

¨ Key insight: Many scatter updates are commutative and can be

reordered for performance

¨ PHI extends the cache hierarchy to exploit temporal and spatial

locality of commutative scatter updates

Cache Core Memory Fetch Fetch WB WB

slide-8
SLIDE 8

Scatter updates are common but inefficient

2

¨ Scatter updates are common in sparse algorithms ¤ e.g., in push graph algorithms, vertices scatter updates to outgoing

neighbors

¨ Current memory hierarchies are optimized for reads ¤ Scatter updates suffer from high synchronization and high memory

bandwidth

¨ Key insight: Many scatter updates are commutative and can be

reordered for performance

¨ PHI extends the cache hierarchy to exploit temporal and spatial

locality of commutative scatter updates

Cache Core Memory Fetch Fetch WB WB Cache Core Memory Push Push

slide-9
SLIDE 9

Scatter updates are common but inefficient

2

¨ Scatter updates are common in sparse algorithms ¤ e.g., in push graph algorithms, vertices scatter updates to outgoing

neighbors

¨ Current memory hierarchies are optimized for reads ¤ Scatter updates suffer from high synchronization and high memory

bandwidth

¨ Key insight: Many scatter updates are commutative and can be

reordered for performance

¨ PHI extends the cache hierarchy to exploit temporal and spatial

locality of commutative scatter updates

Cache Core Memory Fetch Fetch WB WB Cache Core Memory Push Push Merge

slide-10
SLIDE 10

PHI gives large benefits

3

slide-11
SLIDE 11

PHI gives large benefits

3

¨ PageRank algorithm on UK web graph ¨ 16-core processor with 32MB cache, 4 memory controllers

slide-12
SLIDE 12

PHI gives large benefits

3

Push Pull UB PHI

0.0 0.2 0.4 0.6 0.8 1.0

Memory traffic

3.5x

Memory traffic

¨ PageRank algorithm on UK web graph ¨ 16-core processor with 32MB cache, 4 memory controllers

slide-13
SLIDE 13

PHI gives large benefits

3

Push Pull UB PHI

1 2 3 4 5 6 7

Speedup over Push Push Pull UB PHI

0.0 0.2 0.4 0.6 0.8 1.0

Memory traffic

3.5x

Performance Memory traffic

¨ PageRank algorithm on UK web graph ¨ 16-core processor with 32MB cache, 4 memory controllers

slide-14
SLIDE 14

Agenda

4

¨ Background ¨ PHI Design ¨ Evaluation

slide-15
SLIDE 15

Scatter updates are important

5

slide-16
SLIDE 16

Scatter updates are important

5

¨ Sparse algorithms perform push or pull-based indirect accesses

slide-17
SLIDE 17

Scatter updates are important

5

¨ Sparse algorithms perform push or pull-based indirect accesses ¨ Push mode: Indirect accesses are scatter updates

slide-18
SLIDE 18

Scatter updates are important

5

¨ Sparse algorithms perform push or pull-based indirect accesses ¨ Push mode: Indirect accesses are scatter updates

for src in vertices: for dst in outNeighbors(src): vertex(dst) += vertex(src)

slide-19
SLIDE 19

Scatter updates are important

5

¨ Sparse algorithms perform push or pull-based indirect accesses ¨ Push mode: Indirect accesses are scatter updates 1 2 4 3

for src in vertices: for dst in outNeighbors(src): vertex(dst) += vertex(src)

slide-20
SLIDE 20

Scatter updates are important

5

¨ Sparse algorithms perform push or pull-based indirect accesses ¨ Push mode: Indirect accesses are scatter updates 1 2 4 3

for src in vertices: for dst in outNeighbors(src): vertex(dst) += vertex(src)

slide-21
SLIDE 21

Scatter updates are important

5

¨ Sparse algorithms perform push or pull-based indirect accesses ¨ Push mode: Indirect accesses are scatter updates 1 2 4 3

for src in vertices: for dst in outNeighbors(src): vertex(dst) += vertex(src)

slide-22
SLIDE 22

Scatter updates are important

5

¨ Sparse algorithms perform push or pull-based indirect accesses ¨ Push mode: Indirect accesses are scatter updates ¨ Pull mode: Indirect accesses are gather reads

1 2 4 3

for src in vertices: for dst in outNeighbors(src): vertex(dst) += vertex(src) for dst in vertices: for src in inNeighbors(dst): vertex(dst) += vertex(src)

slide-23
SLIDE 23

Scatter updates are important

5

¨ Sparse algorithms perform push or pull-based indirect accesses ¨ Push mode: Indirect accesses are scatter updates ¨ Pull mode: Indirect accesses are gather reads

1 2 4 3 1 2 4 3

for src in vertices: for dst in outNeighbors(src): vertex(dst) += vertex(src) for dst in vertices: for src in inNeighbors(dst): vertex(dst) += vertex(src)

slide-24
SLIDE 24

Scatter updates are important

5

¨ Sparse algorithms perform push or pull-based indirect accesses ¨ Push mode: Indirect accesses are scatter updates ¨ Pull mode: Indirect accesses are gather reads

1 2 4 3 1 2 4 3

for src in vertices: for dst in outNeighbors(src): vertex(dst) += vertex(src) for dst in vertices: for src in inNeighbors(dst): vertex(dst) += vertex(src)

slide-25
SLIDE 25

Scatter updates are important

5

¨ Sparse algorithms perform push or pull-based indirect accesses ¨ Push mode: Indirect accesses are scatter updates ¨ Pull mode: Indirect accesses are gather reads

1 2 4 3 1 2 4 3

for src in vertices: for dst in outNeighbors(src): vertex(dst) += vertex(src) for dst in vertices: for src in inNeighbors(dst): vertex(dst) += vertex(src)

slide-26
SLIDE 26

Scatter updates are important

5

¨ Sparse algorithms perform push or pull-based indirect accesses ¨ Push mode: Indirect accesses are scatter updates ¨ Pull mode: Indirect accesses are gather reads

1 2 4 3 1 2 4 3

for src in vertices: for dst in outNeighbors(src): vertex(dst) += vertex(src) for dst in vertices: for src in inNeighbors(dst): vertex(dst) += vertex(src)

slide-27
SLIDE 27

Scatter updates are important

5

¨ Sparse algorithms perform push or pull-based indirect accesses ¨ Push mode: Indirect accesses are scatter updates ¨ Pull mode: Indirect accesses are gather reads ¨ Important to support scatter updates efficiently

¤ Push mode performs less work when few vertices are active

¤ Some algorithms do not admit a pull implementation

1 2 4 3 1 2 4 3

for src in vertices: for dst in outNeighbors(src): vertex(dst) += vertex(src) for dst in vertices: for src in inNeighbors(dst): vertex(dst) += vertex(src)

slide-28
SLIDE 28

Scatter updates are inefficient on conventional hierarchies

6

slide-29
SLIDE 29

Scatter updates are inefficient on conventional hierarchies

6

¨ Poor temporal and spatial locality when inputs do not fit in cache

¤Wasteful data transfers from main memory

slide-30
SLIDE 30

Scatter updates are inefficient on conventional hierarchies

6

¨ Poor temporal and spatial locality when inputs do not fit in cache

¤Wasteful data transfers from main memory

¨ Multiple threads update the same vertex

¤Cache line ping-ponging

slide-31
SLIDE 31

Scatter updates are inefficient on conventional hierarchies

6

¨ Poor temporal and spatial locality when inputs do not fit in cache

¤Wasteful data transfers from main memory

¨ Multiple threads update the same vertex

¤Cache line ping-ponging

1 2 4 3

slide-32
SLIDE 32

Scatter updates are inefficient on conventional hierarchies

6

¨ Poor temporal and spatial locality when inputs do not fit in cache

¤Wasteful data transfers from main memory

¨ Multiple threads update the same vertex

¤Cache line ping-ponging

1 2 4 3 Core Cache Shared Cache Memory Core Cache

… …

slide-33
SLIDE 33

Scatter updates are inefficient on conventional hierarchies

6

¨ Poor temporal and spatial locality when inputs do not fit in cache

¤Wasteful data transfers from main memory

¨ Multiple threads update the same vertex

¤Cache line ping-ponging

1 2 4 3 Core Cache Shared Cache Memory Core Cache

… …

2

slide-34
SLIDE 34

Scatter updates are inefficient on conventional hierarchies

6

¨ Poor temporal and spatial locality when inputs do not fit in cache

¤Wasteful data transfers from main memory

¨ Multiple threads update the same vertex

¤Cache line ping-ponging

1 2 4 3 Core Cache Shared Cache Memory Core Cache

… …

2

slide-35
SLIDE 35

Scatter updates are inefficient on conventional hierarchies

6

¨ Poor temporal and spatial locality when inputs do not fit in cache

¤Wasteful data transfers from main memory

¨ Multiple threads update the same vertex

¤Cache line ping-ponging

1 2 4 3 Core Cache Shared Cache Memory Core Cache

… …

2

slide-36
SLIDE 36

Scatter updates are inefficient on conventional hierarchies

6

¨ Poor temporal and spatial locality when inputs do not fit in cache

¤Wasteful data transfers from main memory

¨ Multiple threads update the same vertex

¤Cache line ping-ponging

1 2 4 3 Core Cache Shared Cache Memory Core Cache

… …

2

slide-37
SLIDE 37

Scatter updates are inefficient on conventional hierarchies

6

¨ Poor temporal and spatial locality when inputs do not fit in cache

¤Wasteful data transfers from main memory

¨ Multiple threads update the same vertex

¤Cache line ping-ponging

1 2 4 3

Push UB 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Memory requests per edge

Updates Updates Destination Vertex Source Vertex CSR

Core Cache Shared Cache Memory Core Cache

… …

2

Push PageRank on uk-2005 graph

slide-38
SLIDE 38

Scatter updates are inefficient on conventional hierarchies

6

¨ Poor temporal and spatial locality when inputs do not fit in cache

¤Wasteful data transfers from main memory

¨ Multiple threads update the same vertex

¤Cache line ping-ponging

1 2 4 3

Push UB 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Memory requests per edge

93% of traffic due to scatter updates 10x more traffic than compulsory

Updates Updates Destination Vertex Source Vertex CSR

Core Cache Shared Cache Memory Core Cache

… …

2

Push PageRank on uk-2005 graph

slide-39
SLIDE 39

Prior hardware support for scatter updates

7

slide-40
SLIDE 40

Prior hardware support for scatter updates

7

¨ Remote memory operations (RMOs) send and perform update

  • perations at a fixed location (e.g., shared cache banks)
slide-41
SLIDE 41

Prior hardware support for scatter updates

7

¨ Remote memory operations (RMOs) send and perform update

  • perations at a fixed location (e.g., shared cache banks)

¤Avoids cache-line ping ponging

slide-42
SLIDE 42

Prior hardware support for scatter updates

7

¨ Remote memory operations (RMOs) send and perform update

  • perations at a fixed location (e.g., shared cache banks)

¤Avoids cache-line ping ponging

¨ COUP [MICRO’15] modifies the coherence protocol to perform

commutative operations in a distributed fashion

slide-43
SLIDE 43

Prior hardware support for scatter updates

7

¨ Remote memory operations (RMOs) send and perform update

  • perations at a fixed location (e.g., shared cache banks)

¤Avoids cache-line ping ponging

¨ COUP [MICRO’15] modifies the coherence protocol to perform

commutative operations in a distributed fashion

¨ Both RMOs and COUP do not improve locality

slide-44
SLIDE 44

Prior hardware support for scatter updates

7

¨ Remote memory operations (RMOs) send and perform update

  • perations at a fixed location (e.g., shared cache banks)

¤Avoids cache-line ping ponging

¨ COUP [MICRO’15] modifies the coherence protocol to perform

commutative operations in a distributed fashion

¨ Both RMOs and COUP do not improve locality

¤Bottlenecked by memory traffic with large inputs

slide-45
SLIDE 45

PHI builds on Update Batching (UB)

8

Propagation Blocking [IPDPS’17], MILK [PACT’16]

slide-46
SLIDE 46

PHI builds on Update Batching (UB)

8

¨ Maximizes spatial locality of memory transfers using two-phase execution

Propagation Blocking [IPDPS’17], MILK [PACT’16]

slide-47
SLIDE 47

PHI builds on Update Batching (UB)

8

¨ Maximizes spatial locality of memory transfers using two-phase execution

8 16

. .

Destination Vertices

. .

Source Vertices

A B

. .

C D

Propagation Blocking [IPDPS’17], MILK [PACT’16]

slide-48
SLIDE 48

PHI builds on Update Batching (UB)

8

¨ Maximizes spatial locality of memory transfers using two-phase execution

Cache fitting slice

8 16

. .

Destination Vertices

. .

Source Vertices

A B

. .

C D

Propagation Blocking [IPDPS’17], MILK [PACT’16]

slide-49
SLIDE 49

PHI builds on Update Batching (UB)

8

¨ Maximizes spatial locality of memory transfers using two-phase execution

Cache fitting slice

8 16

. .

Destination Vertices

. .

Source Vertices

A B

. .

C D

Propagation Blocking [IPDPS’17], MILK [PACT’16]

Destination Ids

11 5 9 0 7 4 6 8 3 12

slide-50
SLIDE 50

PHI builds on Update Batching (UB)

8

¨ Maximizes spatial locality of memory transfers using two-phase execution ¨ Binning phase: Logs updates to memory, dividing them into cache-fitting

slices (bins) of vertices

Cache fitting slice

8 16

. .

Destination Vertices

. .

Source Vertices

A B

. .

C D

  • 1. Binning Phase

Propagation Blocking [IPDPS’17], MILK [PACT’16]

Destination Ids

11 5 9 0 7 4 6 8 3 12

slide-51
SLIDE 51

PHI builds on Update Batching (UB)

8

¨ Maximizes spatial locality of memory transfers using two-phase execution ¨ Binning phase: Logs updates to memory, dividing them into cache-fitting

slices (bins) of vertices

Cache fitting slice

8 16

. .

Destination Vertices

. .

Source Vertices

A B

. .

C D

Bin 0

A B 5 A

…….

D 3

  • 1. Binning Phase

Propagation Blocking [IPDPS’17], MILK [PACT’16]

Destination Ids

11 5 9 0 7 4 6 8 3 12

slide-52
SLIDE 52

PHI builds on Update Batching (UB)

8

¨ Maximizes spatial locality of memory transfers using two-phase execution ¨ Binning phase: Logs updates to memory, dividing them into cache-fitting

slices (bins) of vertices

Cache fitting slice

8 16

. .

Destination Vertices

. .

Source Vertices

A B

. .

C D

Bin 0

A B 5 A

…….

D 3 A 9 B B 7

Bin 1

11

…….

D 12

  • 1. Binning Phase

Propagation Blocking [IPDPS’17], MILK [PACT’16]

Destination Ids

11 5 9 0 7 4 6 8 3 12

slide-53
SLIDE 53

PHI builds on Update Batching (UB)

8

¨ Maximizes spatial locality of memory transfers using two-phase execution ¨ Binning phase: Logs updates to memory, dividing them into cache-fitting

slices (bins) of vertices

Cache fitting slice

8 16

. .

Destination Vertices

. .

Source Vertices

A B

. .

C D

Bin 0

A B 5 A

…….

D 3 A 9 B B 7

Bin 1

11

…….

D 12

  • 1. Binning Phase

Main memory

Propagation Blocking [IPDPS’17], MILK [PACT’16]

Destination Ids

11 5 9 0 7 4 6 8 3 12

slide-54
SLIDE 54

PHI builds on Update Batching (UB)

8

¨ Maximizes spatial locality of memory transfers using two-phase execution ¨ Binning phase: Logs updates to memory, dividing them into cache-fitting

slices (bins) of vertices

¨ Accumulation phase: Reads and applies logged updates bin-by-bin

Cache fitting slice

8 16

. .

Destination Vertices

. .

Source Vertices

A B

. .

C D

Bin 0

A B 5 A

…….

D 3 A 9 B B 7

Bin 1

11

…….

D 12

  • 1. Binning Phase
  • 2. Accumulation Phase

Main memory

Propagation Blocking [IPDPS’17], MILK [PACT’16]

Destination Ids

11 5 9 0 7 4 6 8 3 12

slide-55
SLIDE 55

PHI builds on Update Batching (UB)

8

¨ Maximizes spatial locality of memory transfers using two-phase execution ¨ Binning phase: Logs updates to memory, dividing them into cache-fitting

slices (bins) of vertices

¨ Accumulation phase: Reads and applies logged updates bin-by-bin

Cache fitting slice

8 16

. .

Destination Vertices

. .

Source Vertices

A B

. .

C D

Bin 0

A B 5 A

…….

D 3 A 9 B B 7

Bin 1

11

…….

D 12

  • 1. Binning Phase
  • 2. Accumulation Phase

Main memory

Propagation Blocking [IPDPS’17], MILK [PACT’16]

Destination Ids

11 5 9 0 7 4 6 8 3 12

slide-56
SLIDE 56

PHI builds on Update Batching (UB)

8

¨ Maximizes spatial locality of memory transfers using two-phase execution ¨ Binning phase: Logs updates to memory, dividing them into cache-fitting

slices (bins) of vertices

¨ Accumulation phase: Reads and applies logged updates bin-by-bin

Cache fitting slice

8 16

. .

Destination Vertices

. .

Source Vertices

A B

. .

C D

Bin 0

A B 5 A

…….

D 3 A 9 B B 7

Bin 1

11

…….

D 12

  • 1. Binning Phase
  • 2. Accumulation Phase

Main memory

Propagation Blocking [IPDPS’17], MILK [PACT’16]

Destination Ids

11 5 9 0 7 4 6 8 3 12

slide-57
SLIDE 57

Update Batching tradeoffs

9

slide-58
SLIDE 58

Update Batching tradeoffs

9

¨ Perfect spatial locality for all main memory transfers

¤Compulsory memory traffic for all data structures

slide-59
SLIDE 59

Update Batching tradeoffs

9

¨ Perfect spatial locality for all main memory transfers

¤Compulsory memory traffic for all data structures

¨ Binning phase ignores temporal locality

¤Generates large stream of updates even with structured inputs

slide-60
SLIDE 60

Update Batching tradeoffs

9

Push UB PHI 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Memory requests per edge Updates Destination Vertex Source Vertex CSR

Unstructured input

¨ Perfect spatial locality for all main memory transfers

¤Compulsory memory traffic for all data structures

¨ Binning phase ignores temporal locality

¤Generates large stream of updates even with structured inputs

Push PageRank on uk-2005 graph

slide-61
SLIDE 61

Update Batching tradeoffs

9

Push UB PHI 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Memory requests per edge Updates Destination Vertex Source Vertex CSR

Unstructured input

¨ Perfect spatial locality for all main memory transfers

¤Compulsory memory traffic for all data structures

¨ Binning phase ignores temporal locality

¤Generates large stream of updates even with structured inputs

Push PageRank on uk-2005 graph

slide-62
SLIDE 62

Update Batching tradeoffs

9

Push UB PHI 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Memory requests per edge Updates Destination Vertex Source Vertex CSR

Unstructured input

¨ Perfect spatial locality for all main memory transfers

¤Compulsory memory traffic for all data structures

¨ Binning phase ignores temporal locality

¤Generates large stream of updates even with structured inputs

Push PageRank on uk-2005 graph

slide-63
SLIDE 63

Update Batching tradeoffs

9

Push UB PHI 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Memory requests per edge Push UB PHI 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Memory requests per edge Updates Destination Vertex Source Vertex CSR

Unstructured input Structured input

¨ Perfect spatial locality for all main memory transfers

¤Compulsory memory traffic for all data structures

¨ Binning phase ignores temporal locality

¤Generates large stream of updates even with structured inputs

Push PageRank on uk-2005 graph

slide-64
SLIDE 64

Agenda

10

¨ Background ¨ PHI Design ¨ Evaluation

slide-65
SLIDE 65

Key techniques of PHI

11

slide-66
SLIDE 66

Key techniques of PHI

11

¨ In-cache update buffering and coalescing

¤Exploits temporal locality

slide-67
SLIDE 67

Key techniques of PHI

11

¨ In-cache update buffering and coalescing

¤Exploits temporal locality

¨ Selective update batching

¤Achieves high spatial locality

slide-68
SLIDE 68

Key techniques of PHI

11

¨ In-cache update buffering and coalescing

¤Exploits temporal locality

¨ Selective update batching

¤Achieves high spatial locality

Bandwidth efficient

slide-69
SLIDE 69

Key techniques of PHI

11

¨ In-cache update buffering and coalescing

¤Exploits temporal locality

¨ Selective update batching

¤Achieves high spatial locality

¨ Hierarchical buffering and coalescing

¤Enables update parallelism ¤Eliminates synchronization overheads

Bandwidth efficient Synchronization efficient

slide-70
SLIDE 70

In-cache buffering and coalescing

12

slide-71
SLIDE 71

In-cache buffering and coalescing

12

¨ Buffer updates in cache without ever

accessing main memory

slide-72
SLIDE 72

In-cache buffering and coalescing

12

¨ Buffer updates in cache without ever

accessing main memory

¨ Treat cache as a large coalescing

buffer for updates

slide-73
SLIDE 73

In-cache buffering and coalescing

12

¨ Buffer updates in cache without ever

accessing main memory

¨ Treat cache as a large coalescing

buffer for updates

¨ Reduction ALU in cache bank performs

coalescing

slide-74
SLIDE 74

In-cache buffering and coalescing

12

Cache

¨ Buffer updates in cache without ever

accessing main memory

¨ Treat cache as a large coalescing

buffer for updates

¨ Reduction ALU in cache bank performs

coalescing

Core Memory

0xFOO

10

slide-75
SLIDE 75

In-cache buffering and coalescing

12

UPDATE 0xFOO, +4

Cache

¨ Buffer updates in cache without ever

accessing main memory

¨ Treat cache as a large coalescing

buffer for updates

¨ Reduction ALU in cache bank performs

coalescing

Core Memory

0xFOO

10

slide-76
SLIDE 76

In-cache buffering and coalescing

12

UPDATE 0xFOO, +4

Cache

¨ Buffer updates in cache without ever

accessing main memory

¨ Treat cache as a large coalescing

buffer for updates

¨ Reduction ALU in cache bank performs

coalescing

Core Memory

4

0xFOO 0xFOO

10

slide-77
SLIDE 77

In-cache buffering and coalescing

12

Cache

¨ Buffer updates in cache without ever

accessing main memory

¨ Treat cache as a large coalescing

buffer for updates

¨ Reduction ALU in cache bank performs

coalescing

Core Memory

4

0xFOO 0xFOO

10

slide-78
SLIDE 78

In-cache buffering and coalescing

12

Cache

¨ Buffer updates in cache without ever

accessing main memory

¨ Treat cache as a large coalescing

buffer for updates

¨ Reduction ALU in cache bank performs

coalescing

Core Memory

4

UPDATE 0xFOO, +2 0xFOO 0xFOO

10

slide-79
SLIDE 79

In-cache buffering and coalescing

12

Cache

¨ Buffer updates in cache without ever

accessing main memory

¨ Treat cache as a large coalescing

buffer for updates

¨ Reduction ALU in cache bank performs

coalescing

Core Memory

4

UPDATE 0xFOO, +2

6

0xFOO 0xFOO

10

slide-80
SLIDE 80

Handling cache evictions

13

slide-81
SLIDE 81

Handling cache evictions

13

¨ PHI adapts to the amount of spatial locality in the evicted line

slide-82
SLIDE 82

Handling cache evictions

13

¨ PHI adapts to the amount of spatial locality in the evicted line ¨ Cache controller performs update batching selectively

¤Achieves good spatial locality in all cases

slide-83
SLIDE 83

Handling cache evictions

13

¨ PHI adapts to the amount of spatial locality in the evicted line ¨ Cache controller performs update batching selectively

¤Achieves good spatial locality in all cases

¨ Key insight: Update batching is a good tradeoff only when the

evicted line has poor spatial locality

slide-84
SLIDE 84

Case 1: Evicted line has few updates

14

slide-85
SLIDE 85

Case 1: Evicted line has few updates

14

¨ Log updates to temporary buffers (stored in cache)

slide-86
SLIDE 86

Case 1: Evicted line has few updates

14

¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full

slide-87
SLIDE 87

Case 1: Evicted line has few updates

14

Cache Memory

7

0xA4:

3

0xF8:

F00 4

0x10:

¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full

INV

4

0xF0:

Invalid line Buffered-updates line

F00 4 A48 7

0x10:

Line with batched updates

slide-88
SLIDE 88

Case 1: Evicted line has few updates

14 Evict 0xA4

Cache Memory

7

0xA4:

3

0xF8:

F00 4

0x10:

¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full

INV

4

0xF0:

Invalid line Buffered-updates line

F00 4 A48 7

0x10:

Line with batched updates

slide-89
SLIDE 89

Case 1: Evicted line has few updates

14 Evict 0xA4

Cache Memory

3

0xF8:

F00 4

0x10: INV

¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full

INV

4

0xF0:

Invalid line Buffered-updates line

F00 4 A48 7

0x10:

Line with batched updates

slide-90
SLIDE 90

Case 1: Evicted line has few updates

14 Evict 0xA4

Cache Memory

3

0xF8:

F00 4 A48 7

0x10: INV

¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full

INV

4

0xF0:

Invalid line Buffered-updates line

F00 4 A48 7

0x10:

Line with batched updates

slide-91
SLIDE 91

Case 1: Evicted line has few updates

14

Cache Memory

3

0xF8:

F00 4 A48 7

0x10: INV

¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full

INV

4

0xF0:

Invalid line Buffered-updates line

F00 4 A48 7

0x10:

Line with batched updates

slide-92
SLIDE 92

Case 1: Evicted line has few updates

14

Cache Memory

3

0xF8:

F00 4 A48 7

0x10: INV

Evict 0xF8

¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full

INV

4

0xF0:

Invalid line Buffered-updates line

F00 4 A48 7

0x10:

Line with batched updates

slide-93
SLIDE 93

Case 1: Evicted line has few updates

14

Cache Memory

3

0xF8:

F00 4 A48 7

0x10: INV

Evict 0xF8 Evict 0x10

¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full

INV

4

0xF0:

Invalid line Buffered-updates line

F00 4 A48 7

0x10:

Line with batched updates

slide-94
SLIDE 94

Case 1: Evicted line has few updates

14

Cache Memory

3

0xF8:

F00 4 A48 7

0x10: INV

Evict 0xF8 Evict 0x10

¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full

INV

4

0xF0:

Invalid line Buffered-updates line

F00 4 A48 7

0x10:

Line with batched updates

slide-95
SLIDE 95

Case 1: Evicted line has few updates

14

Cache Memory

3

0xF8:

F00 4 A48 7

0x10: INV

Evict 0xF8 Evict 0x10

¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full

INV

4

0xF0:

Invalid line Buffered-updates line

F00 4 A48 7

0x10:

Line with batched updates

F00 4 A48 7

0x10:

slide-96
SLIDE 96

Case 1: Evicted line has few updates

14

Cache Memory

3

0xF8: INV

Evict 0xF8

¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full

INV

4

0xF0:

Invalid line Buffered-updates line

F00 4 A48 7

0x10:

Line with batched updates

INV

F00 4 A48 7

0x10:

slide-97
SLIDE 97

Case 1: Evicted line has few updates

14

Cache Memory

INV INV

¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full

INV

4

0xF0:

Invalid line Buffered-updates line

F00 4 A48 7

0x10:

Line with batched updates

F84 3

0x11:

F00 4 A48 7

0x10:

slide-98
SLIDE 98

Case 1: Evicted line has many valid updates

15

___

slide-99
SLIDE 99

Case 1: Evicted line has many valid updates

15

¨ Fetch line from main memory and merge updates

___

slide-100
SLIDE 100

Case 1: Evicted line has many valid updates

15

¨ Fetch line from main memory and merge updates

Cache Memory

4 6 3

0xF0:

___ 5 6 1 8

0xBC:

7 9 2

0xDF: INV

4

0xF0:

Invalid line Buffered-updates line

1 2 1 7

0xF0:

slide-101
SLIDE 101

Case 1: Evicted line has many valid updates

15

¨ Fetch line from main memory and merge updates

Cache Memory

4 6 3

0xF0:

Evict 0xF0

___ 5 6 1 8

0xBC:

7 9 2

0xDF: INV

4

0xF0:

Invalid line Buffered-updates line

1 2 1 7

0xF0:

slide-102
SLIDE 102

Case 1: Evicted line has many valid updates

15

¨ Fetch line from main memory and merge updates

Cache Memory

4 6 3

0xF0:

Evict 0xF0

___ 5 6 1 8

0xBC:

7 9 2

0xDF: INV

4

0xF0:

Invalid line Buffered-updates line

1 2 1 7

0xF0:

slide-103
SLIDE 103

Case 1: Evicted line has many valid updates

15

¨ Fetch line from main memory and merge updates

Cache Memory

4 6 3

0xF0:

Evict 0xF0

___ 5 6 1 8

0xBC:

7 9 2

0xDF: INV

4

0xF0:

Invalid line Buffered-updates line

1 2 1 7

0xF0:

slide-104
SLIDE 104

Case 1: Evicted line has many valid updates

15

¨ Fetch line from main memory and merge updates

Cache Memory

4 6 3

0xF0:

Evict 0xF0

___ 5 6 1 8

0xBC:

7 9 2

0xDF: INV

4

0xF0:

Invalid line Buffered-updates line

1 2 1 7

0xF0:

MERGE

slide-105
SLIDE 105

Case 1: Evicted line has many valid updates

15

¨ Fetch line from main memory and merge updates

Cache Memory

4 6 3

0xF0:

Evict 0xF0

___ 5 6 1 8

0xBC:

7 9 2

0xDF: INV

4

0xF0:

Invalid line Buffered-updates line

1 2 1 7

0xF0:

MERGE

slide-106
SLIDE 106

Case 1: Evicted line has many valid updates

15

¨ Fetch line from main memory and merge updates

Cache Memory

4 6 3

0xF0:

Evict 0xF0

___ 5 6 1 8

0xBC:

7 9 2

0xDF: INV

4

0xF0:

Invalid line Buffered-updates line

5 8 4 7

0xF0:

MERGE

slide-107
SLIDE 107

Case 1: Evicted line has many valid updates

15

¨ Fetch line from main memory and merge updates

Cache Memory

Evict 0xF0

___

INV

5 6 1 8

0xBC:

7 9 2

0xDF: INV

4

0xF0:

Invalid line Buffered-updates line

5 8 4 7

0xF0:

MERGE

slide-108
SLIDE 108

PHI avoids synchronization costs

16

slide-109
SLIDE 109

PHI avoids synchronization costs

16

¨ Private caches buffer and coalesce

updates locally, push them to shared cache on evictions

¤No need for a coherence protocol

slide-110
SLIDE 110

PHI avoids synchronization costs

16

¨ Private caches buffer and coalesce

updates locally, push them to shared cache on evictions

¤No need for a coherence protocol Core 1 Private Cache 0 Shared Cache Core 0 Private Cache 1 Memory

slide-111
SLIDE 111

PHI avoids synchronization costs

16

¨ Private caches buffer and coalesce

updates locally, push them to shared cache on evictions

¤No need for a coherence protocol Core 1 Private Cache 0 Shared Cache Core 0 Private Cache 1 Update 0xF04, +4 Memory

slide-112
SLIDE 112

PHI avoids synchronization costs

16

¨ Private caches buffer and coalesce

updates locally, push them to shared cache on evictions

¤No need for a coherence protocol Core 1 Private Cache 0 Shared Cache Core 0 Private Cache 1

4

0xF0:

Memory

slide-113
SLIDE 113

PHI avoids synchronization costs

16

¨ Private caches buffer and coalesce

updates locally, push them to shared cache on evictions

¤No need for a coherence protocol Core 1 Private Cache 0 Shared Cache Core 0 Private Cache 1

4

0xF0:

Update 0xF08, +3 Memory

slide-114
SLIDE 114

PHI avoids synchronization costs

16

¨ Private caches buffer and coalesce

updates locally, push them to shared cache on evictions

¤No need for a coherence protocol Core 1 Private Cache 0 Shared Cache Core 0 Private Cache 1

4

0xF0:

0 0 3

0xF0:

Memory

slide-115
SLIDE 115

PHI avoids synchronization costs

16

¨ Private caches buffer and coalesce

updates locally, push them to shared cache on evictions

¤No need for a coherence protocol Core 1 Private Cache 0 Shared Cache Core 0 Private Cache 1

4

0xF0:

0 0 3

0xF0:

Update 0xB5C, +2 Memory

slide-116
SLIDE 116

PHI avoids synchronization costs

16

¨ Private caches buffer and coalesce

updates locally, push them to shared cache on evictions

¤No need for a coherence protocol Core 1 Private Cache 0 Shared Cache Core 0 Private Cache 1

4

0xF0:

0 0 3

0xF0:

Evict 0xF0 Update 0xB5C, +2 Memory

slide-117
SLIDE 117

PHI avoids synchronization costs

16

¨ Private caches buffer and coalesce

updates locally, push them to shared cache on evictions

¤No need for a coherence protocol Core 1 Private Cache 0 Shared Cache Core 0 Private Cache 1

4

0xF0:

0 0 3

0xF0:

0 0 3

0xF0:

Evict 0xF0 Update 0xB5C, +2 Memory

slide-118
SLIDE 118

PHI avoids synchronization costs

16

¨ Private caches buffer and coalesce

updates locally, push them to shared cache on evictions

¤No need for a coherence protocol Core 1 Private Cache 0 Shared Cache Core 0 Private Cache 1

4

0xF0:

0 0 3

0xF0:

Update 0xB5C, +2 Memory

slide-119
SLIDE 119

PHI avoids synchronization costs

16

¨ Private caches buffer and coalesce

updates locally, push them to shared cache on evictions

¤No need for a coherence protocol Core 1 Private Cache 0 Shared Cache Core 0 Private Cache 1

4

0xF0:

0 0 3

0xF0:

0 0 0 2

0xB5:

Memory

slide-120
SLIDE 120

PHI avoids synchronization costs

16

¨ Private caches buffer and coalesce

updates locally, push them to shared cache on evictions

¤No need for a coherence protocol Core 1 Private Cache 0 Shared Cache Core 0 Private Cache 1

4

0xF0:

0 0 3

0xF0:

0 0 0 2

0xB5:

Update 0xA00, +1 Memory

slide-121
SLIDE 121

PHI avoids synchronization costs

16

¨ Private caches buffer and coalesce

updates locally, push them to shared cache on evictions

¤No need for a coherence protocol Core 1 Private Cache 0 Shared Cache Core 0 Private Cache 1

4

0xF0:

0 0 3

0xF0:

0 0 0 2

0xB5:

Update 0xA00, +1 Evict 0xF0 Memory

slide-122
SLIDE 122

PHI avoids synchronization costs

16

¨ Private caches buffer and coalesce

updates locally, push them to shared cache on evictions

¤No need for a coherence protocol Core 1 Private Cache 0 Shared Cache Core 0 Private Cache 1

4

0xF0:

0 4 3

0xF0:

0 0 0 2

0xB5:

Update 0xA00, +1 Evict 0xF0 Memory

slide-123
SLIDE 123

PHI avoids synchronization costs

16

¨ Private caches buffer and coalesce

updates locally, push them to shared cache on evictions

¤No need for a coherence protocol Core 1 Private Cache 0 Shared Cache Core 0 Private Cache 1

0 4 3

0xF0:

0 0 0 2

0xB5:

Update 0xA00, +1 Memory

slide-124
SLIDE 124

PHI avoids synchronization costs

16

¨ Private caches buffer and coalesce

updates locally, push them to shared cache on evictions

¤No need for a coherence protocol Core 1 Private Cache 0 Shared Cache Core 0 Private Cache 1

1 0 0

0xA0:

0 4 3

0xF0:

0 0 0 2

0xB5:

Memory

slide-125
SLIDE 125

PHI avoids synchronization costs

16

¨ Private caches buffer and coalesce

updates locally, push them to shared cache on evictions

¤No need for a coherence protocol

¨ Private caches do not perform update

batching

¤Simply evict buffered-update lines to

shared cache

Core 1 Private Cache 0 Shared Cache Core 0 Private Cache 1

1 0 0

0xA0:

0 4 3

0xF0:

0 0 0 2

0xB5:

Memory

slide-126
SLIDE 126

PHI has minimal hardware costs

17

slide-127
SLIDE 127

PHI has minimal hardware costs

17

¨ Per-line buffered updates bit

¤0.17% additional storage with 64-byte lines

slide-128
SLIDE 128

PHI has minimal hardware costs

17

¨ Per-line buffered updates bit

¤0.17% additional storage with 64-byte lines

¨ Reduction unit for each cache bank

¤Supports 64-bit floating-point and integer additions, logical operations ¤0.06% of chip area in a 16-core system (0.09mm2 in 45 nm)

slide-129
SLIDE 129

Agenda

18

¨ Background ¨ PHI Design ¨ Evaluation

slide-130
SLIDE 130

Evaluation methodology

19

slide-131
SLIDE 131

Evaluation methodology

19

¨ Event-driven simulation using ZSim

slide-132
SLIDE 132

Evaluation methodology

19

¨ Event-driven simulation using ZSim ¨ 16-core processor

¤Haswell-like OOO cores ¤32 MB L3 cache ¤4 memory controllers

……

Core0

Private Cache 0

Memory

Shared Cache

Core15

Private Cache 15

slide-133
SLIDE 133

Evaluation methodology

19

¨ Event-driven simulation using ZSim ¨ 16-core processor

¤Haswell-like OOO cores ¤32 MB L3 cache ¤4 memory controllers

¨ Graph applications

¤ PageRank, PageRank Delta,

Connected Components, Radii Estimation

¨ Degree Counting (No Pull)

¨ SpMV

……

Core0

Private Cache 0

Memory

Shared Cache

Core15

Private Cache 15

slide-134
SLIDE 134

Evaluation methodology

19

¨ Event-driven simulation using ZSim ¨ 16-core processor

¤Haswell-like OOO cores ¤32 MB L3 cache ¤4 memory controllers

¨ Graph applications

¤ PageRank, PageRank Delta,

Connected Components, Radii Estimation

¨ Degree Counting (No Pull)

¨ SpMV ¨ Large real world inputs

¤Up to 100 million vertices ¤Up to 1 billion edges

……

Core0

Private Cache 0

Memory

Shared Cache

Core15

Private Cache 15

slide-135
SLIDE 135

PHI improves performance significantly

20

slide-136
SLIDE 136

PHI improves performance significantly

20

Push Pull UB Push-RMO PHI

slide-137
SLIDE 137

PHI improves performance significantly

20

Push Pull UB Push-RMO PHI

slide-138
SLIDE 138

PHI improves performance significantly

20

¨ Pull and UB show mixed results

Push Pull UB Push-RMO PHI

slide-139
SLIDE 139

PHI improves performance significantly

20

¨ Pull and UB show mixed results ¨ Push-RMO improves performance by avoiding synchronization costs

Push Pull UB Push-RMO PHI

slide-140
SLIDE 140

PHI improves performance significantly

20

¨ Pull and UB show mixed results ¨ Push-RMO improves performance by avoiding synchronization costs ¨ PHI consistently outperforms other schemes

Push Pull UB Push-RMO PHI

slide-141
SLIDE 141

PHI reduces memory traffic

21

slide-142
SLIDE 142

PHI reduces memory traffic

21

Push Pull UB PHI

slide-143
SLIDE 143

PHI reduces memory traffic

21

Push Pull UB PHI

slide-144
SLIDE 144

PHI reduces memory traffic

21

¨ Pull incurs higher memory traffic for non-all-active algorithms (CC, RE)

Push Pull UB PHI

slide-145
SLIDE 145

PHI reduces memory traffic

21

¨ Pull incurs higher memory traffic for non-all-active algorithms (CC, RE) ¨ UB increases memory traffic when input has good locality

Push Pull UB PHI

slide-146
SLIDE 146

PHI reduces memory traffic

21

¨ Pull incurs higher memory traffic for non-all-active algorithms (CC, RE) ¨ UB increases memory traffic when input has good locality ¨ PHI reduces memory traffic over UB by exploiting temporal locality

Push Pull UB PHI

slide-147
SLIDE 147

Conclusion

22

slide-148
SLIDE 148

Conclusion

22

¨ Scatter updates are inefficient on conventional hierarchies

slide-149
SLIDE 149

Conclusion

22

¨ Scatter updates are inefficient on conventional hierarchies ¨ PHI extends the cache hierarchy to make commutative scatter updates

efficient

slide-150
SLIDE 150

Conclusion

22

¨ Scatter updates are inefficient on conventional hierarchies ¨ PHI extends the cache hierarchy to make commutative scatter updates

efficient

¨ Exploits both temporal and spatial locality

slide-151
SLIDE 151

Conclusion

22

¨ Scatter updates are inefficient on conventional hierarchies ¨ PHI extends the cache hierarchy to make commutative scatter updates

efficient

¨ Exploits both temporal and spatial locality ¨ Incurs low memory traffic and minimal synchronization

slide-152
SLIDE 152

Conclusion

22

¨ Scatter updates are inefficient on conventional hierarchies ¨ PHI extends the cache hierarchy to make commutative scatter updates

efficient

¨ Exploits both temporal and spatial locality ¨ Incurs low memory traffic and minimal synchronization

Thanks For Your Attention! Questions Are Welcome!