A Scalable Processing-in-Memory Accelerator for Parallel Graph - - PowerPoint PPT Presentation

a scalable processing in memory accelerator for parallel
SMART_READER_LITE
LIVE PREVIEW

A Scalable Processing-in-Memory Accelerator for Parallel Graph - - PowerPoint PPT Presentation

A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Seminar on Computer Architecture Roknoddin Azizibarzoki Junwhan Ahn, Sungpack Hong* Sungjoo Yoo, Onur Mutlu+ Kiyoung Choi Seoul National University *Oracle Labs


slide-1
SLIDE 1

A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing

Junwhan Ahn, Sungpack Hong* Sungjoo Yoo, Onur Mutlu+ Kiyoung Choi Seoul National University *Oracle Labs +Carnegie Mellon University International Symposium on Computer Architecture 2015

Seminar on Computer Architecture

Roknoddin Azizibarzoki

slide-2
SLIDE 2

Executive Summary

Problem: Performance of graph processing on conventional systems does not scale in proportion to graph size Key Idea: Make use of Processing-In-Memory to provide high bandwidth, and design specially architected cores to utilize that bandwidth Goal: Design an infrastructure with scalable performance for graph processing Results: up to 13.8x performance improvement and 87% energy reduction

2

Observation: High memory bandwidth can sustain scalability in graph processing

slide-3
SLIDE 3

Graph Processing

slide-4
SLIDE 4

Graphs

4

Abstractions used to represent objects and their relations Vertices are used to represent objects Edges are used to represent the relation between the objects These representations can sometimes become very huge in real world applications Graphs used in this paper can reach up to 200 million edges, 7 million vertices, and 3-5 GB of memory footprint

Image obtained from: Grandjean, Martin (2015), "Introduction à la visualisation de données, l'analyse de réseau en histoire", Geschichte und Informatik 18/19 (PDF), pp. 109–128

slide-5
SLIDE 5

Graph Processing Workloads

Large amount of data is processed in parallel and almost independent of each other

5

Example: Page Rank Originally designed to sort webpages based on number of views for Google, so as to do better webpage suggestions

1 2 3 4 5 6 for (v: graph.vertices): for (u: v.successors): u.new_rank = v.rank * weight for (v: graph.vertices): v.rank = v.new_rank v.new_rank = alpha 3 4 u.new_rank = v.rank * weight for (v: graph.vertices):

5

Parallel computation almost independent for each vertex

slide-6
SLIDE 6

Graph Processing Workloads Characteristics

Characteristics of this parallel, vertex independent computation:

6

  • 1. Frequent random memory accesses
  • 2. Small amount of computation per vertex

2 3 for (u: v.successors): u.new_rank = v.rank * weight

2 1 Each successor might lead you to a whole new subgraph Simple multiplication computation

slide-7
SLIDE 7

32 Cores DDR3 (102.4GB/s) 128 Cores DDR3 (102.4GB/s) 128 Cores HMC (640GB/s)

1 2 3 4 5 6

Speedup

+42% +89%

Page Rank performance on conventional graph processing infrastructures:

  • 2. Conventional systems do

not utilize bandwidth

  • 1. More bandwidth helps!

Graph Processing on Conventional Systems

7

32 Cores DDR3 (102.4GB/s) 128 Cores DDR3 (102.4GB/s) 128 Cores HMC (640GB/s)

1 2 3 4 5 6

Speedup

+42% +89%

128 Cores HMC Internal BW (8TB/s)

5.3x

Ideally!

7

IDEA:

  • 1. Let’s use HMC based Processing-In-Memory to provide high bandwidth
  • 2. And design specially architected cores to exploit this bandwidth

(Tesseract Cores)

INSIGHT: High bandwidth can mitigate the performance bottleneck!

slide-8
SLIDE 8

Tesseract System

slide-9
SLIDE 9

Tesseract System

9

Host Processor

  • A network of HMC cubes
  • Memory mapped accelerator

interface, non-cacheable, and no support for virtualization

  • Each HMC cube contains 32 vaults, each armed

with a simple in-order core in their logic layer, so that the cores can use HMC’s internal BW

  • Vaults communicating over a crossbar network

for remote function calls

  • Specialized cores, armed with latency

tolerant programming model and graph processing based prefetching mechanisms

  • Message passing interface, prefetching

mechanisms

slide-10
SLIDE 10

Processing-In-Memory with 3D stacked DRAM

Large amount of bandwidth available for the cores to utilize Specialized cores, armed with latency tolerant programming model and graph processing based prefetching mechanisms

10

slide-11
SLIDE 11

In-Order Core DRAM Controller

List Prefetcher

NI Message Queue

Message-Triggered Prefetcher

Prefetch Buffer

Communications in Tesseract

In-Order Core DRAM Controller

List Prefetcher

NI Message Queue

Message-Triggered Prefetcher

Prefetch Buffer

NI Message Queue

Data needed by a Tesseract core might be present in another vaults memory region

1 1

slide-12
SLIDE 12

Communications in Tesseract

Data needed by a tesseract core might be present in another vaults memory region

2 3 for (u: v.successors): u.new_rank = v.rank * weight

Vault #x

TC #x

Vault #y

TC #y u v

for (u: v.successors): put(w.id, function() { w.next_rank += weight * v.rank; }) barrier()

Vault #x

TC #x

Vault #y

TC #y u v

Non-blocking remote function call, increases latency toleration in the source core and guarantees atomicity Send function address and arguments to the remote core

12

slide-13
SLIDE 13

In-Order Core DRAM Controller

List Prefetcher

NI Message Queue

Message-Triggered Prefetcher

Prefetch Buffer

Prefetching in Tesseract

In-Order Core DRAM Controller

List Prefetcher

NI Message Queue

Message-Triggered Prefetcher

Prefetch Buffer

Message-Triggered Prefetcher

Prefetch Buffer

Prefetching the data being referenced in the message queue (Later noted as MTP in the evaluation section) When message enters the message queue, a prefetch request is issued And the message is ready to be serviced when data is present

13

slide-14
SLIDE 14

Tesseract Core

14

In-Order Core DRAM Controller

List Prefetcher

NI Message Queue

Message-Triggered Prefetcher

Prefetch Buffer

14

Novelties of Tesseract

  • Usage of PIM (logic layer integration) to increase the BW available to the cores
  • Message passing employed, to increase latency tolerance and guarantee atomicity
  • Specially crafted prefetching mechanisms are used to utilize the abundant BW

available for graph processing

  • 2. Programming API
  • 3. Blocking remote function calls

Other Constructs of Tesseract:

  • 1. List Prefetching: Prefetching based on the next elements in the list of

traversal, with a constant stride (later noted as LP in the evaluation section)

slide-15
SLIDE 15

Evaluation

slide-16
SLIDE 16

Evaluation Methodology

16

Workloads Simulated Systems

3 real world graphs:

  • ljournal-2008 (social network)
  • enwiki-2003 (Wikipedia)
  • indochina-0024 (web graph)

5 graph processing algorithms:

  • Average teenage follower
  • Conductance
  • PageRank
  • Single-source shortest path
  • Vertex cover

16

  • DDR3 + OoO cores
  • HMC + OoO cores, higher bandwidth
  • HMC + more number of simpler, less powerful cores
  • Tesseract, logic layer integration of the HMC with Tesseract cores
slide-17
SLIDE 17

Evaluation Results

17

2 4 6 8 10 12 14 16

Speedup

DDR3-OoO HMC-OoO HMC-MC Tesseract Tesseract LP Tesseract LP + MTP

+56% +25% 9.0x 1 1.6x 13.8x

Average Performance

slide-18
SLIDE 18

Evaluation Results

18

Average Bandwidth Utilization

0.5 1 1.5 2 2.5 3 3.5

Memory Bandwidth (TB/s)

DDR3-OoO HMC-OoO HMC-MC Tesseract Tesseract LP Tesseract LP + MTP

190GB/s 243GB/s 1.3TB/s 2.2TB/s 2.9TB/s 80GB/s

slide-19
SLIDE 19

Evaluation Results

19

Average Memory Energy Consumption

0.2 0.4 0.6 0.8 1 1.2

Normalized Energy

  • 87%

HMC-OoO Tesseract LP + MTP Memory Layers Logic Layers Cores

slide-20
SLIDE 20

Executive Summary

Problem: Performance of graph processing on conventional systems does not scale in proportion to graph size Key Idea: Make use of Processing-In-Memory to provide high bandwidth, and design specially architected cores to utilize that bandwidth Goal: Design an infrastructure with scalable performance for graph processing Results: 10x performance improvement and 87% energy reduction

20

Observation: High memory bandwidth can sustain scalability in graph processing

slide-21
SLIDE 21

Analysis

slide-22
SLIDE 22

Strengths

22

  • 1. First work to introduce Processing-In-Memory to graph computations
  • 4. The paper is written in a way that is easy to follow
  • 3. Non-blocking remote function call is an effective way to increase latency

tolerance

  • 2. Employing specially designed prefetching mechanisms to better utilize BW
slide-23
SLIDE 23

Weaknesses

23

  • 2. The paper has not discussed why it is limited to graph applications
  • 3. Introducing barriers raises the concern of load balancing
  • 1. Data placement is not taken as a serious concern in this work (GraphP [1],

Reduce communication in Tesseract with efficient data placement)

  • 4. No comparison against prevalent graph processing platforms like GPUs is

included in the paper

  • 5. Adapting common applications to the programming model is not easy
slide-24
SLIDE 24

Takeaways

24

  • 2. If designed effectively, PIM might be a promising approach to provide high

bandwidth for large scale data processing

  • 1. Optimizing a narrow set of factors might lead to underutilization of

resources

slide-25
SLIDE 25

Discussions

25

  • 1. There is the other construct called Blocking Remote Function Calls

The difference is that in that one you have return values that you want to wait for them to come back to the source core Can you think of ways to optimize remote blocking function calls?

slide-26
SLIDE 26

Discussions

26

  • 2. How hard will it be to expand Tesseract to other applications?
slide-27
SLIDE 27

Discussions

  • 3. How bad will Tesseract suffer from unbalanced workloads?

27

slide-28
SLIDE 28

Discussions

28

  • 4. What if we switch Tesseract cores with GPU Streaming Multiprocessors?

TOM[2]: Transparent Offloading and Mapping

  • 1. What to offload to the GPU-PIM accelerator: Bandwidth gain
  • 2. How to map the data and schedule the computation to benefit the most:

Subsequent accesses have a certain offset, thus we can map them together

30% average performance gain over a baseline with a GPU without offloading

28

slide-29
SLIDE 29

Discussions

  • 4. What if we switch Tesseract cores with GPU Streaming Multiprocessors?

But still, TOM does not employ specially designed mechanisms to mitigate communication between vaults and we will have this problem. New question: if we have a PIM cube which has GPU cores in its logic layer, how can we reduce the data movement?

28

slide-30
SLIDE 30

Discussions

28

SM ID Vault ID Percentage of Accesses SM Access Breakdown over Vaults (BFS)

  • 1. Remapping?
  • 2. CTA Migration?

CTA is the set of threads running on a GPU SM at a given time

slide-31
SLIDE 31

Discussions

28

SM ID Vault ID Percentage of Accesses SM Access Breakdown over Vaults (MUMer GPU)

slide-32
SLIDE 32

Discussions

29

  • 5. What about data movement between cubes?

GraphP[1]: Reduce communication between the cubes in Tesseract with efficient data placement 3 key techniques:

  • 1. “Source-cut” Partitioning: an algorithm to ensure a vertex and all its

incoming edges are in the same cube

  • 2. “Two-phase Vertex Program”: a programming model designed for the

“source-cut” partitioning

  • 3. “Hierarchical Communication and Overlapping”
slide-33
SLIDE 33

Discussions

30

  • 6. Other mechanisms for the same problem:

GraphR[3]: Accelerating Graph Processing Using ReRAM Using dense ReRAM crossbars, they do graph computations With ReRAMs you can do analog computation

Results: Up to 4.12x speedup and 10.96% energy saving over Tesseract

30

slide-34
SLIDE 34

References

slide-35
SLIDE 35

References

32

[1] M. Zhang et al., "GraphP: Reducing Communication for PIM-Based Graph Processing with Efficient Data Partition," 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), Vienna, 2018, pp. 544-557 .

[2] K. Hsieh et al., "Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems," 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, 2016, pp. 204-216.

[3] Song, Linghao & Zhuo, Youwei & Qian, Xuehai & Li, Hai & Chen, Yiran. (2017). GraphR: Accelerating Graph Processing Using ReRAM. arXiv’17

slide-36
SLIDE 36

A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing

Junwhan Ahn, Sungpack Hong* Sungjoo Yoo, Onur Mutlu+ Kiyoung Choi Seoul National University *Oracle Labs +Carnegie Mellon University International Symposium on Computer Architecture 2015

Seminar on Computer Architecture

Roknoddin Azizibarzoki

slide-37
SLIDE 37

Backup Slides

slide-38
SLIDE 38

35

Backup Slides

slide-39
SLIDE 39

36

Backup Slides

slide-40
SLIDE 40

37

Backup Slides

slide-41
SLIDE 41

38

Backup Slides

get(id, A func, A arg, S arg_size, A ret, S ret_size) put(id, A func, A arg, S arg_size, A prefetch_addr) disable_interrupt(), enable_interrupt() copy(id, A local, A remote, S size) list_begin(A address, S size, S stride) list_end(A address, S size, S stride) barrier()