Analysis and Optimization of the Memory Hierarchy for Graph - - PowerPoint PPT Presentation

analysis and optimization of the memory hierarchy for
SMART_READER_LITE
LIVE PREVIEW

Analysis and Optimization of the Memory Hierarchy for Graph - - PowerPoint PPT Presentation

Scalable and Energy-efficient Architecture Lab (SEAL) Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads Abanti Basak , Shuangchen Li, Xing Hu, Sang Min Oh, Xinfeng Xie, Li Zhao*, Xiaowei Jiang*, and Yuan Xie


slide-1
SLIDE 1

Scalable and Energy-efficient Architecture Lab (SEAL)

Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads

Abanti Basak, Shuangchen Li, Xing Hu, Sang Min Oh, Xinfeng Xie, Li Zhao*, Xiaowei Jiang*, and Yuan Xie

University of California, Santa Barbara *Server Architecture Group, Alibaba Inc.

slide-2
SLIDE 2

Scalable and Energy-efficient Architecture Lab (SEAL)

Executive Summary

2 Memory-bound behavior in single-machine in-memory graph processing Data-aware characterization of the core and the cache hierarchy to understand the memory-bound behavior Architecture design and evaluation of DROPLET, a data-aware and decoupled prefetcher for graphs to solve the memory access bottleneck

slide-3
SLIDE 3

Scalable and Energy-efficient Architecture Lab (SEAL)

Section I

3

  • Application domains
  • Single-machine in-memory graph processing
  • Memory access bottleneck

Memory-bound behavior in single-machine in-memory graph processing

slide-4
SLIDE 4

Scalable and Energy-efficient Architecture Lab (SEAL)

4

Graph Processing

Transportation Financial money flows

slide-5
SLIDE 5

Scalable and Energy-efficient Architecture Lab (SEAL)

5

Interest in Graph Processing

slide-6
SLIDE 6

Scalable and Energy-efficient Architecture Lab (SEAL)

Single-Machine In-Memory Graph Processing

6

Many common-case industry and academic graphs fit in RAM of a high-end server

Big-memory machines Ex: 1) Intel Xeon with 1.5TB RAM 2) HPE’s MACHINE

slide-7
SLIDE 7

Scalable and Energy-efficient Architecture Lab (SEAL)

Bottleneck…

7

0.0 0.2 0.4 0.6 0.8 1.0

synchronization

45% DRAM

L3 cache front-end fraction of time

15% no stalls

Cycle stack of PageRank on Orkut dataset

45% of cycles are DRAM-bound stall cycles Only 15% of cycles are fully utilized by core without stalling

Data collected using Sniper on a quad-core architecture

slide-8
SLIDE 8

Scalable and Energy-efficient Architecture Lab (SEAL)

Section II

8

  • Novelty
  • Background
  • Characterization setup
  • Profiling observations
  • Summary

Data-aware characterization of the core and the cache hierarchy to understand the memory-bound behavior

slide-9
SLIDE 9

Scalable and Energy-efficient Architecture Lab (SEAL)

Characterization of Core and Cache Hierarchy

9

Novelty compared to prior characterization [IISWC ‘15, MASCOTS ‘16, SC ’15]: data-aware profiling: guidelines to managing different data types simulated environment: explicit exploration of performance sensitivity

  • f hardware design

parameters

slide-10
SLIDE 10

Scalable and Energy-efficient Architecture Lab (SEAL)

Background: Data Type Terminology

10

Compressed Sparse Row (CSR) data layout

  • Structure data -> neighbor ID array
  • Property data -> vertex data array
  • Intermediate data -> any other data
slide-11
SLIDE 11

Scalable and Energy-efficient Architecture Lab (SEAL)

Experimental Setup

11 Algorithms (GAP Benchmark)

  • Connected Components (CC)
  • PageRank (PR)
  • Betweenness Centrality (BC)
  • Breadth First Search (BFS)
  • Single Source Shortest Path (SSSP)

Hardware Characteristics on SniperSIM

  • 4-core, 128-entry ROB, 2.66GHz
  • Private L1D/I caches, 32KB, 8-way SA, 4 cycles
  • Private L2 cache, 256KB, 8-way SA, 8 cycles
  • Shared L3, 8MB, 16-way SA, 30 cycles
  • DDR3 DRAM, access latency = 45 ns

Datasets (|V|= # vertices, |E| = # edges)

  • Kron

16.8M |V| 260M |E|

  • Urand

8.4M |V| 134M |E|

  • Orkut 3M |V| 117M |E|
  • LiveJournal 4.8M |V| 68.5M |E|
  • Road 23.9M |V| 57.7M |E|
slide-12
SLIDE 12

Scalable and Energy-efficient Architecture Lab (SEAL)

Profiling Overview

  • Can we achieve higher Memory-Level Parallelism (MLP)?
  • If not, what factor is restricting MLP?
  • What is the relative performance sensitivity of different cache

levels?

  • How do different data types use the memory hierarchy?

12

slide-13
SLIDE 13

Scalable and Energy-efficient Architecture Lab (SEAL)

Instruction Window size does not hinder MLP

13

kron livejournal

  • rkut

road urand kron livejournal

  • rkut

road urand kron livejournal

  • rkut

road urand kron livejournal

  • rkut

road urand kron livejournal

  • rkut

road urand mean

0.0 0.2 0.4 0.6 0.8 1.0 1.2

CC SSSP PR BFS

speedup with larger ROB

BC

kron livejournal

  • rkut

road urand kron livejournal

  • rkut

road urand kron livejournal

  • rkut

road urand kron livejournal

  • rkut

road urand kron livejournal

  • rkut

road urand mean 10 20 30

CC SSSP BFS PR DRAM BW utilization change (%) BC

We increase IW size to 4X......

Average memory BW utilization increases by only 2.7% Average speedup is

  • nly 1.44%
slide-14
SLIDE 14

Scalable and Energy-efficient Architecture Lab (SEAL)

Load-load dependency hinders MLP

14

43.2% of loads are part of a dependency chain with chain length of 2.5

For every load in ROB, we track its dependency backward until we find an

  • lder load….

LD[R4] -> R2 ADD R1, R3 -> R4 LD[R5] -> R3 (consumer load) (producer load)

slide-15
SLIDE 15

Scalable and Energy-efficient Architecture Lab (SEAL)

Property data is consumer in load-load dependency

15

  • Property data is mostly a

consumer (54%) rather than a producer (6%)

  • Structure data is mostly a

producer (41%) rather than a consumer (6%) We break down producer and consumer loads by application data type…

slide-16
SLIDE 16

Scalable and Energy-efficient Architecture Lab (SEAL)

Private L2 cache shows negligible performance sensitivity

16 We vary L2 cache configurations… An architecture without private L2 caches is just as fine for graph processing

slide-17
SLIDE 17

Scalable and Energy-efficient Architecture Lab (SEAL)

Shared LLC shows higher performance sensitivity

17 We vary shared LLC capacities… 17.4% performance improvement for 4X increase in LLC capacity

slide-18
SLIDE 18

Scalable and Energy-efficient Architecture Lab (SEAL)

Heterogeneous Reuse Distances

18

20 40 60 80 100

CC SSSP PR BFS

structure (%)

DRAM L3 L2 L1

BC

20 40 60 80 100

property (%)

kron livejournal

  • rkut

road urand kron livejournal

  • rkut

road urand kron livejournal

  • rkut

road urand kron livejournal

  • rkut

road urand kron livejournal

  • rkut

road urand

20 40 60 80 100

intermediate (%)

Structure data has the largest reuse distance: serviced by L1 and DRAM Property data has a larger reuse distance than that serviced by L2 cache Intermediate data accesses are mostly

  • n-chip cache hits

We break down memory hierarchy usage by application data type….

slide-19
SLIDE 19

Scalable and Energy-efficient Architecture Lab (SEAL)

To Summarize….

19 Memory-bound behavior caused by:

  • Heterogeneous reuse distances of different data types leading to

intensive DRAM accesses for structure and property data

  • Low MLP due to load-load dependency chains, limiting the possibility
  • f overlapping DRAM accesses
slide-20
SLIDE 20

Scalable and Energy-efficient Architecture Lab (SEAL)

Section III

20

  • DROPLET introduction
  • DROPLET overview
  • L2 structure streamer
  • Property prefetcher
  • Evaluation

Architecture design and evaluation of DROPLET, a data-aware and decoupled prefetcher for graphs to solve the memory access bottleneck

slide-21
SLIDE 21

Scalable and Energy-efficient Architecture Lab (SEAL)

DROPLET: Data-AwaRe DecOuPLed PrEfeTcher

21

Data-aware: prefetches data types according to reuse distances Decoupled:

  • vercomes

serialization from load-load dependency

Private L2 cache Shared Inclusive LLC Data-aware Property Prefetcher Data-aware Structure Streamer

Coherence engine

struct_req struct_req struct_req struct_dat struct_dat struct_dat prop_dat prop_dat prop_dat prop_trigger prop_req prop_req prop_req struct_req

Memory Controller

1 2 struct_trigger

slide-22
SLIDE 22

Scalable and Energy-efficient Architecture Lab (SEAL)

DROPLET Overview

22

Private L2 cache Shared Inclusive LLC Data-aware Property Prefetcher Data-aware Structure Streamer

Coherence engine

struct_req struct_req struct_req struct_dat struct_dat struct_dat prop_dat prop_dat prop_dat prop_trigger prop_req prop_req prop_req struct_req

Memory Controller

1 2 struct_trigger

Data-aware structure streamer sends a prefetch request for structure data

slide-23
SLIDE 23

Scalable and Energy-efficient Architecture Lab (SEAL)

DROPLET Overview

23

Private L2 cache Shared Inclusive LLC Data-aware Property Prefetcher Data-aware Structure Streamer

Coherence engine

struct_req struct_req struct_req struct_dat struct_dat struct_dat prop_dat prop_dat prop_dat prop_trigger prop_req prop_req prop_req struct_req

Memory Controller

1 2 struct_trigger

  • Copy of prefetched structure

cacheline triggers property prefetcher in MC.

  • Property prefetcher uses

information in structure cacheline to calculate property prefetch addresses.

slide-24
SLIDE 24

Scalable and Energy-efficient Architecture Lab (SEAL)

DROPLET Overview

24

Private L2 cache Shared Inclusive LLC Data-aware Property Prefetcher Data-aware Structure Streamer

Coherence engine

struct_req struct_req struct_req struct_dat struct_dat struct_dat prop_dat prop_dat prop_dat prop_trigger prop_req prop_req prop_req struct_req

Memory Controller

1 2 struct_trigger

  • Property prefetch address is

used to check the coherence engine for on-chip presence

  • f data
  • If not on-chip, line up

request in MC

  • If on-chip, query LLC for

property data

slide-25
SLIDE 25

Scalable and Energy-efficient Architecture Lab (SEAL)

DROPLET Overview

25

Private L2 cache Shared Inclusive LLC Data-aware Property Prefetcher Data-aware Structure Streamer

Coherence engine

struct_req struct_req struct_req struct_dat struct_dat struct_dat prop_dat prop_dat prop_dat prop_trigger prop_req prop_req prop_req struct_req

Memory Controller

1 2 struct_trigger

Place prefetched property data in L2

slide-26
SLIDE 26

Scalable and Energy-efficient Architecture Lab (SEAL)

L2 Structure Streamer

26

core L1D

miss?

TLB

yes

bit set? Data-aware L2 streamer hit with bit set?

1

Extra bit in (1) TLB & (2) L2 req queue to identify structure data

2

L2

yes yes

L2 req queue

Changes shaded in purple and blue

slide-27
SLIDE 27

Scalable and Energy-efficient Architecture Lab (SEAL)

Property Prefetcher

27

Scan granularity

  • f structure

cacheline

<<2

+

Prefetched structure cacheline 4B or 8B

Property Address Generator (PAG)

Neighbor ID

Target property virtual addresses for prefetch

Base address

  • f property

array

C-bit set?

Virtual address Core ID

yes

Data from DRAM

Virtual address buffer (VAB)

<<2

+

<<2

+

Neighbor ID Neighbor ID MTLB Physical address Core ID Physical address buffer (PAB) Property address generator (PAG) C FCFS RH Core ID

Memory Request Buffer (MRB)

To caches

To coherence engine

MC-based property prefetcher (MPP)

PAG calculates virtual property prefetch addresses

slide-28
SLIDE 28

Scalable and Energy-efficient Architecture Lab (SEAL)

Property Prefetcher

28

Scan granularity

  • f structure

cacheline

<<2

+

Prefetched structure cacheline 4B or 8B

Property Address Generator (PAG)

Neighbor ID

Target property virtual addresses for prefetch

Base address

  • f property

array

C-bit set?

Virtual address Core ID

yes

Data from DRAM

Virtual address buffer (VAB)

<<2

+

<<2

+

Neighbor ID Neighbor ID MTLB Physical address Core ID Physical address buffer (PAB) Property address generator (PAG) C FCFS RH Core ID

Memory Request Buffer (MRB)

To caches

To coherence engine

MC-based property prefetcher (MPP)

Information obtained from application layer Prefetched structure cacheline

Equation to calculate property addresses: address = base + 4 x neighbor ID

slide-29
SLIDE 29

Scalable and Energy-efficient Architecture Lab (SEAL)

Property Prefetcher

29

Scan granularity

  • f structure

cacheline

<<2

+

Prefetched structure cacheline 4B or 8B

Property Address Generator (PAG)

Neighbor ID

Target property virtual addresses for prefetch

Base address

  • f property

array

C-bit set?

Virtual address Core ID

yes

Data from DRAM

Virtual address buffer (VAB)

<<2

+

<<2

+

Neighbor ID Neighbor ID MTLB Physical address Core ID Physical address buffer (PAB) Property address generator (PAG) C FCFS RH Core ID

Memory Request Buffer (MRB)

To caches

To coherence engine

MC-based property prefetcher (MPP)

slide-30
SLIDE 30

Scalable and Energy-efficient Architecture Lab (SEAL)

Property Prefetcher

30

Scan granularity

  • f structure

cacheline

<<2

+

Prefetched structure cacheline 4B or 8B

Property Address Generator (PAG)

Neighbor ID

Target property virtual addresses for prefetch

Base address

  • f property

array

C-bit set?

Virtual address Core ID

yes

Data from DRAM

Virtual address buffer (VAB)

<<2

+

<<2

+

Neighbor ID Neighbor ID MTLB Physical address Core ID Physical address buffer (PAB) Property address generator (PAG) C FCFS RH Core ID

Memory Request Buffer (MRB)

To caches

To coherence engine

MC-based property prefetcher (MPP)

slide-31
SLIDE 31

Scalable and Energy-efficient Architecture Lab (SEAL)

Property Prefetcher

31

Scan granularity

  • f structure

cacheline

<<2

+

Prefetched structure cacheline 4B or 8B

Property Address Generator (PAG)

Neighbor ID

Target property virtual addresses for prefetch

Base address

  • f property

array

C-bit set?

Virtual address Core ID

yes

Data from DRAM

Virtual address buffer (VAB)

<<2

+

<<2

+

Neighbor ID Neighbor ID MTLB Physical address Core ID Physical address buffer (PAB) Property address generator (PAG) C FCFS RH Core ID

Memory Request Buffer (MRB)

To caches

To coherence engine

MC-based property prefetcher (MPP)

slide-32
SLIDE 32

Scalable and Energy-efficient Architecture Lab (SEAL)

Help From Application Layer

32

Scan granularity

  • f structure

cacheline

<<2

+

Prefetched structure cacheline 4B or 8B

Property Address Generator (PAG)

Neighbor ID

Target property virtual addresses for prefetch

Base address

  • f property

array

C-bit set?

Virtual address Core ID

yes

Data from DRAM

Virtual address buffer (VAB)

<<2

+

<<2

+

Neighbor ID Neighbor ID MTLB Physical address Core ID Physical address buffer (PAB) Property address generator (PAG) C FCFS RH Core ID

Memory Request Buffer (MRB)

To caches

To coherence engine

MC-based property prefetcher (MPP)

Specialized malloc passes these information from application to hardware More in paper!

slide-33
SLIDE 33

Scalable and Energy-efficient Architecture Lab (SEAL)

Experiments

33

DROPLET is compared to:

  • No-prefetch baseline
  • Conventional L2 stream prefetcher
  • Variable Length Delta Prefetcher (VLDP) at L2
  • Global History Buffer (GHB) at L2
  • streamMPP1: conventional L2 streamer + property prefetcher in MC
  • monoDROPLETL1: monolithic data-aware streamer and property

prefetcher at L1. Similar to state-of-the-art graph prefetcher (ICS ’16*).

* S. Ainsworth and T. M. Jones, “Graph prefetching Using Data Structure Knowledge,” ICS 2016

slide-34
SLIDE 34

Scalable and Energy-efficient Architecture Lab (SEAL)

Data-Aware + Decoupled = High Performance

34

DROPLET achieves performance improvements of:

  • 19%-102% over a no-prefetch baseline
  • 9%-74% over a stream prefetcher
  • 14%-74% over VLDP
  • 19%-115% over GHB
  • 4%-12% over state-of-the art graph

prefetcher

CC PR BC BFS SSSP 0.0 0.5 1.0 1.5 2.0 speedup over no-prefetch baseline Graph Algorithm

GHB VLDP stream streamMPP1 DROPLET monoDROPLETL1

102% 30% 19% 32% 26%

More experiments in paper!

slide-35
SLIDE 35

Scalable and Energy-efficient Architecture Lab (SEAL)

Conclusions

35

  • Memory access is the primary bottleneck in single-machine in-memory

graph processing.

  • Memory access bottleneck arises from:

1) Heterogeneous reuse distances of different data types, leading to DRAM accesses for graph structure and property data. 2) Load-load dependency restricts MLP.

  • DROPLET, a data-aware and decoupled prefetcher, effectively solves

memory access bottleneck.

slide-36
SLIDE 36

Scalable and Energy-efficient Architecture Lab (SEAL)

Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads

Abanti Basak, Shuangchen Li, Xing Hu, Sang Min Oh, Xinfeng Xie, Li Zhao*, Xiaowei Jiang*, and Yuan Xie

University of California, Santa Barbara *Server Architecture Group, Alibaba Inc.

Thanks! Questions

slide-37
SLIDE 37

Scalable and Energy-efficient Architecture Lab (SEAL)

Backup Slides

37

slide-38
SLIDE 38

Scalable and Energy-efficient Architecture Lab (SEAL)

Property data benefits from LLC capacity

38 With increasing LLC capacity, most reduction in DRAM accesses comes from property data

slide-39
SLIDE 39

Scalable and Energy-efficient Architecture Lab (SEAL)

Property Prefetcher

39

Scan granularity

  • f structure

cacheline

<<2

+

Prefetched structure cacheline 4B or 8B

Property Address Generator (PAG)

Neighbor ID

Target property virtual addresses for prefetch

Base address

  • f property

array

C-bit set?

Virtual address Core ID

yes

Data from DRAM

Virtual address buffer (VAB)

<<2

+

<<2

+

Neighbor ID Neighbor ID MTLB Physical address Core ID Physical address buffer (PAB) Property address generator (PAG) C FCFS RH Core ID

Memory Request Buffer (MRB)

To caches

To coherence engine

MC-based property prefetcher (MPP)

C

Identify which private L2 cache property prefetches should be sent to Identify prefetched structure data

slide-40
SLIDE 40

Scalable and Energy-efficient Architecture Lab (SEAL)

Hardware Overhead

40

  • Extra bits in TLB: 1.56% storage overhead in paging structure
  • Extra bits in L2 request queue: 1.54% storage overhead
  • Property prefetcher in MC: 0.0348% area overhead compared to entire chip