GraphPIM : Enabling Instruction-Level PIM Offloading in Graph - - PowerPoint PPT Presentation

graphpim enabling instruction level pim offloading in
SMART_READER_LITE
LIVE PREVIEW

GraphPIM : Enabling Instruction-Level PIM Offloading in Graph - - PowerPoint PPT Presentation

GraphPIM : Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks Lifeng Nai ,RamyadHadidi, JaewoongSim*, HyojongKim, PranithKumar, HyesoonKim Georgia Tech, *Intel Labs INTRODUCTION 2 Graph computing: processing big


slide-1
SLIDE 1

GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks

Lifeng Nai,RamyadHadidi, JaewoongSim*,

HyojongKim, PranithKumar, HyesoonKim Georgia Tech, *Intel Labs

slide-2
SLIDE 2

INTRODUCTION

⎮ Graph computing: processing big network data

} Social network, knowledge network, bioinformatics, etc.

⎮ Graph computing is inefficient on conventional architectures

} Inefficiency in memory subsystems

2

slide-3
SLIDE 3

INTRODUCTION

⎮ Processing-in-memory (PIM)

} PIM has the potential of helping graph computing performance } PIM is being realized in real products: Hybrid memory cube (HMC) 2.0

3

Enable PIM for graph computing

slide-4
SLIDE 4

CHALLENGES

⎮ What are the benefits of PIM for graph computing?

} Known benefits of PIM } Bandwidth savings, latency reduction, more computation power } But, they are not good enough } We explore something more!

⎮ How to enable PIM for graph in a practical way?

} Minor hardware/software change } No programmer burden

4

slide-5
SLIDE 5

OVERVIEW

5

GraphPIM: a PIM-enabled graph framework

We identify a new benefit of PIM offloading We determine PIM offloading targets We enable PIM without user-application change

slide-6
SLIDE 6

KNOWN PIM BENEFITS

⎮ More computation power

} Extra computation units in memory

⎮ Bandwidth savings

} Pulling data vs. pushing computation command

⎮ Latency reduction

} Bypassing cache for offloaded accesses avoids cache-checking overhead } Avoids cache pollution and increases effective cache size Request Response Total 64-byte READ 1 FLIT (addr) 5 FLITs (data) 6 FLITs 64-byte WRITE 5 FLITs (addr, data) 1 FLIT (ack) 6 FLITs CPU rd-modify-wr (rd-Miss; wr-evict) 6 FLIT 6 FLIT 12 FLITs PIM rd-modify-wr 2 FLITs (addr, imm) 1 FLIT (ack) 3 FLITs

6

(FLIT: 16 byte, basic flow unit)

slide-7
SLIDE 7

PERFORMANCE BENEFITS

⎮ More computation power?

} Limited # of FUs in memory

⎮ Bandwidth savings?

} Not BW saturated

⎮ Latency reduction?

} Yes, but small

7

slide-8
SLIDE 8

GRAPHPIM EXPLORES…

⎮ Atomic overhead reduction

} Atomic instructions on CPUs have substantial overhead [Schweizer’15]

} RMW: read-modify-write } Cache operation: cache-line invalidation, coherence traffic etc. } Data ordering: write buffer draining, pipeline freeze etc.

} Because of the characteristics of graph programming model, PIM

  • ffloading can avoid the atomic overhead
  • H. Schweizer et al., “Evaluating the Cost of Atomic Operations on Modern

Architectures,” PACT’15

Atomic Instruction RMW Data Ordering Cache Operation

8

slide-9
SLIDE 9

ATOMIC OVERHEAD REDUCTION

9 RMW Data Ordering Cache Operation

Pipeline Stall (Serialization) Retire

CPU

RMW

Offload ACK Offload ACK Retire

CPU PIM

Atomic Continue Execution Serialization in PIM

slide-10
SLIDE 10

ATOMIC OVERHEAD ESTIMATION

⎮ Atomic overhead experiments on a Xeon E5 machine

} Atomic RMW à regular load + compute + store

⎮ Atomic instructions incur 30% performance degradation

0.5 1 1.5 2 BFS CComp DC kCore SSSP TC BC PRank GMean Normalized Execution Time Atomic Non-Atomic

10

(Non-Atomic: artificial experiment, not precise estimation)

slide-11
SLIDE 11

PERFORMANCE BENEFITS

⎮ More computation power?

} Limited # of FUs in memory

⎮ Bandwidth savings?

} Not BW saturated

⎮ Latency reduction?

} Yes, but small

⎮ Atomic overhead reduction?

} Yes and significant! } Main source of PIM benefit for graph

11

slide-12
SLIDE 12

OFFLOADING TARGETS

⎮ Code snippet: Breadth-first search (BFS)

F ← {source} while F is not empty F’ ← {∅} for each u ∈ F in parallel d ← u.depth + 1 for each v ∈ neighbor(u) ret←CAS(v.depth, inf, d) if ret==success F’ ← F’ ∪ v endif endfor endfor barrier() F ← F’ endwhile 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 F: frontier vertex set of current step F’: frontier vertex set of next step u.depth: depth value of vertex u neighbor(u): neighbor vertices of u CAS(v.depth, inf, d): atomic compare and swap operation line 4-5, 8-10: accessing meta data line 6: accessing graph structure line 7: accessing graph property

✔ ✔ 12

Cache Friendly Cache Unfriendly + Atomic

Offload atomic operations on graph property

slide-13
SLIDE 13

INDICATE OFFLOADING TARGETS

13

⎮ How to indicate offloading targets? ⎮ Option #1: Mark instructions

} New instructions in ISA } Requires changes in user-level applications

⎮ Option #2: Mark memory regions

} Special memory region for offloading data } Can be transparent to application programmers

slide-14
SLIDE 14

GRAPH FRAMEWORK

⎮ Graph computing is framework-based

} User application is designed on top of framework interfaces } Data is managed within the framework

14

User Application G = load_graph(“Fruit”); V1 = G.find_vertex(“Apple”); V1.property().price = 5; V1.add_neighbor(“Orange”); load_graph (Framework) G_structure = malloc(size1); G_property = malloc(size2); Open file & load data Graph APIs Middleware

slide-15
SLIDE 15

ENABLE PIM IN GRAPH FRAMEWORK

Graph API Middleware OS Hardware Architecture Graph Data Management User Application User Application

15

SW HW Graph Framework Host Processor Core PIM Offloading Unit malloc() à pmr_malloc() Graph Property No user application change

load_graph G_structure = malloc(size1); G_property = malloc(size2); Open file & load data

pmr_malloc(size2);

slide-16
SLIDE 16

FRAMEWORK CHANGE

⎮ PIM memory region (PMR)

} Uncacheable memory region in virtual memory space } Utilizing existing uncacheable (UC) support in X86

⎮ Framework change

} malloc() à pmr_malloc() } pmr_malloc(): customized malloc function that allocates mem objects in PMR Virtual Memory Space PIM Mem Region Graph Property pmr_malloc() Graph Structure malloc() Other Data Graph Data Management (Framework)

16

slide-17
SLIDE 17

ARCHITECTURE CHANGE

⎮ PIM offloading unit (POU)

} Identifies atomic instructions that are accessing PIM Memory Region } Offloads them as PIM instructions Core Caches POU HMC Atomic Unit Host Processor HMC Hardware Architecture

17

slide-18
SLIDE 18

CHANGES

⎮ Software changes:

} No user application change } Minor change in framework: malloc() à pmr_malloc()

⎮ Hardware changes:

} PIM memory region: utilizes existing uncacheable (UC) support } PIM offloading unit (POU): identifies offloading targets

No burden on programmers + Minor HW/SW change

18

slide-19
SLIDE 19

EVALUATION

⎮ Simulation Environment

} SST (framework) + MacSim (CPU) + VaultSim (HMC)

⎮ Benchmark

} GraphBIG benchmark suite [Nai’15] (https://github.com/graphbig) } LDBC dataset from Linked Data Benchmarking Council (LDBC)

⎮ Configuration

} 16 OoO cores, 2GHz, 4-issue } 32KB L1/256KB L2/16MB shared L3 } HMC 2.0 spec, 8GB, 32 vaults, 512 banks, 4 links, 120GB/s per link

19

  • L. Nai et al. “GraphBIG: Understanding Graph Computing in the Context of Industrial

Solutions,” SC’15

slide-20
SLIDE 20

EVALUATION: PERFORMANCE

⎮ Baseline: No PIM offloading ⎮ U-PEI: Performance upper-bound of PIM-enabled instructions [Ahn’15]

0.0% 1.0% 2.0% 3.0% 1 2 3 %PIM-Atomic in All Instructions Speedup over Baseline Baseline U-PEI GraphPIM %PIM-Atomic

Up to 2.4X speedup On average 1.6X speedup

20

  • J. Ahn et al. “PIM-Enabled Instructions: A Low-Overhead, Locality-Aware PIM

Architecture,” ISCA’15

slide-21
SLIDE 21

EVALUATION: EXECUTION TIME BREAKDOWN

⎮ Breakdown of normalized execution time

} Atomic-inCore: atomic overhead of offloading targets (atomic inst.) } Atomic-inCache: cache-checking overhead of offloading targets

0.2 0.4 0.6 0.8 1 Baseline GraphPIM Baseline GraphPIM Baseline GraphPIM Baseline GraphPIM Baseline GraphPIM Baseline GraphPIM Baseline GraphPIM Baseline GraphPIM BFS CComp DC kCore SSSP TC BC PRank

Normalized Execution Time

Other Atomic-inCore Atomic-inCache 21

slide-22
SLIDE 22

CONCLUSION

⎮ Graph computing is inefficiency on conventional architectures ⎮ GraphPIM enables PIM in graph computing frameworks

} Explores a new benefit of PIM offloading: atomic overhead reduction } Identifies atomic operations on graph property as the offloading target } Requires no user-application change and only minor change in framework and architecture

22

slide-23
SLIDE 23

THANK YOU!

slide-24
SLIDE 24

BACKUP SLIDES

slide-25
SLIDE 25

DYNAMIC CACHE BEHAVIOR?

25

⎮ GraphPIM marks memory region statically

} Cons: cannot be adaptive to the working set sizes } But, property accesses to graphs have very high cache misses regardless

  • f graph inputs except for really small graph sizes [JPDC’16, SC’15]

} Pros: coherence support between memory and processor-cache is not required

slide-26
SLIDE 26

CONSISTENCY?

26

⎮ PIM offloading for atomic instructions works fine because…

} The programming model of graph applications naturally avoids consistency issues: all PIM writes are done before reads } Graph applications require only atomicity from atomic instructions } But, atomic instructions in CPUs don’t allow to specify atomicity without fence } We also have a follow-up work discussing the consistency issue for PIM instructions in the context of general applications [MEMSYS’17]

slide-27
SLIDE 27

CONSISTENCY?

27

⎮ Graph applications with BSP model naturally avoids consistency

issues

} Barriers ensures all PIM writes are done before reads

Program Phases Operation

loop: foreach vertex in task queue: read property fetch neighbor list foreach neighbor: update neighbor property update next-iter task queue barrier // Reads // HMC Inst.

slide-28
SLIDE 28

WHY GRAPHPIM IS A FRAMEWORK?

28

⎮ GraphPIM

} Considers the separation of framework and user application } Proposes a full-stack solution: SW framework + HW architecture } Requires no application programmers’ efforts

⎮ Users can easily enable/disable GraphPIM by switching between

different framework libraries.

slide-29
SLIDE 29

BANDWIDTH SENSITIVITY?

29

⎮ Graph on CPUs are not very sensitive to BW changes

} Speedup over baseline system with different HMC link bandwidth

0.5 1 1.5 2 2.5 BFS CComp DC kCore SSSP TC BC PRank GMean Speedup over Baseline Baseline Baseline-half-BW Baseline-double-BW GraphPIM GraphPIM-Half-BW GraphPIM-Double-BW

slide-30
SLIDE 30

APPLICABILITY?

30

Category Workload Applicable? Offloading Target PIM Inst.

Graph Traversal Breadth-first search

✔ lock cmpxchg CAS if equal

Degree centrality

✔ lock addw Singed add

Betweenness centrality

✘ (Floating point add) (FP add)

Shortest path

✔ lock cmpxchg CAS if equal

K-core decomposition

✔ lock subw Singed add

Connected component

✔ lock cmpxchg CAS if equal

Page rank

✘ (Floating point add) (FP add)

Dynamic Graph Graph construction

✘ (Complex operation)

Graph update

✘ (Complex operation)

Topology morphing

✘ (Complex operation)

Rich Property Triangle count

✔ lock add Singed add

Gibbs inference

✘ (Compute intensive)

✔ ✔

slide-31
SLIDE 31

GRAPHPIM: EVALUATION

31

⎮ GraphPIM speedup over baseline with different dataset sizes

1 2 3 4 5

BFS CComp DC kCore SSSP TC BC PRank Speedup LDBC-1M LDBC-100k LDBC-10k LDBC-1k

slide-32
SLIDE 32

GRAPHPIM: EVALUATION

32

⎮ Normalized uncore energy consumption

0.2 0.4 0.6 0.8 1 Baseline GraphPIM Baseline GraphPIM Baseline GraphPIM Baseline GraphPIM Baseline GraphPIM Baseline GraphPIM Baseline GraphPIM Baseline GraphPIM Baseline GraphPIM BFS CComp DC kCore SSSP TC BC PRank GMean Normalized Energy Breakdown Caches HMC Link HMC FU HMC LogicLayer HMC DRAM

On average, GraphPIM saves 37% of uncore energy because of reduction in cache accesses and memory bandwidth

slide-33
SLIDE 33

GRAPHPIM: EVALUATION

33

⎮ Normalized bandwidth consumption with request/response

breakdown

0.2 0.4 0.6 0.8 1 1.2 Baseline U-PEI GraphPIM Baseline U-PEI GraphPIM Baseline U-PEI GraphPIM Baseline U-PEI GraphPIM Baseline U-PEI GraphPIM Baseline U-PEI GraphPIM Baseline U-PEI GraphPIM Baseline U-PEI GraphPIM BFS CComp DC kCore SSSP TC BC PRank Normalized Memory Bandwidth Request Response

slide-34
SLIDE 34

GRAPHPIM: EVALUATION

34

⎮ Performance and energy results of two real-world applications

} Based on an analytical model } FD: Financial fraud detection; RS: Recommender system

0.5 1 1.5 2 2.5 Baseline GraphPIM Baseline GraphPIM FD RS Speedup over baseline 0.2 0.4 0.6 0.8 1 Baseline GraphPIM Baseline GraphPIM FD RS Normalized Energy Breakdown Caches HMC Link HMC Other

slide-35
SLIDE 35

HARDWARE CHANGES

35

⎮ PIM Memory Region (PMR)

} A uncacheable memory region in virtual memory space } Utilizing existing uncacheable (UC) support in X86

⎮ PIM Offloading Unit (POU)

Links Core

Vault 00 Logic

DRAM Partition

Atomic Unit Vault 01 Logic

DRAM Partition

Atomic Unit Vault 31 Logic

DRAM Partition

Atomic Unit

...

Switch HMC Controller Last-level Cache HMC HOST L2 L1 Core POU

...

...

L1 L2

PIM Mem Region? Atomic Inst? HMC PIM Request

to L1

N N Y Y

Memory Inst. POU to HMC Mem Req

slide-36
SLIDE 36

MOTIVATION

36

⎮ Profiling using HW performance counters

} Execution cycle breakdown: top-down methodology from Intel

50 100 150 200

Misses Per Kilo Instructions L1D L2 L3

0% 25% 50% 75% 100%

Execution Cycle Breakdown Backend Frontend BadSpeculation Retiring

Bottleneck caused by backend stalls High number of cache misses

slide-37
SLIDE 37

BACKGROUND: PIM OFFLOADING IN HMC 2.0

37

⎮ Hybrid Memory Cube (HMC) 2.0

} One of the first industrial PIM proposals } Instruction-level PIM offloading } 1 logic die + 4/8 DRAM dies } 32 Vaults } 4 serial links Serial Links

TSVs Logic Layer DRAM Layers Vault

slide-38
SLIDE 38

BACKGROUND: PIM OFFLOADING IN HMC 2.0

38

⎮ Packet-based protocol ⎮ Regular READ/WRITE

} FLIT: 16-byte; basic flow unit

Header Payload Tail

Request Response 64-byte READ 1 FLIT 5 FLITs 64-byte WRITE 5 FLITs 1 FLIT 8 byte 8 byte 0~256 byte

slide-39
SLIDE 39

BACKGROUND: PIM OFFLOADING IN HMC 2.0

39

⎮ PIM Instruction: read-modify-write (RMW) operation

} Similar as regular READ/WRITE, just different CMD in the Header } DRAM bank is locked during the whole RMW for atomicity PIM-ADD(addr, imm) Header (PIM-ADD) addr, imm Tail ACK