Balancing Graph Processing Workloads Using Work Stealing on - - PowerPoint PPT Presentation

balancing graph processing workloads using work
SMART_READER_LITE
LIVE PREVIEW

Balancing Graph Processing Workloads Using Work Stealing on - - PowerPoint PPT Presentation

Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini , Francis OBrien and Tarek S. Abdelrahman The Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of


slide-1
SLIDE 1

Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems

Matthew Agostini, Francis O’Brien and Tarek S. Abdelrahman

The Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of Toronto matt.agostini@mail.utoronto.ca francis.obrien@mail.utoronto.ca tsa@ece.utoronto.ca

ICPP - August 20, 2020

slide-2
SLIDE 2

Outline

  • Research context and motivation
  • Work stealing
  • Heterogeneous Work Stealing (HWS)
  • Evaluation
  • Related work
  • Conclusions and future work

2 ICPP - August 20, 2020

slide-3
SLIDE 3

Accelerator-based Computing

ICPP - August 20 2020 3

  • Field Programmable Gate Arrays (FPGAs)

– Enable user-defined application-specific circuits – Potential for faster more power-efficient computing

GPUs GPUs FPGAs FPGAs

  • Accelerators are prevalent in computing, from personal to

cloud platforms

slide-4
SLIDE 4

Emerging FPGA-Accelerated Servers

  • A new generation of high-performance systems tightly

integrate FPGAs with multicores, targeting data centers

– Exemplified by the Intel HARP, IBM CAPI and Xilinx Zynq systems

  • An FPGA circuit can directly access system memory in a

manner that is coherent with the processor caches

– Enables CPU threads and FPGA hardware to cooperatively accelerate an application, sharing data in system memory – In contrast to typical offload FPGA acceleration that leaves the CPU idle during FPGA processing

4 ICPP - August 20 2020

slide-5
SLIDE 5

Research Question

  • The concurrent use of (multiple) CPU threads and FPGA

hardware requires balancing of workloads

  • Research question: how to balance workloads between

software (CPU threads) and hardware (FPGA accelerator) such that:

– The accelerator is fully utilized – Load imbalance is minimized – Scheduling overheads are reduced

5 ICPP - August 20, 2020

slide-6
SLIDE 6

Graph Analytics

  • We answer our research question in the context of Graph

Analytics: applications that process large graphs to deduce some properties

– Prevalent in social networks, targeted advertising and web searches

  • Processing graphs is notoriously load imbalanced

– Graph structure (varying outgoing edge degrees) – Distribution of active/inactive vertices (computations vary across processing iterations)

2020-08-10 6

slide-7
SLIDE 7

This Work

  • We develop Heterogeneous Work Stealing (HWS): a strategy

for balancing graph processing workloads on tightly-coupled CPU+FPGA systems

– Identify and address some unique challenges that arise in this context

  • We implement and evaluate HWS on the Intel HARP platform

– Use it for 3 kernels processing large real-world graphs – Effectively balances workloads – Outperforms state-of-the-art strategies

  • Supported by Intel Strategic Research Alliance (ISRA) grant

7 ICPP - August 20 2020

slide-8
SLIDE 8

Work Stealing

8

Thread T[i](Workload:workItems) while true do if has workItems then // Normal Execution Process(workItem) else // Steal AcquireWork(k) // k = id of victim thread

  • Allows fine-grained workload partitioning with low overhead
  • Previously considered unsuitable for heterogeneous systems

due to explicit copying of data to accelerators

ICPP - August 20, 2020

slide-9
SLIDE 9

Work Stealing for Graph Processing

9

Thread T[i](Start, End, sync) while true do if Start < End then // Normal Execution Process(vtx[Start]) Start = Start + 1 else // Start == End; // Steal if CAS(T[k].sync) then // k = id of randomly chosen thread T[i].Start = (T[k].Start+T[k].End)/2 // Steal half T[i].End = T[k].End T[k].End = T[i].Start

ICPP - August 20, 2020

slide-10
SLIDE 10

Challenges with Heterogeneity

  • Non-Linear FPGA performance with workload size
  • How to steal from hardware?
  • Duplicate work caused by FPGA read latency
  • Hardware Limitations

10 ICPP - August 20, 2020

slide-11
SLIDE 11

Non-Linear FPGA Performance

  • The FPGA accelerator performance depends on the size of the

workload assigned to it

– Larger workloads better amortize accelerator startup and initial latency – HWS assigns large enough workloads, stealing only when CPU threads idle

ICPP - August 20, 2020 11

0.00 0.10 0.20 0.30 0.40

Execution Time (s) Workload Partition Size (vertices)

FPGA Only Single CPU Thread

slide-12
SLIDE 12

Steal Mechanism

  • How does software steal from hardware?

– Internal accelerator state is often not accessible by software

  • Accelerator exposes two CSRs: start and end

– Thief thread reads start to determine how much to steal – Thief thread calculates new FPGA end bound and then writes to end

ICPP - August 20, 2020 12

start end (CSRs) HWS

read write

Thief Thread Acc. FPGA processing is un-interrupted during the steal FPGA CPU

slide-13
SLIDE 13

Duplicate Work

  • The delay in reading CSR registers leads to potential work

duplication

– The value of the start register read by thief is stale

ICPP - August 20, 2020 13

FPGA Workload

start end

To CPU

new end

From CPU

slide-14
SLIDE 14

CPU Workload

Duplicate Work

  • The delay in reading CSR registers leads to potential work

duplication

– The value of the start register read by thief is stale

ICPP - August 20, 2020 14

FPGA Workload

start end

slide-15
SLIDE 15

CPU

Duplicate Work

  • The delay in reading CSR registers leads to potential work

duplication

– The value of the start register read by thief is stale – When a small amount of work remains, the thief may steal work already performed by the FPGA

ICPP - August 20, 2020 15

FPGA Workload

start new end

FPGA

end start end duplication

slide-16
SLIDE 16

Duplicate Work

  • The delay in reading CSR registers leads to potential work

duplication

– The value of the start register read by thief is stale – When a small amount of work remains, the thief may steal work already performed by the FPGA

  • We estimate FPGA progress P, ensuring that a steal from the

FPGA fails if too small a workload remains

– Enabled by the relatively deterministic nature of the accelerator

ICPP - August 20, 2020 16

T[i].Start = ((T[k].Start + P) + T[k].End)/2

slide-17
SLIDE 17

Hardware Limitations

  • FPGA memory requests are aligned to cache lines

– Misaligned requests can negatively affect performance

  • HWS aligns FPGA workloads with cache lines and imposes a

lower bound on stealing granularity

– Only 8 vertices per cache line

ICPP - August 20, 2020 17

slide-18
SLIDE 18

Evaluation

  • Graph Benchmarks
  • Platform
  • Metrics of performance
  • Results

– Load balancing effectiveness – Comparison to state-of-the-art – Steal characteristics – Graph processing throughput

ICPP - August 20, 2020 18

slide-19
SLIDE 19

Graph Benchmarks

  • We use three common graph processing benchmarks:

– Breadth-First Search (BFS) – Single Source Shortest Path (SSSP) – PageRank (PR)

  • Implemented in the Scatter-Gather paradigm

– A common paradigm for graph processing – Scatter: sweep over vertices, producing updates to neighboring vertices – Gather: sweep over updates, applying them to destination vertices

19 ICPP - August 20, 2020

Common benchmarks BFS and SSSP used by Graph500

slide-20
SLIDE 20

Evaluation Graphs

20

Graph Vertices Edges Description Twitter 62M 1,468M Follower data LiveJournal 4.8M 69M Friendship relations data Orkut 3M 234M Social connections StackOverflow 2.6M 36M Questions and answers Skitter 1.7M 22M 2005 Internet topology graph Pokec 1.6M 31M Social connections Higgs 460K 15M Twitter subset

ICPP - August 20, 2020

  • Process 7 large graphs, mostly drawn from SNAP
slide-21
SLIDE 21

Platform

  • Intel’s Heterogeneous Architecture Research Platform (HARP)

– Xeon E5-2680 v4 CPU + Arria 10 GX1150 FPGA – AFU issues cache coherent reads/writes to system memory

  • AFUs for the scatter phase of graph processing [O’Brien 2020]

– The gather phase is done by CPU threads

ICPP - August 20, 2020 21

CPU CPU System Memory QPI/PCIe Interconnect Arria 10 FPGA FIU AFU Xeon Multicore AFU: User’s Accelerator Function Unit FIU: QPI/PCIe links protocols Data cache Address translation

slide-22
SLIDE 22

Performance Metrics

  • Execution time: time for processing, excluding loading graph

into memory

  • Load imbalance: the maximum useful work time of a thread

relative to the average useful work time

  • Throughput: the number of traversed edges per second

(MTEPS)

ICPP - August 20, 2020 22

λ = ideally, λ is1

slide-23
SLIDE 23

Comparisons

  • We compare HWS to different load balancing strategies

– Static: equal sized partitions to all threads, giving FPGA 2.5X more – Best-Dynamic: a chunk self-scheduling load balancer with a priori knowledge of the optimal chunk size – HAP: Heterogeneous Adaptive Partitioning scheduler [Rodriguez 2019]

  • We define speedup as the ratio of the execution time of static

to that of a load balancing strategy

23 ICPP - August 20, 2020

slide-24
SLIDE 24

Disclaimer

  • The results in this paper were generated using pre-production

hardware and software, and may not reflect the performance

  • f production or future systems.

ICPP - August 20, 2020 24

slide-25
SLIDE 25

BFS Scatter λ

25

HARPv2 15 Threads + AFU

0.0 0.5 1.0 1.5 2.0 2.5 3.0

λ Graph

Static Best-Dynamic HAP HWS

ICPP - August 20, 2020

slide-26
SLIDE 26

BFS Scatter λ

26

HARPv2 15 Threads + AFU

0.0 0.5 1.0 1.5 2.0 2.5 3.0

λ Graph

Static Best-Dynamic HAP HWS

ICPP - August 20, 2020

slide-27
SLIDE 27

BFS Scatter Performance

27

HARPv2 15 Threads + AFU

0.0 0.5 1.0 1.5 2.0

Speedup Graph

Static Best-Dynamic HAP HWS

ICPP - August 20, 2020

slide-28
SLIDE 28

HWS Steal Characteristics

28

HARPv2 SSSP 7 Threads + AFU

200 400 600 800 1000

Graphs Steal Count

Steals By FPGA Steals From FPGA Aborted

ICPP - August 20, 2020

slide-29
SLIDE 29

Average FPGA Chunk Size

29

HARPv2 SSSP 7 Threads + AFU

1 10 100 1000 10000 100000 1000000

FPGA Stream Size Graphs

Best-Dynamic HAP HWS

ICPP - August 20, 2020

slide-30
SLIDE 30

Overall Throughput

ICPP - August 20, 2020 30

200 400 600 800 1000 1200 1400 1600

MTEPS Graph

HWS 4 HWS 8 HWS 16

HARPv2 BFS

slide-31
SLIDE 31

Related Work

  • Heterogeneous scheduling: Tripp (2005), Belviranli (2013), Vilches

(2015), Song (2016), Navarro (2019), Rodriguez (2019), Wang (2019)

– Focus is on adaptive chunk size selection – We introduce work stealing and demonstrate it effectiveness

  • Work stealing: Acar (2013), Cong (2008), Dinan (2009), Hendler

(2002), Khayyat (2013), Nakashima(2019)

– We extend work stealing to heterogeneous systems

  • FPGA accelerators for graph processing: Dai (2016), Engelhardt

(2016), Zhou (2017), Zhou (2019)

– We extend to concurrent CPU-FPGA usage

ICPP - August 20, 2020 31

slide-32
SLIDE 32

Concluding Remarks

  • HWS addresses some unique challenges when processing

graphs on CPU-FPGA systems

– HWS maximizes FPGA throughput ensuring large workloads – HWS achieves perfect load balance (λ = 1) – HWS outperforms competitively against other schedulers

  • Our results collectively demonstrate that work stealing is an

effective solution for balancing graph processing workloads

  • n tightly-coupled heterogeneous CPU-FPGA systems

32 ICPP - August 20, 2020

slide-33
SLIDE 33

Future Work

  • Heterogeneous acceleration of the gather phase of graph

processing

  • Integration of HWS with multiple FPGAs
  • Work stealing optimizations: priority-based victim selection
  • Processing of dynamically changing graphs

ICPP - August 20, 2020 33

slide-34
SLIDE 34

Thank You