Balancing Graph Processing Workloads Using Work Stealing on - PowerPoint PPT Presentation

Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini , Francis O’Brien and Tarek S. Abdelrahman The Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of Toronto matt.agostini@mail.utoronto.ca francis.obrien@mail.utoronto.ca tsa@ece.utoronto.ca ICPP - August 20, 2020

Outline • Research context and motivation • Work stealing • Heterogeneous Work Stealing (HWS) • Evaluation • Related work • Conclusions and future work ICPP - August 20, 2020 2

Accelerator-based Computing • Accelerators are prevalent in computing, from personal to cloud platforms FPGAs FPGAs GPUs GPUs • Field Programmable Gate Arrays (FPGAs) – Enable user-defined application-specific circuits – Potential for faster more power-efficient computing ICPP - August 20 2020 3

Emerging FPGA-Accelerated Servers • A new generation of high-performance systems tightly integrate FPGAs with multicores, targeting data centers – Exemplified by the Intel HARP, IBM CAPI and Xilinx Zynq systems • An FPGA circuit can directly access system memory in a manner that is coherent with the processor caches – Enables CPU threads and FPGA hardware to cooperatively accelerate an application, sharing data in system memory – In contrast to typical offload FPGA acceleration that leaves the CPU idle during FPGA processing ICPP - August 20 2020 4

Research Question • The concurrent use of (multiple) CPU threads and FPGA hardware requires balancing of workloads • Research question: how to balance workloads between software (CPU threads) and hardware (FPGA accelerator) such that: – The accelerator is fully utilized – Load imbalance is minimized – Scheduling overheads are reduced ICPP - August 20, 2020 5

Graph Analytics • We answer our research question in the context of Graph Analytics : applications that process large graphs to deduce some properties – Prevalent in social networks, targeted advertising and web searches • Processing graphs is notoriously load imbalanced – Graph structure (varying outgoing edge degrees) – Distribution of active/inactive vertices (computations vary across processing iterations) 2020-08-10 6

This Work • We develop Heterogeneous Work Stealing (HWS) : a strategy for balancing graph processing workloads on tightly-coupled CPU+FPGA systems – Identify and address some unique challenges that arise in this context • We implement and evaluate HWS on the Intel HARP platform – Use it for 3 kernels processing large real-world graphs – Effectively balances workloads – Outperforms state-of-the-art strategies • Supported by Intel Strategic Research Alliance (ISRA) grant ICPP - August 20 2020 7

Work Stealing Thread T[i](Workload:workItems) while true do if has workItems then // Normal Execution Process(workItem) else // Steal AcquireWork(k) // k = id of victim thread • Allows fine-grained workload partitioning with low overhead • Previously considered unsuitable for heterogeneous systems due to explicit copying of data to accelerators ICPP - August 20, 2020 8

Work Stealing for Graph Processing Thread T[i](Start, End, sync) while true do if Start < End then // Normal Execution Process(vtx[Start]) Start = Start + 1 else // Start == End; // Steal if CAS(T[k].sync) then // k = id of randomly chosen thread T[i].Start = (T[k].Start+T[k].End)/2 // Steal half T[i].End = T[k].End T[k].End = T[i].Start ICPP - August 20, 2020 9

Challenges with Heterogeneity • Non-Linear FPGA performance with workload size • How to steal from hardware? • Duplicate work caused by FPGA read latency • Hardware Limitations ICPP - August 20, 2020 10

Non-Linear FPGA Performance • The FPGA accelerator performance depends on the size of the workload assigned to it – Larger workloads better amortize accelerator startup and initial latency 0.40 Execution Time (s) – HWS assigns large enough FPGA Only 0.30 Single CPU Thread workloads, stealing only 0.20 when CPU threads idle 0.10 0.00 Workload Partition Size (vertices) ICPP - August 20, 2020 11

Steal Mechanism • How does software steal from hardware? – Internal accelerator state is often not accessible by software • Accelerator exposes two CSRs: start and end – Thief thread reads start to determine how much to steal – Thief thread calculates new FPGA end bound and then writes to end CPU FPGA read FPGA processing is start Thief Acc. HWS un-interrupted Thread end write during the steal (CSRs) ICPP - August 20, 2020 12

Duplicate Work • The delay in reading CSR registers leads to potential work duplication – The value of the start register read by thief is stale FPGA Workload start new end end To From CPU CPU ICPP - August 20, 2020 13

Duplicate Work • The delay in reading CSR registers leads to potential work duplication – The value of the start register read by thief is stale FPGA Workload CPU Workload start end ICPP - August 20, 2020 14

Duplicate Work • The delay in reading CSR registers leads to potential work duplication – The value of the start register read by thief is stale – When a small amount of work remains, the thief may steal work already performed by the FPGA new duplication end end FPGA Workload FPGA CPU start start end ICPP - August 20, 2020 15

Duplicate Work • The delay in reading CSR registers leads to potential work duplication – The value of the start register read by thief is stale – When a small amount of work remains, the thief may steal work already performed by the FPGA • We estimate FPGA progress P , ensuring that a steal from the FPGA fails if too small a workload remains – Enabled by the relatively deterministic nature of the accelerator T[i].Start = ((T[k].Start + P ) + T[k].End)/2 ICPP - August 20, 2020 16

Hardware Limitations • FPGA memory requests are aligned to cache lines – Misaligned requests can negatively affect performance • HWS aligns FPGA workloads with cache lines and imposes a lower bound on stealing granularity – Only 8 vertices per cache line ICPP - August 20, 2020 17

Evaluation • Graph Benchmarks • Platform • Metrics of performance • Results – Load balancing effectiveness – Comparison to state-of-the-art – Steal characteristics – Graph processing throughput ICPP - August 20, 2020 18

Graph Benchmarks • We use three common graph processing benchmarks: – Breadth-First Search (BFS) Common benchmarks – Single Source Shortest Path (SSSP) BFS and SSSP used by Graph500 – PageRank (PR) • Implemented in the Scatter-Gather paradigm – A common paradigm for graph processing – Scatter: sweep over vertices, producing updates to neighboring vertices – Gather: sweep over updates, applying them to destination vertices ICPP - August 20, 2020 19

Evaluation Graphs • Process 7 large graphs, mostly drawn from SNAP Graph Vertices Edges Description Twitter 62M 1,468M Follower data LiveJournal 4.8M 69M Friendship relations data Orkut 3M 234M Social connections StackOverflow 2.6M 36M Questions and answers Skitter 1.7M 22M 2005 Internet topology graph Pokec 1.6M 31M Social connections Higgs 460K 15M Twitter subset ICPP - August 20, 2020 20

Platform • Intel’s Heterogeneous Architecture Research Platform (HARP) – Xeon E5-2680 v4 CPU + Arria 10 GX1150 FPGA – AFU issues cache coherent reads/writes to system memory Arria 10 FPGA Xeon Multicore AFU: User’s Accelerator Function Unit AFU CPU CPU FIU FIU: QPI/PCIe links protocols Data cache QPI/PCIe Interconnect Address translation System Memory • AFUs for the scatter phase of graph processing [O’Brien 2020] – The gather phase is done by CPU threads ICPP - August 20, 2020 21

Performance Metrics • Execution time: time for processing, excluding loading graph into memory • Load imbalance: the maximum useful work time of a thread relative to the average useful work time λ = ideally, λ is1 • Throughput: the number of traversed edges per second (MTEPS) ICPP - August 20, 2020 22

Comparisons • We compare HWS to different load balancing strategies – Static: equal sized partitions to all threads, giving FPGA 2.5X more – Best-Dynamic: a chunk self-scheduling load balancer with a priori knowledge of the optimal chunk size – HAP: Heterogeneous Adaptive Partitioning scheduler [Rodriguez 2019] • We define speedup as the ratio of the execution time of static to that of a load balancing strategy ICPP - August 20, 2020 23

Disclaimer • The results in this paper were generated using pre-production hardware and software, and may not reflect the performance of production or future systems. ICPP - August 20, 2020 24

BFS Scatter λ HARPv2 15 Threads + AFU 3.0 Static Best-Dynamic HAP HWS 2.5 2.0 λ 1.5 1.0 0.5 0.0 Graph ICPP - August 20, 2020 25

BFS Scatter λ HARPv2 15 Threads + AFU 3.0 Static Best-Dynamic HAP HWS 2.5 2.0 λ 1.5 1.0 0.5 0.0 Graph ICPP - August 20, 2020 26

BFS Scatter Performance HARPv2 15 Threads + AFU 2.0 Static Best-Dynamic HAP HWS 1.5 Speedup 1.0 0.5 0.0 Graph ICPP - August 20, 2020 27

HWS Steal Characteristics HARPv2 SSSP 7 Threads + AFU 1000 Steals By FPGA Steals From FPGA Aborted 800 Steal Count 600 400 200 0 Graphs ICPP - August 20, 2020 28

Balancing Graph Processing Workloads Using Work Stealing on - PowerPoint PPT Presentation

Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini , Francis OBrien and Tarek S. Abdelrahman The Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of

Balancing Gas system information provision 12 June 2018 GRTgaz balancing in a nutshell -> 2

Load Balancing Load Balancing Load balancing: distributing data and/or computations across

Introduction Workloads for Experiments Introduction to workloads CS 239 Workload

Understanding Big Data Workloads on Understanding Big Data Workloads on Modern Processors using

Internal Load Balancing in 5 mins Deliver scalable and resilient internal-only services on GCP

Dynamic Load Balancing in Dynamic Load Balancing in Charm+ + Charm+ + Abhinav S Bhatele

Epidemic Algorithm for Load Balancing Harshitha Menon, Laxmikant Kal e 15th April 1 / 25

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Evaluation of Memory and CPU usage via Cgroups of ATLAS workloads via Cgroups of ATLAS workloads

Biscuit: A Framework for Near-Data Processing of Big Data Workloads Oct 21, 2016 Duck-Ho Bae

Graph Data Processing M. Tamer Ozsu 1 / 75 Outline Introduction RDF Graph Querying

XL1A: Graph Nominal Frequency Data Using Excel2013 3/10/2017 V0E XL1A: V0E XL1A: V0E Graph

Load Balancing Load Balancing: Example Example Problem Consider 6 jobs whose processing times

Option contracts for power system balancing Part 3: Power system balancing and multiple optimal

Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Maty Konte*, W.Kouame** and E.Mensah* *UNU-MERIT, Maastricht; **WORLD BANK, Washington DC

IO virtualization Michael Kagan Mellanox Technologies IO Virtualization Mission non-stop

On Accelerating Pair-HMM Computations in Programmable Hardware Contributions Design and

A General Interviewer Training Curriculum for Computer-Assisted Personal Interviews (GIT-CAPI)

American Housing Survey: Survey Topics and Questionnaire May 11 th 2017 Evan Brassell American

tst rts qts s

IMMIGRANT ELIGIBILITY FOR HEALTH CARE AND PUBLIC BENEFITS IN CALIFORNIA AIDS Legal Referral

Data processing during and after the interview Cor Kragt Structure Survey design Dutch LFS

Balancing Graph Processing Workloads Using Work Stealing on - PowerPoint PPT Presentation

Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini , Francis OBrien and Tarek S. Abdelrahman The Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of

Balancing Gas system information provision 12 June 2018 GRTgaz balancing in a nutshell -&gt; 2

Load Balancing Load Balancing Load balancing: distributing data and/or computations across

Introduction Workloads for Experiments Introduction to workloads CS 239 Workload

Understanding Big Data Workloads on Understanding Big Data Workloads on Modern Processors using

Internal Load Balancing in 5 mins Deliver scalable and resilient internal-only services on GCP

Dynamic Load Balancing in Dynamic Load Balancing in Charm+ + Charm+ + Abhinav S Bhatele

Epidemic Algorithm for Load Balancing Harshitha Menon, Laxmikant Kal e 15th April 1 / 25

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Evaluation of Memory and CPU usage via Cgroups of ATLAS workloads via Cgroups of ATLAS workloads

Biscuit: A Framework for Near-Data Processing of Big Data Workloads Oct 21, 2016 Duck-Ho Bae

Graph Data Processing M. Tamer Ozsu 1 / 75 Outline Introduction RDF Graph Querying

XL1A: Graph Nominal Frequency Data Using Excel2013 3/10/2017 V0E XL1A: V0E XL1A: V0E Graph

Load Balancing Load Balancing: Example Example Problem Consider 6 jobs whose processing times

Option contracts for power system balancing Part 3: Power system balancing and multiple optimal

Batch &amp; Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Maty Konte*, W.Kouame** and E.Mensah* *UNU-MERIT, Maastricht; **WORLD BANK, Washington DC

IO virtualization Michael Kagan Mellanox Technologies IO Virtualization Mission non-stop

On Accelerating Pair-HMM Computations in Programmable Hardware Contributions Design and

A General Interviewer Training Curriculum for Computer-Assisted Personal Interviews (GIT-CAPI)

American Housing Survey: Survey Topics and Questionnaire May 11 th 2017 Evan Brassell American

tst rts qts s

IMMIGRANT ELIGIBILITY FOR HEALTH CARE AND PUBLIC BENEFITS IN CALIFORNIA AIDS Legal Referral

Data processing during and after the interview Cor Kragt Structure Survey design Dutch LFS

Balancing Gas system information provision 12 June 2018 GRTgaz balancing in a nutshell -> 2

Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri