Alternative GPU friendly assignment algorithms Paul Richmond and - PowerPoint PPT Presentation

Alternative GPU friendly assignment algorithms Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield

Graphics Processing Units (GPUs)

Context: GPU Performance Accelerated Computing Parallel Computing Serial Computing ~40 384 GigaFLOPS GigaFLOPS 8.74 TeraFLOPS 1 core 16 4992 cores cores

10.0 TFlops 8.74 TeraFLOPS 9.0 TFlops 6 hours CPU time 8.0 TFlops vs. 7.0 TFlops 1 minute GP GPU 6.0 TFlops time 5.0 TFlops 4.0 TFlops 3.0 TFlops 2.0 TFlops 1.0 TFlops ~40 GigaFLOPS 0.0 TFlops 1 CPU Core GPU (4992 cores)

Accelerators • Much of the functionality of CPUs is unused for HPC • Complex Pipelines, Branch prediction, out of order execution, etc. • Ideally for HPC we want: Simple , Low Power and Highly Parallel cores

An accelerated system DRAM GDRAM GPU/ CPU Accelerator PCIe I/O I/O • Co-processor not a CPU replacement

Thinking Parallel • Hardware considerations • High Memory Latency (PCI-e) • Huge Numbers of processing cores • Algorithmic changes required • High level of parallelism required • Data parallel != task parallel “If your problem is not parallel then think again”

Amdahl’s Law 1 𝑇𝑞𝑓𝑓𝑒𝑣𝑞 𝑇 = 𝑄 𝑂 − (1 − 𝑄) 25 20 P = 25% Speedup (S) P = 50% 15 P = 90% 10 P= 95% 5 0 Number of Processors (N) • Speedup of a program is limited by the proportion than can be parallelised • Addition of processing cores gives diminishing returns

SATALL Optimisation

Time per function for Largest Network (LoHAM) 45000 Profile the 97.4% 40000 Application 35000 • 11 hour runtime 30000 • Function A • 97.4% runtime 25000 Time (s) • 2000 calls 20000 • Hardware • Intel Core i7-4770k 3.50GHz 15000 • 16GB DDR3 • Nvidia GeForce Titan X 10000 5000 0.3% 0.3% 0.1% 0.1% 0.1% 0 A B C D E F Function

Function A • Input Distribution of Runtime (LoHAM) • Network (directed weighted graph) 100% • Origin-Destination Matrix 90% 80% • Output 70% • Traffic flow per edge 60% • 2 Distinct Steps 50% 40% 1. Single Source Shortest Path (SSSP) 30% All-or-Nothing Path 20% For each origin in the O-D matrix 10% 2. Flow Accumulation 0% Apply the OD value for each trip to each link on the route Serial Flow SSSP

Single Source Shortest Path For a single Origin Vertex (Centroid) Find the route to each Destination Vertex With the Lowest Cumulative Weight (Cost)

Serial SSSP Algorithm • D’Esopo -Pape (1974) • Maintains a priority queue of vertices to explore Highly Serial • Not a Data-Parallel Algorithm • We must change algorithm to match the hardware Pape, U. "Implementation and efficiency of Moore-algorithms for the shortest route problem." Mathematical Programming 7.1 (1974): 212-222.

Parallel SSSP Algorithm • Bellman-Ford Algorithm (1956) • Poor serial performance & time complexity • Performs significantly more work • Highly Parallel • Suitable for GPU acceleration Total Edges Considered 30 25 Number of Edges 20 15 10 5 Bellman, Richard. On a routing problem. 1956. 0 Bellman-Ford Desopo-Pape

Implementation • A2 - Naïve Bellman- Time to compute all requied SSSP results per model Ford using Cuda 70.00 • Up to 369x slower 60.00 • Striped bars continue 50.00 Time (seconds) off the scale 40.00 Derby 36.5s 30.00 20.00 CLoHAM 1777.2s 10.00 LoHAM 5712.6s 0.00 Serial A2 Optimisation Derby CLoHAM LoHAM

Implementation • Followed iterative cycle of Time to compute all requied SSSP results per model performance optimisations 70.00 • A3 – Early Termination 60.00 • A4 – Node Frontier 50.00 Time (seconds) • A8 – Multiple origins Concurrently 40.00 • SSSP for each Origin in 30.00 the OD matrix 20.00 • A10 – Improved load 10.00 Balancing 0.00 • Cooperative Thread Array Serial A2 A3 A4 A8 A10 A11 • A11 – Improved array access Optimisation Derby CLoHAM LoHAM

Limiting Factor (Function A) Distribution of Runtime (CLoHAM) Distribution of Runtime (LoHAM) 100% 100% 90% 90% 80% 80% 70% 70% 60% 60% 50% 50% 40% 40% 30% 30% 20% 20% 10% 10% 0% 0% Serial Serial Flow SSSP Flow SSSP

Limiting Factor (Function A) • Limiting Factor has now changed • Need to parallelise Flow Accumulation Distribution of Runtime (CLoHAM) Distribution of Runtime (LoHAM) 100% 100% 90% 90% 80% 80% 70% 70% 60% 60% 50% 50% 40% 40% 30% 30% 20% 20% 10% 10% 0% 0% Serial A11 Serial A11 Flow SSSP Flow SSSP

Flow Accumulation • Shortest Path + OD = Flow-per-link For each origin-destination pair Trace the route from the destination to the origin increasing the flow value for each link visited • Parallel problem • But requires synchronised access to shared data structure for all trips (atomic operations) 0 1 2 … 0 0 2 3 … Link 0 1 2 3 … 1 1 0 4 … Flow 9 8 6 9 … 2 2 5 0 … … … … …

Time to compute Flow values per link Flow 6 Accumulation 5 • Problem: 4 • A12 - lots of atomic operations serialise Time (Seconds) the execution 3 2 1 0 Serial A12 Derby CLoHAM LoHAM

Time to compute Flow values per link Flow 6 Accumulation 5 • Problem: 4 • A12 - lots of atomic operations serialise Time (Seconds) the execution • Solutions: 3 • A15 - Reduce number of atomic operations • Solve in batches using parallel reduction 2 • A14 - Use fast hardware-supported single precision atomics • Minimise loss of precision using multiple 32-bit summations 1 • 0.000022% total error 0 Serial A12 A15 A14 (double) (single) Derby CLoHAM LoHAM

Integrated Results

Relative Assignment Runtime Performance vs Serial 45 41.77 Assignment Speedup 40 relative to Serial Serial 35 • LoHAM – 12h 12m Assignment Time relative to serial 30 Double precision • 25 LoHAM – 35m 22s 20.70 18.24 20 Single precision • Reduced loss of precision 15 • LoHAM – 17m 32s 11.24 10 Hardware: 6.81 6.14 5.51 • Intel Core i7-4770K 3.73 3.42 5 • 16GB DDR3 1.00 1.00 1.00 • Nvidia GeForce Titan X 0 Serial Multicore A15 A14 (single) Derby CLoHAM LoHAM

Relative Assignment Runtime Performance vs Multicore 7 Assignment Speedup 6.14 relative to Multicore 6 Multicore • LoHAM – 1h 47m Assignment Time relative to multicore 5 Double precision 4 • LoHAM – 35m 22s 3.31 3.04 Single precision 3 • Reduced loss of precision 2.04 • LoHAM – 17m 32s 1.79 2 Hardware: 1.09 1.00 1.00 1.00 • Intel Core i7-4770K 1 • 0.29 16GB DDR3 0.18 0.15 • Nvidia GeForce Titan X 0 Serial Multicore A15 A14 (single) Derby CLoHAM LoHAM

GPU Computing at UoS

Expertise at Sheffield • Specialists in GPU Computing and performance optimisation • Complex Systems Simulations via FLAME and FLAME GPU • Visual Simulation, Computer Graphics and Virtual Reality • Training and Education for GPU Computing

Thank You Largest Model (LoHAM) results Speedup Speedup Runtime Serial Multicore Paul Richmond p.richmond@sheffield.ac.uk Serial 12:12:24 1.00 0.15 paulrichmond.shef.ac.uk Multicore 01:47:36 6.81 1.00 Peter Heywood A15 (double 00:35:22 20.70 3.04 p.heywood@sheffield.ac.uk precision) ptheywood.uk A14 (single 00:17:32 41.77 6.14 precision)

Backup Slides

Benchmark Models • 3 Benchmark networks Benchmark model performance 50000 • Range of sizes 45000 • Small to V. Large 40000 35000 • Up to 12 hour runtime 30000 Time (s) 25000 Model Vertices Edges O-D trips 20000 (Nodes) (Links) 15000 547 ² 2700 25385 Derby 10000 5000 2548 ² 15179 132600 CLoHAM 0 5194 ² 18427 192711 Serial Salford Multicore LoHAM Version Derby CLoHAM LoHAM

Edges considered per algorithm Edges Considered Per Iteration Total Edges Considered 7 30 6 25 Number of Edges 5 Number of Edges 20 4 15 3 10 2 5 1 0 0 0 1 2 3 4 Iteration Bellman-Ford Desopo-Pape Bellman-Ford Desopo-Pape

Vertex Frontier (A4) • Only Vertices which were Frontier size for each element for source 1 updated in the previous 3000 iteration can result in an 2500 update Frontier Size 2000 1500 • Much fewer threads 1000 launched per iteration 500 0 • Up to 2500 instead of 18427 Iteration per iteration Derby CLoHAM LoHAM

M ultiple Concurrent Origins (A8) Frontier size for each element for Frontier Size for all concurrent sources source 1 18000000 3000 16000000 14000000 2500 12000000 Frontier Size 2000 Frontier Size 10000000 1500 8000000 6000000 1000 4000000 500 2000000 0 0 Iteration Iteration Derby CLoHAM LoHAM Derby CLoHAM LoHAM

Atomic Contention • Atomic operations are guaranteed Atomic Contestance per Iteration to occur 30000000 • Atomic Contention 25000000 • multiple threads atomically modify Atomic Contestance same address 20000000 • Serialised! 15000000 • atomicAdd(double) not 10000000 implemented in hardware • Not yet 5000000 • Solutions 0 1. Algorithmic change to minimise 0 10 20 30 40 50 60 70 80 90 100 atomic contention Cloham Loham 2. Single precision

Alternative GPU friendly assignment algorithms Paul Richmond and - PowerPoint PPT Presentation

Alternative GPU friendly assignment algorithms Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield Graphics Processing Units (GPUs) Context: GPU Performance Accelerated Computing Parallel Computing

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Friendly Communities Sarah Prescott and Jude Woods Time to Shine Leeds Older Peoples Forum

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

Neurosurgeon Collaborative Intelligence Between the Cloud and Mobile Edge by Y. Kang,

OpenCL Application on Mobile GPU: A Case Study Elena Barreras, Juan M. Jimenez, Arian Maghazeh,

GPU Servers for Research in Quantum Fluids L. Galantucci HPC & Quantum Summit QEII Centre,

Tuning Basic Linear Algebra Routines for Hybrid CPU+GPU Platforms e , Luis P. Garc a, Javier

Lattice Measurement of the Delta I=1/2 Contribution to Standard Model Direct CP-Violation in K

Adaptation and Water, Wastewater and Stormwater: Milwaukee and the Milwaukee Metropolitan

Avrupal lar n Mstakbel Bir AB yesi Olarak Trkiyeye Bak lar ve

Putting together technology and dream we introduce first of its kind Virtual online platform.

Alternative GPU friendly assignment algorithms Paul Richmond and - PowerPoint PPT Presentation

Alternative GPU friendly assignment algorithms Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield Graphics Processing Units (GPUs) Context: GPU Performance Accelerated Computing Parallel Computing

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Friendly Communities Sarah Prescott and Jude Woods Time to Shine Leeds Older Peoples Forum

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

Neurosurgeon Collaborative Intelligence Between the Cloud and Mobile Edge by Y. Kang,

OpenCL Application on Mobile GPU: A Case Study Elena Barreras, Juan M. Jimenez, Arian Maghazeh,

GPU Servers for Research in Quantum Fluids L. Galantucci HPC &amp; Quantum Summit QEII Centre,

Tuning Basic Linear Algebra Routines for Hybrid CPU+GPU Platforms e , Luis P. Garc a, Javier

Lattice Measurement of the Delta I=1/2 Contribution to Standard Model Direct CP-Violation in K

Adaptation and Water, Wastewater and Stormwater: Milwaukee and the Milwaukee Metropolitan

Avrupal lar n Mstakbel Bir AB yesi Olarak Trkiyeye Bak lar ve

Putting together technology and dream we introduce first of its kind Virtual online platform.

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

GPU Servers for Research in Quantum Fluids L. Galantucci HPC & Quantum Summit QEII Centre,