Alternative GPU friendly assignment algorithms
Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield
Alternative GPU friendly assignment algorithms Paul Richmond and - - PowerPoint PPT Presentation
Alternative GPU friendly assignment algorithms Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield Graphics Processing Units (GPUs) Context: GPU Performance Accelerated Computing Parallel Computing
Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield
384 GigaFLOPS 8.74 TeraFLOPS
4992 cores Serial Computing Parallel Computing
~40 GigaFLOPS
Accelerated Computing
16 cores 1 core
0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1 CPU Core GPU (4992 cores) 8.74 TeraFLOPS ~40 GigaFLOPS
6 hours CPU time vs. 1 minute GP GPU time
etc.
Parallel cores
DRAM GDRAM CPU
GPU/ Accelerator
I/O I/O PCIe
“If your problem is not parallel then think again”
than can be parallelised
𝑇𝑞𝑓𝑓𝑒𝑣𝑞 𝑇 = 1 𝑄 𝑂 − (1 − 𝑄) 5 10 15 20 25 Speedup (S) Number of Processors (N)
P = 25% P = 50% P = 90% P= 95%
Profile the Application
97.4% 0.3% 0.3% 0.1% 0.1% 0.1% 5000 10000 15000 20000 25000 30000 35000 40000 45000 A B C D E F
Time (s) Function
Time per function for Largest Network (LoHAM)
1. Single Source Shortest Path (SSSP)
All-or-Nothing Path For each origin in the O-D matrix
2. Flow Accumulation
Apply the OD value for each trip to each link on the route
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Serial
Distribution of Runtime (LoHAM)
Flow SSSP
For a single Origin Vertex (Centroid) Find the route to each Destination Vertex With the Lowest Cumulative Weight (Cost)
explore
Highly Serial
the hardware
Pape, U. "Implementation and efficiency of Moore-algorithms for the shortest route problem." Mathematical Programming 7.1 (1974): 212-222.
Bellman, Richard. On a routing problem. 1956.
5 10 15 20 25 30
Number of Edges
Total Edges Considered
Bellman-Ford Desopo-Pape
Ford using Cuda
0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 Serial A2
Time (seconds) Optimisation
Time to compute all requied SSSP results per model
Derby CLoHAM LoHAM
Derby 36.5s CLoHAM 1777.2s LoHAM 5712.6s
performance optimisations
Concurrently
the OD matrix
Balancing
0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 Serial A2 A3 A4 A8 A10 A11
Time (seconds) Optimisation
Time to compute all requied SSSP results per model
Derby CLoHAM LoHAM
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Serial
Distribution of Runtime (CLoHAM)
Flow SSSP 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Serial
Distribution of Runtime (LoHAM)
Flow SSSP
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Serial A11
Distribution of Runtime (CLoHAM)
Flow SSSP 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Serial A11
Distribution of Runtime (LoHAM)
Flow SSSP
For each origin-destination pair Trace the route from the destination to the origin increasing the flow value for each link visited
1 2 … 0 0 2 3 … 1 1 0 4 … 2 2 5 0 … … … … … Link 1 2 3 … Flow 9 8 6 9 …
the execution
1 2 3 4 5 6 Serial A12
Time (Seconds)
Time to compute Flow values per link
Derby CLoHAM LoHAM
the execution
precision atomics
32-bit summations
1 2 3 4 5 6 Serial A12 A15 (double) A14 (single)
Time (Seconds)
Time to compute Flow values per link
Derby CLoHAM LoHAM
Assignment Speedup relative to Serial
Serial
Double precision
Single precision
Hardware:
1.00 3.42 3.73 6.14 1.00 5.51 11.24 18.24 1.00 6.81 20.70 41.77 5 10 15 20 25 30 35 40 45 Serial Multicore A15 A14 (single)
Assignment Time relative to serial
Relative Assignment Runtime Performance vs Serial
Derby CLoHAM LoHAM
Assignment Speedup relative to Multicore
Multicore
Double precision
Single precision
Hardware:
0.29 1.00 1.09 1.79 0.18 1.00 2.04 3.31 0.15 1.00 3.04 6.14 1 2 3 4 5 6 7 Serial Multicore A15 A14 (single)
Assignment Time relative to multicore
Relative Assignment Runtime Performance vs Multicore
Derby CLoHAM LoHAM
performance optimisation
and FLAME GPU
Virtual Reality
Computing
Thank You
Paul Richmond p.richmond@sheffield.ac.uk paulrichmond.shef.ac.uk Peter Heywood p.heywood@sheffield.ac.uk ptheywood.uk Runtime Speedup Serial Speedup Multicore Serial 12:12:24 1.00 0.15 Multicore 01:47:36 6.81 1.00 A15 (double precision) 00:35:22 20.70 3.04 A14 (single precision) 00:17:32 41.77 6.14
Largest Model (LoHAM) results
Model Vertices (Nodes) Edges (Links) O-D trips Derby 2700 25385 547² CLoHAM 15179 132600 2548² LoHAM 18427 192711 5194²
5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 Serial Salford Multicore
Time (s) Version
Benchmark model performance
Derby CLoHAM LoHAM
1 2 3 4 5 6 7 1 2 3 4
Number of Edges Iteration
Edges Considered Per Iteration
Bellman-Ford Desopo-Pape 5 10 15 20 25 30
Number of Edges
Total Edges Considered
Bellman-Ford Desopo-Pape
updated in the previous iteration can result in an update
launched per iteration
per iteration
500 1000 1500 2000 2500 3000
Frontier Size Iteration
Frontier size for each element for source 1
Derby CLoHAM LoHAM
500 1000 1500 2000 2500 3000
Frontier Size Iteration
Frontier size for each element for source 1
Derby CLoHAM LoHAM 2000000 4000000 6000000 8000000 10000000 12000000 14000000 16000000 18000000
Frontier Size Iteration
Frontier Size for all concurrent sources
Derby CLoHAM LoHAM
to occur
same address
implemented in hardware
1. Algorithmic change to minimise atomic contention 2. Single precision
5000000 10000000 15000000 20000000 25000000 30000000 10 20 30 40 50 60 70 80 90 100
Atomic Contestance
Atomic Contestance per Iteration
Cloham Loham
Raw Performance
Hardware:
5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 Serial Multicore A8 A10 A11 A15 A13 (single) A14 (single)
Time (seconds)
Assignment runtime per algorithm
Derby CLoHAM LoHAM