Alternative GPU friendly assignment algorithms Paul Richmond and - - PowerPoint PPT Presentation

alternative gpu friendly
SMART_READER_LITE
LIVE PREVIEW

Alternative GPU friendly assignment algorithms Paul Richmond and - - PowerPoint PPT Presentation

Alternative GPU friendly assignment algorithms Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield Graphics Processing Units (GPUs) Context: GPU Performance Accelerated Computing Parallel Computing


slide-1
SLIDE 1

Alternative GPU friendly assignment algorithms

Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield

slide-2
SLIDE 2

Graphics Processing Units (GPUs)

slide-3
SLIDE 3

Context: GPU Performance

384 GigaFLOPS 8.74 TeraFLOPS

4992 cores Serial Computing Parallel Computing

~40 GigaFLOPS

Accelerated Computing

16 cores 1 core

slide-4
SLIDE 4

0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1 CPU Core GPU (4992 cores) 8.74 TeraFLOPS ~40 GigaFLOPS

6 hours CPU time vs. 1 minute GP GPU time

slide-5
SLIDE 5
  • Much of the functionality of CPUs is unused for HPC
  • Complex Pipelines, Branch prediction, out of order execution,

etc.

  • Ideally for HPC we want: Simple, Low Power and Highly

Parallel cores

Accelerators

slide-6
SLIDE 6

An accelerated system

  • Co-processor not a CPU replacement

DRAM GDRAM CPU

GPU/ Accelerator

I/O I/O PCIe

slide-7
SLIDE 7

Thinking Parallel

  • Hardware considerations
  • High Memory Latency (PCI-e)
  • Huge Numbers of processing cores
  • Algorithmic changes required
  • High level of parallelism required
  • Data parallel != task parallel

“If your problem is not parallel then think again”

slide-8
SLIDE 8

Amdahl’s Law

  • Speedup of a program is limited by the proportion

than can be parallelised

  • Addition of processing cores gives diminishing returns

𝑇𝑞𝑓𝑓𝑒𝑣𝑞 𝑇 = 1 𝑄 𝑂 − (1 − 𝑄) 5 10 15 20 25 Speedup (S) Number of Processors (N)

P = 25% P = 50% P = 90% P= 95%

slide-9
SLIDE 9

SATALL Optimisation

slide-10
SLIDE 10

Profile the Application

  • 11 hour runtime
  • Function A
  • 97.4% runtime
  • 2000 calls
  • Hardware
  • Intel Core i7-4770k 3.50GHz
  • 16GB DDR3
  • Nvidia GeForce Titan X

97.4% 0.3% 0.3% 0.1% 0.1% 0.1% 5000 10000 15000 20000 25000 30000 35000 40000 45000 A B C D E F

Time (s) Function

Time per function for Largest Network (LoHAM)

slide-11
SLIDE 11

Function A

  • Input
  • Network (directed weighted graph)
  • Origin-Destination Matrix
  • Output
  • Traffic flow per edge
  • 2 Distinct Steps

1. Single Source Shortest Path (SSSP)

All-or-Nothing Path For each origin in the O-D matrix

2. Flow Accumulation

Apply the OD value for each trip to each link on the route

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Serial

Distribution of Runtime (LoHAM)

Flow SSSP

slide-12
SLIDE 12

Single Source Shortest Path

For a single Origin Vertex (Centroid) Find the route to each Destination Vertex With the Lowest Cumulative Weight (Cost)

slide-13
SLIDE 13

Serial SSSP Algorithm

  • D’Esopo-Pape (1974)
  • Maintains a priority queue of vertices to

explore

Highly Serial

  • Not a Data-Parallel Algorithm
  • We must change algorithm to match

the hardware

Pape, U. "Implementation and efficiency of Moore-algorithms for the shortest route problem." Mathematical Programming 7.1 (1974): 212-222.

slide-14
SLIDE 14

Parallel SSSP Algorithm

  • Bellman-Ford Algorithm (1956)
  • Poor serial performance & time complexity
  • Performs significantly more work
  • Highly Parallel
  • Suitable for GPU acceleration

Bellman, Richard. On a routing problem. 1956.

5 10 15 20 25 30

Number of Edges

Total Edges Considered

Bellman-Ford Desopo-Pape

slide-15
SLIDE 15

Implementation

  • A2 - Naïve Bellman-

Ford using Cuda

  • Up to 369x slower
  • Striped bars continue
  • ff the scale

0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 Serial A2

Time (seconds) Optimisation

Time to compute all requied SSSP results per model

Derby CLoHAM LoHAM

Derby 36.5s CLoHAM 1777.2s LoHAM 5712.6s

slide-16
SLIDE 16

Implementation

  • Followed iterative cycle of

performance optimisations

  • A3 – Early Termination
  • A4 – Node Frontier
  • A8 – Multiple origins

Concurrently

  • SSSP for each Origin in

the OD matrix

  • A10 – Improved load

Balancing

  • Cooperative Thread Array
  • A11 – Improved array access

0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 Serial A2 A3 A4 A8 A10 A11

Time (seconds) Optimisation

Time to compute all requied SSSP results per model

Derby CLoHAM LoHAM

slide-17
SLIDE 17

Limiting Factor (Function A)

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Serial

Distribution of Runtime (CLoHAM)

Flow SSSP 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Serial

Distribution of Runtime (LoHAM)

Flow SSSP

slide-18
SLIDE 18

Limiting Factor (Function A)

  • Limiting Factor has now changed
  • Need to parallelise Flow Accumulation

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Serial A11

Distribution of Runtime (CLoHAM)

Flow SSSP 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Serial A11

Distribution of Runtime (LoHAM)

Flow SSSP

slide-19
SLIDE 19

Flow Accumulation

  • Shortest Path + OD = Flow-per-link

For each origin-destination pair Trace the route from the destination to the origin increasing the flow value for each link visited

  • Parallel problem
  • But requires synchronised access to shared data structure for all trips (atomic
  • perations)

1 2 … 0 0 2 3 … 1 1 0 4 … 2 2 5 0 … … … … … Link 1 2 3 … Flow 9 8 6 9 …

slide-20
SLIDE 20
  • Problem:
  • A12 - lots of atomic operations serialise

the execution

1 2 3 4 5 6 Serial A12

Time (Seconds)

Time to compute Flow values per link

Derby CLoHAM LoHAM

Flow Accumulation

slide-21
SLIDE 21

Flow Accumulation

  • Problem:
  • A12 - lots of atomic operations serialise

the execution

  • Solutions:
  • A15 - Reduce number of atomic
  • perations
  • Solve in batches using parallel reduction
  • A14 - Use fast hardware-supported single

precision atomics

  • Minimise loss of precision using multiple

32-bit summations

  • 0.000022% total error

1 2 3 4 5 6 Serial A12 A15 (double) A14 (single)

Time (Seconds)

Time to compute Flow values per link

Derby CLoHAM LoHAM

slide-22
SLIDE 22

Integrated Results

slide-23
SLIDE 23

Assignment Speedup relative to Serial

Serial

  • LoHAM – 12h 12m

Double precision

  • LoHAM – 35m 22s

Single precision

  • Reduced loss of precision
  • LoHAM – 17m 32s

Hardware:

  • Intel Core i7-4770K
  • 16GB DDR3
  • Nvidia GeForce Titan X

1.00 3.42 3.73 6.14 1.00 5.51 11.24 18.24 1.00 6.81 20.70 41.77 5 10 15 20 25 30 35 40 45 Serial Multicore A15 A14 (single)

Assignment Time relative to serial

Relative Assignment Runtime Performance vs Serial

Derby CLoHAM LoHAM

slide-24
SLIDE 24

Assignment Speedup relative to Multicore

Multicore

  • LoHAM – 1h 47m

Double precision

  • LoHAM – 35m 22s

Single precision

  • Reduced loss of precision
  • LoHAM – 17m 32s

Hardware:

  • Intel Core i7-4770K
  • 16GB DDR3
  • Nvidia GeForce Titan X

0.29 1.00 1.09 1.79 0.18 1.00 2.04 3.31 0.15 1.00 3.04 6.14 1 2 3 4 5 6 7 Serial Multicore A15 A14 (single)

Assignment Time relative to multicore

Relative Assignment Runtime Performance vs Multicore

Derby CLoHAM LoHAM

slide-25
SLIDE 25

GPU Computing at UoS

slide-26
SLIDE 26

Expertise at Sheffield

  • Specialists in GPU Computing and

performance optimisation

  • Complex Systems Simulations via FLAME

and FLAME GPU

  • Visual Simulation, Computer Graphics and

Virtual Reality

  • Training and Education for GPU

Computing

slide-27
SLIDE 27

Thank You

Paul Richmond p.richmond@sheffield.ac.uk paulrichmond.shef.ac.uk Peter Heywood p.heywood@sheffield.ac.uk ptheywood.uk Runtime Speedup Serial Speedup Multicore Serial 12:12:24 1.00 0.15 Multicore 01:47:36 6.81 1.00 A15 (double precision) 00:35:22 20.70 3.04 A14 (single precision) 00:17:32 41.77 6.14

Largest Model (LoHAM) results

slide-28
SLIDE 28

Backup Slides

slide-29
SLIDE 29

Benchmark Models

  • 3 Benchmark networks
  • Range of sizes
  • Small to V. Large
  • Up to 12 hour runtime

Model Vertices (Nodes) Edges (Links) O-D trips Derby 2700 25385 547² CLoHAM 15179 132600 2548² LoHAM 18427 192711 5194²

5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 Serial Salford Multicore

Time (s) Version

Benchmark model performance

Derby CLoHAM LoHAM

slide-30
SLIDE 30

Edges considered per algorithm

1 2 3 4 5 6 7 1 2 3 4

Number of Edges Iteration

Edges Considered Per Iteration

Bellman-Ford Desopo-Pape 5 10 15 20 25 30

Number of Edges

Total Edges Considered

Bellman-Ford Desopo-Pape

slide-31
SLIDE 31

Vertex Frontier (A4)

  • Only Vertices which were

updated in the previous iteration can result in an update

  • Much fewer threads

launched per iteration

  • Up to 2500 instead of 18427

per iteration

500 1000 1500 2000 2500 3000

Frontier Size Iteration

Frontier size for each element for source 1

Derby CLoHAM LoHAM

slide-32
SLIDE 32

Multiple Concurrent Origins (A8)

500 1000 1500 2000 2500 3000

Frontier Size Iteration

Frontier size for each element for source 1

Derby CLoHAM LoHAM 2000000 4000000 6000000 8000000 10000000 12000000 14000000 16000000 18000000

Frontier Size Iteration

Frontier Size for all concurrent sources

Derby CLoHAM LoHAM

slide-33
SLIDE 33

Atomic Contention

  • Atomic operations are guaranteed

to occur

  • Atomic Contention
  • multiple threads atomically modify

same address

  • Serialised!
  • atomicAdd(double) not

implemented in hardware

  • Not yet
  • Solutions

1. Algorithmic change to minimise atomic contention 2. Single precision

5000000 10000000 15000000 20000000 25000000 30000000 10 20 30 40 50 60 70 80 90 100

Atomic Contestance

Atomic Contestance per Iteration

Cloham Loham

slide-34
SLIDE 34

Raw Performance

Hardware:

  • Intel Core i7-4770K
  • 16GB DDR3
  • Nvidia GeForce Titan X

5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 Serial Multicore A8 A10 A11 A15 A13 (single) A14 (single)

Time (seconds)

Assignment runtime per algorithm

Derby CLoHAM LoHAM