[PPT] - Introduction Every new desktop/laptop is now equipped with a PowerPoint Presentation

SLIDE 1

Introduction

Every ¡new ¡desktop/laptop ¡is ¡now ¡equipped ¡with ¡a ¡

graphic ¡processing ¡unit ¡(GPU).

GPU ¡= ¡Massively ¡Parallel ¡Architecture.
For ¡most ¡of ¡their ¡life, ¡such ¡GPUs ¡are ¡idle.
General ¡Purpose ¡GPU ¡applications:

Introduction GPUs ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ GPU-‐‒(D)BE ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

1

Numerical ¡Analysis

MathWorks MATLAB

Bioinformatics Deep ¡Learning

SLIDE 2

(Distributed) ¡Constraint ¡Optimization

A ¡(D)COP ¡is ¡a ¡tuple ¡<X, ¡D, ¡F, ¡(A, ¡α)>, ¡where:
X ¡is ¡a ¡set ¡of ¡variables.
D is ¡a ¡set ¡of ¡finite ¡domains.
F ¡is ¡a ¡set ¡of ¡utility ¡functions: ¡
A ¡is ¡a ¡set ¡of ¡agents, ¡controlling ¡the ¡variables ¡in X.
α ¡maps ¡variables ¡to ¡agents.
GOAL: ¡Find ¡a ¡utility ¡maximal ¡assignment.

Introduction GPUs ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ GPU-‐‒(D)BE ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

2

functions, fi : ⇥xj∈scope(fi)Dj 7! N [ {0, 1}.

x⇤ = arg max

x

F(x) = arg max

x

X

f2F

f(x|scope(f))

SLIDE 3

Graphical ¡Processing ¡Units ¡(GPUs)

A ¡GPU ¡is ¡a ¡massive ¡parallel ¡architecture:
Thousands of ¡multi-‐‒threaded ¡computing ¡cores.
Very ¡high memory ¡bandwidths.
~80% ¡of ¡transistors ¡devoted ¡to ¡data ¡processing ¡rather ¡than ¡caching.
However:
GPU ¡cores ¡are ¡slower than ¡CPU ¡cores.
GPU ¡memories ¡have ¡different ¡sizes ¡and ¡access ¡times.
GPU ¡programming ¡is ¡more ¡challenging ¡and ¡time ¡consuming.

Introduction ¡ ¡ GPUs GPU-‐‒(D)BE ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

3

SLIDE 4

Execution ¡Model

Introduction ¡ ¡ GPUs GPU-‐‒(D)BE ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

4

Block ¡ (0,0) Block ¡ (1,0) Block ¡ (2,0) Block ¡ (0,1) Block ¡ (1,1) Block ¡ (2,1) Kernel ¡1 Kernel ¡2

B

Thread (0,0) Thread (1,0) Thread (2,0) Thread (3,0) Thread (4,0) Thread (0,1) Thread (1,1) Thread (2,1) Thread (3,1) Thread (4,1) Thread (0,2) Thread (1,2) Thread (2,2) Thread (3,2) Thread (4,2) Thread (0,3) Thread (1,3) Thread (2,3) Thread (3,3) Thread (4,3)

warp

Host Device

Block ¡ (0,0) Block ¡ (1,0) Block ¡ (2,0) Block ¡ (0,1) Block ¡ (1,1) Block ¡ (2,1) warp warp warp warp warp warp

...

A ¡Thread ¡is ¡the ¡basic ¡parallel ¡unit.
Threads are ¡organized ¡into ¡a Block.
Several ¡warps ¡are ¡scheduled ¡for ¡

the ¡execution ¡of ¡a ¡GPU ¡function.

Several ¡Streaming ¡Multiprocessors, ¡

(SD) ¡scheduled ¡in ¡parallel.

Single ¡Instruction ¡Multiple ¡Thread

(SIMT) ¡parallel ¡model.

SLIDE 5

HOST GLOBAL MEMORY CONSTANT MEMORY Shared memory

Thread Thread regs regs

Block Shared memory

Thread Thread regs regs

Block GRID

Memory ¡Hierarchy

The ¡GPU ¡memory ¡architecture ¡

is ¡rather ¡involved.

Registers
Shared ¡memory
Global ¡memory

Introduction ¡ ¡ GPUs GPU-‐‒(D)BE ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

5

SLIDE 6

HOST GLOBAL MEMORY CONSTANT MEMORY Shared memory

Thread Thread regs regs

Block Shared memory

Thread Thread regs regs

Block GRID

Memory ¡Hierarchy

The ¡GPU ¡memory ¡architecture ¡

is ¡rather ¡involved.

Registers
Fastest;
Only ¡accessible ¡by ¡a ¡thread;
Lifetime ¡of ¡a ¡thread.
Shared ¡memory
Global ¡memory

Introduction ¡ ¡ GPUs GPU-‐‒(D)BE ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

6

SLIDE 7

HOST GLOBAL MEMORY CONSTANT MEMORY Shared memory

Thread Thread regs regs

Block Shared memory

Thread Thread regs regs

Block GRID

Memory ¡Hierarchy

The ¡GPU ¡memory ¡architecture ¡

is ¡rather ¡involved.

Registers
Shared ¡memory
Extremely ¡fast;
Highly ¡parallel;
Restricted ¡to ¡a ¡block.
Global ¡memory

Introduction ¡ ¡ GPUs GPU-‐‒(D)BE ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

7

SLIDE 8

HOST GLOBAL MEMORY CONSTANT MEMORY Shared memory

Thread Thread regs regs

Block Shared memory

Thread Thread regs regs

Block GRID

Memory ¡Hierarchy

The ¡GPU ¡memory ¡architecture ¡

is ¡rather ¡involved.

Registers
Shared ¡memory
Global ¡memory
Typically ¡implemented ¡in ¡DRAM;
High ¡access ¡latency ¡(400-‐‒800 ¡cycles);
Potential ¡of ¡traffic ¡congestion.

Introduction ¡ ¡ GPUs GPU-‐‒(D)BE ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

8

SLIDE 9

HOST GLOBAL MEMORY CONSTANT MEMORY Shared memory

Thread Thread regs regs

Block Shared memory

Thread Thread regs regs

Block GRID

Memory ¡Hierarchy

The ¡GPU ¡memory ¡architecture ¡

is ¡rather ¡involved.

Registers
Shared ¡memory
Global ¡memory
Challenge: ¡using ¡memory ¡

effectively ¡-‐‒-‐‒ likely ¡requires ¡to ¡ redesign ¡the ¡algorithm.

Introduction ¡ ¡ GPUs GPU-‐‒(D)BE ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

9

SLIDE 10

CUDA: ¡Compute ¡Unified ¡Device ¡Architecture

Introduction ¡ ¡ GPUs GPU-‐‒(D)BE ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

10

Host Device

SLIDE 11

CUDA: ¡Compute ¡Unified ¡Device ¡Architecture

Introduction ¡ ¡ GPUs GPU-‐‒(D)BE ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

11

Host Device

cu cudaMalloc(&deviceV, sizeV); cudaMemcpy(deviceV, hostV, sizeV, ...)

data Global ¡Memory

SLIDE 12

CUDA: ¡Compute ¡Unified ¡Device ¡Architecture

Introduction ¡ ¡ GPUs GPU-‐‒(D)BE ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

12

Host Device

cu cudaKernel<nThreads, nBlocks>( )

cu cudaKernel( )

Kernel ¡invocation Global ¡Memory

SLIDE 13

CUDA: ¡Compute ¡Unified ¡Device ¡Architecture

Introduction ¡ ¡ GPUs GPU-‐‒(D)BE ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

13

Host Device

cudaMemcpy(hostV, deviceV, sizeV, ...)

Global ¡Memory data

SLIDE 14

Bucket ¡Elimination ¡and ¡DPOP

Dynamic ¡Programming ¡procedures ¡to ¡solve ¡(D)COPs.
Both ¡procedures ¡rely ¡on ¡the ¡use ¡of ¡two ¡operators:
Projection ¡Operator: ¡ ¡π-‐‒xi(fij)
Aggregation ¡Operator: ¡fij + ¡fik

Introduction ¡ ¡ GPUs ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ GPU-‐‒(D)BE Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

14

xi xj U 5 1 8 1 20 1 1 3 fij xj U 20 1 8

SLIDE 15

Bucket ¡Elimination ¡and ¡DPOP

Dynamic ¡Programming ¡procedures ¡to ¡solve ¡(D)COPs.
Both ¡procedures ¡rely ¡on ¡the ¡use ¡of ¡two ¡operators:
Projection ¡Operator: ¡ ¡π-‐‒xi(fij)
Aggregation ¡Operator: ¡fij + ¡fik

Introduction ¡ ¡ GPUs ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ GPU-‐‒(D)BE Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

15

xi xj U 5 1 8 1 20 1 1 3 fij max(5, ¡20) xj U 20 1 8

SLIDE 16

Bucket ¡Elimination ¡and ¡DPOP

Dynamic ¡Programming ¡procedures ¡to ¡solve ¡(D)COPs.
Both ¡procedures ¡rely ¡on ¡the ¡use ¡of ¡two ¡operators:
Projection ¡Operator: ¡ ¡π-‐‒xi(fij)
Aggregation ¡Operator: ¡fij + ¡fik

Introduction ¡ ¡ GPUs ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ GPU-‐‒(D)BE Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

16

xi xj U 5 1 8 1 20 1 1 3 xj U 20 1 8 fij max(8, ¡3)

SLIDE 17

Bucket ¡Elimination ¡and ¡DPOP

Dynamic ¡Programming ¡procedures ¡to ¡solve ¡(D)COPs.
Both ¡procedures ¡rely ¡on ¡the ¡use ¡of ¡two ¡operators:
Projection ¡Operator: ¡ ¡π-‐‒xi(fij)
Aggregation ¡Operator: ¡fij + ¡fik

Introduction ¡ ¡ GPUs ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ GPU-‐‒(D)BE Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

17

xi xj U 5 1 8 1 20 1 1 3 fij xi xk U 2 1 6 1 11 1 1 4 xi xj xk U 7 1 11 1 10 1 1 14 . ¡. ¡. . ¡. ¡. ¡

SLIDE 18

Bucket ¡Elimination ¡and ¡DPOP

Dynamic ¡Programming ¡procedures ¡to ¡solve ¡(D)COPs.
Both ¡procedures ¡rely ¡on ¡the ¡use ¡of ¡two ¡operators:
Projection ¡Operator: ¡ ¡π-‐‒xi(fij)
Aggregation ¡Operator: ¡fij + ¡fik

Introduction ¡ ¡ GPUs ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ GPU-‐‒(D)BE Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

18

xi xj U 5 1 8 1 20 1 1 3 fij xi xk U 2 1 6 1 11 1 1 4 xi xj xk U 7 1 11 1 10 1 1 14 . ¡. ¡. 5 ¡+ ¡2 ¡= ¡7 . ¡. ¡. ¡

SLIDE 19

Bucket ¡Elimination ¡and ¡DPOP

Dynamic ¡Programming ¡procedures ¡to ¡solve ¡(D)COPs.
Both ¡procedures ¡rely ¡on ¡the ¡use ¡of ¡two ¡operators:
Projection ¡Operator: ¡ ¡π-‐‒xi(fij)
Aggregation ¡Operator: ¡fij + ¡fik

Introduction ¡ ¡ GPUs ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ GPU-‐‒(D)BE Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

19

xi xj U 5 1 8 1 20 1 1 3 fij xi xk U 2 1 6 1 11 1 1 4 xi xj xk U 7 1 11 1 10 1 1 14 . ¡. ¡. 5 ¡+ ¡6 ¡= ¡11 . ¡. ¡. ¡

SLIDE 20

(b)

x1 x2 x3

Bucket ¡Elimination ¡and ¡DPOP

1. Imposes ¡an ¡ordering ¡on ¡the ¡problem’s ¡variables.

Introduction ¡ ¡ GPUs ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ GPU-‐‒(D)BE Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

20

f12 f23 f13

X = ¡{x1, ¡x2, ¡x3} F = ¡{f12, ¡f13, ¡f23}

x3 x2 x1

SLIDE 21

(b)

x1 x2 x3

Bucket ¡Elimination ¡and ¡DPOP

2. Selects ¡the ¡variable ¡xi with ¡highest ¡priority, ¡and ¡it ¡ creates ¡a ¡bucket: ¡ ¡

Introduction ¡ ¡ GPUs ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ GPU-‐‒(D)BE Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

21

f12 f23 f13

B3 = ¡{f13, ¡f23}

x3 x2 x1

X = ¡{x1, ¡x2, ¡x3} F = ¡{f12, ¡f13, ¡f23}

∧ i = max{k|xk ∈ scope(fj)}

Bi =

n fj ∈ F|xi ∈ scope(fj) ∧

SLIDE 22

(b)

x1 x2 x3

Bucket ¡Elimination ¡and ¡DPOP

3. It ¡computes ¡a ¡new ¡utility ¡function ¡fi’ by ¡aggregating the ¡ functions ¡in ¡Bi and ¡projecting ¡out ¡xi

Introduction ¡ ¡ GPUs ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ GPU-‐‒(D)BE Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

22

f12 f23 f13 x3 x2 x1

X = ¡{x1, ¡x2, ¡x3} F = ¡{f12, ¡f13, ¡f23}

x1 x3 U 5 1 8 1 20 1 1 3 x2 x3 U 5 1 8 1 20 1 1 3 x1 x2 U (x3) max(5 ¡+ 5, ¡8 ¡+ 8) = ¡16 1

f3

’ = ¡π-‐‒x3 ¡(f13 + f23)

B3 = ¡{f13, ¡f23}

f13 f23 f3’

SLIDE 23

(b)

x1 x2 x3

Bucket ¡Elimination ¡and ¡DPOP

Introduction ¡ ¡ GPUs ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ GPU-‐‒(D)BE Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

23

f12 f23 f13 x3 x2 x1

X = ¡{x1, ¡x2, ¡x3} F = ¡{f12, ¡f13, ¡f23}

x1 x3 U 5 1 8 1 20 1 1 3 x2 x3 U 5 1 8 1 20 1 1 3 x1 x2 U (x3) 16 1 1 max(5 ¡+ 20, ¡8 ¡+ 3) = ¡25

f3

’ = ¡π-‐‒x3 ¡(f13 + f23)

3. It ¡computes ¡a ¡new ¡utility ¡function ¡fi’ by ¡aggregating the ¡ functions ¡in ¡Bi and ¡projecting ¡out ¡xi

B3 = ¡{f13, ¡f23}

f13 f23 f3’

SLIDE 24

(b)

x1 x2 x3

Bucket ¡Elimination ¡and ¡DPOP

Introduction ¡ ¡ GPUs ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ GPU-‐‒(D)BE Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

24

f12 f23 f13 x3 x2 x1

X = ¡{x1, ¡x2, ¡x3} F = ¡{f12, ¡f13, ¡f23}

x1 x3 U 5 1 8 1 20 1 1 3 x2 x3 U 5 1 8 1 20 1 1 3 x1 x2 U (x3) 16 1 1 25 1 max(20 ¡+ 5, ¡3 ¡+ 8) = ¡25

f3

’ = ¡π-‐‒x3 ¡(f13 + f23)

3. It ¡computes ¡a ¡new ¡utility ¡function ¡fi’ by ¡aggregating the ¡ functions ¡in ¡Bi and ¡projecting ¡out ¡xi

B3 = ¡{f13, ¡f23}

f13 f23 f3’

SLIDE 25

(b)

x1 x2 x3

Bucket ¡Elimination ¡and ¡DPOP

Introduction ¡ ¡ GPUs ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ GPU-‐‒(D)BE Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

25

f12 f23 f13 x3 x2 x1

X = ¡{x1, ¡x2, ¡x3} F = ¡{f12, ¡f13, ¡f23}

x1 x3 U 5 1 8 1 20 1 1 3 x2 x3 U 5 1 8 1 20 1 1 3 x1 x2 U (x3) 16 1 1 25 1 25 1 max(20 ¡+ 20, ¡3 ¡+ 3) = ¡40

f3

’ = ¡π-‐‒x3 ¡(f13 + f23)

3. It ¡computes ¡a ¡new ¡utility ¡function ¡fi’ by ¡aggregating the ¡ functions ¡in ¡Bi and ¡projecting ¡out ¡xi

B3 = ¡{f13, ¡f23}

f3’ f13 f23

SLIDE 26

(b)

x1 x2 x3

Bucket ¡Elimination ¡and ¡DPOP

4. It ¡updates ¡the ¡set ¡of ¡variables:

5. It ¡updates ¡the ¡set ¡of ¡functions:

Introduction ¡ ¡ GPUs ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ GPU-‐‒(D)BE Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

26

f12 f23 f13 x3 x2 x1

X = ¡{x1, ¡x2} F = ¡{f12, ¡f3

’}

B2 = ¡{f13, ¡f3’}

X ← X \ {xi} F ← (F ∪ {f 0

i}) \ Bi

SLIDE 27

(b)

x1 x2 x3

Bucket ¡Elimination ¡and ¡DPOP

Repeat...

Introduction ¡ ¡ GPUs ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ GPU-‐‒(D)BE Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

27

f12 f23 f13 x3 x2 x1

X = ¡{x1, ¡x2} F = ¡{f12, ¡f3

’}

DPOP is ¡a ¡distributed ¡version ¡of ¡BE.
It ¡operates ¡on ¡a ¡Pseudotree ordering ¡
f ¡the ¡constraint ¡graph.

fj’ xi xj xk fk’

SLIDE 28

(b)

x1 x2 x3

GPU-‐‒(D)BE

Introduction ¡ ¡ GPUs ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ GPU-‐‒(D)BE Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

28

f12 f23 f13 x3 x2 x1

x1 x3 U 5 1 8 1 20 1 1 3 x2 x3 U 5 1 8 1 20 1 1 3 x1 x2 U max(5 ¡+ 5, ¡8 ¡+ 8) = ¡16 1 max(5 ¡+ 20, ¡8 ¡+ 3) = ¡25 1 max(20 ¡+ 5, ¡3 ¡+ 8) = ¡25 1 max(20 ¡+ 20, ¡3 ¡+ 3) = ¡40

f3

’ = ¡π-‐‒x3 ¡(f13 + f23)

BE ¡and ¡DPOP ¡complexity: ¡O(dw*).

d ¡= ¡max. ¡domain ¡size; ¡w* ¡= ¡induced ¡width ¡of ¡the ¡constraint ¡graph.

Can ¡the ¡projection ¡and ¡aggregator ¡operators ¡be ¡executed ¡in ¡parallel?
Do ¡they ¡fit ¡the ¡SIMT ¡parallel ¡model?

SLIDE 29

(b)

x1 x2 x3

GPU-‐‒(D)BE

Introduction ¡ ¡ GPUs ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ GPU-‐‒(D)BE Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

29

f12 f23 f13 x3 x2 x1

x1 x3 U 5 1 8 1 20 1 1 3 x2 x3 U 5 1 8 1 20 1 1 3 x1 x2 U max(5 ¡+ 5, ¡8 ¡+ 8) = ¡16 1 max(5 ¡+ 20, ¡8 ¡+ 3) = ¡25 1 max(20 ¡+ 5, ¡3 ¡+ 8) = ¡25 1 max(20 ¡+ 20, ¡3 ¡+ 3) = ¡40

f3

’ = ¡π-‐‒x3 ¡(f13 + f23)

BE ¡and ¡DPOP ¡complexity: ¡O(dw*).

d ¡= ¡max. ¡domain ¡size; ¡w* ¡= ¡induced ¡width ¡of ¡the ¡constraint ¡graph.

Can ¡the ¡projection ¡and ¡aggregator ¡operators ¡be ¡executed ¡in ¡parallel?
Do ¡they ¡fit ¡the ¡SIMT ¡parallel ¡model?
Obs.: ¡The ¡computation ¡of ¡each ¡row ¡of ¡the ¡Utility ¡tables ¡is ¡

independent ¡from ¡the ¡computation ¡of ¡other ¡rows.

SLIDE 30

Algorithm ¡design ¡and ¡data ¡structure

Limit ¡the ¡amount ¡of ¡host-‐‒device ¡data ¡transfers.
Static ¡Entities: ¡require ¡a ¡single ¡data ¡transaction. ¡
Variables; ¡Domains; ¡Utility ¡functions; ¡Constraint ¡Graph.
Dynamic ¡Entities: ¡might ¡require ¡multiple ¡data ¡transactions.
Utility ¡tables.
Minimize ¡the ¡accesses ¡to ¡the ¡global ¡memory.
Padding ¡Utility ¡Tables’ ¡rows; ¡Perfect ¡hashing.
Ensure ¡data ¡accesses ¡are ¡coalesced.
Mono-‐‒dimensional ¡array ¡organization; ¡

Introduction ¡ ¡ GPUs ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ GPU-‐‒(D)BE Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

30

good bad

SLIDE 31

Parallel ¡Projection ¡and ¡Aggregation

Mapping ¡between ¡the ¡fi’ ¡table ¡rows ¡and ¡the ¡CUDA ¡blocks:
Each ¡thread ¡in ¡a ¡block ¡is ¡associated ¡to ¡the ¡computations ¡of ¡
ne permutation ¡of ¡values ¡in ¡scope(fi’).
1 ¡block = 64k ¡threads (1 ¡≤ ¡k ¡≤ ¡16).
k ¡depends ¡on ¡the ¡architecture ¡and ¡it ¡is ¡chosen ¡so ¡to ¡maximize ¡the ¡

number ¡of ¡threads ¡that ¡can ¡be ¡scheduled ¡concurrently.

Obs.: ¡Max ¡number ¡of ¡parallel ¡fi’ table ¡rows ¡is ¡M=|SM|64k
In ¡our ¡experiments, ¡|SMs| ¡= ¡14 and k ¡= ¡3. Thus ¡M ¡= ¡2688.

Introduction ¡ ¡ GPUs ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ GPU-‐‒(D)BE Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

31

GPU Kernel

GPU Global Memory

B0 B1 B13 B14 B15 B29

… …

SM0 SM1 SM13

… … … …

Th0 Th1 Th192

…

X max

d∈Di

X

fj∈Bi

fj(σi

x = r1 ∧ xi =d)

X max

d∈Di

X

fj∈Bi

fj(σi

x = r192 ∧ xi =d)

max

d∈Di

X

fj∈Bi

fj(σi

x = r0 ∧ xi =d)

X R+0 R+1 R+192

U’i …

SLIDE 32

Experimental ¡Results

Introduction ¡ ¡ GPUs ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ GPU-‐‒(D)BE ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ Results Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

32

BE

GPU−BE

Regular ¡Grid ¡Networks

|Di| ¡= ¡5; ¡ ¡
p2 ¡= ¡0.5;
timeout ¡= ¡300s

Speedup

avg. ¡max.: ¡ ¡125.1x
avg. ¡min.: ¡ ¡ ¡ ¡42.6x
Similar ¡trends ¡at

increasing ¡|Di|.

Similar ¡trends ¡for ¡

DPOP ¡vs GPU-‐‒DBE Number of Variables Runtime (sec) 0.01 0.1 1 10 9 25 36 49 64 81 100

CPU: ¡2.3GHz, ¡128 ¡GB ¡RAM
GPU: ¡14 ¡SMs, ¡837MHz.

SLIDE 33

Experimental ¡Results

Introduction ¡ ¡ GPUs ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ GPU-‐‒(D)BE ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ Results Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

33

BE

GPU−BE

Random ¡Networks

Speedup

avg. ¡max.: ¡ ¡69.3.x
avg. ¡min.: ¡ ¡ ¡ ¡16.1x
|Di| ¡= ¡5; ¡ ¡
p2 ¡= ¡0.5; ¡p1 ¡= ¡0.3;
timeout ¡= ¡300s
Similar ¡trends ¡at

increasing ¡|Di|.

Similar ¡trends ¡for ¡

DPOP ¡vs GPU-‐‒DBE Number of Variables Runtime (sec) 0.01 0.1 1 10 50 5 10 15 20 25

CPU: ¡2.3GHz, ¡128 ¡GB ¡RAM
GPU: ¡14 ¡SMs, ¡837MHz.

SLIDE 34

Experimental ¡Results

Introduction ¡ ¡ GPUs ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ GPU-‐‒(D)BE ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ Results Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

34

BE

GPU−BE

Scale ¡Free ¡Networks

Speedup

avg. ¡max.: ¡ ¡34.9.x
avg. ¡min.: ¡ ¡ ¡ ¡9.5x
|Di| ¡= ¡5; ¡ ¡
p2 ¡= ¡0.5;
timeout ¡= ¡300s
Similar ¡trends ¡at

increasing ¡|Di|.

Similar ¡trends ¡for ¡

DPOP ¡vs GPU-‐‒DBE Number of Variables Runtime (sec) 0.1 1 10 50 10 20 30 40 50

CPU: ¡2.3GHz, ¡128 ¡GB ¡RAM
GPU: ¡14 ¡SMs, ¡837MHz.

SLIDE 35

Lesson ¡Learned ¡#1

The ¡fi’ table size ¡increases ¡exponentially ¡with ¡w*.
Limited ¡GPU ¡global ¡memory ¡(2GB).
fi’ table+ ¡Bi tables, ¡to ¡be ¡used ¡in ¡the ¡aggregation ¡operation, ¡

might ¡exceed ¡global ¡memory ¡capacity!

Partition fi’ ¡computations ¡in ¡multiple ¡chunks.
Alternates ¡GPU ¡and ¡CPU ¡to ¡compute ¡fi’.
GPU: ¡Aggregates ¡the ¡functions ¡in ¡Bi excluding ¡those ¡which ¡do ¡not ¡

fit ¡in ¡the ¡global ¡memory.

CPU: ¡ ¡Aggregates ¡the ¡other ¡functions ¡in ¡Bi; ¡

Projects ¡out ¡the ¡variable ¡xi.

Introduction ¡ ¡ GPUs ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ GPU-‐‒(D)BE ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

35

SLIDE 36

Experimental ¡Results

Introduction ¡ ¡ GPUs ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ GPU-‐‒(D)BE ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ Results Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

36

Random ¡Networks

Phase ¡Transition p1 ¡= ¡0.4 ¡

small ¡p1 ¡correspond ¡

to ¡smaller ¡w*

|A| ¡= ¡10; ¡|Di| ¡= ¡5; ¡ ¡
p2 ¡= ¡0.5;
timeout ¡= ¡300s

Graph Density (p1) Speedup (GPU vs CPU) 40 50 70 80 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

New ¡results

(work ¡in ¡progress)

CPU: ¡2.3GHz, ¡128 ¡GB ¡RAM
GPU: ¡14 ¡SMs, ¡837MHz.

SLIDE 37

Lesson ¡Learned ¡#2

Host ¡and ¡Device ¡concurrency.
Possible ¡when ¡the ¡fi’ ¡tables ¡are ¡computed ¡in ¡chunks.
It ¡may ¡hide ¡host-‐‒device ¡data ¡transfers ¡as ¡byproduct.

Introduction ¡ ¡ GPUs ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ GPU-‐‒(D)BE ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ Results Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

37

Execute K2 Compress U1 Execute K1 Compress U2 Compute U1 Compute U2 … Copy

H D D H D H

Copy Copy (Init) CPU

(Host)

GPU

(Device)

…

Update Global Mem. Update Global Mem.

SLIDE 38

Discussion

Exploiting ¡the ¡integration ¡of ¡CPU ¡and ¡GPU ¡is ¡a ¡key ¡factor ¡

to ¡obtain ¡competitive ¡solver ¡performance.

How ¡to ¡determine ¡good ¡tradeoffs ¡of ¡such ¡integration?
GPU: ¡
Repeated, ¡non ¡memory ¡intensive ¡operations; ¡
Operations ¡requiring ¡regular ¡memory ¡access; ¡
CPU:
Memory ¡intensive ¡operations;

Introduction ¡ ¡ GPUs ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ GPU-‐‒(D)BE ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

38

SLIDE 39

Conclusions

Exploit ¡GPU-‐‒style ¡parallelism ¡from ¡DP-‐‒based ¡(D)COPs ¡

resolution ¡methods.

GPU-‐‒(D)BE: ¡Exploits ¡GPUs ¡to ¡parallelizes ¡the ¡aggregation ¡

and ¡projection ¡operators.

Observed ¡different ¡speedup, ¡ranging ¡from ¡34.9 ¡to ¡125.1, ¡

based ¡on ¡several ¡network ¡topologies.

Discussed ¡several ¡possible ¡optimization ¡techniques.
FUTURE ¡WORK:
Exploit ¡GPUs ¡in ¡DP-‐‒based ¡propagators.
Investigate ¡GPUs ¡in ¡higher ¡form ¡of ¡consistency.

Introduction ¡ ¡ GPUs ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ GPU-‐‒(D)BE ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

39

SLIDE 40

Introduction ¡ ¡ GPUs ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ GPU-‐‒(D)BE ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ Results ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Conclusions

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

40

Ferdinando Fioretto

New Mexico State University, University of Udine Email: ffiorett@cs.nmsu.edu Web: www.cs.nmsu.edu/~ffiorett

Exploiting ¡GPUs ¡in ¡Solving ¡(Distributed) ¡ Constraint ¡Optimization ¡Problems ¡with ¡ Dynamic ¡Programming

F. ¡Fioretto, ¡T. ¡Le, ¡E. ¡Pontelli, ¡W. ¡Yeoh, ¡T. ¡Son

Thank ¡You!

(b)

x1 x2 x3

f12 f23 f13

x3 x2 x1

Introduction

graphic ¡processing ¡unit ¡(GPU).

Numerical ¡Analysis

Bioinformatics Deep ¡Learning

(Distributed) ¡Constraint ¡Optimization

functions, fi : ⇥xj∈scope(fi)Dj 7! N [ {0, 1}.

x⇤ = arg max

F(x) = arg max

X

f(x|scope(f))

Graphical ¡Processing ¡Units ¡(GPUs)

Execution ¡Model

B

...

the ¡execution ¡of ¡a ¡GPU ¡function.

(SD) ¡scheduled ¡in ¡parallel.

(SIMT) ¡parallel ¡model.

Memory ¡Hierarchy

is ¡rather ¡involved.

Memory ¡Hierarchy

is ¡rather ¡involved.

Memory ¡Hierarchy

is ¡rather ¡involved.

Memory ¡Hierarchy

is ¡rather ¡involved.

Memory ¡Hierarchy

is ¡rather ¡involved.

effectively ¡-­‐‒-­‐‒ likely ¡requires ¡to ¡ redesign ¡the ¡algorithm.

CUDA: ¡Compute ¡Unified ¡Device ¡Architecture

Host Device

CUDA: ¡Compute ¡Unified ¡Device ¡Architecture

Host Device

data Global ¡Memory

CUDA: ¡Compute ¡Unified ¡Device ¡Architecture

Host Device

Kernel ¡invocation Global ¡Memory

CUDA: ¡Compute ¡Unified ¡Device ¡Architecture

Host Device

Global ¡Memory data

Bucket ¡Elimination ¡and ¡DPOP

xi xj U 5 1 8 1 20 1 1 3 fij xj U 20 1 8

Bucket ¡Elimination ¡and ¡DPOP

xi xj U 5 1 8 1 20 1 1 3 fij max(5, ¡20) xj U 20 1 8

Bucket ¡Elimination ¡and ¡DPOP

xi xj U 5 1 8 1 20 1 1 3 xj U 20 1 8 fij max(8, ¡3)

Bucket ¡Elimination ¡and ¡DPOP

xi xj U 5 1 8 1 20 1 1 3 fij xi xk U 2 1 6 1 11 1 1 4 xi xj xk U 7 1 11 1 10 1 1 14 . ¡. ¡. . ¡. ¡. ¡

Bucket ¡Elimination ¡and ¡DPOP

xi xj U 5 1 8 1 20 1 1 3 fij xi xk U 2 1 6 1 11 1 1 4 xi xj xk U 7 1 11 1 10 1 1 14 . ¡. ¡. 5 ¡+ ¡2 ¡= ¡7 . ¡. ¡. ¡

Bucket ¡Elimination ¡and ¡DPOP

xi xj U 5 1 8 1 20 1 1 3 fij xi xk U 2 1 6 1 11 1 1 4 xi xj xk U 7 1 11 1 10 1 1 14 . ¡. ¡. 5 ¡+ ¡6 ¡= ¡11 . ¡. ¡. ¡

(b)

x1 x2 x3

Bucket ¡Elimination ¡and ¡DPOP

1.

Imposes ¡an ¡ordering ¡on ¡the ¡problem’s ¡variables.

f12 f23 f13

X = ¡{x1, ¡x2, ¡x3} F = ¡{f12, ¡f13, ¡f23}

x3 x2 x1

(b)

x1 x2 x3

Bucket ¡Elimination ¡and ¡DPOP

2.

Selects ¡the ¡variable ¡xi with ¡highest ¡priority, ¡and ¡it ¡ creates ¡a ¡bucket: ¡ ¡

f12 f23 f13

B3 = ¡{f13, ¡f23}

x3 x2 x1

X = ¡{x1, ¡x2, ¡x3} F = ¡{f12, ¡f13, ¡f23}

∧ i = max{k|xk ∈ scope(fj)}

n fj ∈ F|xi ∈ scope(fj) ∧

(b)

x1 x2 x3

Bucket ¡Elimination ¡and ¡DPOP

3.

It ¡computes ¡a ¡new ¡utility ¡function ¡fi’ by ¡aggregating the ¡ functions ¡in ¡Bi and ¡projecting ¡out ¡xi

f12 f23 f13 x3 x2 x1

X = ¡{x1, ¡x2, ¡x3} F = ¡{f12, ¡f13, ¡f23}

f3

B3 = ¡{f13, ¡f23}

f13 f23 f3’

effectively ¡-‐‒-‐‒ likely ¡requires ¡to ¡ redesign ¡the ¡algorithm.

GPU-‐‒(D)BE

GPU-‐‒(D)BE