[PPT] - A Dynamic Programming-based MCMC Framework for Solving DCOPs with PowerPoint Presentation

SLIDE 1

A Dynamic Programming-based MCMC Framework for Solving DCOPs with GPUs

Ferdinando Fioretto1(2) (joint work with) William Yeoh2 and Enrico Pontelli2

1 University of Michigan 2 New Mexico State University

CP 2016, Toulouse

SLIDE 2

1

Distributed Discrete Optimization with Preferences

Introduction GPUs DMCMC Results Conclusions

SLIDE 3

GPUs

Every new desktop/laptop is now equipped with a

graphic processing unit (GPU).

GPU = Massively Parallel Architecture.
For most of their life, such GPUs are idle.
General Purpose GPU applications:

Introduction GPUs DMCMC Results Conclusions

2

Numerical Analysis

MathWorks MATLAB

Bioinformatics Deep Learning

SLIDE 4

Outline

Introduction
GPUs
D-MCMC
Results
Conclusions

3

SLIDE 5

centralized solver centralized solver DCOP Algorithm

agent variables constraints

Multi-Agent Constraint Optimization

A DCOP is a tuple <X, D, F, A, α>, where:
X is a set of variables.
D is a set of finite domains for each variable.
F is a set of constraints between variables.
A is a set of agents, controlling the variables in X.
α is a mapping from variables to agents.

Introduction GPUs DMCMC Results Conclusions

4

xa xb U 3 1 20 1 2 1 1 5

SLIDE 6

Multi-Agent Constraint Optimization

A DCOP is a tuple <X, D, F, A, α>, where:
X is a set of variables.
D is a set of finite domains for each variable.
F is a set of constraints between variables.
A is a set of agents, controlling the variables in X.
α is a mapping from variables to agents.

Introduction GPUs DMCMC Results Conclusions

5

x2 x1 x3 x4 x5 Boundary variables Local variables Bi Li Agent ai

SLIDE 7

Multi-Agent Constraint Optimization

A DCOP is a tuple <X, D, F, A, α>, where:
X is a set of variables.
D is a set of finite domains for each variable.
F is a set of constraints between variables.
A is a set of agents, controlling the variables in X.
α is a mapping from variables to agents.
GOAL: Find a utility maximal assignment.

Introduction GPUs DMCMC Results Conclusions

6

x⇤ = arg max

x

F(x) = arg max

x

X

f2F

f(x|scope(f))

SLIDE 8

MCMC Sampling

MCMC algorithms approximate probability distributions.
They use a proposal distribution to generate a sequence of

samples z(1), z(2) , … which forms a Marokv Chain.

The quality of the sample improves as a function of the

number of steps.

7

Introduction GPUs DMCMC Results Conclusions

Source: http://xr0038.hatenadiary.jp/

SLIDE 9

MCMC Sampling

MCMC sampling algorithms can be used to solve DCOPs.

[Nguyen et al., AAMAS 2013]

MCMC Sampling algorithms can be used to solve the

Maximum A Posteriori (MAP) estimation problem.

The authors provide a mapping from solving a DCOP to

solving a MAP.

8

Introduction GPUs DMCMC Results Conclusions

SLIDE 10

Graphical Processing Units (GPUs)

A GPU is a massive parallel architecture:
Thousands of multi-threaded computing cores.
Very high memory bandwidths.
~80% of transistors devoted to data processing rather than caching.
However:
GPU cores are slower than CPU cores.
GPU memories have different sizes and access times.
GPU programming is more challenging and time consuming.

Introduction GPUs DMCMC Results Conclusions

9

SLIDE 11

Execution Model

10

Block (0,0) Block (1,0) Block (2,0) Block (0,1) Block (1,1) Block (2,1) Kernel 1 Kernel 2

B

Thread (0,0) Thread (1,0) Thread (2,0) Thread (3,0) Thread (4,0) Thread (0,1) Thread (1,1) Thread (2,1) Thread (3,1) Thread (4,1) Thread (0,2) Thread (1,2) Thread (2,2) Thread (3,2) Thread (4,2) Thread (0,3) Thread (1,3) Thread (2,3) Thread (3,3) Thread (4,3)

block

CPU GPU

Block (0,0) Block (1,0) Block (2,0) Block (0,1) Block (1,1) Block (2,1) block block block block block block

...

A Thread is the basic parallel unit.
Identified by a Thread ID.
Threads are organized into Blocks.
Several Streaming Multiprocessors,

(SD) scheduled in parallel.

Single Instruction Multiple Thread

(SIMT) parallel model.

Introduction GPUs DMCMC Results Conclusions

SLIDE 12

HOST GLOBAL MEMORY CONSTANT MEMORY Shared memory

Thread Thread regs regs

Block Shared memory

Thread Thread regs regs

Block GRID

Memory Hierarchy

The GPU memory architecture

is rather involved.

Registers
Fastest.
Only accessible by a thread.
Lifetime of a thread.
Shared memory
Fast.
Accessible by all threads in a block.
Global memory
High access latency
Potential of traffic congestion.

Introduction GPUs DMCMC Results Conclusions

11

SLIDE 13

CUDA: Compute Unified Device Architecture

Introduction GPUs DMCMC Results Conclusions

12

Host Device

SLIDE 14

CUDA: Compute Unified Device Architecture

Introduction GPUs DMCMC Results Conclusions

13

Host Device

cu cudaMallo lloc(&deviceV, sizeV); cudaMemcpy(deviceV, hostV, sizeV, ...)

data Global Memory

SLIDE 15

CUDA: Compute Unified Device Architecture

Introduction GPUs DMCMC Results Conclusions

14

Host Device

cu cudaKe Kernel<nThreads, nBlocks>( )

cu cudaKernel( )

Kernel invocation Global Memory

SLIDE 16

CUDA: Compute Unified Device Architecture

Introduction GPUs DMCMC Results Conclusions

15

Host Device

cudaMemcpy(hostV, deviceV, sizeV, ...)

Global Memory data

SLIDE 17

D-MCMC: Related Work

16

Introduction GPUs DMCMC Results Conclusions

Computing the normalizing constant can be expensive.
A lots of sample to converge.

Algorithm 1: Gibbs(z1, . . . , zn) 1 for i = 1 to n do 2 z0

i ← Initialize(zi)

3 end 4 for t = 1 to T do 5 for i = 1 to n do 6 zt

i ← Sample(P(zi | zt 1, . . . , zt i1, zt1 i+1, . . . , zt1 n

)) 7 end 8 end

S

[Nguyen et al. AAMAS-2013]

D-Gibbs Sampling

SLIDE 18

DMCMC

Each agent controls several variables.
Given values for its boundary variables each agent can

solve its local sub-problem independently from other agents.

17

x2 x1 x3 x4 x5

Introduction GPUs DMCMC Results Conclusions

x7 x6 x8 x9 x10 Li Lj Bi Bj Agent ai Agent aj

SLIDE 19

DMCMC

Each agent controls several variables.
Given values for its boundary variables.
Find a solution for the local sub-problem using MCMC

algorithms: Gibbs sampling and Metropolis–Hastings.

18

x2 x1 x3 x4 x5 x1 x2 x3 x4 x5 U 3 2 5 21 1 2 1 4 20 2 3 5 1 32 Joint utility table

Introduction GPUs DMCMC Results Conclusions

Li Bi Bi Li

SLIDE 20

DMCMC: Local Sampling Process

19

Introduction GPUs DMCMC Results Conclusions

3 Level of Parallelism

Block (0,0) Block (1,0) Block (2,0) Block (0,1) Block (1,1) Block (2,1)

GPU

Block (0,0) Block (1,0) Block (2,0) Block (0,1) Block (1,1) Block (2,1) block block block block block block

Each row of the Joint utility table is computed in parallel using several blocks.

1

Joint utility table x1 x2 x3 x4 x5 U 3 2 5 21 1 2 1 4 20 2 3 5 1 32 [Fioretto et al. CP-15]

SLIDE 21

DMCMC: Local Sampling Process

20

Introduction GPUs DMCMC Results Conclusions

3 Level of Parallelism

x1 x2 x3 x4 x5 U 3 2 5 21 2 1 4 20 3 5 1 32 1 2 1 4 20 1 3 5 1 32 1 2 1 4 20 2 3 5 1 32 …

Block (0,0) Block (1,0) Block (2,0) Block (0,1) Block (1,1) Block (2,1)

GPU

Block (0,0) Block (1,0) Block (2,0) Block (0,1) Block (1,1) Block (2,1) block block block block block block

R multiple samples

2

Joint utility table

SLIDE 22

DMCMC: Local Sampling Process

21

Introduction GPUs DMCMC Results Conclusions

3 Level of Parallelism

Block (0,0) Block (1,0) Block (2,0) Block (0,1) Block (1,1) Block (2,1)

B

Thread (0,0) Thread (1,0) Thread (2,0) Thread (3,0) Thread (4,0) Thread (0,1) Thread (1,1) Thread (2,1) Thread (3,1) Thread (4,1) Thread (0,2) Thread (1,2) Thread (2,2) Thread (3,2) Thread (4,2) Thread (0,3) Thread (1,3) Thread (2,3) Thread (3,3) Thread (4,3)

block

GPU

Block (0,0) Block (1,0) Block (2,0) Block (0,1) Block (1,1) Block (2,1) block block block block block block

... 3

q(xk =did | xl ∈ Li \ {xk}) = 1 Zπ exp X

fj∈Fi

fj(z|xfj ) q(xk =did | xl ∈ Li \ {xk}) = 1 Zπ exp X

fj∈Fi

fj(z|xfj ) q(xk =did | xl ∈ Li \ {xk}) = 1 Zπ exp X

fj∈Fi

fj(z|xfj )

… 1 2 Gibbs Sampling Process

SLIDE 23

Algorithm design and data structure

Ensure data accesses are coalesced.
Minimize the accesses to the global memory.
Padding Utility Tables’ rows; Perfect hashing.

22

good bad

Introduction GPUs DMCMC Results Conclusions

SLIDE 24

23

Introduction GPUs DMCMC Results Conclusions

Results

DPOP

Gibbs (CPU) Gibbs (GPU) MH (CPU) MH (GPU) MGM2 Number of samples (S) Quality (Ratio) 10 2500 5000 10000 0.6 0.7 0.8 0.9 1.0

Number of samples (S)

Simulated Time (sec.) 10 2500 5000 10000 0.1 0.5 1.0 5.0 10.0 50.0 100.0 500.0

●
Main results:
Runtime: GPU-MCMC algorithms are > 1 order of magnitude faster

than CPU-MCMC ones.

Quality: GPU-Gibbs dominates MGM2 for S>100;

GPU-MH solutions quality comparable to those of MGM2 Random Networks

SLIDE 25

24

Introduction GPUs DMCMC Results Conclusions

Results

Main results:

Gibbs on GPU is up to 2 order of magnitude faster than

MGM(2) and finds solutions of higher quality.

|A| 5 10 25 50 wct st quality wct st quality wct st quality wct st quality DPOP 125.39 94.98 1661

ot
ot
ot
ot
ot
ot
MGM

7.435 0.435 1379 11.910 0.446 2766 24.211 0.417 6692 45.771 0.462 13802 MGM2 8.939 0.979 1389 23.903 1.526 2783 56.035 1.629 7116 112.54 1.788 14145 GibbsCP U 6.146 1.101 1638 12.093 1.190 3.319 31.031 1.347 8344 62.411 1.489 16577 GibbsGP U 0.162 0.033 1635 0.301 0.034 3338 0.708 0.041 8344 1.416 0.048 16550 MHCP U 0.561 0.113 1131 1.091 0.121 2775 2.281 0.176 6921 3.921 0.185 12112 MHGP U 0.047 0.014 1143 0.102 0.016 2663 0.196 0.017 6925 0.360 0.022 11856

Meeting Scheduling S = 100; R = 10

SLIDE 26

Conclusions

Exploit GPU-style parallelism from DP-based DCOPs

resolution methods and MCMC sampling.

D-MCMC framework: Decomposes a DCOP into

independent sub-problems that can be sampled in parallel with GPUs.

D-MCMC with Gibbs produces high quality solutions

with runtimes up to 2 order of magnitude faster than

ther state-of-the art incomplete solvers.
Future Work:
Exploit similar techniques to solve WCSPs.
Extend the proposed method using memory bounded solutions.

Introduction GPUs DMCMC Results Conclusions

25

SLIDE 27

Thank you!

References

[1] D. T. Nguyen, W. Yeoh, and H. C. Lau, "Distributed Gibbs: A Memory-Bounded

Sampling-Based DCOP Algorithm, AAMAS, 2013.

[2] F. Fioretto, T. Le, E. Pontelli, W. Yeoh, and T. Son, “Exploiting GPUs in Solving

(Distributed) Constraint Optimization Problems with Dynamic Programming”, CP, 2015.

26