Exploring the Use of GPUs in Constraint Solving A Preliminary - - PowerPoint PPT Presentation

exploring the use of gpus in constraint solving
SMART_READER_LITE
LIVE PREVIEW

Exploring the Use of GPUs in Constraint Solving A Preliminary - - PowerPoint PPT Presentation

Exploring the Use of GPUs in Constraint Solving A Preliminary Investigation Federico Campeotto 1 , 2 Alessandro Dal Pal` u 3 Agostino Dovier 1 Ferdinando Fioretto 1 , 2 Enrico Pontelli 2 1. Universit` a di Udine 2. New Mexico State University


slide-1
SLIDE 1

Exploring the Use of GPUs in Constraint Solving

A Preliminary Investigation Federico Campeotto1,2 Alessandro Dal Pal` u3 Agostino Dovier1 Ferdinando Fioretto1,2 Enrico Pontelli2

  • 1. Universit`

a di Udine

  • 2. New Mexico State University
  • 3. Universit`

a di Parma

San Diego CA, January 2014

slide-2
SLIDE 2

Introduction

Introduction

Every new desktop/laptop comes equipped with a powerful graphic processor unit (GPU) These GPUs are general purpose (i.e., we can program them) For most of their life, however, they are absolutely idle (unless some kid is continuously playing with your PC) The question is: can we exploit this computation power for constraint solving? We present a preliminary investigation, focusing on constraint solving

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 2 / 1

slide-3
SLIDE 3

Introduction

Constraint Satisfaction Problems

A Constraint Satisfaction Problem (CSP) is defined by: X = {x1, . . . , xn} is a n-tuple of variables D = {Dx1, . . . , Dxn} set of variable’s domains C finite set of constraints over X: c(xi1, . . . , xim) is a relation c(xi1, . . . , xim) ⊆ Dxi1 × . . . × Dxim. A solution of a CSP is a tuple s1, . . . , sn ∈ ×n

i=1Dxi such that for each

c(xi1, . . . , xim) ∈ C, we have si1, . . . , sim ∈ c. CSP solvers alternate 2 steps:

1

Labeling: select a variable and (non-deterministically) assign a value from its domain

2

Constraint propagation: propagate the assignment through the constraints, and possibly detect inconsistencies

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 3 / 1

slide-4
SLIDE 4

Introduction

Consistency techniques

Idea: replace the current CSP by a “simpler” one, yet equivalent Definition (Arc Consistency) The most common notion of local consistency is arc consistency (AC). Let us consider a binary constraint c ∈ C, where scp(c) = {xi, xj} and xi, xj ∈ X. We say that c is arc consistent if:

  • ∀a ∈ Dxi∃b ∈ Dxj(a, b) ∈ c;
  • ∀b ∈ Dxj∃a ∈ Dxi(a, b) ∈ c;

It is possible to ensure AC by iteratively removing all the values of the variables involved in the constraint that are not consistent with the constraint until a fixpoint is reached

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 4 / 1

slide-5
SLIDE 5

Introduction

Consistency techniques

Idea: replace the current CSP by a “simpler” one, yet equivalent Definition (Arc Consistency) The most common notion of local consistency is arc consistency (AC). Let us consider a binary constraint c ∈ C, where scp(c) = {xi, xj} and xi, xj ∈ X. We say that c is arc consistent if:

  • ∀a ∈ Dxi∃b ∈ Dxj(a, b) ∈ c;
  • ∀b ∈ Dxj∃a ∈ Dxi(a, b) ∈ c;

It is possible to ensure AC by iteratively removing all the values of the variables involved in the constraint that are not consistent with the constraint until a fixpoint is reached The propagation engine computes a mutual fixpoint of all the constraint

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 4 / 1

slide-6
SLIDE 6

Introduction

Consistency techniques

Idea: replace the current CSP by a “simpler” one, yet equivalent Definition (Arc Consistency) The most common notion of local consistency is arc consistency (AC). Let us consider a binary constraint c ∈ C, where scp(c) = {xi, xj} and xi, xj ∈ X. We say that c is arc consistent if:

  • ∀a ∈ Dxi∃b ∈ Dxj(a, b) ∈ c;
  • ∀b ∈ Dxj∃a ∈ Dxi(a, b) ∈ c;

It is possible to ensure AC by iteratively removing all the values of the variables involved in the constraint that are not consistent with the constraint until a fixpoint is reached The propagation engine computes a mutual fixpoint of all the constraint Several algorithms based on fixpoint loop iteration to achieve (Arc) Consistency: AC3, AC4, AC6, etc.

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 4 / 1

slide-7
SLIDE 7

GPUs, in few minutes Why GPUs?

GPUs, in few minutes

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 5 / 1

slide-8
SLIDE 8

GPUs, in few minutes Compute Unified Device Architecture

GPUs, in few minutes

A GPU is a parallel machine with a lot of computing cores, with shared and a local memories, able to schedule the execution of a large number of threads.

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 6 / 1

slide-9
SLIDE 9

GPUs, in few minutes Compute Unified Device Architecture

GPUs, in few minutes

A GPU is a parallel machine with a lot of computing cores, with shared and a local memories, able to schedule the execution of a large number of threads. However, things are not that easy. Cores are organized hierarchically, memories have different behaviors, . . . it’s not easy to obtain a good speed-up.

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 6 / 1

slide-10
SLIDE 10

GPUs, in few minutes Compute Unified Device Architecture

CUDA: Host, Global, Device

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 7 / 1

slide-11
SLIDE 11

GPUs, in few minutes Compute Unified Device Architecture

CUDA: Host, Global, Device

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 7 / 1

slide-12
SLIDE 12

GPUs, in few minutes Compute Unified Device Architecture

CUDA: Host, Global, Device

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 7 / 1

slide-13
SLIDE 13

GPUs, in few minutes Compute Unified Device Architecture

CUDA: Host, Global, Device

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 7 / 1

slide-14
SLIDE 14

GPUs, in few minutes Compute Unified Device Architecture

CUDA: Host, Global, Device

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 7 / 1

slide-15
SLIDE 15

GPUs, in few minutes Compute Unified Device Architecture

CUDA: Host, Global, Device

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 8 / 1

slide-16
SLIDE 16

GPUs, in few minutes Compute Unified Device Architecture

CUDA: Host, Global, Device

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 8 / 1

slide-17
SLIDE 17

GPUs, in few minutes Compute Unified Device Architecture

CUDA: Host, Global, Device

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 8 / 1

slide-18
SLIDE 18

GPUs, in few minutes Compute Unified Device Architecture

CUDA: Host, Global, Device

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 8 / 1

slide-19
SLIDE 19

GPUs, in few minutes Compute Unified Device Architecture

CUDA: Memories

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 9 / 1

slide-20
SLIDE 20

GPUs, in few minutes Compute Unified Device Architecture

How to. . .

Can we perform propagation on GPGPUs? We will see a constraint engine that uses GPU to propagate constraints in parallel Several issues: memory accesses, slow GPU cores, data transfers, ... Different choices Preliminary results

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 10 / 1

slide-21
SLIDE 21

Parallel CP

Parallel Constraint Solving: Parallel Consistency

Establishing arc-consistency is P-complete; There are different parallel AC-based algorithms that can achieve 3, 4× speedup; Two main parallel strategies:

1

parallel AC algorithms using shared memory

2

distributed AC algorithms

We focus on a shared memory AC algorithm

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 11 / 1

slide-22
SLIDE 22

Parallel CP

Parallel AC algorithm - 1

Parallel algorithms for solving node and (bound) arc consistency; Strategy: check for consistency on all the arcs in the constraint queue simultaneously → O(nd) instead of O(ed3); We adopted 3 level of parallelism

Constraints: one parallel block for each constraint Variables: one parallel thread for each variable CPU for efficient propagators and GPU for expensive propagators

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 12 / 1

slide-23
SLIDE 23

Parallel CP

Parallel AC algorithm - 2

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 13 / 1

slide-24
SLIDE 24

Parallel CP

Parallel AC algorithm - 3

The constraint engine is based on the notion of events (not AC3!) Event: a change in the domain of a variable The queue of propagators is updated accordingly...

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 14 / 1

slide-25
SLIDE 25

Parallel CP

Choices: Domain representation

Domain as a Bitset 4 extra variables are used: (1) sign, (2) min, (3) max, and (4) event The use of bit-wise operators on domains reduces the differences between the GPU cores and the CPU cores

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 15 / 1

slide-26
SLIDE 26

Parallel CP

Choices: Status representation

The status of the computation is represented by a vector of M · |V | integer values where M is a multiple of 32 We take advantage of the device cache, since global memory accesses are cached and served as part of 128−byte memory transactions. Coalesced memory accesses: the accesses to the global memory are coalesced for contiguous locations in global memory

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 16 / 1

slide-27
SLIDE 27

Parallel CP

Choices: Data transfers

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 17 / 1

slide-28
SLIDE 28

Parallel CP

Choices: Propagators - 1

Standard language for modelling CP problems: Minizinc/FlatZinc FlatZinc is a low-level solver-input (translated from Minizinc models) Our solver parses FlatZinc models We implemented propagators for the FlatZinc constraints plus specific propagators for some global constraints Every propagator is implemented as a specific device function invoked by a single block

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 18 / 1

slide-29
SLIDE 29

Parallel CP

Choices: Propagators - 2

Intuitive example: the all different constraint C on the variables x1, . . . , xn can be naively encoded as a quadratic number of binary = constraints It can be implemented by a set of n propagators p1, . . . , pn: pi takes care of the constraints xi = xj where j = i Algorithm 1

1: if threadIdx = i then 2:

xj ← scp(C)[threadIdx];

3:

Dxj[xi] ← 0;

4: end if

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 19 / 1

slide-30
SLIDE 30

Results

Results

  • 24

25 26 27 28 29 30 200 400 600 800 1000 1200 1400

  • CPU

GPU Gecode JaCoP N Sec.

nQueens

  • 40.0

40.5 41.0 41.5 42.0 42.5 43.0 20 40 60 80 100

  • CPU

GPU Gecode JaCoP N (B=4) Sec.

Schur

Host: AMD Opteron 270, 2.01GHz, RAM 4GB Device: NVIDIA GeForce GTS 450, 192 cores (4MP). Processor Clock 1.566GHz.

  • 200

250 300 350 400 450 500 5 10 15

  • CPU

GPU Gecode JaCoP M (k=10, n = 20) Sec.

Propagation Stress

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 20 / 1

slide-31
SLIDE 31

Results

Main drawbacks

Data transfers: many failures imply more backtrack actions and more copies between host and device GPU memory latency and coalesced access patterns Difference between the GPU clock and the CPU clock We can partially reduce some of these issues using an Upper bound parameter: if the number of CPU-propagators if higher than a given upper bound, they are all propagated on GPU

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 21 / 1

slide-32
SLIDE 32

Results

Main drawbacks

We can improve the performance using an Upper bound parameter: if the number of CPU-propagators if higher than a given upper bound, they are all propagated on GPU We handle the cases where a large number of efficient propagator are assigned to the CPU, while they could take advantage of parallel propagation Example: Golomb ruler CPU UB = 0 UB = 100 UB = 500 UB = 1000 UB = 1500 266.4 223.4 216.4 214.2 210.4 207.8

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 22 / 1

slide-33
SLIDE 33

Results

Global constraints

A higher speedup can be achieved with expensive constraints: more parallel work to do! We considered two expensive global constraints:

the inverse constraint: inverse(x, y) holds if y is the inverse function of x (and vice versa) the table constraint: it enforces that tuple of variables takes a value from a set of tuples

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 23 / 1

slide-34
SLIDE 34

Results

  • 100

200 300 400 500 600 700 2 4 6 8

  • CPU

GPU

N Sec.

Inverse

Table Instance CPU GPU Speedup CW-m1c-lex-vg4-6 0.015 0.005 3.00 langford-2-50 44.06 15.16 2.94 CW-m1c-uk-vg16-20 1.488 0.225 6.61 ModRen 0 0.381 0.154 2.74 CW-m1c-lex-vg7-7 209.4 43.87 4.77 ModRen 49 0.317 0.117 2.74 langford-2-40 136.4 46.39 2.90 RD k5 n10 d10 m15 0.138 0.053 2.60

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 24 / 1

slide-35
SLIDE 35

Results

Conclusions (or Start?)

The question was: can we exploit GPGPUs computational power for constraint solving?

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 25 / 1

slide-36
SLIDE 36

Results

Conclusions (or Start?)

The question was: can we exploit GPGPUs computational power for constraint solving? The answer is yes but . . .

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 25 / 1

slide-37
SLIDE 37

Results

Conclusions (or Start?)

The question was: can we exploit GPGPUs computational power for constraint solving? The answer is yes but . . . First results are encouraging, especially for global constraints, so . . .

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 25 / 1

slide-38
SLIDE 38

Results

Conclusions (or Start?)

The question was: can we exploit GPGPUs computational power for constraint solving? The answer is yes but . . . First results are encouraging, especially for global constraints, so . . . GPUs can be used for effective exploitation of parallelism in the case

  • f domain-specific constraints with complex propagation strategies

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 25 / 1

slide-39
SLIDE 39

Results

Conclusions (or Start?)

The question was: can we exploit GPGPUs computational power for constraint solving? The answer is yes but . . . First results are encouraging, especially for global constraints, so . . . GPUs can be used for effective exploitation of parallelism in the case

  • f domain-specific constraints with complex propagation strategies

Parallel constraint propagation on GPGPUs should be used depending

  • n the type of problem: real-world problems with complex constraints

are good!

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 25 / 1

slide-40
SLIDE 40

Results

Conclusions (or Start?)

The question was: can we exploit GPGPUs computational power for constraint solving? The answer is yes but . . . First results are encouraging, especially for global constraints, so . . . GPUs can be used for effective exploitation of parallelism in the case

  • f domain-specific constraints with complex propagation strategies

Parallel constraint propagation on GPGPUs should be used depending

  • n the type of problem: real-world problems with complex constraints

are good! We experimented with an ad-hoc constraint-based implementation of protein structure prediction via fragment assembly, parallelized on GPUs using similar techniques, with excellent performance results (up to 50×)

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 25 / 1

slide-41
SLIDE 41

Results

Conclusions (or Start?)

The question was: can we exploit GPGPUs computational power for constraint solving? The answer is yes but . . . First results are encouraging, especially for global constraints, so . . . GPUs can be used for effective exploitation of parallelism in the case

  • f domain-specific constraints with complex propagation strategies

Parallel constraint propagation on GPGPUs should be used depending

  • n the type of problem: real-world problems with complex constraints

are good! We experimented with an ad-hoc constraint-based implementation of protein structure prediction via fragment assembly, parallelized on GPUs using similar techniques, with excellent performance results (up to 50×) Future work: combine parallel search with parallel constraint propagation

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 25 / 1

slide-42
SLIDE 42

Results

Conclusions (or Start?)

Meanwhile, using GPU in the correct way:

FD-ADP-AD-FF-EP (UD-NMSU-PR) Exploring the Use of GPUs in Constraint Solving 26 / 1