GPU Parallel SubTree Interpreter for Genetic Programming Alberto - - PowerPoint PPT Presentation

▶

Oct 23, 2023 132 likes •327 views

GPU Parallel SubTree Interpreter for Genetic Programming Alberto Cano and Sebastin Ventura Knowledge Discovery and Intelligent Systems Research Group University of Crdoba, Spain 1/15 Vancouver, Canada, July 12-16, 2014 Overview 1.

SLIDE 1

Alberto Cano and Sebastián Ventura Knowledge Discovery and Intelligent Systems Research Group University of Córdoba, Spain

GPU Parallel SubTree Interpreter for Genetic Programming

Vancouver, Canada, July 12-16, 2014 1/15

SLIDE 2

Overview

1. Parallelization approaches for GP evaluation
2. Stack-based GP interpreter
3. Parallel SubTree interpreter
4. Experiments
5. Conclusions
6. Future work

2/15

SLIDE 3

Parallelization approaches for GP evaluation

Data parallel
GP run on multiple fitness cases (thousands, millions)
GPU SIMD viewpoint

1

3/15

“Genetic Programming is embarrassingly parallel”

Population parallel
Multi-core CPUs (acceptable for small population sizes)
Many-core GPUs (required for large population sizes)

SLIDE 4

Parallelization approaches for GP evaluation

Population and data parallel
2D grid of threads for individuals and fitness cases

1

4/15

GP GP GP GP GP GP … GP

data data data data data data … data data data data data data data … data data data data data data data … data data data data data data data … data data data data data data data … data … … … … … … … … data data data data data data … data

Performance hints:
Warp: single GP individual run on 32 fitness cases
GP individual in constant memory: single read - broadcast
Data coalescence: transposed data matrix

SLIDE 5

> <

AND AT1 V1 AT2 V2

>

AT5 V5

<

AT6 V6

>

AT3 V3

<

AT4 V4 OR OR AND AND

Stack-based GP interpreter

Postfix notation: expression is evaluated left-to-right

V6 AT6 < V5 AT5 > OR V4 AT4 < AND V3 AT3 > V2 AT2 < V1 AT1 > AND OR AND

2

5/15

O(n) complexity
23 push and 22 pop operations

SLIDE 6

Stack-based GP interpreter

Mixed prefix and postfix notation:

< AT6 V6 > AT5 V5 OR < AT4 V4 AND > AT3 V3 < AT2 V2 > AT1 V1 AND OR AND

2

6/15

O(n) complexity
11 push and 10 pop operations

> <

AND AT1 V1 AT2 V2

>

AT5 V5

<

AT6 V6

>

AT3 V3

<

AT4 V4 OR OR AND AND

SLIDE 7

> <

AND AT1 V1 AT2 V2

>

AT5 V5

<

AT6 V6

>

AT3 V3

<

AT4 V4 OR OR AND AND

Parallel SubTree interpreter

Computation of independent subtrees can be parallelized

3

7/15

O(depth) complexity
No stack depth needed
Threads cooperation via shared memory
Best performance on balanced trees

SLIDE 8

Parallel SubTree interpreter

3

8/15 Full code at: (link available in the paper) http://www.uco.es/grupos/kdis/wiki/GPevaluation

SLIDE 9

Experiments

GPU: GTX 780 donated by NVIDIA
Comparison: population and data parallel vs subtree parallel
Datasets: 15
Population size: 32, 64, 128
Tree size: 31, 63, 127
Performance measure: GPops/s
How affects the population, tree and dataset size?

4

9/15

SLIDE 10

Tree size 31 63 127 Population size 32 64 128 32 64 128 32 64 128 Dataset Instances Atts Population and data parallel fars 100968 29 35,15 35,33 35,53 44,26 42,73 44,57 45,75 45,88 43,80 glass 214 9 8,08 14,49 19,76 10,97 19,15 24,20 13,08 23,36 27,33 ionosphere 351 33 11,55 15,57 23,44 15,18 19,35 27,08 18,36 20,70 29,21 iris 150 4 5,69 10,53 13,84 7,36 14,41 17,50 9,54 17,64 19,69 kddcup 494020 42 34,13 34,61 34,51 43,32 44,49 44,48 45,92 44,66 48,15 pima 768 8 22,26 29,02 34,67 29,70 34,07 42,17 36,84 43,10 48,34 satimage 6435 36 37,07 40,00 41,55 40,31 42,45 48,01 42,29 45,63 51,07 shuttle 58000 9 35,20 35,57 35,69 42,82 44,47 44,67 45,52 45,15 45,90 texture 5500 40 36,44 39,36 41,61 40,05 42,45 43,76 41,77 43,76 43,26 vowel 990 13 21,49 29,83 35,20 27,52 34,81 39,01 30,66 37,62 39,98 Dataset Instances Atts Subtree parallel fars 100968 29 45,63 45,87 43,59 51,03 51,22 51,28 49,88 49,93 49,99 heart 270 13 10,99 17,81 28,37 20,29 27,94 37,14 27,58 35,19 41,29 ionosphere 351 33 15,56 24,07 32,88 23,40 32,67 39,68 29,95 37,91 43,05 iris 150 4 8,21 14,24 20,93 13,70 20,87 29,18 20,07 28,53 36,19 kddcup 494020 42 45,89 44,92 45,94 50,96 50,92 51,11 49,79 50,78 50,88 pima 768 8 25,80 34,01 46,60 33,65 39,96 49,72 38,42 47,75 51,16 satimage 6435 36 41,03 43,82 45,55 47,28 49,11 55,44 47,90 54,00 54,64 shuttle 58000 9 45,36 43,12 43,16 48,27 51,01 51,20 49,77 49,87 49,88 texture 5500 40 39,90 43,56 45,19 46,50 48,87 45,77 47,40 48,70 49,30 vowel 990 13 28,83 36,22 42,52 35,86 41,93 38,30 40,57 44,55 47,06

Experiments GPops/s (Billion)

4

10/15

SLIDE 11

Experiments

4

11/15

Performance variation when increasing population and tree size

SLIDE 12

Experiments

4

12/15

Performance variation when increasing data and tree size
Performance increases as soon as there are enough individuals,

subtrees or data to fill the GPU compute units

SLIDE 13

Conclusions

Positive:
Mixed prefix/postfix notation
O(depth) complexity
No stack depth needed
Best for balanced trees
The higher tree density the better performance
Negative:
Inappropriate for extremely unbalanced trees
Synchronization at each depth level
The number of active threads is reduced at each level
Limited by kernel size
Limited by shared memory

5

13/15

SLIDE 14

Future work

Performance analysis: balance, density, and branch factor
Scalability to bigger trees
CUDA dynamic parallelism
Parent kernel can launch nested smaller kernel
Kepler’s shuffle instruction to avoid shared memory

6

14/15 We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GTX 780 GPU used for this research

SLIDE 15

Alberto Cano and Sebastián Ventura Knowledge Discovery and Intelligent Systems Research Group University of Córdoba, Spain

GPU Parallel SubTree Interpreter for Genetic Programming

Vancouver, Canada, July 12-16, 2014

Alberto Cano acano@uco.es http://www.uco.es/users/i52caroa http://www.uco.es/grupos/kdis

15/15

SLIDE 16

GPU Parallel SubTree Interpreter for GP

SLIDE 17

GPU Parallel SubTree Interpreter for GP

SLIDE 18

Parallelization approaches for GP evaluation

Association rule mining 3
Antecedent and consequent to be evaluated in paralell
Concurrent kernels
Pittsburgh style encoding 1
Individuals represent variable length rule-sets
3D grid of threads for individuals, rules and fitness cases
Multi-instance classification 2
Examples represent sets of instances

1) A. Cano, A. Zafra, and S. Ventura. Parallel evaluation of Pittsburgh rule-based classifiers on GPUs. Neurocomputing, vol. 126, pages 45-57, 2014. 2) A. Cano, A. Zafra, and S. Ventura. Speeding up multiple instance learning classification rules on

GPUs. Knowledge and Information Systems, In press, 2014.

3) A. Cano, A. Zafra, and S. Ventura. Parallel evaluation of Pittsburgh rule-based classifiers on GPUs. Neurocomputing, vol. 126, pages 45-57, 2014.