GPU Parallel SubTree Interpreter for Genetic Programming Alberto - - PowerPoint PPT Presentation

gpu parallel subtree interpreter
SMART_READER_LITE
LIVE PREVIEW

GPU Parallel SubTree Interpreter for Genetic Programming Alberto - - PowerPoint PPT Presentation

GPU Parallel SubTree Interpreter for Genetic Programming Alberto Cano and Sebastin Ventura Knowledge Discovery and Intelligent Systems Research Group University of Crdoba, Spain 1/15 Vancouver, Canada, July 12-16, 2014 Overview 1.


slide-1
SLIDE 1

Alberto Cano and Sebastián Ventura Knowledge Discovery and Intelligent Systems Research Group University of Córdoba, Spain

GPU Parallel SubTree Interpreter for Genetic Programming

Vancouver, Canada, July 12-16, 2014 1/15

slide-2
SLIDE 2

Overview

  • 1. Parallelization approaches for GP evaluation
  • 2. Stack-based GP interpreter
  • 3. Parallel SubTree interpreter
  • 4. Experiments
  • 5. Conclusions
  • 6. Future work

2/15

slide-3
SLIDE 3

Parallelization approaches for GP evaluation

  • Data parallel
  • GP run on multiple fitness cases (thousands, millions)
  • GPU SIMD viewpoint

1

3/15

“Genetic Programming is embarrassingly parallel”

  • Population parallel
  • Multi-core CPUs (acceptable for small population sizes)
  • Many-core GPUs (required for large population sizes)
slide-4
SLIDE 4

Parallelization approaches for GP evaluation

  • Population and data parallel
  • 2D grid of threads for individuals and fitness cases

1

4/15

GP GP GP GP GP GP … GP

data data data data data data … data data data data data data data … data data data data data data data … data data data data data data data … data data data data data data data … data … … … … … … … … data data data data data data … data

  • Performance hints:
  • Warp: single GP individual run on 32 fitness cases
  • GP individual in constant memory: single read - broadcast
  • Data coalescence: transposed data matrix
slide-5
SLIDE 5

> <

AND AT1 V1 AT2 V2

>

AT5 V5

<

AT6 V6

>

AT3 V3

<

AT4 V4 OR OR AND AND

Stack-based GP interpreter

  • Postfix notation: expression is evaluated left-to-right

V6 AT6 < V5 AT5 > OR V4 AT4 < AND V3 AT3 > V2 AT2 < V1 AT1 > AND OR AND

2

5/15

  • O(n) complexity
  • 23 push and 22 pop operations
slide-6
SLIDE 6

Stack-based GP interpreter

  • Mixed prefix and postfix notation:

< AT6 V6 > AT5 V5 OR < AT4 V4 AND > AT3 V3 < AT2 V2 > AT1 V1 AND OR AND

2

6/15

  • O(n) complexity
  • 11 push and 10 pop operations

> <

AND AT1 V1 AT2 V2

>

AT5 V5

<

AT6 V6

>

AT3 V3

<

AT4 V4 OR OR AND AND

slide-7
SLIDE 7

> <

AND AT1 V1 AT2 V2

>

AT5 V5

<

AT6 V6

>

AT3 V3

<

AT4 V4 OR OR AND AND

Parallel SubTree interpreter

  • Computation of independent subtrees can be parallelized

3

7/15

  • O(depth) complexity
  • No stack depth needed
  • Threads cooperation via shared memory
  • Best performance on balanced trees
slide-8
SLIDE 8

Parallel SubTree interpreter

3

8/15 Full code at: (link available in the paper) http://www.uco.es/grupos/kdis/wiki/GPevaluation

slide-9
SLIDE 9

Experiments

  • GPU: GTX 780 donated by NVIDIA
  • Comparison: population and data parallel vs subtree parallel
  • Datasets: 15
  • Population size: 32, 64, 128
  • Tree size: 31, 63, 127
  • Performance measure: GPops/s
  • How affects the population, tree and dataset size?

4

9/15

slide-10
SLIDE 10

Tree size 31 63 127 Population size 32 64 128 32 64 128 32 64 128 Dataset Instances Atts Population and data parallel fars 100968 29 35,15 35,33 35,53 44,26 42,73 44,57 45,75 45,88 43,80 glass 214 9 8,08 14,49 19,76 10,97 19,15 24,20 13,08 23,36 27,33 ionosphere 351 33 11,55 15,57 23,44 15,18 19,35 27,08 18,36 20,70 29,21 iris 150 4 5,69 10,53 13,84 7,36 14,41 17,50 9,54 17,64 19,69 kddcup 494020 42 34,13 34,61 34,51 43,32 44,49 44,48 45,92 44,66 48,15 pima 768 8 22,26 29,02 34,67 29,70 34,07 42,17 36,84 43,10 48,34 satimage 6435 36 37,07 40,00 41,55 40,31 42,45 48,01 42,29 45,63 51,07 shuttle 58000 9 35,20 35,57 35,69 42,82 44,47 44,67 45,52 45,15 45,90 texture 5500 40 36,44 39,36 41,61 40,05 42,45 43,76 41,77 43,76 43,26 vowel 990 13 21,49 29,83 35,20 27,52 34,81 39,01 30,66 37,62 39,98 Dataset Instances Atts Subtree parallel fars 100968 29 45,63 45,87 43,59 51,03 51,22 51,28 49,88 49,93 49,99 heart 270 13 10,99 17,81 28,37 20,29 27,94 37,14 27,58 35,19 41,29 ionosphere 351 33 15,56 24,07 32,88 23,40 32,67 39,68 29,95 37,91 43,05 iris 150 4 8,21 14,24 20,93 13,70 20,87 29,18 20,07 28,53 36,19 kddcup 494020 42 45,89 44,92 45,94 50,96 50,92 51,11 49,79 50,78 50,88 pima 768 8 25,80 34,01 46,60 33,65 39,96 49,72 38,42 47,75 51,16 satimage 6435 36 41,03 43,82 45,55 47,28 49,11 55,44 47,90 54,00 54,64 shuttle 58000 9 45,36 43,12 43,16 48,27 51,01 51,20 49,77 49,87 49,88 texture 5500 40 39,90 43,56 45,19 46,50 48,87 45,77 47,40 48,70 49,30 vowel 990 13 28,83 36,22 42,52 35,86 41,93 38,30 40,57 44,55 47,06

Experiments GPops/s (Billion)

4

10/15

slide-11
SLIDE 11

Experiments

4

11/15

  • Performance variation when increasing population and tree size
slide-12
SLIDE 12

Experiments

4

12/15

  • Performance variation when increasing data and tree size
  • Performance increases as soon as there are enough individuals,

subtrees or data to fill the GPU compute units

slide-13
SLIDE 13

Conclusions

  • Positive:
  • Mixed prefix/postfix notation
  • O(depth) complexity
  • No stack depth needed
  • Best for balanced trees
  • The higher tree density the better performance
  • Negative:
  • Inappropriate for extremely unbalanced trees
  • Synchronization at each depth level
  • The number of active threads is reduced at each level
  • Limited by kernel size
  • Limited by shared memory

5

13/15

slide-14
SLIDE 14

Future work

  • Performance analysis: balance, density, and branch factor
  • Scalability to bigger trees
  • CUDA dynamic parallelism
  • Parent kernel can launch nested smaller kernel
  • Kepler’s shuffle instruction to avoid shared memory

6

14/15 We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GTX 780 GPU used for this research

slide-15
SLIDE 15

Alberto Cano and Sebastián Ventura Knowledge Discovery and Intelligent Systems Research Group University of Córdoba, Spain

GPU Parallel SubTree Interpreter for Genetic Programming

Vancouver, Canada, July 12-16, 2014

Alberto Cano acano@uco.es http://www.uco.es/users/i52caroa http://www.uco.es/grupos/kdis

15/15

slide-16
SLIDE 16

GPU Parallel SubTree Interpreter for GP

slide-17
SLIDE 17

GPU Parallel SubTree Interpreter for GP

slide-18
SLIDE 18

Parallelization approaches for GP evaluation

  • Association rule mining 3
  • Antecedent and consequent to be evaluated in paralell
  • Concurrent kernels
  • Pittsburgh style encoding 1
  • Individuals represent variable length rule-sets
  • 3D grid of threads for individuals, rules and fitness cases
  • Multi-instance classification 2
  • Examples represent sets of instances

1) A. Cano, A. Zafra, and S. Ventura. Parallel evaluation of Pittsburgh rule-based classifiers on GPUs. Neurocomputing, vol. 126, pages 45-57, 2014. 2) A. Cano, A. Zafra, and S. Ventura. Speeding up multiple instance learning classification rules on

  • GPUs. Knowledge and Information Systems, In press, 2014.

3) A. Cano, A. Zafra, and S. Ventura. Parallel evaluation of Pittsburgh rule-based classifiers on GPUs. Neurocomputing, vol. 126, pages 45-57, 2014.