[PPT] - AUTOMATIC PARALLELISATION OF SOFTWARE USING GENETIC IMPROVEMENT PowerPoint Presentation

SLIDE 1

AUTOMATIC PARALLELISATION OF SOFTWARE USING GENETIC IMPROVEMENT

Bobby R. Bruce

SLIDE 2

BOBBY R. BRUCE

INSPIRATION

Samsung Galaxy S7

SLIDE 3

BOBBY R. BRUCE

INSPIRATION

Samsung Galaxy S7 Mali-T880 MP12

SLIDE 4

BOBBY R. BRUCE

INSPIRATION

Samsung Galaxy S7 Mali-T880 MP12 Intel i7-2500K (overclocked to 5GHz)

SLIDE 5

BOBBY R. BRUCE

INSPIRATION

Samsung Galaxy S7 Mali-T880 MP12 Intel i7-2500K (overclocked to 5GHz)

SLIDE 6

BOBBY R. BRUCE

INSPIRATION

Samsung Galaxy S7 Mali-T880 MP12 Intel i7-2500K (overclocked to 5GHz) 70 GFLOPs

SLIDE 7

BOBBY R. BRUCE

INSPIRATION

Samsung Galaxy S7 Mali-T880 MP12 Intel i7-2500K (overclocked to 5GHz) 265.2 GFLOPs 70 GFLOPs

SLIDE 8

BOBBY R. BRUCE

INSPIRATION

Mali-T880 MP12 Intel i7-2500K (overclocked to 5GHz) 265.2 GFLOPs 70 GFLOPs 4327 GFLOPs nVidia GTX 1060

SLIDE 9

WHY DON’T WE UTILISE THIS POWERFUL HARDWARE?

BOBBY R. BRUCE

Developers lack the skills
Hardware specialisation
Developers’ time is

expensive; translating code to run on the GPU is expensive

Getting decent optimisation

requires manual trial and error

SLIDE 10

WHY DON’T WE UTILISE THIS POWERFUL HARDWARE?

BOBBY R. BRUCE

Developers lack the skills
Hardware specialisation
Developers’ time is

expensive; translating code to run on the GPU is expensive

Getting decent optimisation

requires manual trial and error An Automated approach would be ideal

SLIDE 11

BACKGROUND: WHAT’S CURRENTLY AVAILABLE?

BOBBY R. BRUCE

SLIDE 12

BACKGROUND: WHAT’S CURRENTLY AVAILABLE?

BOBBY R. BRUCE

Domain Pros Cons

SLIDE 13

BACKGROUND: WHAT’S CURRENTLY AVAILABLE?

BOBBY R. BRUCE

Automatic Parallelisation Compilers Only targets very specific loops where dependencies are fully understood Does not require any skills, or knowledge of, parallelisation Domain Pros Cons

SLIDE 14

BACKGROUND: WHAT’S CURRENTLY AVAILABLE?

BOBBY R. BRUCE

Automatic Parallelisation Compilers CUDA/OpenCL Only targets very specific loops where dependencies are fully understood Difficult to learn, harder to master. Very Manual Does not require any skills, or knowledge of, parallelisation When implemented well offers the best performance Domain Pros Cons

SLIDE 15

BACKGROUND: WHAT’S CURRENTLY AVAILABLE?

BOBBY R. BRUCE

Automatic Parallelisation Compilers CUDA/OpenCL Directive-based Only targets very specific loops where dependencies are fully understood Difficult to learn, harder to master. Very Manual Does not require any skills, or knowledge of, parallelisation When implemented well offers the best performance Considerably easier to implement. Still requires some skill, practise, and trial and error. Domain Pros Cons

SLIDE 16

BACKGROUND: WHAT’S CURRENTLY AVAILABLE?

BOBBY R. BRUCE

Automatic Parallelisation Compilers CUDA/OpenCL Directive-based Only targets very specific loops where dependencies are fully understood Difficult to learn, harder to master. Very Manual Does not require any skills, or knowledge of, parallelisation When implemented well offers the best performance Considerably easier to implement. Still requires some skill, practise, and trial and error. Domain Pros Cons

SLIDE 17

BACKGROUND: WHAT’S CURRENTLY AVAILABLE?

BOBBY R. BRUCE

Automatic Parallelisation Compilers CUDA/OpenCL Directive-based Only targets very specific loops where dependencies are fully understood Difficult to learn, harder to master. Very Manual Does not require any skills, or knowledge of, parallelisation When implemented well offers the best performance Considerably easier to implement. Still requires some skill, practise, and trial and error. Domain Pros Cons

SLIDE 18

BACKGROUND: WHAT’S CURRENTLY AVAILABLE?

BOBBY R. BRUCE

Automatic Parallelisation Compilers CUDA/OpenCL Directive-based Only targets very specific loops where dependencies are fully understood Difficult to learn, harder to master. Very Manual Does not require any skills, or knowledge of, parallelisation When implemented well offers the best performance Considerably easier to implement. Still requires some skill, practise, and trial and error. Domain Pros Cons

SLIDE 19

BACKGROUND: OPENACC

BOBBY R. BRUCE

SLIDE 20

BACKGROUND: OPENACC

BOBBY R. BRUCE

SLIDE 21

BACKGROUND: OPENACC

BOBBY R. BRUCE

x20 Speed Up

SLIDE 22

OUR GOAL: AUTOMATICALLY ADD OPENACC DIRECTIVES

BOBBY R. BRUCE

OPENACC_GI

SLIDE 23

OUR GOAL: AUTOMATICALLY ADD OPENACC DIRECTIVES

BOBBY R. BRUCE

OPENACC_GI

Patch Creates

SLIDE 24

OUR GOAL: AUTOMATICALLY ADD OPENACC DIRECTIVES

BOBBY R. BRUCE

OPENACC_GI

Patch Creates CFG-GP

SLIDE 25

OUR GOAL: AUTOMATICALLY ADD OPENACC DIRECTIVES

BOBBY R. BRUCE

OPENACC_GI

Patch Creates CFG-GP

FITNESS FUNCTION

Patch

SLIDE 26

OUR GOAL: AUTOMATICALLY ADD OPENACC DIRECTIVES

BOBBY R. BRUCE

OPENACC_GI

Patch Creates CFG-GP GRAMMAR

FITNESS FUNCTION

Patch

SLIDE 27

OUR GOAL: AUTOMATICALLY ADD OPENACC DIRECTIVES

BOBBY R. BRUCE

OPENACC_GI

Patch Creates CFG-GP OPENACC GRAMMAR GRAMMAR

FITNESS FUNCTION

Patch

SLIDE 28

OUR GOAL: AUTOMATICALLY ADD OPENACC DIRECTIVES

BOBBY R. BRUCE

OPENACC_GI

Patch Creates CFG-GP OPENACC GRAMMAR PROGRAM DATA GRAMMAR

FITNESS FUNCTION

Patch

SLIDE 29

OUR GOAL: AUTOMATICALLY ADD OPENACC DIRECTIVES

BOBBY R. BRUCE

OPENACC_GI

Patch Creates CFG-GP OPENACC GRAMMAR PROGRAM DATA GRAMMAR

FITNESS FUNCTION

Patch

SOURCE CODE LEXICAL ANALYSER

SLIDE 30

GRAMMAR

BOBBY R. BRUCE

<start> ::= <base> | <base> <start> <base> ::= "#pragma acc " <choice> <choice> ::= "loop "<private> <loop_line_number> <private> ::= "private(" <variables> ") " | " " <variables> ::= <variable> | <variable> "," <variables> <variable> ::= <variable_placeholder> <variable_placeholder> ::= "1" | "2" | "3" | "4" | "5" | "6" | "7" …

SLIDE 31

GRAMMAR

BOBBY R. BRUCE

<start> ::= <base> | <base> <start> <base> ::= "#pragma acc " <choice> <choice> ::= "loop "<private> <loop_line_number> <private> ::= "private(" <variables> ") " | " " <variables> ::= <variable> | <variable> "," <variables> <variable> ::= <variable_placeholder> <variable_placeholder> ::= "1" | "2" | "3" | "4" | "5" | "6" | "7" … <loop_line_number> ::= "15@example1.c" | "145@example2.c"

SLIDE 32

GRAMMAR

BOBBY R. BRUCE

<start> ::= <base> | <base> <start> <base> ::= "#pragma acc " <choice> <choice> ::= "loop "<private> <loop_line_number> <private> ::= "private(" <variables> ") " | " " <variables> ::= <variable> | <variable> "," <variables> <variable> ::= <variable_placeholder> <variable_placeholder> ::= "1" | "2" | "3" | "4" | "5" | "6" | "7" … <loop_line_number> ::= "15@example1.c" | "145@example2.c" #pragma acc loop private(1,2) 15@example1.c

SLIDE 33

GRAMMAR

BOBBY R. BRUCE

<start> ::= <base> | <base> <start> <base> ::= "#pragma acc " <choice> <choice> ::= "loop "<private> <loop_line_number> <private> ::= "private(" <variables> ") " | " " <variables> ::= <variable> | <variable> "," <variables> <variable> ::= <variable_placeholder> <variable_placeholder> ::= "1" | "2" | "3" | "4" | "5" | "6" | "7" … <loop_line_number> ::= "15@example1.c" | "145@example2.c"

SLIDE 34

GRAMMAR

BOBBY R. BRUCE

<start> ::= <base> | <base> <start> <base> ::= "#pragma acc " <choice> <choice> ::= "loop "<private> <loop_line_number> <private> ::= "private(" <variables> ") " | " " <variables> ::= <variable> | <variable> "," <variables> <variable> ::= <variable_placeholder> <variable_placeholder> ::= "1" | "2" | "3" | "4" | "5" | "6" | "7" … <loop_line_number> ::= "15@example1.c" | "145@example2.c" —- example1.c +++ example1.c @@ -15,0 +15,1 @@ + #pragma acc loop private(x,y)

SLIDE 35

INITIAL INVESTIGATION

Chose to run a very small

example as a sanity check

nVidia provide an n-body

simulation example already containing OpenACC directives

These directives were

stripped for openacc to replicate

Ran for 100 generations with

population of 100

SLIDE 36

RESULTS

BOBBY R. BRUCE

sequential

riginal

gi_best 50 100 150 200 250 300 350 Execution Time (ms)

SLIDE 37

RESULTS

BOBBY R. BRUCE

riginal

gi_best 11.6 11.8 12.0 12.2 12.4 12.6 12.8 13.0 Execution Time (ms)

SLIDE 38

RESULTS: OTHER NOTES

Seems like much of the gain is due to random search
We’d like to be able to beat human-written alternatives
This example is very small, future investigations will show how

well the tool scales

BOBBY R. BRUCE

20

40 60 80 100 50 100 150 200 250 300 Generation Elite Performance (ms)

SLIDE 39

CURRENT/FUTURE WORK

Currently applying the tool to larger
At present can only work with C/C++, expanding code to work with

FORTRAN Possible Improvements:

Seed initial generation with basic solutions
Introduce some clever profiling
Get working with OpenMP as well as OpenACC

BOBBY R. BRUCE

SLIDE 40

ANY QUESTIONS?

BOBBY R. BRUCE

sequential

riginal
ptimal

50 100 150 200 250 300 350 Execution Time (ms)

20

40 60 80 100 50 100 150 200 250 300 Generation Elite Performance (ms)

riginal
ptimal

11.6 11.8 12.0 12.2 12.4 12.6 12.8 13.0 Execution Time (ms)