AUTOMATIC PARALLELISATION OF SOFTWARE USING GENETIC IMPROVEMENT - - PowerPoint PPT Presentation

automatic parallelisation of software using genetic
SMART_READER_LITE
LIVE PREVIEW

AUTOMATIC PARALLELISATION OF SOFTWARE USING GENETIC IMPROVEMENT - - PowerPoint PPT Presentation

AUTOMATIC PARALLELISATION OF SOFTWARE USING GENETIC IMPROVEMENT Bobby R. Bruce INSPIRATION Samsung Galaxy S7 BOBBY R. BRUCE INSPIRATION Mali-T880 MP12 Samsung Galaxy S7 BOBBY R. BRUCE INSPIRATION Intel i7-2500K (overclocked to


slide-1
SLIDE 1

AUTOMATIC PARALLELISATION OF SOFTWARE USING GENETIC IMPROVEMENT

Bobby R. Bruce

slide-2
SLIDE 2

BOBBY R. BRUCE

INSPIRATION

Samsung Galaxy S7

slide-3
SLIDE 3

BOBBY R. BRUCE

INSPIRATION

Samsung Galaxy S7 Mali-T880 MP12

slide-4
SLIDE 4

BOBBY R. BRUCE

INSPIRATION

Samsung Galaxy S7 Mali-T880 MP12 Intel i7-2500K (overclocked to 5GHz)

slide-5
SLIDE 5

BOBBY R. BRUCE

INSPIRATION

Samsung Galaxy S7 Mali-T880 MP12 Intel i7-2500K (overclocked to 5GHz)

slide-6
SLIDE 6

BOBBY R. BRUCE

INSPIRATION

Samsung Galaxy S7 Mali-T880 MP12 Intel i7-2500K (overclocked to 5GHz) 70 GFLOPs

slide-7
SLIDE 7

BOBBY R. BRUCE

INSPIRATION

Samsung Galaxy S7 Mali-T880 MP12 Intel i7-2500K (overclocked to 5GHz) 265.2 GFLOPs 70 GFLOPs

slide-8
SLIDE 8

BOBBY R. BRUCE

INSPIRATION

Mali-T880 MP12 Intel i7-2500K (overclocked to 5GHz) 265.2 GFLOPs 70 GFLOPs 4327 GFLOPs nVidia GTX 1060

slide-9
SLIDE 9

WHY DON’T WE UTILISE THIS POWERFUL HARDWARE?

BOBBY R. BRUCE

  • Developers lack the skills
  • Hardware specialisation
  • Developers’ time is

expensive; translating code to run on the GPU is expensive

  • Getting decent optimisation

requires manual trial and error

slide-10
SLIDE 10

WHY DON’T WE UTILISE THIS POWERFUL HARDWARE?

BOBBY R. BRUCE

  • Developers lack the skills
  • Hardware specialisation
  • Developers’ time is

expensive; translating code to run on the GPU is expensive

  • Getting decent optimisation

requires manual trial and error An Automated approach would be ideal

slide-11
SLIDE 11

BACKGROUND: WHAT’S CURRENTLY AVAILABLE?

BOBBY R. BRUCE

slide-12
SLIDE 12

BACKGROUND: WHAT’S CURRENTLY AVAILABLE?

BOBBY R. BRUCE

Domain Pros Cons

slide-13
SLIDE 13

BACKGROUND: WHAT’S CURRENTLY AVAILABLE?

BOBBY R. BRUCE

Automatic Parallelisation Compilers Only targets very specific loops where dependencies are fully understood Does not require any skills, or knowledge of, parallelisation Domain Pros Cons

slide-14
SLIDE 14

BACKGROUND: WHAT’S CURRENTLY AVAILABLE?

BOBBY R. BRUCE

Automatic Parallelisation Compilers CUDA/OpenCL Only targets very specific loops where dependencies are fully understood Difficult to learn, harder to master. Very Manual Does not require any skills, or knowledge of, parallelisation When implemented well offers the best performance Domain Pros Cons

slide-15
SLIDE 15

BACKGROUND: WHAT’S CURRENTLY AVAILABLE?

BOBBY R. BRUCE

Automatic Parallelisation Compilers CUDA/OpenCL Directive-based Only targets very specific loops where dependencies are fully understood Difficult to learn, harder to master. Very Manual Does not require any skills, or knowledge of, parallelisation When implemented well offers the best performance Considerably easier to implement. Still requires some skill, practise, and trial and error. Domain Pros Cons

slide-16
SLIDE 16

BACKGROUND: WHAT’S CURRENTLY AVAILABLE?

BOBBY R. BRUCE

Automatic Parallelisation Compilers CUDA/OpenCL Directive-based Only targets very specific loops where dependencies are fully understood Difficult to learn, harder to master. Very Manual Does not require any skills, or knowledge of, parallelisation When implemented well offers the best performance Considerably easier to implement. Still requires some skill, practise, and trial and error. Domain Pros Cons

slide-17
SLIDE 17

BACKGROUND: WHAT’S CURRENTLY AVAILABLE?

BOBBY R. BRUCE

Automatic Parallelisation Compilers CUDA/OpenCL Directive-based Only targets very specific loops where dependencies are fully understood Difficult to learn, harder to master. Very Manual Does not require any skills, or knowledge of, parallelisation When implemented well offers the best performance Considerably easier to implement. Still requires some skill, practise, and trial and error. Domain Pros Cons

slide-18
SLIDE 18

BACKGROUND: WHAT’S CURRENTLY AVAILABLE?

BOBBY R. BRUCE

Automatic Parallelisation Compilers CUDA/OpenCL Directive-based Only targets very specific loops where dependencies are fully understood Difficult to learn, harder to master. Very Manual Does not require any skills, or knowledge of, parallelisation When implemented well offers the best performance Considerably easier to implement. Still requires some skill, practise, and trial and error. Domain Pros Cons

slide-19
SLIDE 19

BACKGROUND: OPENACC

BOBBY R. BRUCE

slide-20
SLIDE 20

BACKGROUND: OPENACC

BOBBY R. BRUCE

slide-21
SLIDE 21

BACKGROUND: OPENACC

BOBBY R. BRUCE

x20 Speed Up

slide-22
SLIDE 22

OUR GOAL: AUTOMATICALLY ADD OPENACC DIRECTIVES

BOBBY R. BRUCE

OPENACC_GI

slide-23
SLIDE 23

OUR GOAL: AUTOMATICALLY ADD OPENACC DIRECTIVES

BOBBY R. BRUCE

OPENACC_GI

Patch Creates

slide-24
SLIDE 24

OUR GOAL: AUTOMATICALLY ADD OPENACC DIRECTIVES

BOBBY R. BRUCE

OPENACC_GI

Patch Creates CFG-GP

slide-25
SLIDE 25

OUR GOAL: AUTOMATICALLY ADD OPENACC DIRECTIVES

BOBBY R. BRUCE

OPENACC_GI

Patch Creates CFG-GP

FITNESS FUNCTION

Patch

slide-26
SLIDE 26

OUR GOAL: AUTOMATICALLY ADD OPENACC DIRECTIVES

BOBBY R. BRUCE

OPENACC_GI

Patch Creates CFG-GP GRAMMAR

FITNESS FUNCTION

Patch

slide-27
SLIDE 27

OUR GOAL: AUTOMATICALLY ADD OPENACC DIRECTIVES

BOBBY R. BRUCE

OPENACC_GI

Patch Creates CFG-GP OPENACC GRAMMAR GRAMMAR

FITNESS FUNCTION

Patch

slide-28
SLIDE 28

OUR GOAL: AUTOMATICALLY ADD OPENACC DIRECTIVES

BOBBY R. BRUCE

OPENACC_GI

Patch Creates CFG-GP OPENACC GRAMMAR PROGRAM DATA GRAMMAR

FITNESS FUNCTION

Patch

slide-29
SLIDE 29

OUR GOAL: AUTOMATICALLY ADD OPENACC DIRECTIVES

BOBBY R. BRUCE

OPENACC_GI

Patch Creates CFG-GP OPENACC GRAMMAR PROGRAM DATA GRAMMAR

FITNESS FUNCTION

Patch

SOURCE CODE LEXICAL ANALYSER

slide-30
SLIDE 30

GRAMMAR

BOBBY R. BRUCE

<start> ::= <base> | <base> <start> <base> ::= "#pragma acc " <choice> <choice> ::= "loop "<private> <loop_line_number> <private> ::= "private(" <variables> ") " | " " <variables> ::= <variable> | <variable> "," <variables> <variable> ::= <variable_placeholder> <variable_placeholder> ::= "1" | "2" | "3" | "4" | "5" | "6" | "7" …

slide-31
SLIDE 31

GRAMMAR

BOBBY R. BRUCE

<start> ::= <base> | <base> <start> <base> ::= "#pragma acc " <choice> <choice> ::= "loop "<private> <loop_line_number> <private> ::= "private(" <variables> ") " | " " <variables> ::= <variable> | <variable> "," <variables> <variable> ::= <variable_placeholder> <variable_placeholder> ::= "1" | "2" | "3" | "4" | "5" | "6" | "7" … <loop_line_number> ::= "15@example1.c" | "145@example2.c"

slide-32
SLIDE 32

GRAMMAR

BOBBY R. BRUCE

<start> ::= <base> | <base> <start> <base> ::= "#pragma acc " <choice> <choice> ::= "loop "<private> <loop_line_number> <private> ::= "private(" <variables> ") " | " " <variables> ::= <variable> | <variable> "," <variables> <variable> ::= <variable_placeholder> <variable_placeholder> ::= "1" | "2" | "3" | "4" | "5" | "6" | "7" … <loop_line_number> ::= "15@example1.c" | "145@example2.c" #pragma acc loop private(1,2) 15@example1.c

slide-33
SLIDE 33

GRAMMAR

BOBBY R. BRUCE

<start> ::= <base> | <base> <start> <base> ::= "#pragma acc " <choice> <choice> ::= "loop "<private> <loop_line_number> <private> ::= "private(" <variables> ") " | " " <variables> ::= <variable> | <variable> "," <variables> <variable> ::= <variable_placeholder> <variable_placeholder> ::= "1" | "2" | "3" | "4" | "5" | "6" | "7" … <loop_line_number> ::= "15@example1.c" | "145@example2.c"

slide-34
SLIDE 34

GRAMMAR

BOBBY R. BRUCE

<start> ::= <base> | <base> <start> <base> ::= "#pragma acc " <choice> <choice> ::= "loop "<private> <loop_line_number> <private> ::= "private(" <variables> ") " | " " <variables> ::= <variable> | <variable> "," <variables> <variable> ::= <variable_placeholder> <variable_placeholder> ::= "1" | "2" | "3" | "4" | "5" | "6" | "7" … <loop_line_number> ::= "15@example1.c" | "145@example2.c" —- example1.c +++ example1.c @@ -15,0 +15,1 @@ + #pragma acc loop private(x,y)

slide-35
SLIDE 35

INITIAL INVESTIGATION

  • Chose to run a very small

example as a sanity check

  • nVidia provide an n-body

simulation example already containing OpenACC directives

  • These directives were

stripped for openacc to replicate

  • Ran for 100 generations with

population of 100

slide-36
SLIDE 36

RESULTS

BOBBY R. BRUCE

sequential

  • riginal

gi_best 50 100 150 200 250 300 350 Execution Time (ms)

slide-37
SLIDE 37

RESULTS

BOBBY R. BRUCE

  • riginal

gi_best 11.6 11.8 12.0 12.2 12.4 12.6 12.8 13.0 Execution Time (ms)

slide-38
SLIDE 38

RESULTS: OTHER NOTES

  • Seems like much of the gain is due to random search
  • We’d like to be able to beat human-written alternatives
  • This example is very small, future investigations will show how

well the tool scales

BOBBY R. BRUCE

  • 20

40 60 80 100 50 100 150 200 250 300 Generation Elite Performance (ms)

slide-39
SLIDE 39

CURRENT/FUTURE WORK

  • Currently applying the tool to larger
  • At present can only work with C/C++, expanding code to work with

FORTRAN Possible Improvements:

  • Seed initial generation with basic solutions
  • Introduce some clever profiling
  • Get working with OpenMP as well as OpenACC

BOBBY R. BRUCE

slide-40
SLIDE 40

ANY QUESTIONS?

BOBBY R. BRUCE

sequential

  • riginal
  • ptimal

50 100 150 200 250 300 350 Execution Time (ms)

  • 20

40 60 80 100 50 100 150 200 250 300 Generation Elite Performance (ms)

  • riginal
  • ptimal

11.6 11.8 12.0 12.2 12.4 12.6 12.8 13.0 Execution Time (ms)