TMBL Kernels for CUDA GPUs Compile Faster Using PTX Tony E Lewis - - PowerPoint PPT Presentation

tmbl kernels for cuda gpus compile faster using ptx
SMART_READER_LITE
LIVE PREVIEW

TMBL Kernels for CUDA GPUs Compile Faster Using PTX Tony E Lewis - - PowerPoint PPT Presentation

TMBL Kernels for CUDA GPUs Compile Faster Using PTX Tony E Lewis George D Magoulas Two Major Approaches to GPU Acceleration of GP Data parallel Compile new GPU code for each new batch Population parallel Write one GPU interpreter to process


slide-1
SLIDE 1

TMBL Kernels for CUDA GPUs Compile Faster Using PTX

Tony E Lewis George D Magoulas

slide-2
SLIDE 2

Two Major Approaches to GPU Acceleration of GP

Data parallel Compile new GPU code for each new batch Population parallel Write one GPU interpreter to process all batches

slide-3
SLIDE 3

The Aim of the Work: To Minimise the Weakness of Data-parallel

Data parallel Evaluation: very fast Compilation: long Population parallel Evaluation: fast Compilation: none

slide-4
SLIDE 4

The Problem: Compilation Stops Small Datasets Getting Top Speed

slide-5
SLIDE 5

Two Strategies to Ease Load for Compiler; This Talk is about the First

  • 1. PTX

Write the individuals in a lower level language

  • 2. Alignment

Exploit similarities between individuals

slide-6
SLIDE 6

Compilation Creates a GPU-ready Binary from C Source Code

slide-7
SLIDE 7

Compilation Uses Two Slow Steps; This Work Eliminates the First

slide-8
SLIDE 8

Compilation Uses Two Slow Steps; This Work Eliminates the First

slide-9
SLIDE 9

PTX is a Bit Like Assembly

C Example

slot0 = -1.64101672f; slot4 += slot3; slot1 -= testcase0; slot0 *= slot3; slot2 = ( (slot3 == 0.0f) ? 0.0f : slot2/slot3 );

PTX Example

mov.f32 %slot0, 0fBFD20CD6; add.f32 %slot4, %slot4, %slot3; sub.f32 %slot1, %slot1, %testcase0; mul.f32 %slot0, %slot0, %slot3; div.full.f32 %slot2, %slot2, %slot3; setp.eq.f32 %divPred, %slot3, 0f00000000; selp.f32 %slot2, 0f00000000, %slot2, %divPred;

slide-10
SLIDE 10

Take a Step Back: What is the Reason For Doing This Work?

slide-11
SLIDE 11

Take a Step Back: What is the Reason For Doing This Work? Long Term Fitness Growth

slide-12
SLIDE 12

Thought Experiment:

slide-13
SLIDE 13

Thought Experiment: Toy Blocks

slide-14
SLIDE 14

Thought Experiment: A Tower of Blocks

slide-15
SLIDE 15

The Same Problem Is Faced by a GP Tree

slide-16
SLIDE 16

How Can We Encourage Long Term Fitness Growth?

slide-17
SLIDE 17

How Can We Encourage Long Term Fitness Growth? Encourage tweaks: Mutations that can easily change behaviour without ruining existing functionality

slide-18
SLIDE 18

A Representation to Encourage Tweaks Linear form not node-based Registers not stack Iterated execution not point of execution Instructions that modify not overwrite Long programs

slide-19
SLIDE 19

The Result: TMBL

Tweaking a Tower of Blocks Leads to a TMBL: Pursuing Long Term Fitness Growth in Program Evolution Tony E Lewis,George D Magoulas 2010, IEEE Congress on Evolutionary Computation (CEC) (pages 4465-4472) takesatmbl.wordpress.com

slide-20
SLIDE 20

PTX is a Bit Like Assembly

C Example

slot0 = -1.64101672f; slot4 += slot3; slot1 -= testcase0; slot0 *= slot3; slot2 = ( (slot3 == 0.0f) ? 0.0f : slot2/slot3 );

PTX Example

mov.f32 %slot0, 0fBFD20CD6; add.f32 %slot4, %slot4, %slot3; sub.f32 %slot1, %slot1, %testcase0; mul.f32 %slot0, %slot0, %slot3; div.full.f32 %slot2, %slot2, %slot3; setp.eq.f32 %divPred, %slot3, 0f00000000; selp.f32 %slot2, 0f00000000, %slot2, %divPred;

slide-21
SLIDE 21

...but PTX isn't Exactly Like Assembly Doesn't directly correspond with resulting binary

  • Eg. Many registers get compiled to few
slide-22
SLIDE 22

Will PTX Code Evaluate Slower? Maybe Yes: Competing with the CUDA compiler's developers Maybe No: We know our code better than the compiler does: Can guarantee non-divergent branches Can use non-divergent instructions (a=b?c:d)

slide-23
SLIDE 23

Results: Load time is small

slide-24
SLIDE 24

Results: Evaluation Speed is Improved

slide-25
SLIDE 25

Results: Compile Time is Considerably Reduced (~5.8x)

slide-26
SLIDE 26

Conclusions Complexity Maintainability Effectiveness Possibility of going further

slide-27
SLIDE 27

EPSRC Reviewers You Thanks