genetic improvement of gpu code
play

Genetic improvement of GPU code Jhe-Yu (Jerry) Liou, Stephanie - PowerPoint PPT Presentation

Genetic improvement of GPU code Jhe-Yu (Jerry) Liou, Stephanie Forrest, Carole-Jean Wu Computer Science and Engineering Biodesign institute Arizona State University, Tempe, AZ Motivation GPU is the de-facto co-processor for computation-


  1. Genetic improvement of GPU code Jhe-Yu (Jerry) Liou, Stephanie Forrest, Carole-Jean Wu Computer Science and Engineering Biodesign institute Arizona State University, Tempe, AZ

  2. Motivation • GPU is the de-facto co-processor for computation- intensive applications • Deep learning • Image processing • Protein folding… Core Core • GPU programs are often poorly optimized Core Core Core • Optimization requires both architecture/domain expertise Core Core • C++ -like programming interface encourages novice Core Core programmers Core Core 2

  3. Approach: Use Genetic Programming to find optimizations • GPU programs are usually small, but critical to performance • Search space is smaller • Any improvement can be significant • Many GPU applications are error-tolerant • More resilient to the program transformation from GP • Error can be co-optimized along with performance (multi-objective) Error Runtime 3

  4. Outline • Motivation • Proposed Design – GEVO • Experimental Setup • Result and Analysis • Conclusion 4

  5. Compilation flow of GPU programs Device LLVM IR Nvidia PTX Device code CUDA source file – mixed with host and __global__ kernel(){ device code id = threadId.x; … __global__ kernel(){ } id = threadId.x; … } GEVO – Gpu EVOlve int main() { cudaInit() Host code (Pure C/C++) float *a; float *b; int main() { … cudaInit() cudaMemoryCopy() float *a; kernel<<<…>>>( a,b) float *b; cudaMemoryCopy() cudaMemoryCopy() } Application … Binary cudaKernelLaunch() cudaMemoryCopy } 5

  6. Overview of Gpu EVOlution (GEVO) INPUT Population GPU Kernel Code Application Evaluation Selection GEVO Fitness Function Test Cases Mutation Crossover 6

  7. Selection • Multi-objective selection: (runtime, error) • NSGA-II : Non-dominated Sorting Genetic Algorithm [1] A Rank High Low Error A C B B C Runtime • Combine dominance and crowding distance for ranking [1] Deb et al., "A fast and elitist multiobjective genetic algorithm: NSGA- II,“IEEE Transactions on Evolutionary Computation, 2002 7

  8. Mutation • Copy, delete, move, replace, swap instructions/operands • Often breaks LLVM syntax: requires repairs Function(int %0) Function(int %0) %1 = load int, %0 %1 = load int, %0 Copy an instruction Connect the input %4i = mul float, %3, 1.0 %4i = mul float, %3, 1.0 1.0 %2 = add int, %1, %1 %2 = add int, %1, %1 %3 = conv float int %2 %3 = conv float int %2 Apply the output %4 = mul float, %3, 1.0 %4 = mul float, %3, 1.0 %4i 8

  9. Individual representation LLVM-IR + Patch(mutation) Individual LLVM-IR Patch Copy 3, 4 Move 9, 3 Del 4 Crossover Mutation 9

  10. Crossover • Uses patch-based representation Reapply Random Combine Reorder Individual 1 New Individual 1 Patches Point & Shuffle Patch Patch Separation New Kernel CP CP Kernel CP CP Variant CP SWP Variant 1 CP CP 1 DEL CP CP CP CP Original Individual 2 New Individual 2 DEL kernel DEL Patch Patch MV MV New MV DEL MV Kernel CP SWP SWP Kernel Variant MV SWP SWP MV Variant 2 2 DEL SWP DEL 10

  11. Outline • Motivation • Proposed Design – GEVO • Experimental Setup • Results and Analysis • Conclusion 11

  12. Experimental Setup • Platform • GPU: Nvidia P100 • Driver: CUDA 9.2 with Nvidia driver 410 • CUDA kernel Compiler: Clang/LLVM-8.0 • GEVO Parameters • Population size: 256 • Cross rate: 80% • Mutation rate: 30% • Search time: 48 hours (20 – 100 generations) 12

  13. Benchmarks Applications Error metric Test suites Post-optimization validation • Bfs Rodinia Max raw output Built-in data Held-out tests • B+tree benchmark suites difference generator • … [2] • Particle filter (GPGPU) • Stream cluster (13 applications) • MNIST • Testing ML workloads Model training Training datasets • a9a trained using error datasets • MNIST large ThunderSVM [3] dataset [2] S. Che et al., "Rodinia : A benchmark suite for heterogeneous computing,“ IISWC 2009 13 [3] W. Zei et al., “ ThunderSVM : A Fast SVM Library on GPUs and CPUs”, JMLS 2018

  14. Outline • Motivation • Proposed Design – GEVO • Experimental Setup • Results and Analysis • Rodinia benchmark suite • ML workloads trained under ThunderSVM • Conclusion 14

  15. GEVO results – Rodinia 50% Performance improvement No error tolerance 40% Accept up to 1% error 30% 20% 10% 0% 15

  16. Temporal analysis – hotspot (epistasis) 1 1 Runtime Maximum Error Rate(%) Error rate 0.8 Normalized Runtime 0.9 1. Sub-optimal individual can be served as the stepping stone for 0.6 better optimization combination 0.8 0.4 0.7 0.2 2. This implies error tolerance can be used for circumventing and 0.6 0 reaching other program spaces. 0 10 20 30 40 50 60 Generation • Observed 3 key mutations, introducing 0.3 error rate individually, but only incurring 0.1 error rate when combined. 16

  17. Optimization analysis – remove redundant store (LU decomposition) (c) GEVO Optimized (a) Unmodified (b) Post-Compilation 1 __shared__ s[BLOCK]; 1 __shared__ s[BLOCK]; 1 __shared__ s[BLOCK]; 2 int c = CONST; 2 int c = CONST; 2 int c = CONST; 3 int tid = ThreadId.x; 3 int tid = ThreadId.x; 3 int tid = ThreadId.x; 4 for (i=0; i < 16; i++) 4 for (i=0; i < 16; i++) 4 for (i=0; i < 16; i++) 5 s[tid] = init(tid); 5 s[tid] = init(tid); 5 s[tid] = init(tid); 6 __syncthread(); 6 __syncthread(); 6 __syncthread(); 7 7 7 8 8 float temp = s[tid]; 8 float temp = s[tid]; 9 for (i=0; i < 16; i++) 9 for (i=0; i < 16; i++) { 9 for (i=0; i < 16; i++) 10 s[tid] = s[tid] - s[i]*s[i]; 10 temp = temp - s[i]*s[i]; 10 temp = temp - s[i]*s[i]; 11 11 s[tid] = temp ; } 11 s[tid] = temp; 12 s[tid] = s[tid] / c; 12 s[tid] = temp / c; 12 s[tid] = temp / c; 13 __syncthread(); 13 __syncthread(); 13 __syncthread(); • Interpretation: The GPU executes the load instruction without waiting for the outstanding store instruction to be finished, and renders the store instruction redundant. 17

  18. Representative Rodinia optimizations Architecture-specific Application-specific Removing redundant Removing conditional execution • Hotspot synchronization primitives • LU decomposition • Hotspot • Particle filter • LU decomposition • Needleman-Wunch Removing redundant stores Loop perforation • LU decomposition • Stream cluster • LU decomposition • Hotspot Memoization • Hotspot 18

  19. GEVO results – ML workloads in ThunderSVM 2.4 16.4 Pareto frontier 3.24x faster, 0.17% Pareto frontier 2.93x faster, 0.04% 16.2 Prediction Error (%) Prediction Error (%) Unmodified Unmodified more accurate more accurate 16 2.2 15.8 15.6 2 15.4 15.2 15 1.8 0 1 2 3 4 5 1 2 3 4 5 6 7 8 9 10 Runtime (s) Runtime (s) MNIST a9a • Supersede the baseline in both objectives! • Same prediction error trend on testing dataset • 10x training time reduction on the MNIST large dataset (1182 mins to 121 mins) • with nearly the same training accuracy (100% to 99.997%) 19

  20. Optimization analysis – Terminate the loop earlier (MNIST) • Sequential minimal optimization ... 00 While (1) • Iteratively optimizes solution until 01 // select f Up 02 if (is_I_up (…)) the progress being slow down. 03 f_val_reduce[tid] = f; 04 up_val = f_val_reduce […]; • GEVO changes the terminal 05 06 // select f Low condition, to exit the loop earlier 07 if (is_I_low (…)) 08 // f_val_reduce[tid] = -f; • The accuracy isn’t affected by this 09 f_val_reduce[tid] = 1 – f; 10 down_val = f_val_reduce […]; change. 11 12 if (up_val – down_val < epsilon) 13 break ; • This might only be applicable for particular type of dataset 20

  21. Conclusion • GEVO finds 3 classes of optimization: • Architecture-specific • Application-specific • Dataset-specific • Machine learning is a promising GEVO target • Error tolerant • Expensive training times • Currently experimenting with deep learning frameworks • Multi-objective search allows GEVO to find stepping stones to explore larger program space. 21

  22. Thanks for Yours Attention! Genetic improvement of GPU code Jhe-Yu (Jerry) Liou, Stephanie Forrest, Carole-Jean Wu Computer Science and Engineering Biodesign institute Arizona State University, Tempe, AZ

  23. Main loop of GEVO pop = Initialization (POP_SIZE, PROGRAM) POP_SIZE = 256 for all individual from pop CROSS_RATE = 0.8 Initialization Mutate (individual) * 3 MUTATE_RATE = 0.3 rank = NonDominateSorting (pop) while offspring = SelTournment (pop, rank, POP_SIZE) Selection elites = SelBest (pop, rank, POP_SIZE/4) for every 2 individuals (ind1,ind2) from Offspring if random (0,1) < CROSS_RATE Crossover & Crossover (ind1, ind2) Mutation for every ind from Offspring if random (0,1) < MUTATE_RATE Mutate (ind) rank = NonDominateSorting (elites + offspring) Elitism pop = SelBest (elites + offspring, rank, POP_SIZE) 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend