Algorithms for Auto-tuning OpenACC Accelerated Kernels - PowerPoint PPT Presentation

Outline Algorithms ¡for ¡Auto-‑tuning ¡OpenACC ¡ Accelerated ¡Kernels ¡ Fatemah ¡Al-‑Zayer 1 , ¡Ameerah ¡Al-‑Mu2ry 1 , ¡Mona ¡Al-‑Shahrani 1, ¡ ¡Saber ¡Feki 2 , ¡and ¡David ¡Keyes 1 ¡ ¡ ¡ 1 Extreme ¡Compu,ng ¡Research ¡Center, ¡ 2 KAUST ¡Supercompu,ng ¡Laboratory, ¡ King ¡Abdullah ¡University ¡of ¡Science ¡and ¡Technology ¡(KAUST), ¡Thuwal, ¡Saudi ¡Arabia ¡ GTC ¡2016 ¡– ¡San ¡Jose-‑ ¡CA ¡– ¡April ¡6 th ¡2016 ¡ GPU Technology Conference 2016

Outline ¡ • Mo,va,on ¡ • Auto-‑Tuning ¡Methodology ¡ • Performance ¡Results ¡ • Toward ¡a ¡Vector ¡Model ¡ • Preliminary ¡Results ¡ • Conclusion ¡and ¡Future ¡work ¡ ¡ ¡ GPU Technology Conference 2016 2

Mo,va,on ¡ • Tuning ¡loops ¡execu,ons ¡is ¡one ¡important ¡step ¡to ¡op,mize ¡the ¡ performance ¡of ¡your ¡OpenACC ¡accelerated ¡code. ¡ – Gang ¡(num_gangs) ¡and ¡Vector ¡(vector_length) ¡clauses ¡ – Tile ¡ – Collapse ¡ • Manually ¡tuning ¡these ¡parameters ¡can ¡be ¡tedious ¡and ¡,me ¡ consuming. ¡ ¡ ¡ ¡ Auto-‑Tuning ¡ GPU Technology Conference 2016 3

Tuning ¡Time ¡! ¡ • Using ¡brute ¡force ¡search ¡is ¡very ¡,me ¡consuming ¡ Ø Historic ¡Learning ¡Approach ¡ Ø Using ¡faster ¡deriva,ve-‑free ¡search ¡algorithms: ¡ – Random ¡Search ¡ – Simulated ¡Annealing ¡ – Gene,c ¡Algorithm ¡ – Nelder-‑Mead ¡ Ø A ¡hybrid ¡solu,on ¡using ¡search ¡algorithms ¡combined ¡with ¡historic ¡ learning. ¡ GPU Technology Conference 2016 4 4 ¡

Automa,c ¡Performance ¡Tuning ¡Method ¡ GPU Technology Conference 2016 5

Seismic ¡Imaging ¡Kernel ¡ • Solve ¡the ¡acous,c ¡wave ¡equa,on ¡(Isotropic ¡case) ¡ • Finite ¡difference ¡scheme, ¡2 nd ¡order ¡in ¡,me, ¡4 th ¡or ¡8 th ¡order ¡in ¡space ¡ GPU Technology Conference 2016 6

Seismic ¡Imaging ¡Kernel ¡– ¡CPU ¡version ¡ GPU Technology Conference 2016 7

Seismic ¡Imaging ¡Kernel: ¡OpenACC ¡Implementa,on ¡ GPU Technology Conference 2016 8

Experimental ¡Results ¡ • Seismic ¡kernel ¡solving ¡the ¡acous,c ¡wave ¡equa,on, ¡finite ¡ difference ¡scheme ¡8th ¡and ¡4th ¡order ¡in ¡space. ¡ • Performance ¡reported ¡on ¡NVIDIA ¡K20 ¡and ¡K40 ¡GPUs. ¡ • NVIDIA ¡recommends ¡the ¡vector ¡size ¡to ¡be ¡mul,ple ¡of ¡warp ¡size ¡ (32) ¡on ¡thus ¡we ¡explored ¡the ¡values ¡[32,64,96,…,1024] ¡ ¡ • Gang ¡values ¡tested ¡on ¡increments ¡of ¡2 ¡star,ng ¡from ¡2 ¡up ¡,ll ¡ 1024 ¡ • Performance ¡and ¡tuning ¡,me ¡while ¡using ¡different ¡search ¡ algorithm ¡and ¡in ¡combina,on ¡with ¡historic ¡learning. ¡ GPU Technology Conference 2016 9

K20 ¡-‑ ¡8th ¡order ¡– ¡Speedup ¡ ¡ 2.80 ¡ 2.60 ¡ 2.40 ¡ 2.20 ¡ Speedup ¡ 2.00 ¡ 1.80 ¡ 1.60 ¡ 1.40 ¡ 1.20 ¡ 1.00 ¡ Problem ¡Size ¡ Brute ¡Force ¡ Random ¡Walk ¡ Simulated ¡Annealing ¡ Nelder-‑Mead ¡ Gene,c ¡Algorithm ¡1 ¡ Gene,c ¡Algorithm ¡2 ¡ GPU Technology Conference 2016 10

K20 ¡-‑ ¡8th ¡order ¡– ¡Tuning ¡,me ¡ ¡ 10000 ¡ 1000 ¡ 100 ¡ Time ¡(Sec) ¡ 10 ¡ 1 ¡ 0.1 ¡ Problem ¡Size ¡ Brute ¡Force ¡ Random ¡Walk ¡ Simulated ¡Annealing ¡ Nelder-‑Mead ¡ Gene,c ¡Algorithm ¡1 ¡ ¡ Gene,c ¡Algorithm ¡2 ¡ GPU Technology Conference 2016 11

K20 ¡-‑ ¡4th ¡order ¡– ¡Speedup ¡ ¡ 2.60 ¡ 2.40 ¡ 2.20 ¡ 2.00 ¡ Speedup ¡ 1.80 ¡ 1.60 ¡ 1.40 ¡ 1.20 ¡ 1.00 ¡ Problem ¡Size ¡ Brute ¡Force ¡ Random ¡Walk ¡ Simulated ¡Annealing ¡ Nelder-‑Mead ¡ Gene,c ¡Algorithm ¡1 ¡ Gene,c ¡Algorithm ¡2 ¡ GPU Technology Conference 2016 12

K20 ¡-‑ ¡4th ¡order ¡– ¡Tuning ¡,me ¡ 10000 ¡ 1000 ¡ 100 ¡ Time ¡(Sec) ¡ 10 ¡ 1 ¡ 0.1 ¡ Problem ¡Size ¡ Brute ¡Force ¡ Random ¡Walk ¡ Simulated ¡Annealing ¡ Nelder-‑Mead ¡ Gene,c ¡Algorithm ¡1 ¡ Gene,c ¡Algorithm ¡2 ¡ GPU Technology Conference 2016 13

K40 ¡-‑ ¡8th ¡order ¡– ¡Speedup ¡ 3.50 ¡ 3.00 ¡ 2.50 ¡ Speedup ¡ 2.00 ¡ 1.50 ¡ 1.00 ¡ Problem ¡Size ¡ Brute ¡Force ¡ Random ¡Walk ¡ Simulated ¡Annealing ¡ Nelde-‑Mead ¡ Gene,c ¡Algorithm ¡1 ¡ Gene,c ¡Algorithm ¡2 ¡ GPU Technology Conference 2016 14

K40 ¡-‑ ¡8th ¡order ¡– ¡Tuning ¡,me ¡ 10000 ¡ 1000 ¡ 100 ¡ Time ¡(Sec) ¡ 10 ¡ 1 ¡ 0.1 ¡ Problem ¡Size ¡ Brute ¡Force ¡ Random ¡Walk ¡ Simulated ¡Annealing ¡ ¡ Nelder-‑Mead ¡ Gene,c ¡Algorithm ¡1 ¡ Gene,c ¡Algoithm ¡2 ¡ ¡ GPU Technology Conference 2016 15

K20 ¡-‑ ¡8th ¡order ¡– ¡Speedup ¡with ¡historical ¡learning ¡ 2.80 ¡ Brute ¡Force ¡ Historic ¡Learning ¡and ¡Brute ¡Force ¡ 2.60 ¡ Historic ¡Learning ¡and ¡Random ¡Walk ¡ Historic ¡Learning ¡and ¡Nelder-‑Mead ¡ 2.40 ¡ Historic ¡Learning ¡and ¡Gene,c ¡Algorithm ¡ 2.20 ¡ Speedup ¡ 2.00 ¡ 1.80 ¡ 1.60 ¡ 1.40 ¡ 1.20 ¡ 1.00 ¡ 80x140x275 ¡ 1500x150x15 ¡ 120x83x402 ¡ 98x418x392 ¡ 288x288x288 ¡ Problem ¡Size ¡ GPU Technology Conference 2016 16

K20 ¡-‑ ¡8th ¡order ¡– ¡Tuning ¡,me ¡historical ¡learning ¡ ¡ 10000 ¡ Brute ¡Force ¡ Historic ¡Learning ¡and ¡Brute ¡Force ¡ Historic ¡Learning ¡and ¡Random ¡Walk ¡ 1000 ¡ Historic ¡Learning ¡and ¡Nelder-‑Mead ¡ Historic ¡Learning ¡and ¡Gene,c ¡Algorithm ¡ 100 ¡ Time ¡(Sec) ¡ 10 ¡ 1 ¡ 80x140x275 ¡ 1500x150x15 ¡ 120x83x402 ¡ 98x418x392 ¡ 288x288x288 ¡ 0.1 ¡ Problem ¡Size ¡ GPU Technology Conference 2016 17

Model ¡for ¡gang ¡and ¡vector ¡? ¡ • Can ¡we ¡provide ¡a ¡beier ¡model ¡to ¡the ¡compiler ¡to ¡use ¡for ¡selec,ng ¡ the ¡best ¡gang ¡and/or ¡ vector ¡values ¡? ¡ • For ¡a ¡given: ¡ – Three-‑dimensional ¡problem ¡size ¡ – Applica,on: ¡e.g. ¡8 th ¡order ¡vs ¡4 th ¡order ¡ – GPU ¡Specifica,on: ¡K20 ¡(K40 ¡results: ¡work ¡in ¡progress) ¡ ¡ • Correla2ons ¡between ¡the ¡problem ¡size ¡and ¡the ¡best ¡values ¡for ¡ gang ¡and ¡vector ¡parameters. ¡Which ¡dimensions? ¡ GPU Technology Conference 2016 18

GPU Technology Conference 2016 Best ¡Vector ¡Value ¡ 1000 ¡ 1200 ¡ 200 ¡ 400 ¡ 600 ¡ 800 ¡ 0 ¡ 30 ¡ Vector ¡as ¡func,on ¡of ¡Z ¡dimension ¡ 50 ¡ 80 ¡ 100 ¡ 128 ¡ 136 ¡ 167 ¡ 215 ¡ 256 ¡ 256 ¡ 275 ¡ 300 ¡ 8th ¡order ¡on ¡K20 ¡ 300 ¡ 310 ¡ 392 ¡ Z ¡-‑ ¡Dimension ¡ 400 ¡ 402 ¡ 500 ¡ 545 ¡ 640 ¡ 678 ¡ 730 ¡ 753 ¡ 788 ¡ 807 ¡ 840 ¡ 870 ¡ 900 ¡ 915 ¡ 920 ¡ 1010 ¡ 1024 ¡ 1024 ¡ 19

Vector ¡Models ¡as ¡func,on ¡of ¡Z ¡dimension ¡ 450 ¡ 1200 ¡ 400 ¡ 1000 ¡ 350 ¡ 800 ¡ 300 ¡ Vector ¡Value ¡ Vector ¡Value ¡ 250 ¡ 600 ¡ BF ¡Vector ¡ BF ¡Vector ¡ 200 ¡ M1 ¡Vector ¡ M3 ¡Vector ¡ 400 ¡ 150 ¡ M2 ¡Vector ¡ M4 ¡Vector ¡ 100 ¡ 200 ¡ 50 ¡ 0 ¡ 0 ¡ 478 ¡ 510 ¡ 580 ¡ 640 ¡ 700 ¡ 740 ¡ 788 ¡ 830 ¡ 870 ¡ 900 ¡ 915 ¡ 920 ¡ 1010 ¡ 1024 ¡ 1024 ¡ 30 ¡ 50 ¡ 80 ¡ 100 ¡ 128 ¡ 136 ¡ 167 ¡ 215 ¡ 256 ¡ 256 ¡ 275 ¡ 300 ¡ 300 ¡ 310 ¡ 392 ¡ 400 ¡ 402 ¡ Z ¡-‑ ¡Dimension ¡ Z ¡-‑ ¡Dimension ¡ ¡ GPU Technology Conference 2016 20

Performance ¡Speedup ¡(I) ¡ • Best ¡value: ¡average ¡speedup ¡over ¡all ¡problem ¡sizes ¡using ¡(g AT ,v AT ) ¡ with ¡the ¡auto-‑tuning ¡in ¡comparison ¡to ¡the ¡compiler ¡(g c ,v c ) ¡ • We ¡report ¡the ¡average ¡speedup ¡while ¡using ¡the ¡model ¡value ¡for ¡ vector ¡and: ¡ – Computed ¡value ¡of ¡gang: ¡(g c *v c /v M, ¡ v M ) ¡ ¡ – Auto-‑Tuned ¡value ¡of ¡gang: ¡(g AT ,v M ) ¡ ¡ – Compiler ¡value ¡of ¡gang: ¡(g c ,v M ) ¡ ¡ ¡ GPU Technology Conference 2016 21

Algorithms for Auto-tuning OpenACC Accelerated Kernels - PowerPoint PPT Presentation

Outline Algorithms for Auto-tuning OpenACC Accelerated Kernels Fatemah Al-Zayer 1 , Ameerah Al-Mu2ry 1 , Mona Al-Shahrani 1, Saber Feki 2 , and David Keyes 1

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES AGENDA OpenACC Review

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC

OpenACC Birgitte Bryds HPC2N, Ume a University 12 December 2017 1 / 27 OpenACC Overview

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

KODA AUTO University KODA AUTO University Agenda on KODA AUTO University Enterprise

KODA AUTO University KODA AUTO University Agenda on KODA AUTO University Enterprise

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

PAC PACE AUT AUTO-WER WERKS KS Vehicle Tuning Services Performance tuning with fuel

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

OpenACC 2.0 and Beyond PGI Accelerator Compilers and Tools One Slide Intro to OpenACC Directives

S6540 High-Accuracy Quantum Chemistry Need for Speed: Accelerating High-Accuracy using OpenACC

Quantum algorithms based on quantum walks . J er emie Roland Universit e Libre de

Continuous Arvand: Motion Planning with Monte Carlo Random Walks Weifeng Chen and Martin Mller

Focused Random Walk with Configuration Checking and Break Minimum for Satisfiability Chuan Luo 1

Radonifying Operators and Stochastic Integration Markus Riedle 1 L evy Processes U , V

Can Cloud Computing be Used for Planning? An Initial Study Authors: Qiang Lu* , You Xu,

Managed Languages Martin Thompson - @mjpt777 Really, what is your preferred platform for

QUANTIFICATION OF PORE QUANTIFICATION OF PORE QUANTIFICATION OF PORE STRUCTURE CHARACTERISTICS

Empirical Evaluation of the Understandability of Architectural Component Diagrams Srdjan