GPU APPLICATIONS WITH NITRO Saurav Muralidharan University of Utah - PowerPoint PPT Presentation

BUILDING HIGH PERFORMANCE INPUT-ADAPTIVE GPU APPLICATIONS WITH NITRO Saurav Muralidharan University of Utah nitro-tuner.github.io

Disclaimers  This research was funded in part by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.  This research was funded by DARPA contract HR0011- 13- 3-0001.  Co-authors of this paper own stock in NVIDIA Corporation

Motivation  Some computations may have many implementations  Example: BFS, SpMV, Solvers, Sort etc.  Performance of implementations may depend on input and architecture  Set of implementations constitutes a ‘search space’  Best implementation may not be known till runtime  This talk describes a framework that tries to dynamically select the best implementation

Sparse Matrix-Vector Multiplication Sparse matrices represented using many formats • Example formats: Compressed Sparse Row (CSR), DIA etc. • Optimized implementations exist for each format • Exploit as much structure of the matrix as possible • Running Example: SpMV implementations in CUSP library • CSR-VEC DIA ELL

Input Dependence in SpMV

Autotuning Systems  Navigate a search space of:  Parameters  Implementations, a.k.a ‘ Code Variants ’  Objective: Find the best ‘point’ in search space  According to some optimization criteria  Usually Performance  Why autotuning ?

Tuning Code Variants  Parameter tuning systems param_2 param_1 param_1: 5.0 Search Search Heuristic param_2 Space param_2: 3.5 param_1  Can we tune variants using parameter tuning systems?  How do we ‘prune’ the search space?  Most information known only at runtime  Do we run search heuristic on every execution of program?  We need some sort of ‘model’ or mapping

Nitro: Introduction What is Nitro? Programmer-directed code variant tuning framework Infers mapping: inputs  variants Uses mapping to select variants @ runtime Goal : Provide general productivity tool for experts  Both library and application developers Some Terminology Input features Variant label  Model:  Feature: Characteristic or property of input data  Constraint: A check to prevent execution of invalid variant

Tuning Process Overview Library Driver Tuning Script Training Inputs (C++) (Python) User Library ( my_lib ) Nitro Library SpMV� (...)� CSR_VEC � DIA� ELL� ... � F 1� F 2� … � … � F j� C 1� C 2� … � … � C k� Active Learner Feature Evaluator Nitro Tuning Subsystem Classifier Constraint Evaluator Models Models

Nitro Production Use User Library ( my_lib ) Nitro Library SpMV (...) SpMV (...) CSR_VEC CSR_VEC my_lib::SpMV(matrix); DIA DIA DIA ELL ELL Run DIA ... ... F 1 F 1 F 2 F 2 … … … … F j F j C 1 C 1 C 2 C 2 … … … … C k C k Query Models SpMV Model End User User Library

SpMV Library Driver (C++) // Create Nitro tuning context Auto-Generated context cx; from Tuning Script ... code_variant<tuning_policies::spmv, ArgTuple> spmv(cx); thrust::tuple of Variant Args // Declare and add variants csr_vector_type<T> csr_vector_variant; dia_type<T> dia_variant; C++ Functor ... Containing DIA Variant spmv.add_variant(&csr_vector_variant); spmv.add_variant(&dia_variant);

SpMV Library Driver (C++) // Declare and add features... avg_nnz_per_row_type<T> avg_nnz_feature; ... spmv.add_input_feature(&avg_nnz_feature); ... // ... and constraints dia_cutoff_type dia_cutoff; spmv.add_constraint(&dia_cutoff); ... Padding estimate for conversion to DIA // Call variant Format spmv(input_matrix);

SpMV Tuning Script (Python) # Provide application, fn name, number of variants tuner = autotuner (“ spmv ”) spmv = code_variant (“ spmv ”, 6) # Set variant-specific tuning options spmv.classifier = svm_classifier() spmv.constraints = True # Provide training data for classifier tuner.set_training_args(input) # Perform autotuning of variant tuner.tune([spmv])

Model Construction  Tuning subsystem builds a model that maps a given feature vector to label corresponding to optimal variant  Offline training phase Labeled Training Data Training Inputs DIA CSRV Exhaustive Search Feature & Constraint Evaluation  Plug-in support for classifiers  Support Vector Machines (using libSVM ) is currently used by default:  RBF Kernel is default; parameters found using cross-validation based parameter search

Improving Training & Runtime Overheads  Incremental tuning through Active Learning Training Pool Active Pool Retrain BvSB Pick Model  Parallel feature and constraint evaluation  Asynchronous feature function execution

Experimental Setup  Target architecture: Tesla C2050 (Fermi)  Training inputs  Taken from standard sets  Exemplar input for each variant (minimally)  Test inputs  Distinct from training data  Test set much larger than training set to test generalization

Benchmarks Benchmark Variants SpMV (CUSP) CSR Scalar (Tex/Non-Tex) CSR Vector (Tex/Non-Tex), ELL , DIA Pre-Conditioner+Solver (CG, BiCGStab) Solvers (CULA) (Jacobi, Blocked Jacobi, FAInv) Pre- conditioners BFS (Back40Computing) E-C (Fused/Iterative) C-E (Fused/Iterative) 2-Phase (Fused/Iterative) Histogram (CUB) (Sort, Global-Atomic, Shared-Atomic) Variants (Even-Share, Dynamic) Grid Mappings GPU Sort (CUB, ModernGPU) Merge, Locality, Radix  Features specific to each benchmark; details in paper

Results: Nitro vs. Other Variants On average, Nitro achieves at least 93% performance w.r.t exhaustive search

Performance Breakdown ~ 80% of test set achieves at least 90% of performance.

Results: Incremental Tuning Incremental� Tuning� Performance� Best� 100� w.r.t� 90� Performance� Achieves 90% of performance of full training SpMV� 80� set in ~ 25 iterations Solvers� 70� BFS� Histogram� %� 60� Average� Sort� 50� 1� 11� 21� 31� 41� 51� Number� of� Training� Instances�

Related Work  Variant Tuning Systems: PetaBricks, STAPL etc.  Tuning based on general input characteristics  Parameter Tuning Systems: Active Harmony, Orio etc.  Domain-Specific Autotuners: OSKI, SPIRAL, etc.  Other Solutions to Algorithm Selection Problem  MDP , Reinforcement Learning etc.  Can be integrated into Nitro’s learning sub-system

Conclusions & Future Work  Nitro  Programmer-directed code variant tuning system  Uses supervised learning to select variants based on input dataset features  For 5 high-performance GPU benchmarks, Nitro-tuned variants achieve over 93% of performance w.r.t exhaustive search  Incremental tuning supported via Active Learning  Future Work  Optimization parameter support  Architectural tuning support  Tuning for energy and power efficiency

 Nitro is a collaborative project by the University of Utah and NVIDIA Research  Original Paper: S. Muralidharan, M. Shantharam, M. Hall, M. Garland, B. Catanzaro, “Nitro: A Framework for Adaptive Code Variant Tuning” , IPDPS 2014  Nitro Web Page: nitro-tuner.github.io  Contact: sauravm@cs.utah.edu

Feature Evaluation Overhead Performance� w.r.t� Feature� Evalua on� Overhead� Best� 100� 95� w.r.t� 90� SpMV� Analysis helps remove features with high Performance� asymptotic complexity Solvers� 85� BFS� 80� Histogram� 75� %� Sort� 70� 1� 2� 3� 4� 5� 6� 7� 8� Number� of� Features�

Library and Tuning Interfaces

Benchmarks: Features  Sparse Matrix-Vector Multiplication  AvgNZPerRow, RL-SD, MaxDeviation, DIA and ELL Fillin  Pre-conditioner + Solvers  NNZ, #Rows, Trace, DiagAvg, DiagVar, DiagDominance, LBw, Norm1  Breadth-First Search  AvgOutDeg, Deg-SD, MaxDeviation, #Vertices, #Edges  Histogram  N, N/#Bins, SubSampleSD  GPU Sort  N, #Bits, #AscSeq

GPU APPLICATIONS WITH NITRO Saurav Muralidharan University of Utah - PowerPoint PPT Presentation

BUILDING HIGH PERFORMANCE INPUT-ADAPTIVE GPU APPLICATIONS WITH NITRO Saurav Muralidharan University of Utah nitro-tuner.github.io Disclaimers This research was funded in part by the U.S. Government. The views and conclusions contained in

WashU iGEM 2014 Team nitro GENIUS ! Washington University in St. Louis Meet Team nitro GENIUS!

NITRO PROGRAMME (RF) LTD Nitro 7 Investor presentation 3 5 April 2019 Strictly private and

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Nitro: A Fast, Scalable In-Memory Storage Engine for NoSQL Global Secondary Index Sarath

GPU Computing at the Netherlands eScience Center Ben van Werkhoven NIRICT GPU Applications

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Exploration edge Fraser MacCorquodale General Manager - Exploration Disclaimers Forward Looking

Exploratory subgroup analysis: Post-hoc subgroup identification in clinical trials Alex

13 th Transparency Workshop Update on the TP activities BluePoint Meeting Centre 21 November 2019

Industry 4.0 Mapping the Structure and Evolution of an Emerging Field Yaar Tonta and Gleda

Monkeys in Lab Coats Automating Failure Testing Research at The whole is greater than the sum of

A NEW SEARCH SPACE IN AUSTRALIAS PREMIER GOLD PROVINCE Tom Sanders, Australian Resources ASX

Ashland Community Association April 21, 2020 Project Team Introductions Alameda County

Outline: Outline: -Motivations of GAPI Lib -Why GAs? -Structure of GAs - GAPI Lib architecture

GPU APPLICATIONS WITH NITRO Saurav Muralidharan University of Utah - PowerPoint PPT Presentation

BUILDING HIGH PERFORMANCE INPUT-ADAPTIVE GPU APPLICATIONS WITH NITRO Saurav Muralidharan University of Utah nitro-tuner.github.io Disclaimers This research was funded in part by the U.S. Government. The views and conclusions contained in

WashU iGEM 2014 Team nitro GENIUS ! Washington University in St. Louis Meet Team nitro GENIUS!

NITRO PROGRAMME (RF) LTD Nitro 7 Investor presentation 3 5 April 2019 Strictly private and

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Nitro: A Fast, Scalable In-Memory Storage Engine for NoSQL Global Secondary Index Sarath

GPU Computing at the Netherlands eScience Center Ben van Werkhoven NIRICT GPU Applications

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Exploration edge Fraser MacCorquodale General Manager - Exploration Disclaimers Forward Looking

Exploratory subgroup analysis: Post-hoc subgroup identification in clinical trials Alex

13 th Transparency Workshop Update on the TP activities BluePoint Meeting Centre 21 November 2019

Industry 4.0 Mapping the Structure and Evolution of an Emerging Field Yaar Tonta and Gleda

Monkeys in Lab Coats Automating Failure Testing Research at The whole is greater than the sum of

A NEW SEARCH SPACE IN AUSTRALIAS PREMIER GOLD PROVINCE Tom Sanders, Australian Resources ASX

Ashland Community Association April 21, 2020 Project Team Introductions Alameda County

Outline: Outline: -Motivations of GAPI Lib -Why GAs? -Structure of GAs - GAPI Lib architecture

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,