Prasanth Chatarasi PhD Thesis Defense Habanero Extreme Scale Software - PowerPoint PPT Presentation

ADVANCING COMPILER OPTIMIZATIONS FOR GENERAL-PURPOSE & DOMAIN-SPECIFIC PARALLEL ARCHITECTURES Prasanth Chatarasi PhD Thesis Defense Habanero Extreme Scale Software Research Group School of Computer Science Georgia Institute of Technology July 27th, 2020 1

Disruption in Computer Hardware • Transistor scaling is reaching its limits (7nm today) • Leading to the end of Moore’s law General-Purpose Domain-specific Parallel Architectures Parallel Architectures Multi-core CPUs Many-core CPUs Spatial accelerators Specialized SIMD Thread Migratory Quantum SIMD GPGPUs These architectures are evolving rapidly! 2 Images are taken from public domain

Application domains that demand high performance are also increasing Scientific computing Large scale graph processing applications Machine learning (Deep Neural Networks) Furthermore, these domains are rapidly evolving with new algorithms! 3

Ways to achieve high-performance 2) High-performance 3) Optimizing 1) Ninja/Expert libraries compilers programmers — Easy to develop high — Easy to develop high — Achieve close to performance applications peak performance performance applications — Portable across platforms — Hard to port to new — Not portable across hardware platforms platforms — Easily supports rapidly evolving applications — Only a small fraction — Hard to support rapidly evolving applications of developers are Ninja — Enables full-program programmers optimizations — Inhibits optimizations across library calls — Promising direction, but requires advancements! 4

Thesis statement “Given the increasing demand for performance across multiple application domains and the major disruptions in future computer hardware as we approach the end of Moore’s Law, our thesis is that advances in compiler optimizations are critical for enabling a wide range of applications to exploit future advances in both general-purpose and domain-specific parallel architectures.” 5

Key Contributions Advancing Compiler Optimizations for General-Purpose Parallel Architectures Analysis and optimization of explicitly Multi-core/Many-core 1) parallel programs (PACT’15) CPUs Unification of storage transformations Vector Units 2) with loop transformations (LCPC’18) (SIMD, SIMT) Advancing Compiler Optimizations for Domain-Specific Parallel Architectures Domain-specific compiler for graph Thread migratory 3) analytics on thread migratory hardware (EMU) (MCHPC’18) Data-centric compiler for DNN Flexible Spatial 4) operators on flexible spatial accelerators accelerators (ArXiv’20) Domain-specific compiler for tensor Specialized vector 5) convolutions on 2D SIMD units units (AI Engine) (Under submission) 6

Analysis and Optimizations of Explicitly-Parallel Programs "Polyhedral Optimizations of Explicitly Parallel Program" Prasanth Chatarasi , Jun Shirako, and Vivek Sarkar,   In Proceedings of the 24th International Conference on Parallel Architecture and Compilation (PACT'15)   (One of four papers selected for Best Paper session) 7

Explicit parallel software on the rise! • Parallel programming of multi-cores, many-cores in CPUs, GPUs have become mainstream • E.g., OpenMP for CPUs, CUDA for GPUs • Programmers explicitly specify parallelism in the program Key Challenges: 1) How to extend foundations of optimizing compilers to support explicit parallelism? 2) Can explicit-parallelism be used to refine conservative (imprecise) dependences? 8

Background: Explicit Parallelism • Parallel programs have partial execution order • Described by Happens-before relations • Loop-level parallelism (since OpenMP 1.0) • Iterations of the loop can be run in parallel • Task-level parallelism (since OpenMP 3.0 & 4.0) • Synchronization b/w parents and children — “omp taskwait” • Synchronization b/w siblings — “depend” clause 9

Background: Serial-Elision property Removal of all parallel constructs results in a sequential program that is a valid (albeit ine ffi cient) implementation of the parallel program semantics. Task dependence graph Graph after removing Original program of the program parallel constructs Satisfies serial-elision property 10

Our Approach (PoPP) (satisfying serial-elision property) PoPP — Polyhedral optimizations of Parallel Programs 11

Step-1: Compute dependences based on the sequential order (use serial-elision and ignore parallel constructs) Jacobi scientific benchmark from the KASTORS suite 12

Step-2: Compute happens-before relations using parallel constructs (ignoring statement bodies) Jacobi scientific benchmark from the KASTORS suite 13

Step-3: Intersect dependences (Best of both worlds) 14

Step-4: Pass refined dependences to Polyhedral optimizers (PolyAST) ’ ’ • Refined dependences enable a broad set of transformations • i-loop is parallel, but invalid rectangular tiling • Skewing transformation to enable rectangular tiling 15

Step-5: Generate code Omitted tiling for brevity ’ ’ • Invoke polyhedral code generators (PolyAST) • Capable of scanning the complex iteration space • Fine-grained (point-to-point ) synchronization instead of barriers 16

Evaluation • PoPP was implemented in ROSE source to source compiler framework and evaluated on the following benchmarks. • KASTORS — Task parallel benchmarks (3) • Jacobi, Jacobi-blocked, Sparse LU • RODINIA — Loop parallel benchmarks (8) • Back propagation, CFD solver, Hotspot, Kmeans, LUD, Needleman– Wunsch, particle filter, path finder 17

Variants • Original OpenMP program • Written by programmer/application developer • Automatic optimization and parallelization of serial-elision version of the OpenMP program • Automatic optimizers (PolyAST) • Optimized OpenMP program with our approach • Our framework (PoPP) which extends PolyAST with the intersection of happens- before and data dependence relations 18

Evaluation on IBM Power 8 19

Summary & Related Work • Summary: • Extended the foundations of optimizing compiler for analyzing parallel programs and also advanced the dependence analysis. • Broadened the range of applicable legal transformations • Geometric mean performance improvements of 1.62X on Intel westmere and 2.75X on IBM Power8 • Related work: • Data-flow analysis of explicitly parallel programs [Yuki et al. PPoPP’13] • Improved loop dependence analysis for GCC auto-vectorization [Jenson et al. TACO’17] • Enabled classical scalar optimizations for explicitly-parallel programs using “serial-elision” property [TAPIR — Tao et al. PPoPP’17] 20

Key Contributions Advancing Compiler Optimizations for General-Purpose Parallel Architectures Analysis and optimization of explicitly Multi-core/Many-core 1) parallel programs (PACT’15) CPUs Unification of storage transformations Vector Units 2) with loop transformations (LCPC’18) (SIMD, SIMT) Advancing Compiler Optimizations for Domain-Specific Parallel Architectures Domain-specific compiler for graph Thread migratory 3) analytics on thread migratory hardware (EMU) (MCHPC’18) Data-centric compiler for DNN Flexible Spatial 4) operators on flexible spatial accelerators accelerators (ArXiv’20) Domain-specific compiler for tensor Specialized vector 5) convolutions on 2D SIMD units units (AI Engine) (Under submission)

Marvel: A Data-Centric Compiler for DNN Operators onto Flexible Spatial Accelerators "Marvel: A Data-centric Compiler for DNN Operators on Spatial Accelerators" Prasanth Chatarasi , Hyoukjun Kwon, Natesh Raina, Saurabh Malik, Vaisakh Haridas, Angshuman Parashar, Michael Pellauer, Tushar Krishna, and Vivek Sarkar, (ArXiv’20) 22

Deep Learning (DNN Models) Examples of DNN Regular CONV2D over 4D Tensors Operators (Layers) Weights C — Regular CONV1D Inputs Partial Sums K R — Regular CONV2D — Depth-wise CONV2D Q = Y –R S Y C — Transposed CONV2D — Regular CONV3D C P = X – S X — Strided variants N K — GEMM (MatMul) N — LSTM (RNNs) — Element-wise — Pooling — Fully Connected/MLP Involves billions of computations — ….. 23 Parashar et al., ISPASS 2019

Spatial Accelerators Abstract overview DNN Operators DRAM unit — Regular CONV1D Problem statement: — Regular CONV2D To/From DRAM How to map for — Depth-wise CONV2D Shared Bu ff er (L2 Scratch Pad) — Transposed CONV2D low latency, — Regular CONV3D Network-on-Chip (NoC) high energy e ffi ciency? — Strided variants PE PE PE — GEMM (MatMul) L1 Scratch Pad L1 Scratch Pad L1 Scratch Pad — LSTM (RNNs) ALU (MAC Unit) ALU (MAC Unit) ALU (MAC Unit) — Element-wise — Pooling PE PE PE — Fully Connected/MLP L1 Scratch Pad L1 Scratch Pad L1 Scratch Pad — ….. ALU (MAC Unit) ALU (MAC Unit) ALU (MAC Unit) PE PE PE Mapping involves L1 Scratch Pad L1 Scratch Pad L1 Scratch Pad ALU (MAC Unit) ALU (MAC Unit) ALU (MAC Unit) 1) Parallelization onto compute resources, 3-level accelerator 2) Tiling across memory resources, and 3) Exploitation of data reuse E.g., TPU, Eyeriss, NVDLA 24

Prasanth Chatarasi PhD Thesis Defense Habanero Extreme Scale Software - PowerPoint PPT Presentation

ADVANCING COMPILER OPTIMIZATIONS FOR GENERAL-PURPOSE & DOMAIN-SPECIFIC PARALLEL ARCHITECTURES Prasanth Chatarasi PhD Thesis Defense Habanero Extreme Scale Software Research Group School of Computer Science Georgia Institute of Technology

Polyhedral Transformations of Explicitly Parallel Programs Prasanth Chatarasi, Jun Shirako, Vivek

A Preliminary Study of Compiler Transformations for Graph Applications on the EMU System

Static Data Race Detection for SPMD Programs via an Extended Polyhedral Representation Prasanth

A quantum algorithm for model independent searches for new physics Prasanth Shyamsundar

Extending the Polyhedral Compilation Model for Debugging and Optimization of SPMD-style Explicitly

C O R P U S C H R I S T I P A R I S H PASTOR: Fr. Prasanth George CMI OFFICE HOURS: Tuesday to

Using ONNX for accelerated inferencing on cloud and edge Prasanth Pulavarthi (Microsoft) Kevin

Investor Presentation July 2018 Senior management presenting Prasanth Manghat CEO Prashanth

Investor Presentation August 2018 Senior management presenting Prasanth Manghat CEO Prashanth

OASIS: Better simulated events to allow for fewer simulated events Prasanth Shyamsundar

FY 2014 Results Presentation 24 February, 2015 Senior management presenting Prasanth Manghat

Anokha Indo Fren rench musical journ rney ey Prasanth Anouk Algrain Shankararaman

Anomaly Detection for Network Connection Logs Swapneel Mehta Prasanth Kothuri, Daniel Lanza

A Big Data Architecture for the Detection of Anomalies within Database Connection Logs Swapneel

GF2UD and UD2GF UD: Universal Dependencies Prasanth Kolachina GF Summer school, 2017 the black

Improving the Efficiency of Manual Ground Truth Labeling Using Automated Anatomy Segmentation

Computational Social Choice: Spring 2017 Ulle Endriss Institute for Logic, Language and

CS6200 Information Retrieval Jesse Anderton College of Computer and Information Science

Fair division Lirong Xia March 11, 2016 Last class: two-sided 1-1 stable matching Boys Kyle

Kick Off Celebration & Giving Day - Wed., August 29, 2018, 7am - 7pm 108 Donors Contributed

DCU at FIRE 2013: Cross Language !ndian News Story Search Piyush Arora, Jennifer Foster, Gareth

Semantic Web: a short introduction Ivan Herman, Semantic Web Activity Lead, W3C Webelopers

Web CS490W: Web I nformation Search & Management Web opened the door for many important

Why do we use multiplexing on cars ? EVOLUTION DU CABLAGE METRES (longueur de cablage) NOMBRE

Prasanth Chatarasi PhD Thesis Defense Habanero Extreme Scale Software - PowerPoint PPT Presentation

ADVANCING COMPILER OPTIMIZATIONS FOR GENERAL-PURPOSE & DOMAIN-SPECIFIC PARALLEL ARCHITECTURES Prasanth Chatarasi PhD Thesis Defense Habanero Extreme Scale Software Research Group School of Computer Science Georgia Institute of Technology

Polyhedral Transformations of Explicitly Parallel Programs Prasanth Chatarasi, Jun Shirako, Vivek

A Preliminary Study of Compiler Transformations for Graph Applications on the EMU System

Static Data Race Detection for SPMD Programs via an Extended Polyhedral Representation Prasanth

A quantum algorithm for model independent searches for new physics Prasanth Shyamsundar

Extending the Polyhedral Compilation Model for Debugging and Optimization of SPMD-style Explicitly

C O R P U S C H R I S T I P A R I S H PASTOR: Fr. Prasanth George CMI OFFICE HOURS: Tuesday to

Using ONNX for accelerated inferencing on cloud and edge Prasanth Pulavarthi (Microsoft) Kevin

Investor Presentation July 2018 Senior management presenting Prasanth Manghat CEO Prashanth

Investor Presentation August 2018 Senior management presenting Prasanth Manghat CEO Prashanth

OASIS: Better simulated events to allow for fewer simulated events Prasanth Shyamsundar

FY 2014 Results Presentation 24 February, 2015 Senior management presenting Prasanth Manghat

Anokha Indo Fren rench musical journ rney ey Prasanth Anouk Algrain Shankararaman

Anomaly Detection for Network Connection Logs Swapneel Mehta Prasanth Kothuri, Daniel Lanza

A Big Data Architecture for the Detection of Anomalies within Database Connection Logs Swapneel

GF2UD and UD2GF UD: Universal Dependencies Prasanth Kolachina GF Summer school, 2017 the black

Improving the Efficiency of Manual Ground Truth Labeling Using Automated Anatomy Segmentation

Computational Social Choice: Spring 2017 Ulle Endriss Institute for Logic, Language and

CS6200 Information Retrieval Jesse Anderton College of Computer and Information Science

Fair division Lirong Xia March 11, 2016 Last class: two-sided 1-1 stable matching Boys Kyle

Kick Off Celebration &amp; Giving Day - Wed., August 29, 2018, 7am - 7pm 108 Donors Contributed

DCU at FIRE 2013: Cross Language !ndian News Story Search Piyush Arora, Jennifer Foster, Gareth

Semantic Web: a short introduction Ivan Herman, Semantic Web Activity Lead, W3C Webelopers

Web CS490W: Web I nformation Search &amp; Management Web opened the door for many important

Why do we use multiplexing on cars ? EVOLUTION DU CABLAGE METRES (longueur de cablage) NOMBRE

Kick Off Celebration & Giving Day - Wed., August 29, 2018, 7am - 7pm 108 Donors Contributed

Web CS490W: Web I nformation Search & Management Web opened the door for many important