Vector Forward Mode Automatic Differentiation on SIMD/SIMT - PowerPoint PPT Presentation

Vector Forward Mode Automatic Differentiation on SIMD/SIMT architectures Jan H¨ uckelheim, Michel Schanen, Sri Hari Krishna Narayanan, Paul Hovland Argonne National Laboratory August 10, 2020

Outlines Automatic Differentiation Modes Parallelization and Vectorization for AD Test Cases, Results Final Remarks 2 / 16

Outline Automatic Differentiation Modes Parallelization and Vectorization for AD Test Cases, Results Final Remarks 3 / 16

Automatic Differentiation (AD) ◮ Given: a program that computes y ← F ( x ) ◮ AD produces program that computes derivatives of F . ◮ AD can be implemented using source-to-source compilers, run-time tracing, JIT, ... ◮ Many tools exist, for a variety of languages including Python, C/C++, Fortran, see autodiff.org community website for an overview ◮ AD is not ”numerical differentiation” or ”finite differences”: There is no finite step size, no truncation error ◮ AD can be seen as symbolic differentiation for real-world programs, including branches, loops, function calls, etc 4 / 16

Forward vs Reverse, scalar vs vector Original Program Reverse Vector Forward x intermediate values y Forward ◮ Forward mode is simple to understand and implement, but: Need to re-run for every input. ◮ Reverse mode is cheaper for many inputs and few outputs (run once, get all directional derivatives). 5 / 16

Parallelization challenges and opportunities ◮ Forward mode AD has the same data flow as original program. Can keep parallelism ◮ Vector-forward mode is another dimension of parallelism. Free of branch divergence, also good for SIMD/SIMT Original Program Reverse Vector Forward x intermediate SIMD values y Forward 7 / 16

Parallelization challenges and opportunities ◮ Reverse mode is cheaper for many inputs and few outputs (run once, get all directional derivatives) ◮ Reverse mode changes data flow, can cause new data races. Hard to analyze by an AD tool Original Program Reverse Vector Forward x concurrent intermediate concurrent concurrent values write read y read Forward 8 / 16

Wait... But didn’t back-propagation work in parallel? ◮ Some frameworks (e.g. TensorFlow, Pytorch, Halide) support parallel reverse mode, but: ◮ They operate on a higher level of abstraction, e.g. ◮ composing manually-parallelized building blocks or ◮ generating code while exploiting problem structure ◮ They are not general-purpose 9 / 16

What AD mode is best for my application? ◮ This is commonly known in AD literature: ◮ Forward mode is easy, low overhead, but gets costly for many inputs ◮ Reverse mode is hard, high overhead, but it’s worth it for many inputs ◮ But what is many ? ◮ Where is the break-even point on today’s hardware? ◮ What might it depend on? ◮ We present a case study here. 10 / 16

Test Case 1 Description: Stencil ◮ Stencil computations are common in PDE solvers, convolutions, image processing ◮ Are easy to parallelize, but vectorize poorly due to misaligned data ◮ Reverse mode AD is difficult to parallelize (not supported by AD tools) ◮ We compare performance of primal, forward, and reverse mode, on CPU (Intel Skylake) and GPU (Nvidia Quadro GV100) ◮ Stencil is taken from Parboil benchmark suite, differentiated with Tapenade AD tool ◮ Vector-forward-mode is post-processed to insert OpenMP SIMD directives before direction loops ◮ GPU version is created by manual translation to Julia, then using Julia’s built-in AD and GPU support 12 / 16

Test Case 1 Results: Stencil 1 core avx512 double 1 core avx512 float 28 core HT avx512 double 28 core HT avx512 float 10 3 reverse double reverse float GPU double GPU float Time [s] 10 2 10 1 10 0 8 16 32 64 128 256 512 Number of computed derivatives 13 / 16

How useful are O(100)-O(1000) inputs? ◮ Number of inputs != number of state variables. For example: CAD parameters in a simulation with many more state variables ◮ Coloring can reduce number of derivative directions. For example: Power Flow application with over half million inputs, due to sparsity, needs only 30 forward mode evaluations (see paper for details). 15 / 16

Conclusions ◮ Forward mode can be surprisingly competitive to reverse mode, for hundreds of inputs ◮ With current hardware trends (more parallelism, longer vectors, ...), this number may grow further ◮ We assume here that the reverse mode can not be auto-parallelized or auto-vectorized. Maybe (hopefully) this will change? ◮ We are happy to hear your questions and comments by email, jhueckelheim@anl.gov ◮ If you are watching this as part of ICPP20, please visit our Q & A session. This work was funded in part by support from the U.S. Department of Energy, Office of Science, under contract DE-AC02-06CH11357. We gratefully acknowledge the computing resources provided and operated by the Joint Laboratory for System Evaluation (JLSE) at Argonne National Laboratory. 16 / 16

Vector Forward Mode Automatic Differentiation on SIMD/SIMT - PowerPoint PPT Presentation

Vector Forward Mode Automatic Differentiation on SIMD/SIMT architectures Jan H uckelheim, Michel Schanen, Sri Hari Krishna Narayanan, Paul Hovland Argonne National Laboratory August 10, 2020 Outlines Automatic Differentiation Modes

SIMD+ Overview Illiac IV History Early machines First massively parallel (SIMD) computer

SIMD+ Overview Illiac IV History Early machines First massively parallel (SIMD) computer

SIMD+ Overview Illiac IV History Early machines First massively

Automatic SIMD vectorization for Haskell Leaf Petersen, Dominic Orchard , Neal Glew ICFP 2013 -

Data-Level Parallelism Vector, SIMD, GPU 1 MO401 Tpicos IC-UNICAMP Vector

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Software Vector Chaining M. Anton Ertl TU Wien Data Parallelism and SIMD instructions Data

A Simply Typed -Calculus of Forward Automatic Differentiation Oleksandr Manzyuk National

Control of switch-mode converters Current Programmed Mode control CPM Mor M. Peretz, Switch-Mode

SIMD Programming CS 240A, 2017 1 Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common

Welcome! INFOMOV Lecture 5 SIMD (1) 2 Meanwhile, on ars technica INFOMOV

Architecture without explicit locks for logic Importance Of Simulation simulation on SIMD

SIMD Programming SIMD Programming with Larrabee with Larrabee Tom Forsyth Larrabee Architect

Parallel Programming and Heterogeneous Computing SIMD: Integrated Accelerators Max Plauth, Sven

Module 5.1 Thread Execusion Efficiency Warps and SIMD Hardware Objective To understand

Automatic Differentiation Tools for FreeFem++ Workshop FreeFem++ Sylvain Auliac (

A Characterization and Analysis of PTX Kernels Andrew Kerr*, Gregory Diamos, and Sudhakar

Usable assembly language for GPUs D. J. Bernstein University of Illinois at Chicago 319 ms:

Det Detec ectin ing An Anom omal alou ous Com omputat ation ion wit ith RN RNNs on on

Performance Gaps between OpenMP and OpenCL for Multi-core CPUs Jie Shen, Jianbin Fang, Henk

The Era of Heterogeneous Compute: Challenges and Opportunities Sudhakar Yalamanchili Computer

Visualization of OpenCL Application Execution on CPU-GPU Systems A. Ziabari, R. Ubal, D.

Microarchitectural Mechanisms to Exploit Value Structure in SIMT Architectures Ji Kim,

Simulation of OpenCL and APUs on Multi2Sim 4.1 Rafael Ubal, David Kaeli Conference title 1

Vector Forward Mode Automatic Differentiation on SIMD/SIMT - PowerPoint PPT Presentation

Vector Forward Mode Automatic Differentiation on SIMD/SIMT architectures Jan H uckelheim, Michel Schanen, Sri Hari Krishna Narayanan, Paul Hovland Argonne National Laboratory August 10, 2020 Outlines Automatic Differentiation Modes

SIMD+ Overview Illiac IV History Early machines First massively parallel (SIMD) computer

SIMD+ Overview Illiac IV History Early machines First massively parallel (SIMD) computer

SIMD+ Overview Illiac IV History Early machines First massively

Automatic SIMD vectorization for Haskell Leaf Petersen, Dominic Orchard , Neal Glew ICFP 2013 -

Data-Level Parallelism Vector, SIMD, GPU 1 MO401 Tpicos IC-UNICAMP Vector

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Software Vector Chaining M. Anton Ertl TU Wien Data Parallelism and SIMD instructions Data

A Simply Typed -Calculus of Forward Automatic Differentiation Oleksandr Manzyuk National

Control of switch-mode converters Current Programmed Mode control CPM Mor M. Peretz, Switch-Mode

SIMD Programming CS 240A, 2017 1 Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common

Welcome! INFOMOV Lecture 5 SIMD (1) 2 Meanwhile, on ars technica INFOMOV

Architecture without explicit locks for logic Importance Of Simulation simulation on SIMD

SIMD Programming SIMD Programming with Larrabee with Larrabee Tom Forsyth Larrabee Architect

Parallel Programming and Heterogeneous Computing SIMD: Integrated Accelerators Max Plauth, Sven

Module 5.1 Thread Execusion Efficiency Warps and SIMD Hardware Objective To understand

Automatic Differentiation Tools for FreeFem++ Workshop FreeFem++ Sylvain Auliac (

A Characterization and Analysis of PTX Kernels Andrew Kerr*, Gregory Diamos, and Sudhakar

Usable assembly language for GPUs D. J. Bernstein University of Illinois at Chicago 319 ms:

Det Detec ectin ing An Anom omal alou ous Com omputat ation ion wit ith RN RNNs on on

Performance Gaps between OpenMP and OpenCL for Multi-core CPUs Jie Shen, Jianbin Fang, Henk

The Era of Heterogeneous Compute: Challenges and Opportunities Sudhakar Yalamanchili Computer

Visualization of OpenCL Application Execution on CPU-GPU Systems A. Ziabari*, R. Ubal*, D.

Microarchitectural Mechanisms to Exploit Value Structure in SIMT Architectures Ji Kim,

Simulation of OpenCL and APUs on Multi2Sim 4.1 Rafael Ubal, David Kaeli Conference title 1

Visualization of OpenCL Application Execution on CPU-GPU Systems A. Ziabari, R. Ubal, D.