Tornado VM: Running Java on GPUs and FPGAs Juan Fumero, PhD - PowerPoint PPT Presentation

Tornado VM: Running Java on GPUs and FPGAs Juan Fumero, PhD http://jjfumero.github.io QCon-London 2020, 3rd March 2020

Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos 3. Performance Results 4. Related Work & Future Directions 5. Conclusions 2

Who am I? Dr. Juan Fumero Lead Developer of TornadoVM Postdoc @ University of Manchester juan.fumero@manchester.ac.uk @snatverk 3

Motivation 4

Why should we care about GPUs/FPGAs, etc.? CPU GPU FPGA Intel FPGA Stratix 10 (14nm) Intel Ice Lake (10nm) NVIDIA GP 100 – Pascal - 16nm Reconfigurable Hardware 8 cores HT, AVX(512 SIMD) 60 SMs, 64 cores each ~ 10 TFlops ~1TFlops* (including the iGPU) 3584 FP32 cores TDP ~225Watts ~ TDP 28W 10.6 TFlops (FP32) Source: Intel docs Source: Intel docs TDP ~300 Watts Source: NVIDIA docs 5

What is a GPU? Graphics Processing Unit Contains a set of Stream Multiprocessor cores (SMx) * Pascal arch. 60 SMx * ~3500 CUDA cores Users need to know: A) Programming model (normally CUDA or OpenCL) B) Details about the architecture are essential to achieve performance * Non sequential consistency, manual barriers, etc. Source: NVIDIA docs 6

What is an FPGA? Field Programmable Gate Array You can configure the design of your hardware after manufacturing It is like having " your algorithms directly wired on hardware " with only the parts you need 7

Current Computer Systems & Prog. Lang. 8

Ideal System for Managed Languages 9

TornadoVM 10

Demo: Kinect Fusion with TornadoVM * Computer Vision Application * ~7K LOC * Thousands of OpenCL LOC generated. https://github.com/beehive-lab/kfusion-tornadovm 11

TornadoVM Overview Tasks = Methods Annotations API Task-Schedulers = Group of Methods 12

TornadoVM Overview Tasks = Methods Annotations API Task-Schedulers = Group of Methods Data-Flow & Optimizer Runtime TornadoVM Bytecode Generation 13

TornadoVM Overview Tasks = Methods Annotations API Task-Schedulers = Group of Methods Data-Flow & Optimizer Runtime TornadoVM Bytecode Generation Bytecode interpreter Execution Engine Device Drivers 14

TornadoVM Overview Tasks = Methods Annotations API Task-Schedulers = Group of Methods Data-Flow & Optimizer Runtime TornadoVM Bytecode Generation Bytecode interpreter Just-In-Time Compiler Compiler / Execution Graal JIT Engine Device Drivers Device's heap Extensions 15

TornadoVM Overview • OpenJDK 8 > 141 • OpenJDK 11 Tasks = Methods Annotations • GraalVM 19.3.0 API • OpenCL >= 1.2 Task-Schedulers = Group of Methods • Support for: • NVIDIA GPUs • Intel HD Graphics • AMD GPUs Data-Flow & Optimizer • Intel Altera FPGAs Runtime • Xilinx FPGAs TornadoVM Bytecode Generation • Multi-core CPUs Bytecode interpreter Just-In-Time Compiler Compiler / Execution Graal JIT Engine Device Drivers Device's heap Extensions 16

Tornado API – example class Compute { public static void mxm(Matrix2DFloat A, Matrix2DFloatB, Matrix2DFloat C, final int size) { for ( int i = 0; i < size; i++) { for ( int j = 0; j < size; j++) { float sum = 0.0f; for ( int k = 0; k < size; k++) { sum += A.get(i, k) * B.get(k, j); } C.set(i, j, sum); } } } } 17

Tornado API – example class Compute { We add the parallel public static void mxm(Matrix2DFloat A, Matrix2DFloatB, annotation as a hint for the Matrix2DFloat C, final int size) { compiler. for ( @Parallel int i = 0; i < size; i++) { for ( @Parallel int j = 0; j < size; j++) { float sum = 0.0f; for ( int k = 0; k < size; k++) { sum += A.get(i, k) * B.get(k, j); } C.set(i, j, sum); } } } } 18

Tornado API – example class Compute { public static void mxm(Matrix2DFloat A, Matrix2DFloatB, Matrix2DFloat C, final int size) { for ( @Parallel int i = 0; i < size; i++) { for ( @Parallel int j = 0; j < size; j++) { float sum = 0.0f; for ( int k = 0; k < size; k++) { sum += A.get(i, k) * B.get(k, j); } C.set(i, j, sum); } } } } TaskSchedule ts = new TaskSchedule (" s0 "); ts. task (" t0 ", Compute::mxm, matrixA, matrixB, matrixC, size) . streamOut (matrixC) . execute (); 19

Tornado API – example class Compute { To run: public static void mxm(Matrix2DFloat A, Matrix2DFloatB, Matrix2DFloat C, final int size) { $ tornado Compute for ( @Parallel int i = 0; i < size; i++) { for ( @Parallel int j = 0; j < size; j++) { float sum = 0.0f; for ( int k = 0; k < size; k++) { sum += A.get(i, k) * B.get(k, j); } tornado command is just an C.set(i, j, sum); alias to Java and all the } parameters to enable } TornadoVM } } TaskSchedule ts = new TaskSchedule (" s0 "); ts. task (" t0 ", Compute::mxm, matrixA, matrixB, matrixC, size) . streamOut (matrixC) . execute (); 20

Demo: Running Matrix Multiplication https://github.com/jjfumero/qconlondon2020-tornadovm 21

TornadoVM Compiler & Runtime Overview 22

TornadoVM & Dynamic Languages 23

TornadoVM & Dynamic Languages 24

De Demo 2: 2: Node. e.js ex example le https://github.com/jjfumero/qconlondon2020-tornadovm 25

TornadoVM JIT Compiler Specializations 28

FPGA Specializations void compute( float [] input, float [] output) { for ( @Parallel int i = 0; …) } for ( int j = 0; ...) { // Computation } } } From slowdowns without Specializations to 240x with Automatic Specializations on Intel FPGAs 29

TornadoVM: VM in a VM 30

TornadoVM: VM in a VM 31

TornadoVM Bytecodes - Example 32

Batch Processing: 16GB into 1GB GPU 40

Live Task Migration 43

Dynamic Reconfiguration 44

How is the decision made? • End-to-end: including JIT compilation time • Peak Performance: without JIT and after warming-up • Latency: does not wait for all threads to finish 47

Demo Live Task Migration – Server/Client App https://github.com/jjfumero/qconlondon2020-tornadovm 48

New compilation tier for Heterogeneous Systems 49

New compilation tier for Heterogeneous Systems 50

Related Work 51

Related Work (in the Java context) Production- Supported Live Task Compiler Dynamic Project Ready Devices Migration Specializations Languages Sumatra No AMD GPUs No No No Multi-core, Marawacc No No No No GPUs JaBEE No NVIDIA GPUs No No No RootBeer No NVIDIA GPUs No No No GPUs, Aparapi Yes No No No multi- core IBM GPU J9 Yes NVIDIA GPUs No No No grCUDA No (*) NVIDIA GPUs No No Yes Multi-core, TornadoVM Not yet (*) Yes Yes Yes GPUs,FPGAs 52

Related Work (in the Java context) Production- Supported Live Task Compiler Dynamic Project Ready Devices Migration Specializations Languages Sumatra No AMD GPUs No No No Multi-core, Marawacc No No Yes Yes GPUs JaBEE No NVIDIA GPUs No No No RootBeer No NVIDIA GPUs No No No GPUs, Aparapi Yes No No No multi- core IBM GPU J9 Yes NVIDIA GPUs No No No grCUDA No (*) NVIDIA GPUs No No Yes Multi-core, TornadoVM Not yet (*) Yes Yes Yes GPUs,FPGAs 53

Ok, cool! What about performance? 54

Performance * TornadoVM performs up to 7.7x over the best device (statically). * Up to >4500x over Java sequential - NVIDIA GTX 1060 - Intel FPGA Nallatech 385a - Intel Core i7-7700K 55

Performance on GPUs, iGPUs, and CPUs 56

More details in our papers! https://github.com/beehive-lab/TornadoVM/blob/master/assembly/src/docs/Publications.md 57

Limitations & Future Work 58

Limitations We inherit limitations from the underlying Programming Model: • No object support (except for a few cases) • No recursion • No dynamic memory allocation (*) • No support for exceptions (*) 59

Future Work • GPU/FPGA full capabilities • Exploitation of Tier-memories such as local memory (in progress) • Policies for energy efficiency • Multi-device within a task-schedule • More parallel skeletons ( reductions , stencil, scan, filter, …) • PTX Backend for NVIDIA 60

Current Applicability of TornadoVM 61

EU H2020 E2Data Project https://e2data.eu/ "End-to-end solutions for Big Data deployments that fully exploit heterogeneous hardware" European Union’s Horizon H2020 research and innovation programme under grant agreement No 780245 62

Tornado VM: Running Java on GPUs and FPGAs Juan Fumero, PhD - PowerPoint PPT Presentation

Tornado VM: Running Java on GPUs and FPGAs Juan Fumero, PhD http://jjfumero.github.io QCon-London 2020, 3rd March 2020 Agenda 1. Motivation & Background 2. TornadoVM API - examples Runtime & Just In Time Compiler Live

The BIST History of FPGAs FPGAs The BIST History of The BISTory BISTory of of FPGAs FPGAs

Migrating to Java 9 Modules @Sander_Mak By Sander Mak Migrating to Java 9 Java 8 java -cp ..

10/30/2019 Joplin, Missouri Path of Destruction MAY 22, 2011: F-5 TORNADO 1 2 3 4 5 6 1

FPGAs 1 CMPE691/491: Advanced FPGA Design FPGAs Large array of configurable logic blocks

Physical Design For FPGAs Rajeev Jayaraman Physical Implementation Tools Xilinx Inc. ISPD-2001

Red- -Light Running Light Running Red Red-Light Running 2 Traffic Signals Traffic Signals

Red- -Light Running Light Running Red Red-Light Running 2 Traffic Signals Traffic Signals

JAVA Java vs. Java Java Language Specification

FPGAs milliseconds+ to reconfjgure custom chips ??? (next week) FPGAs ??? GPUs

Java Comes Home to the Consumer Chet Haase Java SE Client Architect Java Comes Home to the

Multi-core in JVM/Java Concurrent programming in java Prior Java 5 Java 5 (2006)

Efficient Multi-Ported Memories for FPGAs Eric LaForest Greg Steffan University of Toronto

Java Java Basics Java Program Statements Java Review Conditional statements

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

How Java works The java compiler takes a .java file and generates a .class file The .class

OpenJDK The Future of Open Source Java on GNU/Linux Dalibor Topi Java F/OSS Ambassador

April 2011 Introduction to Bank of Georgia The leading universal bank in Georgia No.1 by assets

Gaming and Violence Adam Hallberg Video Game: A game played by electronically manipulating

FPGA-CAPELLA: A REAL TIME AUDIO FX UNIT COSMA KUFA AND JUSTIN XIAO WHAT IS FPGA-CAPELLA?

Pulse: Plural To EVMDD-SMC The Compiler and Model Generator Ijaz Ahmed N estor Cata no

Summer Acceleration Plan Office name goes here Office of Pathways to College and Career May 8,

Robust: Road Upgrade of Standards GRD1-2002-70021. Acceleration transducers, data

Strategic Elementary Initiatives for Student Acceleration School Board Meeting March 24, 2015

Simple Harmonic Motion (SHM) Slide 2 / 67 SHM and Uniform Circular Motion There is a deep