Exploiting High-Performance Heterogeneous Hardware for Java Programs - PowerPoint PPT Presentation

Exploiting High-Performance Heterogeneous Hardware for Java Programs using Graal James Clarkson ± , Juan Fumero ∗ , Michalis Papadimitriou ∗ , Foivos S. Zakkak ∗ , Christos Kotselidis ∗ and Mikel Luján ∗ ± Dyson, ∗ The University of Manchester ManLang’18, Linz (Austria), 12th September 2018

Outline Background Tornado Tornado-API Tornado Runtime Tornado JIT Compiler Performance Results Conclusions 1

Context of this project Started as the PhD thesis of James Clarkson : Compiler and Runtime Support for Heterogeneous Programming James Clarkson, Christos Kotselidis, Gavin Brown, and Mikel Luján. Boosting Java Performance using GPGPUs. In Proceedings of the 30th International Conference on Architecture of Computing Systems Christos Kotselidis, James Clarkson, Andrey Rodchenko, Andy Nisbet, John Mawer, and Mikel Luján. Heterogeneous Managed Runtime Systems: A Computer Vision Case Study ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE ’17) Partially funded by the EPSRC AnyScale grant EP/L000725/1 2

Currently part of the EU H2020 E2Data Project "End-to-end solution for heterogeneous Big Data deployments that fully exploits and advances the state-of-the-art in infrastructure" https://e2data.eu/ European Union’s Horizon H2020 research and innovation programme under grant agreement No 780622 3

1. Background 4

Current Heterogeneous Computing Landscape 5

Current Virtual Machines 8

Our Solution: VM + Heterogeneous Runtime 9

2. Tornado: A Practical Heterogeneous Programming Framework 10

Tornado • A Java based Heterogeneous Programming Framework • It exposes a task-based parallel programming API • It contains an OpenCL JIT Compiler and a Runtime for running on heterogeneous devices • Modular system currently using: – OpenJDK/Graal – OpenCL • It currently runs on CPUs, GPUs and FPGAs* 11

Tornado Overview 12

Tornado API: @ Parallel "It’s a developer provided annotation that instructs the JIT compiler that it is OK for each iteration to be executed independently." It does not specify or imply: • iterations should be executed in parallel; • the parallelization scheme to be used 13

Task Schedules "A task schedule describes how to co-ordinate the execution of tasks across heterogeneous hardware." . • Composability • Sequential consistency • Task-based parallelism • Automatic and optimised data movement 14

Tornado API: enabling task-based parallelism 15

Task Schedules: example c l a s s Ex { 1 2 public s t a t i c void multiply 3 ( Double4 [ ] a , Double4 [ ] b , Double4 [ ] c ) { // code here 4 5 } 6 public s t a t i c void add 7 8 ( Double4 [ ] a , Double4 [ ] b , Double4 [ ] c ) { // code here 9 } 10 11 } 18

Task Schedules: example 19

3. Tornado Runtime 22

Tornado: WorkFlow Task Graph describes a data-fow graph each node is a Tornado API Task Schedule Task Source new TaskSchedule("s0") void add( int [] a, int [] b, int [] c){ Task Schedule .add(Ex1::add, a, b, c) for ( @Parallel int i=0; i<c.length; i++){ .streamOut(c) c[i] = a[i] + b[i]; .execute(); } } 1 Optimize Task Schedule Tornado Runtime Tornado Compiler Task Schedule Graph Optimizer Sketcher Serialized 2 4 - task placement - Tornado API Runtime Optimizations - data-fow optimization - code reachability analysis - inserts low-level tasks - data dependency analysis 3 5 HIR Cache Execute Task Schedule Code Generator Task Executor 7 - compiles cached sketches - maps tasks onto driver API Code Cache - parallelization Task Execution - triggers JIT compilation - device specifc built-ins - triggers data-movements 6 Pluggable Driver OpenCL C OpenCL Runtime __kernel void foo(…) clEnqueueWriteBufer() Driver API { clEnqueueNDRangeKernel() … clEnqueueReadBufer() } 23

Data parallelism - Task specialisation E.g., currently we have two parallel schemes: course-grain and fine-grain 1 // Loop for GPUs 1 // Loop for CPUs 2 int idx = get_global_id (0); 2 int id = get_global_id (0); 3 size = get_global_size (0); 3 size = get_global_size (0); int int 4 for ( int i = idx; i < c.length; 4 int block_size = (size + 5 i += size) { 5 inputSize - 1) / size; 6 // computation 6 start = id * block_size; int 7 c[i] = a[i] + b[i]; 7 int end = min(start + bs , c.length ); 8 } 8 for ( int i = start; i < end; i++) { 9 // computation 10 c[i] = a[i] + b[i]; 11 } 24

Memory Management • Each heterogeneous device has a managed heap • Enables objects to persist on devices • Currently we duplicate objects which reside in the JVM heap • No object creation on devices 25

4. Tornado JIT Compiler 26

Tornado JIT Compiler 27

5. Case study 28

Case study Kinect Fusion : it is a complex computer vision application that is able to re-construct a 3D movel from RGB-D camera in real time. 29

Why KFusion? • Not a normal Java application • Complex multi-kernel pipeline – Sustained the execution of 540-1620 kernels per second. – SLA of 30 FPS • Representative of cutting edge robotics/computer vision applications • Want to deploy across many platform and accelerator combinations 30

What did we get with Tornado? Running on NVIDIA Tesla, up to 150 fps 31

And compared to native code? 250 200 Frames Per Second OpenCL 150 100 Tornado-OR 50 Tornado-JR 0 0 250 500 750 Frame Number Tornado is 28% slower than the best OpenCL native code. 32

6. Announcement & Conclusions 33

Tornado is now Open Source! • We also have a poster tormorrow, come along! • If you are interested, we can also show you demos on GPUs and FPGAs! 34

Takeaway • We have presented Tornado • We have shown runtime code generation for OpenCL • We have shown a case study for computer vision • It is open-source, give a try! We are looking forward for your feedback! 35

Thank you very much for your attention This work is partially supported by the EPSRC grants PAMELA EP/K008730/1 and AnyScale Apps EP/L000725/1, and the EU Horizon 2020 E2Data 780245. Juan Fumero <juan.fumero@manchester.ac.uk> 36

Compilation times OpenCL Graal 0.20 Time (seconds) 0.15 0.10 0.05 0.00 AMD Intel Intel AMD Intel NVIDIA NVIDIA A10 i7 E5 Radeon Iris Pro GT Tesla 7850K 4850HQ 2620 R7 5200 750M K20m 37

OpenCL Device Driver: Just In Time Compiler OpenCL JIT Compiler and Runtime 38

Exploiting High-Performance Heterogeneous Hardware for Java Programs - PowerPoint PPT Presentation

Exploiting High-Performance Heterogeneous Hardware for Java Programs using Graal James Clarkson , Juan Fumero , Michalis Papadimitriou , Foivos S. Zakkak , Christos Kotselidis and Mikel Lujn Dyson, The University of

Hardware Observability Framework Hardware Observability Framework Hardware Observability

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

Exploiting Modern Hardware Features via Lightweight Profiling Probir Roy Scalable Tools

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

Exploiting Extreme Processor Counts on the Cray Exploiting Extreme Processor Counts on the Cray

High Performance Hardware, High Performance Hardware, Memory & CPU Memory & CPU Step

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

Towards 1000x with Heterogeneous, Programmable Hardware Datacenter Name: Anton Burtsev, UC

Di ff erentially-Private Batch Query Answering Exploiting the Workload vs. Exploiting the Data

Exploiting Private Local Exploiting Private Local Memories to Reduce the Memories to Reduce the

Exploiting carbon and nitrogen Exploiting carbon and nitrogen compounds for enhanced energy

Visualization of Geant4 Data: Exploiting Component Visualization of Geant4 Data: Exploiting

Hacking Browser's DOM Exploiting Ajax and RIA Exploiting Ajax and RIA Shreeraj Shah

Register Packing Register Packing Exploiting Narrow- -Width Operands Width Operands Exploiting

Cetus for C, C++, and Java LCPC 04 Mini Workshop of Compiler Research Infrastructures

The Java Collections Framework Definition Set of interfaces, abstract and concrete classes that

Data Parallelism in Java Brian Goetz Java Language Architect Hardware trends (Graphic courtesy

Featherweight Java Overview higher & first-order syntax inference rules, induction tools to

Parallel programming with Java Slides 1: Introduc:on Michelle

Safety-Critical Java for Low-End Embedded Platforms Stephan E. Korsholm & Hans Sndergaard

Computational Expression Computer and Java Basics Janyl Jumadinova 4 September, 2019 Janyl

Software Thread Level Speculation for the Java Language and Virtual Machine Environment

Exploiting High-Performance Heterogeneous Hardware for Java Programs - PowerPoint PPT Presentation

Exploiting High-Performance Heterogeneous Hardware for Java Programs using Graal James Clarkson , Juan Fumero , Michalis Papadimitriou , Foivos S. Zakkak , Christos Kotselidis and Mikel Lujn Dyson, The University of

Hardware Observability Framework Hardware Observability Framework Hardware Observability

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

Exploiting Modern Hardware Features via Lightweight Profiling Probir Roy Scalable Tools

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

Exploiting Extreme Processor Counts on the Cray Exploiting Extreme Processor Counts on the Cray

High Performance Hardware, High Performance Hardware, Memory &amp; CPU Memory &amp; CPU Step

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

Towards 1000x with Heterogeneous, Programmable Hardware Datacenter Name: Anton Burtsev, UC

Di ff erentially-Private Batch Query Answering Exploiting the Workload vs. Exploiting the Data

Exploiting Private Local Exploiting Private Local Memories to Reduce the Memories to Reduce the

Exploiting carbon and nitrogen Exploiting carbon and nitrogen compounds for enhanced energy

Visualization of Geant4 Data: Exploiting Component Visualization of Geant4 Data: Exploiting

Hacking Browser's DOM Exploiting Ajax and RIA Exploiting Ajax and RIA Shreeraj Shah

Register Packing Register Packing Exploiting Narrow- -Width Operands Width Operands Exploiting

Cetus for C, C++, and Java LCPC 04 Mini Workshop of Compiler Research Infrastructures

The Java Collections Framework Definition Set of interfaces, abstract and concrete classes that

Data Parallelism in Java Brian Goetz Java Language Architect Hardware trends (Graphic courtesy

Featherweight Java Overview higher &amp; first-order syntax inference rules, induction tools to

Parallel programming with Java Slides 1: Introduc:on Michelle

Safety-Critical Java for Low-End Embedded Platforms Stephan E. Korsholm &amp; Hans Sndergaard

Computational Expression Computer and Java Basics Janyl Jumadinova 4 September, 2019 Janyl

Software Thread Level Speculation for the Java Language and Virtual Machine Environment

High Performance Hardware, High Performance Hardware, Memory & CPU Memory & CPU Step

Featherweight Java Overview higher & first-order syntax inference rules, induction tools to

Safety-Critical Java for Low-End Embedded Platforms Stephan E. Korsholm & Hans Sndergaard