tornado vm running java on gpus and fpgas
play

Tornado VM: Running Java on GPUs and FPGAs Juan Fumero, PhD - PowerPoint PPT Presentation

Tornado VM: Running Java on GPUs and FPGAs Juan Fumero, PhD http://jjfumero.github.io QCon-London 2020, 3rd March 2020 Agenda 1. Motivation & Background 2. TornadoVM API - examples Runtime & Just In Time Compiler Live


  1. Tornado VM: Running Java on GPUs and FPGAs Juan Fumero, PhD http://jjfumero.github.io QCon-London 2020, 3rd March 2020

  2. Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos 3. Performance Results 4. Related Work & Future Directions 5. Conclusions 2

  3. Who am I? Dr. Juan Fumero Lead Developer of TornadoVM Postdoc @ University of Manchester juan.fumero@manchester.ac.uk @snatverk 3

  4. Motivation 4

  5. Why should we care about GPUs/FPGAs, etc.? CPU GPU FPGA Intel FPGA Stratix 10 (14nm) Intel Ice Lake (10nm) NVIDIA GP 100 – Pascal - 16nm Reconfigurable Hardware 8 cores HT, AVX(512 SIMD) 60 SMs, 64 cores each ~ 10 TFlops ~1TFlops* (including the iGPU) 3584 FP32 cores TDP ~225Watts ~ TDP 28W 10.6 TFlops (FP32) Source: Intel docs Source: Intel docs TDP ~300 Watts Source: NVIDIA docs 5

  6. What is a GPU? Graphics Processing Unit Contains a set of Stream Multiprocessor cores (SMx) * Pascal arch. 60 SMx * ~3500 CUDA cores Users need to know: A) Programming model (normally CUDA or OpenCL) B) Details about the architecture are essential to achieve performance * Non sequential consistency, manual barriers, etc. Source: NVIDIA docs 6

  7. What is an FPGA? Field Programmable Gate Array You can configure the design of your hardware after manufacturing It is like having " your algorithms directly wired on hardware " with only the parts you need 7

  8. Current Computer Systems & Prog. Lang. 8

  9. Ideal System for Managed Languages 9

  10. TornadoVM 10

  11. Demo: Kinect Fusion with TornadoVM * Computer Vision Application * ~7K LOC * Thousands of OpenCL LOC generated. https://github.com/beehive-lab/kfusion-tornadovm 11

  12. TornadoVM Overview Tasks = Methods Annotations API Task-Schedulers = Group of Methods 12

  13. TornadoVM Overview Tasks = Methods Annotations API Task-Schedulers = Group of Methods Data-Flow & Optimizer Runtime TornadoVM Bytecode Generation 13

  14. TornadoVM Overview Tasks = Methods Annotations API Task-Schedulers = Group of Methods Data-Flow & Optimizer Runtime TornadoVM Bytecode Generation Bytecode interpreter Execution Engine Device Drivers 14

  15. TornadoVM Overview Tasks = Methods Annotations API Task-Schedulers = Group of Methods Data-Flow & Optimizer Runtime TornadoVM Bytecode Generation Bytecode interpreter Just-In-Time Compiler Compiler / Execution Graal JIT Engine Device Drivers Device's heap Extensions 15

  16. TornadoVM Overview • OpenJDK 8 > 141 • OpenJDK 11 Tasks = Methods Annotations • GraalVM 19.3.0 API • OpenCL >= 1.2 Task-Schedulers = Group of Methods • Support for: • NVIDIA GPUs • Intel HD Graphics • AMD GPUs Data-Flow & Optimizer • Intel Altera FPGAs Runtime • Xilinx FPGAs TornadoVM Bytecode Generation • Multi-core CPUs Bytecode interpreter Just-In-Time Compiler Compiler / Execution Graal JIT Engine Device Drivers Device's heap Extensions 16

  17. Tornado API – example class Compute { public static void mxm(Matrix2DFloat A, Matrix2DFloatB, Matrix2DFloat C, final int size) { for ( int i = 0; i < size; i++) { for ( int j = 0; j < size; j++) { float sum = 0.0f; for ( int k = 0; k < size; k++) { sum += A.get(i, k) * B.get(k, j); } C.set(i, j, sum); } } } } 17

  18. Tornado API – example class Compute { We add the parallel public static void mxm(Matrix2DFloat A, Matrix2DFloatB, annotation as a hint for the Matrix2DFloat C, final int size) { compiler. for ( @Parallel int i = 0; i < size; i++) { for ( @Parallel int j = 0; j < size; j++) { float sum = 0.0f; for ( int k = 0; k < size; k++) { sum += A.get(i, k) * B.get(k, j); } C.set(i, j, sum); } } } } 18

  19. Tornado API – example class Compute { public static void mxm(Matrix2DFloat A, Matrix2DFloatB, Matrix2DFloat C, final int size) { for ( @Parallel int i = 0; i < size; i++) { for ( @Parallel int j = 0; j < size; j++) { float sum = 0.0f; for ( int k = 0; k < size; k++) { sum += A.get(i, k) * B.get(k, j); } C.set(i, j, sum); } } } } TaskSchedule ts = new TaskSchedule (" s0 "); ts. task (" t0 ", Compute::mxm, matrixA, matrixB, matrixC, size) . streamOut (matrixC) . execute (); 19

  20. Tornado API – example class Compute { To run: public static void mxm(Matrix2DFloat A, Matrix2DFloatB, Matrix2DFloat C, final int size) { $ tornado Compute for ( @Parallel int i = 0; i < size; i++) { for ( @Parallel int j = 0; j < size; j++) { float sum = 0.0f; for ( int k = 0; k < size; k++) { sum += A.get(i, k) * B.get(k, j); } tornado command is just an C.set(i, j, sum); alias to Java and all the } parameters to enable } TornadoVM } } TaskSchedule ts = new TaskSchedule (" s0 "); ts. task (" t0 ", Compute::mxm, matrixA, matrixB, matrixC, size) . streamOut (matrixC) . execute (); 20

  21. Demo: Running Matrix Multiplication https://github.com/jjfumero/qconlondon2020-tornadovm 21

  22. TornadoVM Compiler & Runtime Overview 22

  23. TornadoVM & Dynamic Languages 23

  24. TornadoVM & Dynamic Languages 24

  25. De Demo 2: 2: Node. e.js ex example le https://github.com/jjfumero/qconlondon2020-tornadovm 25

  26. TornadoVM Compiler & Runtime Overview 26

  27. TornadoVM Compiler & Runtime Overview 27

  28. TornadoVM JIT Compiler Specializations 28

  29. FPGA Specializations void compute( float [] input, float [] output) { for ( @Parallel int i = 0; …) } for ( int j = 0; ...) { // Computation } } } From slowdowns without Specializations to 240x with Automatic Specializations on Intel FPGAs 29

  30. TornadoVM: VM in a VM 30

  31. TornadoVM: VM in a VM 31

  32. TornadoVM Bytecodes - Example 32

  33. TornadoVM Bytecodes - Example 33

  34. TornadoVM Bytecodes - Example 34

  35. TornadoVM Bytecodes - Example 35

  36. TornadoVM Bytecodes - Example 36

  37. TornadoVM Bytecodes - Example 37

  38. TornadoVM Bytecodes - Example 38

  39. TornadoVM Bytecodes - Example 39

  40. Batch Processing: 16GB into 1GB GPU 40

  41. Batch Processing: 16GB into 1GB GPU 41

  42. Batch Processing: 16GB into 1GB GPU 42

  43. Live Task Migration 43

  44. Dynamic Reconfiguration 44

  45. Dynamic Reconfiguration 45

  46. Dynamic Reconfiguration 46

  47. How is the decision made? • End-to-end: including JIT compilation time • Peak Performance: without JIT and after warming-up • Latency: does not wait for all threads to finish 47

  48. Demo Live Task Migration – Server/Client App https://github.com/jjfumero/qconlondon2020-tornadovm 48

  49. New compilation tier for Heterogeneous Systems 49

  50. New compilation tier for Heterogeneous Systems 50

  51. Related Work 51

  52. Related Work (in the Java context) Production- Supported Live Task Compiler Dynamic Project​ Ready​ Devices​ Migration​ Specializations​ Languages Sumatra​ No​ AMD GPUs​ No​ No​ No Multi-core, Marawacc No​ No​ No​ No GPUs​ JaBEE No​ NVIDIA GPUs​ No​ No​ No RootBeer No​ NVIDIA GPUs​ No​ No​ No GPUs, Aparapi Yes​ No​ No​ No multi- core​ IBM GPU J9​ Yes​ NVIDIA GPUs​ No​ No​ No grCUDA No (*) NVIDIA GPUs No No Yes Multi-core, TornadoVM Not yet (*) Yes​ Yes​ Yes GPUs,FPGAs 52

  53. Related Work (in the Java context) Production- Supported Live Task Compiler Dynamic Project​ Ready​ Devices​ Migration​ Specializations​ Languages Sumatra​ No​ AMD GPUs​ No​ No​ No Multi-core, Marawacc No​ No​ Yes​ Yes GPUs​ JaBEE No​ NVIDIA GPUs​ No​ No​ No RootBeer No​ NVIDIA GPUs​ No​ No​ No GPUs, Aparapi Yes​ No​ No​ No multi- core​ IBM GPU J9​ Yes​ NVIDIA GPUs​ No​ No​ No grCUDA No (*) NVIDIA GPUs No No Yes Multi-core, TornadoVM Not yet (*) Yes​ Yes​ Yes GPUs,FPGAs 53

  54. Ok, cool! What about performance? 54

  55. Performance * TornadoVM performs up to 7.7x over the best device (statically). * Up to >4500x over Java sequential - NVIDIA GTX 1060 - Intel FPGA Nallatech 385a - Intel Core i7-7700K 55

  56. Performance on GPUs, iGPUs, and CPUs 56

  57. More details in our papers! https://github.com/beehive-lab/TornadoVM/blob/master/assembly/src/docs/Publications.md 57

  58. Limitations & Future Work 58

  59. Limitations We inherit limitations from the underlying Programming Model: • No object support (except for a few cases) • No recursion • No dynamic memory allocation (*) • No support for exceptions (*) 59

  60. Future Work • GPU/FPGA full capabilities • Exploitation of Tier-memories such as local memory (in progress) • Policies for energy efficiency • Multi-device within a task-schedule • More parallel skeletons ( reductions , stencil, scan, filter, …) • PTX Backend for NVIDIA 60

  61. Current Applicability of TornadoVM 61

  62. EU H2020 E2Data Project https://e2data.eu/ "End-to-end solutions for Big Data deployments that fully exploit heterogeneous hardware" European Union’s Horizon H2020 research and innovation programme under grant agreement No 780245 62

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend