Easy and High Performance GPU Programming for Java Programmers GTC - PowerPoint PPT Presentation

Easy and High Performance GPU Programming for Java Programmers GTC 2016 Kazuaki Ishizaki (kiszk@acm.org) + , Gita Koblents - , Alon Shalev Housfater - , Jimmy Kwa - , Marcel Mitran – , Akihiro Hayashi * , Vivek Sarkar * + IBM Research – Tokyo - IBM Canada * Rice University 1

Java Program Runs on GPU with IBM Java 8 http://www-01.ibm.com/support/docview.wss?uid=swg21696670 https://devblogs.nvidia.com/parallelforall/ next-wave-enterprise-performance-java-power-systems-nvidia-gpus/ 2 Easy and High Performance GPU Programming for Java Programmers

Java Meets GPUs 3 Easy and High Performance GPU Programming for Java Programmers

What You Will Learn from this Talk  How to program GPUs in pure Java – using standard parallel stream APIs  How IBM Java 8 runtime executes the parallel program on GPUs – with optimizations without annotations  GPU read-only cache exploitation  data copy reductions between CPU and GPU  exception check eliminations for Java  Achieve good performance results using one K40 card with – 58.9x over 1-CPU-thread sequential execution on POWER8 – 3.7x over 160-CPU-thread parallel execution on POWER8 4 Easy and High Performance GPU Programming for Java Programmers

Outline  Goal  Motivation  How to Write a Parallel Program in Java  Overview of IBM Java 8 Runtime  Performance Evaluation  Conclusion 5 Easy and High Performance GPU Programming for Java Programmers

Why We Want to Use Java for GPU Programming  High productivity – Safety and flexibility – Good program portability among different machines  “write once, run anywhere” – Ease of writing a program  Hard to use CUDA and OpenCL for non-expert programmers  Many computation-intensive applications in non-HPC area – Data analytics and data science (Hadoop, Spark, etc.) – Security analysis (events in log files) – Natural language processing (messages in social network system) 6 Easy and High Performance GPU Programming for Java Programmers From https://www.flickr.com/photos/dlato/5530553658

Programmability of CUDA vs. Java for GPUs  CUDA requires programmers to explicitly write operations for – managing device memories void fooCUDA(N, float *A, float *B, int N) { int sizeN = N * sizeof(float); – copying data cudaMalloc(&d_A, sizeN); cudaMalloc(&d_B, sizeN); cudaMemcpy(d_A, A, sizeN, HostToDevice); between CPU and GPU GPU<<<N, 1>>>(d_A, d_B, N); cudaMemcpy(B, d_B, sizeN, DeviceToHost); – expressing parallelism cudaFree(d_B); cudaFree(d_A); } // code for GPU __global__ void GPU(float* d_a, float* d_b, int n) { int i = threadIdx.x; if (n <= i) return; d_b[i] = d_a[i] * 2.0; } void fooJava(float A[], float B[], int n) {  Java 8 enables programmers // similar to for (idx = 0; i < n; i++) IntStream.range(0, N).parallel().forEach(i -> { to just focus on b[i] = a[i] * 2.0; }); – expressing parallelism } 7 Easy and High Performance GPU Programming for Java Programmers

Safety and Flexibility in Java  Automatic memory management – No memory leak  Object-oriented float[] a = new float[N], b = new float[N] new Par().foo(a, b, N) // unnecessary to explicitly free a[] and b[]  Exception checks class Par { – No unsafe void foo(float[] a, float[] b, int n) { memory accesses // similar to for (idx = 0; i < n; i++) IntStream.range(0, N).parallel().forEach(i -> { // throw an exception if // a[] == null, b[] = null // i < 0, a.length <= i, b.length <= i b[i] = a[i] * 2.0; }); } } 8 Easy and High Performance GPU Programming for Java Programmers

Portability among Different Hardware  How a Java program works – ‘ javac ’ command creates machine -independent Java bytecode – ‘java’ command launches Java runtime with Java bytecode  An interpreter executes a program by processing each Java bytecode  A just-in-time compiler generates native instructions for a target machine from Java bytecode of a hotspot method Java Java runtime Java bytecode program just-in-time (.class, Interpreter (.java) > java Seq compiler > javac Seq.java .jar) Target machine 9 Easy and High Performance GPU Programming for Java Programmers

Outline  Goal  Motivation  How to Write a Parallel Program in Java  Overview of IBM Java 8 Runtime  Performance Evaluation  Conclusion 10 Easy and High Performance GPU Programming for Java Programmers

How to Write a Parallel Loop in Java 8  Express parallelism by using parallel stream APIs among iterations of a lambda expression (index variable: i ) Example IntStream.range(0, 5).parallel(). forEach(i -> { System.out.println(i);}); 0 3 2 Reference implementation of Java 8 can execute this 4 on multiple CPU threads 1 println(0) on thread 0 println(1) on thread 0 println(3) on thread 1 println(2) on thread 2 println(4) on thread 3 time 11 Easy and High Performance GPU Programming for Java Programmers

Outline  Goal  Motivation  How to Write and Execute a Parallel Program in Java  Overview of IBM Java 8 Runtime  Performance Evaluation  Conclusion 12 Easy and High Performance GPU Programming for Java Programmers

Portability among Different Hardware (including GPUs)  A just-in-time compiler in IBM Java 8 runtime generates native instructions – for a target machine including GPUs from Java bytecode – for GPU which exploit device-specific capabilities more easily than OpenCL IBM Java 8 runtime Java just-in-time Java bytecode compiler program (.class, Interpreter (.java) > java Par for GPU > javac Par.java .jar) Target machine IntStream.range(0, n) .parallel().forEach(i -> { ... }); 13 Easy and High Performance GPU Programming for Java Programmers

IBM Java 8 Can Execute the Code on CPU or GPU  Generate code for GPU execution from a parallel loop – GPU instructions for code in blue – CPU instructions for GPU memory manage and data copy  Execute this loop on CPU or GPU base on cost model – e.g., execute this on CPU if ‘ n ’ is very small class Par { void foo(float[] a, float[] b, float[] c, int n) { IntStream.range(0, n).parallel().forEach(i -> { b[i] = a[i] * 2.0; c[i] = a[i] * 3.0; }); } } Note: GPU support in current version is limited to lambdas with one-dimensional arrays and primitive types 14 Easy and High Performance GPU Programming for Java Programmers

Optimizations for GPUs in IBM Just-In-Time Compiler  Using read-only cache – reduce # of memory transactions to a GPU global memory  Optimizing data copy between CPU and GPU – reduce amount of data copy  Eliminating redundant exception checks for Java on GPU – reduce # of instructions in GPU binary 15 Easy and High Performance GPU Programming for Java Programmers

Using Read-Only Cache  Automatically detect a read-only array and access it thru read- only cache – read-only cache is faster than other memories in GPU float[] A = new float[N], B = new float[N], C = new float[N]; foo(A, B, C, N); void foo(float[] a, float[] b, float[] c, int n) { IntStream.range(0, n).parallel().forEach(i -> { b[i] = a[i] * 2.0; Equivalent to CUDA code c[i] = a[i] * 3.0; }); __device__ foo(*a, *b, *c, N) b[i] = __ldg(&a[i]) * 2.0; } c[i] = __ldg(&a[i]) * 3.0; } 16 Easy and High Performance GPU Programming for Java Programmers

Optimizing Data Copy between CPU and GPU  Eliminate data copy from GPU to CPU – if an array (e.g., a[]) is not written on GPU  Eliminate data copy from CPU to GPU – if an array (e.g., b[] and c[]) is not read on GPU void foo(float[] a, float[] b, float[] c, int n) { // Data copy for a[] from CPU to GPU // No data copy for b[] and c[] IntStream.range(0, n).parallel().forEach(i -> { b[i] = a[i] * 2.0; c[i] = a[i] * 3.0; }); // Data copy for b[] and c[] from GPU to CPU // No data copy for a[] } 17 Easy and High Performance GPU Programming for Java Programmers

Optimizing Data Copy between CPU and GPU  Eliminate data copy between CPU and GPU – if an array (e.g., a[] and b[]), which was accessed on GPU, is not accessed on CPU // Data copy for a[] from CPU to GPU for (int t = 0; t < T; t++) { IntStream.range(0, N*N).parallel().forEach(idx -> { b[idx] = a[...]; }); // No data copy for b[] between GPU and CPU IntStream.range(0, N*N).parallel().forEach(idx -> { a[idx] = b[...]; } // No data copy for a[] between GPU and CPU } // Data copy for a[] and b[] from GPU to CPU 18 Easy and High Performance GPU Programming for Java Programmers

How to Support Exception Checks on GPUs  IBM just-in-time compiler inserts exception checks in GPU kernel // Java program IntStream.range(0,n).parallel(). forEach(i -> { b[i] = a[i] * 2.0; c[i] = a[i] * 3.0; }); __device__ GPUkernel (…) { int i = ...; // code for CPU if ((a == NULL) || i < 0 || a.length <= i) { { exception = true; return; } ... if ((b == NULL) || b.length <= i) { launch GPUkernel(...) exception = true; return; } if (exception) { b[i] = a[i] * 2.0; goto handle_exception; if ((c == NULL) || c.length <= i) { } exception = true; return; } ... c[i] = a[i] * 3.0; } } 19 Easy and High Performance GPU Programming for Java Programmers

Easy and High Performance GPU Programming for Java Programmers GTC - PowerPoint PPT Presentation

Easy and High Performance GPU Programming for Java Programmers GTC 2016 Kazuaki Ishizaki (kiszk@acm.org) + , Gita Koblents - , Alon Shalev Housfater - , Jimmy Kwa - , Marcel Mitran , Akihiro Hayashi * , Vivek Sarkar * + IBM Research Tokyo

Easy-to-Use Easy-to-Install Easy on the Budget orecx.com Easy-to-Use

Easy Flype & Easy HiFlype Peripheral Self-Expanding Stent System 20/07/2018 Easy Flype

Title Table of content 1 Easy to change colors, photos and Text 2 Easy to change colors,

TS 83 DORMA DORMA TS 83 Easy-action Door Closer Easy-action door closer Data and features TS

Easy Move Progression and Distinctive versions The lifts for accessibility EASY MOVE Distinctive

Rapid Restoration Diagnostic Motivate Enable Implement ++ Urgent Urgent Not easy Easy

Meal Planning Made Easy Meal Planning Made Easy Healthy Utah Meal Planning Made Easy

Expandabee Easy Access | Easy Lift | Easy Bees Red B Where we left off Standalone

1 1 easy to compute , 1 easy to compute 2

Easy Tank Speed Log 2017-05-18 2 Consilium STW Speed Log with Easy Installation Based on

BelKraft Water Purifiers Pure Water Pure Water Pure Water an Easy Way to an Easy Way

Drupal High Availability High Performance Samstag, 3. November 12 Drupal High Availability

make experiment WMT 2010 workflow management Goals JHU Submission WMT 2010 Running translation

Eagle Scholars: High Eagle Scholars: High Eagle Scholars: High Eagle Scholars: High Eagle

Getting the Performance Out Of Getting the Performance Out Of High Performance Computing High

High Performance Computing in Web Browsers CE Seminar WT14/15 Henning Lohse High Performance

Incorporating Off-The- Shelf Components with Event-based Integration Jie Ren, Richard Taylor

Java +- -or- Java but worse but also better Zeynep Ejder - Language Guru Ashley Daguanno -

Cannabis Packaging for a Circular Economy Three years ago, we saw an opportunity for

Stratigraphy Vanderbilt Student Volunteers for Science 2018-2019 VINSE/VSVS Rural IA. Reviewing

Monadic Imperative languages C# Java C / C++ Fortran Scala Subtract abstractions Add

Realtime Java for Industrial and Critical Applications Andy Walter COO, aicas GmbH 1 June 2007

Integrating Spark MLlib into Weka Mark Hall Pentaho Data Mining Architect, Hitachi Vantara

Solving Minimax Problems with Feasible Sequential Quadratic Programming 05/06/2014 Constrained

Easy and High Performance GPU Programming for Java Programmers GTC - PowerPoint PPT Presentation

Easy and High Performance GPU Programming for Java Programmers GTC 2016 Kazuaki Ishizaki (kiszk@acm.org) + , Gita Koblents - , Alon Shalev Housfater - , Jimmy Kwa - , Marcel Mitran , Akihiro Hayashi * , Vivek Sarkar * + IBM Research Tokyo

Easy-to-Use Easy-to-Install Easy on the Budget orecx.com Easy-to-Use

Easy Flype &amp; Easy HiFlype Peripheral Self-Expanding Stent System 20/07/2018 Easy Flype

Title Table of content 1 Easy to change colors, photos and Text 2 Easy to change colors,

TS 83 DORMA DORMA TS 83 Easy-action Door Closer Easy-action door closer Data and features TS

Easy Move Progression and Distinctive versions The lifts for accessibility EASY MOVE Distinctive

Rapid Restoration Diagnostic Motivate Enable Implement ++ Urgent Urgent Not easy Easy

Meal Planning Made Easy Meal Planning Made Easy Healthy Utah Meal Planning Made Easy

Expandabee Easy Access | Easy Lift | Easy Bees Red B Where we left off Standalone

1 1 easy to compute , 1 easy to compute 2

Easy Tank Speed Log 2017-05-18 2 Consilium STW Speed Log with Easy Installation Based on

BelKraft Water Purifiers Pure Water Pure Water Pure Water an Easy Way to an Easy Way

Drupal High Availability High Performance Samstag, 3. November 12 Drupal High Availability

make experiment WMT 2010 workflow management Goals JHU Submission WMT 2010 Running translation

Eagle Scholars: High Eagle Scholars: High Eagle Scholars: High Eagle Scholars: High Eagle

Getting the Performance Out Of Getting the Performance Out Of High Performance Computing High

High Performance Computing in Web Browsers CE Seminar WT14/15 Henning Lohse High Performance

Incorporating Off-The- Shelf Components with Event-based Integration Jie Ren, Richard Taylor

Java +- -or- Java but worse but also better Zeynep Ejder - Language Guru Ashley Daguanno -

Cannabis Packaging for a Circular Economy Three years ago, we saw an opportunity for

Stratigraphy Vanderbilt Student Volunteers for Science 2018-2019 VINSE/VSVS Rural IA. Reviewing

Monadic Imperative languages C# Java C / C++ Fortran Scala Subtract abstractions Add

Realtime Java for Industrial and Critical Applications Andy Walter COO, aicas GmbH 1 June 2007

Integrating Spark MLlib into Weka Mark Hall Pentaho Data Mining Architect, Hitachi Vantara

Solving Minimax Problems with Feasible Sequential Quadratic Programming 05/06/2014 Constrained

Easy Flype & Easy HiFlype Peripheral Self-Expanding Stent System 20/07/2018 Easy Flype