easy and high performance
play

Easy and High Performance GPU Programming for Java Programmers GTC - PowerPoint PPT Presentation

Easy and High Performance GPU Programming for Java Programmers GTC 2016 Kazuaki Ishizaki (kiszk@acm.org) + , Gita Koblents - , Alon Shalev Housfater - , Jimmy Kwa - , Marcel Mitran , Akihiro Hayashi * , Vivek Sarkar * + IBM Research Tokyo


  1. Easy and High Performance GPU Programming for Java Programmers GTC 2016 Kazuaki Ishizaki (kiszk@acm.org) + , Gita Koblents - , Alon Shalev Housfater - , Jimmy Kwa - , Marcel Mitran – , Akihiro Hayashi * , Vivek Sarkar * + IBM Research – Tokyo - IBM Canada * Rice University 1

  2. Java Program Runs on GPU with IBM Java 8 http://www-01.ibm.com/support/docview.wss?uid=swg21696670 https://devblogs.nvidia.com/parallelforall/ next-wave-enterprise-performance-java-power-systems-nvidia-gpus/ 2 Easy and High Performance GPU Programming for Java Programmers

  3. Java Meets GPUs 3 Easy and High Performance GPU Programming for Java Programmers

  4. What You Will Learn from this Talk  How to program GPUs in pure Java – using standard parallel stream APIs  How IBM Java 8 runtime executes the parallel program on GPUs – with optimizations without annotations  GPU read-only cache exploitation  data copy reductions between CPU and GPU  exception check eliminations for Java  Achieve good performance results using one K40 card with – 58.9x over 1-CPU-thread sequential execution on POWER8 – 3.7x over 160-CPU-thread parallel execution on POWER8 4 Easy and High Performance GPU Programming for Java Programmers

  5. Outline  Goal  Motivation  How to Write a Parallel Program in Java  Overview of IBM Java 8 Runtime  Performance Evaluation  Conclusion 5 Easy and High Performance GPU Programming for Java Programmers

  6. Why We Want to Use Java for GPU Programming  High productivity – Safety and flexibility – Good program portability among different machines  “write once, run anywhere” – Ease of writing a program  Hard to use CUDA and OpenCL for non-expert programmers  Many computation-intensive applications in non-HPC area – Data analytics and data science (Hadoop, Spark, etc.) – Security analysis (events in log files) – Natural language processing (messages in social network system) 6 Easy and High Performance GPU Programming for Java Programmers From https://www.flickr.com/photos/dlato/5530553658

  7. Programmability of CUDA vs. Java for GPUs  CUDA requires programmers to explicitly write operations for – managing device memories void fooCUDA(N, float *A, float *B, int N) { int sizeN = N * sizeof(float); – copying data cudaMalloc(&d_A, sizeN); cudaMalloc(&d_B, sizeN); cudaMemcpy(d_A, A, sizeN, HostToDevice); between CPU and GPU GPU<<<N, 1>>>(d_A, d_B, N); cudaMemcpy(B, d_B, sizeN, DeviceToHost); – expressing parallelism cudaFree(d_B); cudaFree(d_A); } // code for GPU __global__ void GPU(float* d_a, float* d_b, int n) { int i = threadIdx.x; if (n <= i) return; d_b[i] = d_a[i] * 2.0; } void fooJava(float A[], float B[], int n) {  Java 8 enables programmers // similar to for (idx = 0; i < n; i++) IntStream.range(0, N).parallel().forEach(i -> { to just focus on b[i] = a[i] * 2.0; }); – expressing parallelism } 7 Easy and High Performance GPU Programming for Java Programmers

  8. Safety and Flexibility in Java  Automatic memory management – No memory leak  Object-oriented float[] a = new float[N], b = new float[N] new Par().foo(a, b, N) // unnecessary to explicitly free a[] and b[]  Exception checks class Par { – No unsafe void foo(float[] a, float[] b, int n) { memory accesses // similar to for (idx = 0; i < n; i++) IntStream.range(0, N).parallel().forEach(i -> { // throw an exception if // a[] == null, b[] = null // i < 0, a.length <= i, b.length <= i b[i] = a[i] * 2.0; }); } } 8 Easy and High Performance GPU Programming for Java Programmers

  9. Portability among Different Hardware  How a Java program works – ‘ javac ’ command creates machine -independent Java bytecode – ‘java’ command launches Java runtime with Java bytecode  An interpreter executes a program by processing each Java bytecode  A just-in-time compiler generates native instructions for a target machine from Java bytecode of a hotspot method Java Java runtime Java bytecode program just-in-time (.class, Interpreter (.java) > java Seq compiler > javac Seq.java .jar) Target machine 9 Easy and High Performance GPU Programming for Java Programmers

  10. Outline  Goal  Motivation  How to Write a Parallel Program in Java  Overview of IBM Java 8 Runtime  Performance Evaluation  Conclusion 10 Easy and High Performance GPU Programming for Java Programmers

  11. How to Write a Parallel Loop in Java 8  Express parallelism by using parallel stream APIs among iterations of a lambda expression (index variable: i ) Example IntStream.range(0, 5).parallel(). forEach(i -> { System.out.println(i);}); 0 3 2 Reference implementation of Java 8 can execute this 4 on multiple CPU threads 1 println(0) on thread 0 println(1) on thread 0 println(3) on thread 1 println(2) on thread 2 println(4) on thread 3 time 11 Easy and High Performance GPU Programming for Java Programmers

  12. Outline  Goal  Motivation  How to Write and Execute a Parallel Program in Java  Overview of IBM Java 8 Runtime  Performance Evaluation  Conclusion 12 Easy and High Performance GPU Programming for Java Programmers

  13. Portability among Different Hardware (including GPUs)  A just-in-time compiler in IBM Java 8 runtime generates native instructions – for a target machine including GPUs from Java bytecode – for GPU which exploit device-specific capabilities more easily than OpenCL IBM Java 8 runtime Java just-in-time Java bytecode compiler program (.class, Interpreter (.java) > java Par for GPU > javac Par.java .jar) Target machine IntStream.range(0, n) .parallel().forEach(i -> { ... }); 13 Easy and High Performance GPU Programming for Java Programmers

  14. IBM Java 8 Can Execute the Code on CPU or GPU  Generate code for GPU execution from a parallel loop – GPU instructions for code in blue – CPU instructions for GPU memory manage and data copy  Execute this loop on CPU or GPU base on cost model – e.g., execute this on CPU if ‘ n ’ is very small class Par { void foo(float[] a, float[] b, float[] c, int n) { IntStream.range(0, n).parallel().forEach(i -> { b[i] = a[i] * 2.0; c[i] = a[i] * 3.0; }); } } Note: GPU support in current version is limited to lambdas with one-dimensional arrays and primitive types 14 Easy and High Performance GPU Programming for Java Programmers

  15. Optimizations for GPUs in IBM Just-In-Time Compiler  Using read-only cache – reduce # of memory transactions to a GPU global memory  Optimizing data copy between CPU and GPU – reduce amount of data copy  Eliminating redundant exception checks for Java on GPU – reduce # of instructions in GPU binary 15 Easy and High Performance GPU Programming for Java Programmers

  16. Using Read-Only Cache  Automatically detect a read-only array and access it thru read- only cache – read-only cache is faster than other memories in GPU float[] A = new float[N], B = new float[N], C = new float[N]; foo(A, B, C, N); void foo(float[] a, float[] b, float[] c, int n) { IntStream.range(0, n).parallel().forEach(i -> { b[i] = a[i] * 2.0; Equivalent to CUDA code c[i] = a[i] * 3.0; }); __device__ foo(*a, *b, *c, N) b[i] = __ldg(&a[i]) * 2.0; } c[i] = __ldg(&a[i]) * 3.0; } 16 Easy and High Performance GPU Programming for Java Programmers

  17. Optimizing Data Copy between CPU and GPU  Eliminate data copy from GPU to CPU – if an array (e.g., a[]) is not written on GPU  Eliminate data copy from CPU to GPU – if an array (e.g., b[] and c[]) is not read on GPU void foo(float[] a, float[] b, float[] c, int n) { // Data copy for a[] from CPU to GPU // No data copy for b[] and c[] IntStream.range(0, n).parallel().forEach(i -> { b[i] = a[i] * 2.0; c[i] = a[i] * 3.0; }); // Data copy for b[] and c[] from GPU to CPU // No data copy for a[] } 17 Easy and High Performance GPU Programming for Java Programmers

  18. Optimizing Data Copy between CPU and GPU  Eliminate data copy between CPU and GPU – if an array (e.g., a[] and b[]), which was accessed on GPU, is not accessed on CPU // Data copy for a[] from CPU to GPU for (int t = 0; t < T; t++) { IntStream.range(0, N*N).parallel().forEach(idx -> { b[idx] = a[...]; }); // No data copy for b[] between GPU and CPU IntStream.range(0, N*N).parallel().forEach(idx -> { a[idx] = b[...]; } // No data copy for a[] between GPU and CPU } // Data copy for a[] and b[] from GPU to CPU 18 Easy and High Performance GPU Programming for Java Programmers

  19. How to Support Exception Checks on GPUs  IBM just-in-time compiler inserts exception checks in GPU kernel // Java program IntStream.range(0,n).parallel(). forEach(i -> { b[i] = a[i] * 2.0; c[i] = a[i] * 3.0; }); __device__ GPUkernel (…) { int i = ...; // code for CPU if ((a == NULL) || i < 0 || a.length <= i) { { exception = true; return; } ... if ((b == NULL) || b.length <= i) { launch GPUkernel(...) exception = true; return; } if (exception) { b[i] = a[i] * 2.0; goto handle_exception; if ((c == NULL) || c.length <= i) { } exception = true; return; } ... c[i] = a[i] * 3.0; } } 19 Easy and High Performance GPU Programming for Java Programmers

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend