Easy and High Performance GPU Programming for Java Programmers GTC - - PowerPoint PPT Presentation

easy and high performance
SMART_READER_LITE
LIVE PREVIEW

Easy and High Performance GPU Programming for Java Programmers GTC - - PowerPoint PPT Presentation

Easy and High Performance GPU Programming for Java Programmers GTC 2016 Kazuaki Ishizaki (kiszk@acm.org) + , Gita Koblents - , Alon Shalev Housfater - , Jimmy Kwa - , Marcel Mitran , Akihiro Hayashi * , Vivek Sarkar * + IBM Research Tokyo


slide-1
SLIDE 1

GTC 2016

Kazuaki Ishizaki (kiszk@acm.org) +, Gita Koblents -, Alon Shalev Housfater -, Jimmy Kwa -, Marcel Mitran –, Akihiro Hayashi *, Vivek Sarkar *

+ IBM Research – Tokyo

  • IBM Canada

* Rice University

Easy and High Performance GPU Programming for Java Programmers

1

slide-2
SLIDE 2

Java Program Runs on GPU with IBM Java 8

2 Easy and High Performance GPU Programming for Java Programmers http://www-01.ibm.com/support/docview.wss?uid=swg21696670

https://devblogs.nvidia.com/parallelforall/ next-wave-enterprise-performance-java-power-systems-nvidia-gpus/

slide-3
SLIDE 3

Java Meets GPUs

3 Easy and High Performance GPU Programming for Java Programmers

slide-4
SLIDE 4

What You Will Learn from this Talk

  • How to program GPUs in pure Java

– using standard parallel stream APIs

  • How IBM Java 8 runtime executes the parallel program on

GPUs

– with optimizations without annotations

  • GPU read-only cache exploitation
  • data copy reductions between CPU and GPU
  • exception check eliminations for Java
  • Achieve good performance results using one K40 card with

– 58.9x over 1-CPU-thread sequential execution on POWER8 – 3.7x over 160-CPU-thread parallel execution on POWER8

4 Easy and High Performance GPU Programming for Java Programmers

slide-5
SLIDE 5

Outline

  • Goal
  • Motivation
  • How to Write a Parallel Program in Java
  • Overview of IBM Java 8 Runtime
  • Performance Evaluation
  • Conclusion

5 Easy and High Performance GPU Programming for Java Programmers

slide-6
SLIDE 6

Why We Want to Use Java for GPU Programming

  • High productivity

– Safety and flexibility – Good program portability among different machines

  • “write once, run anywhere”

– Ease of writing a program

  • Hard to use CUDA and OpenCL for non-expert programmers
  • Many computation-intensive applications in non-HPC area

– Data analytics and data science (Hadoop, Spark, etc.) – Security analysis (events in log files) – Natural language processing (messages in social network system)

6 Easy and High Performance GPU Programming for Java Programmers

From https://www.flickr.com/photos/dlato/5530553658

slide-7
SLIDE 7

Programmability of CUDA vs. Java for GPUs

  • CUDA requires programmers to explicitly write operations for

– managing device memories – copying data

between CPU and GPU

– expressing parallelism

  • Java 8 enables programmers

to just focus on

– expressing parallelism

7 Easy and High Performance GPU Programming for Java Programmers

// code for GPU __global__ void GPU(float* d_a, float* d_b, int n) { int i = threadIdx.x; if (n <= i) return; d_b[i] = d_a[i] * 2.0; } void fooJava(float A[], float B[], int n) { // similar to for (idx = 0; i < n; i++) IntStream.range(0, N).parallel().forEach(i -> { b[i] = a[i] * 2.0; }); } void fooCUDA(N, float *A, float *B, int N) { int sizeN = N * sizeof(float); cudaMalloc(&d_A, sizeN); cudaMalloc(&d_B, sizeN); cudaMemcpy(d_A, A, sizeN, HostToDevice); GPU<<<N, 1>>>(d_A, d_B, N); cudaMemcpy(B, d_B, sizeN, DeviceToHost); cudaFree(d_B); cudaFree(d_A); }

slide-8
SLIDE 8

Safety and Flexibility in Java

  • Automatic memory management

– No memory leak

  • Object-oriented
  • Exception checks

– No unsafe

memory accesses

8 Easy and High Performance GPU Programming for Java Programmers

float[] a = new float[N], b = new float[N] new Par().foo(a, b, N) // unnecessary to explicitly free a[] and b[] class Par { void foo(float[] a, float[] b, int n) { // similar to for (idx = 0; i < n; i++) IntStream.range(0, N).parallel().forEach(i -> { // throw an exception if // a[] == null, b[] = null // i < 0, a.length <= i, b.length <= i b[i] = a[i] * 2.0; }); } }

slide-9
SLIDE 9

Portability among Different Hardware

  • How a Java program works

– ‘javac’ command creates machine-independent Java bytecode – ‘java’ command launches Java runtime with Java bytecode

  • An interpreter executes a program by processing each Java bytecode
  • A just-in-time compiler generates native instructions for a target machine

from Java bytecode of a hotspot method

9 Easy and High Performance GPU Programming for Java Programmers

Java program (.java) Java bytecode (.class, .jar)

Java runtime

Target machine Interpreter just-in-time compiler > javac Seq.java > java Seq

slide-10
SLIDE 10

Outline

  • Goal
  • Motivation
  • How to Write a Parallel Program in Java
  • Overview of IBM Java 8 Runtime
  • Performance Evaluation
  • Conclusion

10 Easy and High Performance GPU Programming for Java Programmers

slide-11
SLIDE 11

How to Write a Parallel Loop in Java 8

  • Express parallelism by using parallel stream APIs

among iterations of a lambda expression (index variable: i)

11 Easy and High Performance GPU Programming for Java Programmers

IntStream.range(0, 5).parallel(). forEach(i -> { System.out.println(i);}); 3 2 4 1

Example

Reference implementation of Java 8 can execute this

  • n multiple CPU threads

println(0) on thread 0 println(3) on thread 1 println(2) on thread 2 println(4) on thread 3 println(1) on thread 0

time

slide-12
SLIDE 12

Outline

  • Goal
  • Motivation
  • How to Write and Execute a Parallel Program in Java
  • Overview of IBM Java 8 Runtime
  • Performance Evaluation
  • Conclusion

12 Easy and High Performance GPU Programming for Java Programmers

slide-13
SLIDE 13

Portability among Different Hardware (including GPUs)

  • A just-in-time compiler in IBM Java 8 runtime generates

native instructions

– for a target machine including GPUs from Java bytecode – for GPU which exploit device-specific capabilities more easily than

OpenCL

13 Easy and High Performance GPU Programming for Java Programmers

Java program (.java) Java bytecode (.class, .jar)

IBM Java 8 runtime

Target machine Interpreter just-in-time compiler > javac Par.java > java Par for GPU

IntStream.range(0, n) .parallel().forEach(i -> { ... });

slide-14
SLIDE 14

IBM Java 8 Can Execute the Code on CPU or GPU

  • Generate code for GPU execution from a parallel loop

– GPU instructions for code in blue – CPU instructions for GPU memory manage and data copy

  • Execute this loop on CPU or GPU base on cost model

– e.g., execute this on CPU if ‘n’ is very small

14 Easy and High Performance GPU Programming for Java Programmers

class Par { void foo(float[] a, float[] b, float[] c, int n) { IntStream.range(0, n).parallel().forEach(i -> { b[i] = a[i] * 2.0; c[i] = a[i] * 3.0; }); } }

Note: GPU support in current version is limited to lambdas with one-dimensional arrays and primitive types

slide-15
SLIDE 15

Optimizations for GPUs in IBM Just-In-Time Compiler

  • Using read-only cache

– reduce # of memory transactions to a GPU global memory

  • Optimizing data copy between CPU and GPU

– reduce amount of data copy

  • Eliminating redundant exception checks for Java on GPU

– reduce # of instructions in GPU binary

15 Easy and High Performance GPU Programming for Java Programmers

slide-16
SLIDE 16

Using Read-Only Cache

  • Automatically detect a read-only array and access it thru read-
  • nly cache

– read-only cache is faster than other memories in GPU

16 Easy and High Performance GPU Programming for Java Programmers

float[] A = new float[N], B = new float[N], C = new float[N]; foo(A, B, C, N); void foo(float[] a, float[] b, float[] c, int n) { IntStream.range(0, n).parallel().forEach(i -> { b[i] = a[i] * 2.0; c[i] = a[i] * 3.0; }); }

Equivalent to CUDA code

__device__ foo(*a, *b, *c, N) b[i] = __ldg(&a[i]) * 2.0; c[i] = __ldg(&a[i]) * 3.0; }

slide-17
SLIDE 17

Optimizing Data Copy between CPU and GPU

  • Eliminate data copy from GPU to CPU

– if an array (e.g., a[]) is not written on GPU

  • Eliminate data copy from CPU to GPU

– if an array (e.g., b[] and c[]) is not read on GPU

17 Easy and High Performance GPU Programming for Java Programmers

void foo(float[] a, float[] b, float[] c, int n) { // Data copy for a[] from CPU to GPU // No data copy for b[] and c[] IntStream.range(0, n).parallel().forEach(i -> { b[i] = a[i] * 2.0; c[i] = a[i] * 3.0; }); // Data copy for b[] and c[] from GPU to CPU // No data copy for a[] }

slide-18
SLIDE 18

Optimizing Data Copy between CPU and GPU

  • Eliminate data copy between CPU and GPU

– if an array (e.g., a[] and b[]), which was accessed on GPU, is not

accessed on CPU

18 Easy and High Performance GPU Programming for Java Programmers

// Data copy for a[] from CPU to GPU for (int t = 0; t < T; t++) { IntStream.range(0, N*N).parallel().forEach(idx -> { b[idx] = a[...]; }); // No data copy for b[] between GPU and CPU IntStream.range(0, N*N).parallel().forEach(idx -> { a[idx] = b[...]; } // No data copy for a[] between GPU and CPU } // Data copy for a[] and b[] from GPU to CPU

slide-19
SLIDE 19

How to Support Exception Checks on GPUs

  • IBM just-in-time compiler inserts exception checks in GPU

kernel

19 Easy and High Performance GPU Programming for Java Programmers

// code for CPU { ... launch GPUkernel(...) if (exception) { goto handle_exception; } ... } __device__ GPUkernel(…) { int i = ...; if ((a == NULL) || i < 0 || a.length <= i) { exception = true; return; } if ((b == NULL) || b.length <= i) { exception = true; return; } b[i] = a[i] * 2.0; if ((c == NULL) || c.length <= i) { exception = true; return; } c[i] = a[i] * 3.0; }

// Java program IntStream.range(0,n).parallel(). forEach(i -> { b[i] = a[i] * 2.0; c[i] = a[i] * 3.0; });

slide-20
SLIDE 20

Eliminating Redundant Exception Checks

  • Speculatively perform exception checks on CPU if the form of

an array index is simple (xi + y)

20 Easy and High Performance GPU Programming for Java Programmers

// code for CPU if ( // check conditions for null pointer a != null && b != null && c != null && // check conditions for out of bounds of array index 0 <= a.length && a.length < n && 0 <= b.length && b.length < n && 0 <= c.length && c.length < n) { ... launch GPUkernel(...) ... } else { // execute this loop on CPU to produce an exception } __device__ GPUkernel(…) { // no exception check is // required i = ...; b[i] = a[i] * 2.0; c[i] = a[i] * 3.0; }

IntStream.range(0,n).parallel(). forEach(i -> { b[i] = a[i] * 2.0; c[i] = a[i] * 3.0; });

slide-21
SLIDE 21

Outline

  • Goal
  • Motivation
  • How to Write and Execute a Parallel Program in Java
  • Overview of IBM Java 8 Runtime
  • Performance Evaluation
  • Conclusion

21 Easy and High Performance GPU Programming for Java Programmers

slide-22
SLIDE 22

Performance Evaluation Methodology

  • Measured performance improvement by GPU using four programs (on

next slide) over

– 1-CPU-thread sequential execution – 160-CPU-thread parallel execution

  • Experimental environment used

– IBM Java 8 Service Release 2 for PowerPC Little Endian

  • Download for free at http://www.ibm.com/java/jdk/

– Two 10-core 8-SMT IBM POWER8 CPUs at 3.69 GHz with 256GB

memory (160 hardware threads in total)

  • With one NVIDIA Kepler K40m GPU (2880 CUDA cores in total) at 876 MHz

with 12GB global memory (ECC off)

– Ubuntu 14.10, CUDA 5.5

22 Easy and High Performance GPU Programming for Java Programmers

slide-23
SLIDE 23

Benchmark Programs

  • Prepare sequential and parallel stream API versions in Java

23 Easy and High Performance GPU Programming for Java Programmers

Name Summary Data size Type MM A dense matrix multiplication: C = A.B 1,024 × 1,024 double SpMM A sparse matrix multiplication: C = A.B 500,000× 500,000 double Jacobi2D Solve an equation using the Jacobi method 8,192 × 8,192 double LifeGame Conway’s game of life. Iterate 10,000 times 512 × 512 byte

slide-24
SLIDE 24

Performance Improvements of GPU Version over Sequential and Parallel CPU Versions

 Achieve 58.9x on geomean and 317.0x for Jacobi2D over 1 CPU thread  Achieve 3.7x on geomean and 14.8x for Jacobi2D over 160 CPU threads  Degrade performance for SpMM against 160 CPU threads

Easy and High Performance GPU Programming for Java Programmers 24

slide-25
SLIDE 25

Conclusion

  • Program GPUs using pure Java with standard parallel stream

APIs

  • Compile a Java program without annotations for GPUs by IBM

Java 8 runtime with optimizations

– read-only cache exploitation – data copy optimizations between CPU and GPU – exception check eliminations

  • Offer performance improvements using GPUs by

–58.9x over sequential execution –3.7x over 160-CPU-thread parallel execution

25 Easy and High Performance GPU Programming for Java Programmers

Details are in our paper “Compiling and Optimizing Java 8 Programs for GPU Execution” (PACT2015)