Leveraging the GPU on Spark Tobias Polzer, Friedrich-Alexander - - PowerPoint PPT Presentation

leveraging the gpu on spark
SMART_READER_LITE
LIVE PREVIEW

Leveraging the GPU on Spark Tobias Polzer, Friedrich-Alexander - - PowerPoint PPT Presentation

Leveraging the GPU on Spark Leveraging the GPU on Spark Tobias Polzer, Friedrich-Alexander University Erlangen-Nuremberg Josef Adersberger, QAware GmbH May 17, 2017 1 / 26 Leveraging the GPU on Spark Contents Motivation Challenges


slide-1
SLIDE 1

Leveraging the GPU on Spark

Leveraging the GPU on Spark

Tobias Polzer, Friedrich-Alexander University Erlangen-Nuremberg Josef Adersberger, QAware GmbH May 17, 2017

1 / 26

slide-2
SLIDE 2

Leveraging the GPU on Spark

Contents

Motivation Challenges Prototype Architecture Benchmarks Conclusions The Way Forward

2 / 26

slide-3
SLIDE 3

Leveraging the GPU on Spark Motivation

Motivation

◮ Initial motivation: Time series analysis in Chronix

Accelerating operations with high arithmetic intensity is “easy”:

copy from Spark to accelerated native application compute… copy back results

3 / 26

slide-4
SLIDE 4

Leveraging the GPU on Spark Motivation

Motivation

◮ Initial motivation: Time series analysis in Chronix ◮ Accelerating operations with high arithmetic intensity is

“easy”:

◮ copy from Spark to accelerated native application ◮ compute… ◮ copy back results 3 / 26

slide-5
SLIDE 5

Leveraging the GPU on Spark Motivation

Motivation

◮ What if intermediate results need to be exchanged?

e.g. in outlier detection More generally: accelerate operations with low arithmetic intensity typically CPU GPU slow, GPU RAM fast Can we just keep the data on the GPU all the time?

4 / 26

slide-6
SLIDE 6

Leveraging the GPU on Spark Motivation

Motivation

◮ What if intermediate results need to be exchanged?

e.g. in outlier detection

◮ More generally: accelerate operations with low arithmetic

intensity

◮ typically CPU ↔ GPU slow, GPU RAM fast

Can we just keep the data on the GPU all the time?

4 / 26

slide-7
SLIDE 7

Leveraging the GPU on Spark Motivation

Motivation

◮ What if intermediate results need to be exchanged?

e.g. in outlier detection

◮ More generally: accelerate operations with low arithmetic

intensity

◮ typically CPU ↔ GPU slow, GPU RAM fast ◮ Can we just keep the data on the GPU all the time?

4 / 26

slide-8
SLIDE 8

Leveraging the GPU on Spark Challenges

GPU ↔ Java

◮ Project Sumatra aimed for deep integration into Hotspot.

Didn’t happen (project is “currently inactive”). OpenCL and CUDA are native APIs, interfacing via JNI possible but tedious There has yet to emerge a standard way of GPU acceleration for Java Many publications, but few publish code

5 / 26

slide-9
SLIDE 9

Leveraging the GPU on Spark Challenges

GPU ↔ Java

◮ Project Sumatra aimed for deep integration into Hotspot.

Didn’t happen (project is “currently inactive”).

◮ OpenCL and CUDA are native APIs, interfacing via JNI

possible but tedious

◮ There has yet to emerge a standard way of GPU

acceleration for Java Many publications, but few publish code

5 / 26

slide-10
SLIDE 10

Leveraging the GPU on Spark Challenges

GPU ↔ Java

◮ Project Sumatra aimed for deep integration into Hotspot.

Didn’t happen (project is “currently inactive”).

◮ OpenCL and CUDA are native APIs, interfacing via JNI

possible but tedious

◮ There has yet to emerge a standard way of GPU

acceleration for Java

◮ Many publications, but few publish code

5 / 26

slide-11
SLIDE 11

Leveraging the GPU on Spark Challenges

Transpilers

There are two serious transpilers publicly available:

◮ Rootbeer (Java→CUDA)

Aparapi (Java OpenCL) Both could use some love...

6 / 26

slide-12
SLIDE 12

Leveraging the GPU on Spark Challenges

Transpilers

There are two serious transpilers publicly available:

◮ Rootbeer (Java→CUDA) ◮ Aparapi (Java→OpenCL)

Both could use some love...

6 / 26

slide-13
SLIDE 13

Leveraging the GPU on Spark Challenges

Transpilers

There are two serious transpilers publicly available:

◮ Rootbeer (Java→CUDA) ◮ Aparapi (Java→OpenCL)

Both could use some love...

6 / 26

slide-14
SLIDE 14

Leveraging the GPU on Spark Challenges

jocl/jcuda

Near 1:1 wrappers around OpenCL/CUDA

◮ Very flexible in usage

Direct OpenCL usage makes runtime code generation easy. Buffer management with exceptions but without proper destructors is awkward. Currently the only reasonable choices.

7 / 26

slide-15
SLIDE 15

Leveraging the GPU on Spark Challenges

jocl/jcuda

Near 1:1 wrappers around OpenCL/CUDA

◮ Very flexible in usage ◮ Direct OpenCL usage makes runtime code generation easy.

Buffer management with exceptions but without proper destructors is awkward. Currently the only reasonable choices.

7 / 26

slide-16
SLIDE 16

Leveraging the GPU on Spark Challenges

jocl/jcuda

Near 1:1 wrappers around OpenCL/CUDA

◮ Very flexible in usage ◮ Direct OpenCL usage makes runtime code generation easy. ◮ Buffer management with exceptions but without proper

destructors is awkward. Currently the only reasonable choices.

7 / 26

slide-17
SLIDE 17

Leveraging the GPU on Spark Challenges

jocl/jcuda

Near 1:1 wrappers around OpenCL/CUDA

◮ Very flexible in usage ◮ Direct OpenCL usage makes runtime code generation easy. ◮ Buffer management with exceptions but without proper

destructors is awkward. Currently the only reasonable choices.

7 / 26

slide-18
SLIDE 18

Leveraging the GPU on Spark Challenges

CUDA vs. OpenCL

CUDA

◮ has a mature ecosystem ◮ needs separate compilation ◮ works only on Nvidia GPUs

OpenCL

◮ “works” on lots of devices (CPUs, GPUs, FPGAs, etc) ◮ supports JIT compilation of kernels (from C) ◮ most implementations are fragile/quirky

8 / 26

slide-19
SLIDE 19

Leveraging the GPU on Spark Challenges

GPU ↔ Spark

◮ Project Tungsten (theoretically)

IBM GPUEnabler (Tungsten prototype?)

looks promising but mostly undocumented uses internal Spark APIs had randomly failing tests their example code is faster on the CPU

9 / 26

slide-20
SLIDE 20

Leveraging the GPU on Spark Challenges

GPU ↔ Spark

◮ Project Tungsten (theoretically) ◮ IBM GPUEnabler (Tungsten prototype?)

◮ looks promising

but mostly undocumented uses internal Spark APIs had randomly failing tests their example code is faster on the CPU

9 / 26

slide-21
SLIDE 21

Leveraging the GPU on Spark Challenges

GPU ↔ Spark

◮ Project Tungsten (theoretically) ◮ IBM GPUEnabler (Tungsten prototype?)

◮ looks promising ◮ but mostly undocumented ◮ uses internal Spark APIs ◮ had randomly failing tests ◮ their example code is faster on the CPU 9 / 26

slide-22
SLIDE 22

Leveraging the GPU on Spark Prototype Architecture

CLRDD

CLRDD[T](val wrapped: RDD[CLPartition[T]]) extends RDD[T]

◮ One CLPartition yields one context and an iterator of

binary chunks

◮ The context provides asynchronous methods on chunks

Provides GPU functions on the RDD The user can choose caching on the GPU at runtime If data is not cached on the GPU, it is streamed as needed

10 / 26

slide-23
SLIDE 23

Leveraging the GPU on Spark Prototype Architecture

CLRDD

CLRDD[T](val wrapped: RDD[CLPartition[T]]) extends RDD[T]

◮ One CLPartition yields one context and an iterator of

binary chunks

◮ The context provides asynchronous methods on chunks

◮ Provides GPU functions on the RDD ◮ The user can choose caching on the GPU at runtime ◮ If data is not cached on the GPU, it is streamed as needed

10 / 26

slide-24
SLIDE 24

Leveraging the GPU on Spark Prototype Architecture

Storage

◮ all useful operations on CLRDD[T] require a typeclass

instance CLType[T]

◮ minimal defjnition includes OpenCL type, mapping to/from

ByteBuffer storage

◮ optionally: OpenCL arithmetics ◮ macro generated instances for all primitve vector/tuple

types

11 / 26

slide-25
SLIDE 25

Leveraging the GPU on Spark Prototype Architecture

Operations

Operations are represented as composable case classes that can generate a kernel source:

case class MapReduceKernel[A, B]( f: MapKernel[A,B], reduceBody: String, identity: String, cpu: Boolean, implicit val clA: CLType[A], implicit val clB: CLType[B] ) extends CLProgramSource { def generateSource(supply: Iterator[String]) : Array[String] = ... ... }

12 / 26

slide-26
SLIDE 26

Leveraging the GPU on Spark Prototype Architecture

Functions on the GPU

High level functions that are implemented:

◮ One to one map functions (inplace/copying):

crdd.map[Byte]("return x%2;")

Simple reduction:

def sum(implicit num: Numeric[T]) : T = { val clT = implicitly[CLType[T]] reduce(MapReduceKernel( MapKernel.identity[T], // first map "return x+y;", // then reduce clT.zeroName, // string zero useCPU, // algorithm selection clT, clT // explicit typeclasses ), num.zero, ((x: T, y: T) => num.plus(x,y))) }

13 / 26

slide-27
SLIDE 27

Leveraging the GPU on Spark Prototype Architecture

Functions on the GPU

High level functions that are implemented:

◮ One to one map functions (inplace/copying):

crdd.map[Byte]("return x%2;")

◮ Simple reduction:

def sum(implicit num: Numeric[T]) : T = { val clT = implicitly[CLType[T]] reduce(MapReduceKernel( MapKernel.identity[T], // first map "return x+y;", // then reduce clT.zeroName, // string zero useCPU, // algorithm selection clT, clT // explicit typeclasses ), num.zero, ((x: T, y: T) => num.plus(x,y))) }

13 / 26

slide-28
SLIDE 28

Leveraging the GPU on Spark Prototype Architecture

Functions on the GPU

◮ Many to one sliding window map

def movingAverage(width: Int)(implicit clT: CLType[T]) //polymorphic return type, e.g.CLRDD[(Double,Double)] : CLRDD[clT.doubleCLInstance.elemType] = { val clRes = clT.doubleCLInstance sliding[clT.doubleCLInstance.elemType]( width, 1, // width, stride s"""${clRes.clName} res = ${clRes.zeroName}; for(int i=0; i<$width; ++i) res += convert_${clRes.clName}(GET(i)); return res/$width;""" )//just scala things... (clT.doubleCLInstance.selfInstance, clT.doubleCLInstance.elemClassTag) }

14 / 26

slide-29
SLIDE 29

Leveraging the GPU on Spark Benchmarks

Benchmarking Setup

Workstation

◮ Spark local mode ◮ Intel i7-3770: 4 cores, 8 threads, ~20GiB/s ◮ Radeon HD 7950, ~200GiB/s

Cluster

Spark standalone cluster mode 4 nodes, 40Gbit/s Infjniband interconnect two Xeon 2660v2: 20 cores, 40 threads, ~100GiB/s two K20m, ~400GiB/s

15 / 26

slide-30
SLIDE 30

Leveraging the GPU on Spark Benchmarks

Benchmarking Setup

Workstation

◮ Spark local mode ◮ Intel i7-3770: 4 cores, 8 threads, ~20GiB/s ◮ Radeon HD 7950, ~200GiB/s

Cluster

◮ Spark standalone cluster mode ◮ 4 nodes, 40Gbit/s Infjniband interconnect ◮ two Xeon 2660v2: 20 cores, 40 threads, ~100GiB/s ◮ two K20m, ~400GiB/s

15 / 26

slide-31
SLIDE 31

Leveraging the GPU on Spark Benchmarks

Benchmarks

◮ All benchmarks operate on RDD[Double]s. ◮ AMD’s OpenCL implementation for the CPUs

all data cached in RAM/graphics RAM before benchmarking solid lines show throughput dashed lines show time to process one RDD

16 / 26

slide-32
SLIDE 32

Leveraging the GPU on Spark Benchmarks

Benchmarks

◮ All benchmarks operate on RDD[Double]s. ◮ AMD’s OpenCL implementation for the CPUs ◮ all data cached in RAM/graphics RAM before

benchmarking solid lines show throughput dashed lines show time to process one RDD

16 / 26

slide-33
SLIDE 33

Leveraging the GPU on Spark Benchmarks

Benchmarks

◮ All benchmarks operate on RDD[Double]s. ◮ AMD’s OpenCL implementation for the CPUs ◮ all data cached in RAM/graphics RAM before

benchmarking

◮ solid lines show throughput ◮ dashed lines show time to process one RDD

16 / 26

slide-34
SLIDE 34

Leveraging the GPU on Spark Benchmarks

Workstation sum

100 101 102 103 102 103 104 105 size [MiB] throughput [MiB/s] GPU CPU Scala 10−2 10−1 time [s]

1“Scala” result with neither rdd.sum(), nor rdd.reduce() 17 / 26

slide-35
SLIDE 35

Leveraging the GPU on Spark Benchmarks

Workstation stats

100 101 102 103 103 104 105 size [MiB] throughput [MiB/s] GPU CPU Scala 10−2 10−1 100 time [s]

18 / 26

slide-36
SLIDE 36

Leveraging the GPU on Spark Benchmarks

Workstation moving Average

100 101 102 103 102 103 104 size [MiB] throughput [MiB/s] GPU CPU Scala 10−2 10−1 100 time [s]

19 / 26

slide-37
SLIDE 37

Leveraging the GPU on Spark Benchmarks

Cluster sum

101 102 103 104 102 103 104 105 106 size [MiB] throughput [MiB/s] GPU CPU Scala 10−1 100 time [s]

20 / 26

slide-38
SLIDE 38

Leveraging the GPU on Spark Benchmarks

Cluster stats

101 102 103 104 102 103 104 105 size [MiB] throughput [MiB/s] GPU CPU Scala 10−1 100 time [s]

21 / 26

slide-39
SLIDE 39

Leveraging the GPU on Spark Benchmarks

Cluster moving Average

101 102 103 104 102 103 104 size [MiB] throughput [MiB/s] GPU CPU Scala 10−1 100 time [s]

22 / 26

slide-40
SLIDE 40

Leveraging the GPU on Spark Conclusions

Conclusions

◮ simple aggregations could be faster even without GPUs.

large speedups for big datasets in GPU memory implementation effort vs. plain Spark is a lot higher

fjt data into GPU RAM special GPU code? debugging deploying

23 / 26

slide-41
SLIDE 41

Leveraging the GPU on Spark Conclusions

Conclusions

◮ simple aggregations could be faster even without GPUs. ◮ large speedups for big datasets in GPU memory

implementation effort vs. plain Spark is a lot higher

fjt data into GPU RAM special GPU code? debugging deploying

23 / 26

slide-42
SLIDE 42

Leveraging the GPU on Spark Conclusions

Conclusions

◮ simple aggregations could be faster even without GPUs. ◮ large speedups for big datasets in GPU memory ◮ implementation effort vs. plain Spark is a lot higher

◮ fjt data into GPU RAM ◮ special GPU code? ◮ debugging ◮ deploying 23 / 26

slide-43
SLIDE 43

Leveraging the GPU on Spark The Way Forward

The Way Forward

◮ Effjciently using GPUs (for arbitrary tasks) is a hard

problem. Builtins could benefjt, especially with intelligent caching in GPU memory (typically scarce). Bytecode inspection for simple operations (see SPARK-14083)? Spark as a compiler?

24 / 26

slide-44
SLIDE 44

Leveraging the GPU on Spark The Way Forward

The Way Forward

◮ Effjciently using GPUs (for arbitrary tasks) is a hard

problem.

◮ Builtins could benefjt, especially with intelligent caching in

GPU memory (typically scarce). Bytecode inspection for simple operations (see SPARK-14083)? Spark as a compiler?

24 / 26

slide-45
SLIDE 45

Leveraging the GPU on Spark The Way Forward

The Way Forward

◮ Effjciently using GPUs (for arbitrary tasks) is a hard

problem.

◮ Builtins could benefjt, especially with intelligent caching in

GPU memory (typically scarce).

◮ Bytecode inspection for simple operations (see

SPARK-14083)?

◮ Spark as a compiler?

24 / 26

slide-46
SLIDE 46

Leveraging the GPU on Spark The Way Forward

Code

◮ Remember that complaint about not publishing code?

Fully functioning prototype implementation at: https://github.com/TPolzer/spark-clrdd.

25 / 26

slide-47
SLIDE 47

Leveraging the GPU on Spark The Way Forward

Code

◮ Remember that complaint about not publishing code? ◮ Fully functioning prototype implementation at:

https://github.com/TPolzer/spark-clrdd.

25 / 26

slide-48
SLIDE 48

Leveraging the GPU on Spark

Questions?

26 / 26