Leveraging the GPU on Spark Tobias Polzer, Friedrich-Alexander - PowerPoint PPT Presentation

Leveraging the GPU on Spark Leveraging the GPU on Spark Tobias Polzer, Friedrich-Alexander University Erlangen-Nuremberg Josef Adersberger, QAware GmbH May 17, 2017 1 / 26

Leveraging the GPU on Spark Contents Motivation Challenges Prototype Architecture Benchmarks Conclusions The Way Forward 2 / 26

Accelerating operations with high arithmetic intensity is “easy”: copy from Spark to accelerated native application compute… copy back results Leveraging the GPU on Spark Motivation Motivation ◮ Initial motivation: Time series analysis in Chronix 3 / 26

Leveraging the GPU on Spark Motivation Motivation ◮ Initial motivation: Time series analysis in Chronix ◮ Accelerating operations with high arithmetic intensity is “easy”: ◮ copy from Spark to accelerated native application ◮ compute… ◮ copy back results 3 / 26

More generally: accelerate operations with low arithmetic intensity typically CPU GPU slow, GPU RAM fast Can we just keep the data on the GPU all the time? Leveraging the GPU on Spark Motivation Motivation ◮ What if intermediate results need to be exchanged? e.g. in outlier detection 4 / 26

Can we just keep the data on the GPU all the time? Leveraging the GPU on Spark Motivation Motivation ◮ What if intermediate results need to be exchanged? e.g. in outlier detection ◮ More generally: accelerate operations with low arithmetic intensity ◮ typically CPU ↔ GPU slow, GPU RAM fast 4 / 26

Leveraging the GPU on Spark Motivation Motivation ◮ What if intermediate results need to be exchanged? e.g. in outlier detection ◮ More generally: accelerate operations with low arithmetic intensity ◮ typically CPU ↔ GPU slow, GPU RAM fast ◮ Can we just keep the data on the GPU all the time? 4 / 26

OpenCL and CUDA are native APIs, interfacing via JNI possible but tedious There has yet to emerge a standard way of GPU acceleration for Java Many publications, but few publish code Leveraging the GPU on Spark Challenges GPU ↔ Java ◮ Project Sumatra aimed for deep integration into Hotspot. Didn’t happen (project is “currently inactive”). 5 / 26

Many publications, but few publish code Leveraging the GPU on Spark Challenges GPU ↔ Java ◮ Project Sumatra aimed for deep integration into Hotspot. Didn’t happen (project is “currently inactive”). ◮ OpenCL and CUDA are native APIs, interfacing via JNI possible but tedious ◮ There has yet to emerge a standard way of GPU acceleration for Java 5 / 26

Leveraging the GPU on Spark Challenges GPU ↔ Java ◮ Project Sumatra aimed for deep integration into Hotspot. Didn’t happen (project is “currently inactive”). ◮ OpenCL and CUDA are native APIs, interfacing via JNI possible but tedious ◮ There has yet to emerge a standard way of GPU acceleration for Java ◮ Many publications, but few publish code 5 / 26

Aparapi (Java OpenCL) Both could use some love... Leveraging the GPU on Spark Challenges Transpilers There are two serious transpilers publicly available: ◮ Rootbeer (Java → CUDA) 6 / 26

Both could use some love... Leveraging the GPU on Spark Challenges Transpilers There are two serious transpilers publicly available: ◮ Rootbeer (Java → CUDA) ◮ Aparapi (Java → OpenCL) 6 / 26

Leveraging the GPU on Spark Challenges Transpilers There are two serious transpilers publicly available: ◮ Rootbeer (Java → CUDA) ◮ Aparapi (Java → OpenCL) Both could use some love... 6 / 26

Direct OpenCL usage makes runtime code generation easy. Buffer management with exceptions but without proper destructors is awkward. Currently the only reasonable choices. Leveraging the GPU on Spark Challenges jocl/jcuda Near 1:1 wrappers around OpenCL/CUDA ◮ Very flexible in usage 7 / 26

Buffer management with exceptions but without proper destructors is awkward. Currently the only reasonable choices. Leveraging the GPU on Spark Challenges jocl/jcuda Near 1:1 wrappers around OpenCL/CUDA ◮ Very flexible in usage ◮ Direct OpenCL usage makes runtime code generation easy. 7 / 26

Currently the only reasonable choices. Leveraging the GPU on Spark Challenges jocl/jcuda Near 1:1 wrappers around OpenCL/CUDA ◮ Very flexible in usage ◮ Direct OpenCL usage makes runtime code generation easy. ◮ Buffer management with exceptions but without proper destructors is awkward. 7 / 26

Leveraging the GPU on Spark Challenges jocl/jcuda Near 1:1 wrappers around OpenCL/CUDA ◮ Very flexible in usage ◮ Direct OpenCL usage makes runtime code generation easy. ◮ Buffer management with exceptions but without proper destructors is awkward. Currently the only reasonable choices. 7 / 26

Leveraging the GPU on Spark Challenges CUDA vs. OpenCL CUDA ◮ has a mature ecosystem ◮ needs separate compilation ◮ works only on Nvidia GPUs OpenCL ◮ “works” on lots of devices (CPUs, GPUs, FPGAs, etc) ◮ supports JIT compilation of kernels (from C) ◮ most implementations are fragile/quirky 8 / 26

IBM GPUEnabler (Tungsten prototype?) looks promising but mostly undocumented uses internal Spark APIs had randomly failing tests their example code is faster on the CPU Leveraging the GPU on Spark Challenges GPU ↔ Spark ◮ Project Tungsten (theoretically) 9 / 26

but mostly undocumented uses internal Spark APIs had randomly failing tests their example code is faster on the CPU Leveraging the GPU on Spark Challenges GPU ↔ Spark ◮ Project Tungsten (theoretically) ◮ IBM GPUEnabler (Tungsten prototype?) ◮ looks promising 9 / 26

Leveraging the GPU on Spark Challenges GPU ↔ Spark ◮ Project Tungsten (theoretically) ◮ IBM GPUEnabler (Tungsten prototype?) ◮ looks promising ◮ but mostly undocumented ◮ uses internal Spark APIs ◮ had randomly failing tests ◮ their example code is faster on the CPU 9 / 26

Provides GPU functions on the RDD The user can choose caching on the GPU at runtime If data is not cached on the GPU, it is streamed as needed CLRDD [ T ]( val wrapped : RDD [ CLPartition [ T ]]) extends RDD [ T ] Leveraging the GPU on Spark Prototype Architecture CLRDD ◮ One CLPartition yields one context and an iterator of binary chunks ◮ The context provides asynchronous methods on chunks 10 / 26

CLRDD [ T ]( val wrapped : RDD [ CLPartition [ T ]]) extends RDD [ T ] Leveraging the GPU on Spark Prototype Architecture CLRDD ◮ One CLPartition yields one context and an iterator of binary chunks ◮ The context provides asynchronous methods on chunks ◮ Provides GPU functions on the RDD ◮ The user can choose caching on the GPU at runtime ◮ If data is not cached on the GPU, it is streamed as needed 10 / 26

Leveraging the GPU on Spark Prototype Architecture Storage ◮ all useful operations on CLRDD [ T ] require a typeclass instance CLType [ T ] ◮ minimal defjnition includes OpenCL type, mapping to/from ByteBuffer storage ◮ optionally: OpenCL arithmetics ◮ macro generated instances for all primitve vector/tuple types 11 / 26

cpu : Boolean , ) extends CLProgramSource { } ... : Array [ String ] = ... case class MapReduceKernel [ A , B ]( f : MapKernel [ A , B ], reduceBody : String , identity : String , def generateSource(supply : Iterator [ String ]) implicit val clA : CLType [ A ], implicit val clB : CLType [ B ] Leveraging the GPU on Spark Prototype Architecture Operations Operations are represented as composable case classes that can generate a kernel source: 12 / 26

Simple reduction: reduce( MapReduceKernel ( clT.zeroName, // string zero } ), num.zero, ((x : T , y : T ) => num.plus(x,y))) clT, clT // explicit typeclasses crdd.map[ Byte ]("return x%2;") def sum( implicit num : Numeric [ T ]) : T = { val clT = implicitly[ CLType [ T ]] useCPU, // algorithm selection MapKernel .identity[ T ], // first map "return x+y;", // then reduce Leveraging the GPU on Spark Prototype Architecture Functions on the GPU High level functions that are implemented: ◮ One to one map functions (inplace/copying): 13 / 26

reduce( MapReduceKernel ( "return x+y;", // then reduce } ), num.zero, ((x : T , y : T ) => num.plus(x,y))) clT, clT // explicit typeclasses crdd.map[ Byte ]("return x%2;") useCPU, // algorithm selection def sum( implicit num : Numeric [ T ]) : T = { val clT = implicitly[ CLType [ T ]] clT.zeroName, // string zero MapKernel .identity[ T ], // first map Leveraging the GPU on Spark Prototype Architecture Functions on the GPU High level functions that are implemented: ◮ One to one map functions (inplace/copying): ◮ Simple reduction: 13 / 26

width, 1, // width, stride ) //just scala things... } clT.doubleCLInstance.elemClassTag) def movingAverage(width : Int )( implicit clT : CLType [ T ]) //polymorphic return type, e.g.CLRDD[(Double,Double)] : CLRDD [ clT.doubleCLInstance.elemType ] = { val clRes = clT.doubleCLInstance sliding[ clT.doubleCLInstance.elemType ]( (clT.doubleCLInstance.selfInstance, s""" ${ clRes.clName } res = ${ clRes.zeroName } ; for(int i=0; i< $width ; ++i) res += convert_ ${ clRes.clName } (GET(i)); return res/ $width ;""" Leveraging the GPU on Spark Prototype Architecture Functions on the GPU ◮ Many to one sliding window map 14 / 26

Leveraging the GPU on Spark Tobias Polzer, Friedrich-Alexander - PowerPoint PPT Presentation

Leveraging the GPU on Spark Leveraging the GPU on Spark Tobias Polzer, Friedrich-Alexander University Erlangen-Nuremberg Josef Adersberger, QAware GmbH May 17, 2017 1 / 26 Leveraging the GPU on Spark Contents Motivation Challenges

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

GPU Enabled Spark MLlib Lingyun Li & Lei Yao CS 848 University of Waterloo Outline

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Driverless Cars The future of mobility and the implications for insurance David Williams,

eGPU for Monitoring Performance and Power Consumption on Multi-GPUs XIII Workshop de

What w e have learned from developing and running ABw E Jiri Navratil, Les R.Cottrell (SLAC)

MAS MASTE TER R OF OF INTERN INTERNATION TIONAL AL BUSINESS USINESS DU DUAL AL DEGR

Log-Structured Merge Trees CSCI 333 How Should I Organize My Stuff (Data)? How Should I

IO on Lustre and GPFS David Henty and Adrian Jackson (EPCC, The University of Edinburgh) Charles

Authenticated Encryption Atul Luykx COSIC, ESAT, KU Leuven, Belgium July 15, 2016 1 2 2 2 2

for Microblog Search A Preliminary Study Maram Hasanain, Rana Malhas, Tamer Elsayed 11 July 2014

Leveraging the GPU on Spark Tobias Polzer, Friedrich-Alexander - PowerPoint PPT Presentation

Leveraging the GPU on Spark Leveraging the GPU on Spark Tobias Polzer, Friedrich-Alexander University Erlangen-Nuremberg Josef Adersberger, QAware GmbH May 17, 2017 1 / 26 Leveraging the GPU on Spark Contents Motivation Challenges

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

GPU Enabled Spark MLlib Lingyun Li &amp; Lei Yao CS 848 University of Waterloo Outline

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Driverless Cars The future of mobility and the implications for insurance David Williams,

eGPU for Monitoring Performance and Power Consumption on Multi-GPUs XIII Workshop de

What w e have learned from developing and running ABw E Jiri Navratil, Les R.Cottrell (SLAC)

MAS MASTE TER R OF OF INTERN INTERNATION TIONAL AL BUSINESS USINESS DU DUAL AL DEGR

Log-Structured Merge Trees CSCI 333 How Should I Organize My Stuff (Data)? How Should I

IO on Lustre and GPFS David Henty and Adrian Jackson (EPCC, The University of Edinburgh) Charles

Authenticated Encryption Atul Luykx COSIC, ESAT, KU Leuven, Belgium July 15, 2016 1 2 2 2 2

for Microblog Search A Preliminary Study Maram Hasanain, Rana Malhas, Tamer Elsayed 11 July 2014

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

GPU Enabled Spark MLlib Lingyun Li & Lei Yao CS 848 University of Waterloo Outline

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team