Marawacc: A Framework for Heterogeneous Computing in Java - - PowerPoint PPT Presentation

marawacc a framework for heterogeneous computing in java
SMART_READER_LITE
LIVE PREVIEW

Marawacc: A Framework for Heterogeneous Computing in Java - - PowerPoint PPT Presentation

Marawacc Marawacc: A Framework for Heterogeneous Computing in Java Motivation Marawacc-API Runtime Code Generation Juan Fumero, Michel Steuwer, Christophe Dubach Runtime Management Results Conclusion The University of Edinburgh UK


slide-1
SLIDE 1

Marawacc Motivation Marawacc-API Runtime Code Generation Runtime Management Results Conclusion

Marawacc: A Framework for Heterogeneous Computing in Java

Juan Fumero, Michel Steuwer, Christophe Dubach

The University of Edinburgh

UK Many-Core Developer Conference 2016

1 / 23

slide-2
SLIDE 2

Marawacc Motivation Marawacc-API Runtime Code Generation Runtime Management Results Conclusion

Motivation

2 / 23

slide-3
SLIDE 3

Marawacc Motivation Marawacc-API Runtime Code Generation Runtime Management Results Conclusion

Motivation

3 / 23

slide-4
SLIDE 4

Marawacc Motivation Marawacc-API Runtime Code Generation Runtime Management Results Conclusion

Motivation

4 / 23

slide-5
SLIDE 5

Marawacc Motivation Marawacc-API Runtime Code Generation Runtime Management Results Conclusion

Motivation

5 / 23

slide-6
SLIDE 6

Marawacc Motivation Marawacc-API Runtime Code Generation Runtime Management Results Conclusion

Marawacc: our approach

Three levels of abstraction

6 / 23

slide-7
SLIDE 7

Marawacc Motivation Marawacc-API Runtime Code Generation Runtime Management Results Conclusion

Marawacc API

7 / 23

slide-8
SLIDE 8

Marawacc Motivation Marawacc-API Runtime Code Generation Runtime Management Results Conclusion

Example: Saxpy in Java

1

f l o a t [ ] v1 = new f l o a t [ s i z e ] ;

2

f l o a t [ ] v2 = new f l o a t [ s i z e ] ;

3

f l o a t [ ] r e s u l t = new f l o a t [ s i z e ] ;

4 5 f o r

( i n t i = 0; i < s i z e ; i ++) {

6

r e s u l t [ i ] = alpha ∗ v1 [ i ] + v2 [ i ] ;

7 } 8 / 23

slide-9
SLIDE 9

Marawacc Motivation Marawacc-API Runtime Code Generation Runtime Management Results Conclusion

Example: Saxpy in Java

1 F l o at [ ]

v1 = new F l o a t [ s i z e ] ;

2 F l o at [ ]

v2 = new F l o a t [ s i z e ] ;

3 4 ArrayFunc<Tuple2<Float ,

Float >, Float > f ;

5

f = new MapFunction<>(t − > alpha ∗ t . 1 () + t . 2 () ) ;

6 7 F l o at [ ]

r e s u l t = f . z i p ( v1 , v2 ) . apply () ;

9 / 23

slide-10
SLIDE 10

Marawacc Motivation Marawacc-API Runtime Code Generation Runtime Management Results Conclusion

Runtime Code Generation

10 / 23

slide-11
SLIDE 11

Marawacc Motivation Marawacc-API Runtime Code Generation Runtime Management Results Conclusion

Runtime Code Generation

Workflow

... 10: aload_2 11: iload_3 12: aload_0 13: getfield 16: aaload 18: invokeinterface#apply 23: aastore 24: iinc 27: iload_3 ... Java source Map.apply(f) Java bytecode Graal VM CFG + Dataflow (Graal IR) void kernel ( global float* input, global float* output) { ...; ...; } OpenCL Kernel

  • 3. optimizations
  • 2. IR generation
  • 4. kernel generation
  • 1. Type inference

11 / 23

slide-12
SLIDE 12

Marawacc Motivation Marawacc-API Runtime Code Generation Runtime Management Results Conclusion

Runtime Code Generation

MapFunction< Integer, Double >(x − > x * 2.0)

Param StartNode MethodCallTarget Invoke#Integer.intValue DoubleConvert Const (2.0) * MethodCallTarget Invoke#Double.valueOf Param StartNode IsNull GuardingPi (NullCheckException) DoubleConvert Const (2.0) * Box Return Return Unbox Param StartNode DoubleConvert Const (2.0) * Return

inline double lambda0 ( int p0 ) { double cast_1 = ( double ) p0 ; double result_2 = cast_1 * 2.0; return result_2 ; }

12 / 23

slide-13
SLIDE 13

Marawacc Motivation Marawacc-API Runtime Code Generation Runtime Management Results Conclusion

Marawacc: Runtime Management

13 / 23

slide-14
SLIDE 14

Marawacc Motivation Marawacc-API Runtime Code Generation Runtime Management Results Conclusion

Where the time is spent?

Black-scholes benchmark. Float[] = ⇒ Tuple2 < Float, Float > []

0.0 0.2 0.4 0.6 0.8 1.0 Amount of total runtime in %

Unmarshaling CopyToCPU GPU Execution CopyToGPU Marshaling Java overhead

◮ Un/marshal data

takes up to 90% of the time

◮ Computation step

should be dominant This is not acceptable. Can we do better?

14 / 23

slide-15
SLIDE 15

Marawacc Motivation Marawacc-API Runtime Code Generation Runtime Management Results Conclusion

Custom Array Type: PArray

Programmer's View T uple2

...

Graal-OCL VM float float float float

...

double double double double

...

FloatBuffer DoubleBuffer ...

1 2 n-1

...

1 2 n-1 1 2 n-1

float double T uple2 float double T uple2 float double T uple2 float double ...

PArray<T uple2<Float,Double>>

With this layout, un/marshal operations are not necessary

15 / 23

slide-16
SLIDE 16

Marawacc Motivation Marawacc-API Runtime Code Generation Runtime Management Results Conclusion

Sapy example

1 F l o at [ ]

v1 = new F l o a t [ s i z e ] ;

2 Double [ ]

v2 = new Double [ s i z e ] ;

3 4 ArrayFunc<Tuple2<Float ,

Double >, Double> f ;

5

f = new MapFunction<>(t − > alpha ∗ t . 1 () + t . 2 () ) ;

6 7 F l o at [ ]

r e s u l t = f . z i p ( v1 , v2 ) . apply () ;

16 / 23

slide-17
SLIDE 17

Marawacc Motivation Marawacc-API Runtime Code Generation Runtime Management Results Conclusion

Saxpy with our Custom PArrays

1 F l o at [ ]

v1 = new F l o a t [ s i z e ] ;

2 Double [ ]

v2 = new Double [ s i z e ] ;

3

PArray i n p u t= new PArray ( v1 , v2 ) ;

4 5 ArrayFunc<Tuple2<Float ,

Double >, Double> f ;

6 f = new MapFunction<>(t −

> alpha ∗ t . 1 () + t . 2 () ) ;

7 PArray<Double> output =

f . apply ( i n p u t ) ;

17 / 23

slide-18
SLIDE 18

Marawacc Motivation Marawacc-API Runtime Code Generation Runtime Management Results Conclusion

Results

18 / 23

slide-19
SLIDE 19

Marawacc Motivation Marawacc-API Runtime Code Generation Runtime Management Results Conclusion

OpenCL GPU Execution

AMD R9 and NVIDIA GeForce GTX Titan

0.1 1 10 100 1000

small large

Saxpy

small large

K−Means

small large

Black−Scholes

small large

N−Body

small large

Monte Carlo

Speedup vs. Java sequential AMD Marshalling AMD Optimized AMD Marshalling AMD Optimized 0.1 1 10 100 1000

small large

Saxpy

small large

K−Means

small large

Black−Scholes

small large

N−Body

small large

Monte Carlo

Speedup vs. Java sequential Nvidia Marshalling Nvidia Optimized Nvidia Marshalling Nvidia Optimized

19 / 23

slide-20
SLIDE 20

Marawacc Motivation Marawacc-API Runtime Code Generation Runtime Management Results Conclusion

Comparison with OpenCL C++

AMD R9 and NVIDIA GeForce GTX Titan

Small Large Small Large Small Large Small Large Small Large Saxpy K-Means Black-Scholes N-Body MonteCarlo 1 10 100 1000 500 Speedup over sequential code Speedup over sequential code on AMD Marawacc Aparapi OpenCL C++ Small Large Small Large Small Large Small Large Small Large Saxpy K-Means Black-Scholes N-Body MonteCarlo 1 10 100 500 Speedup over sequential code Speedup over sequential code on NVIDIA Marawacc Aparapi OpenCL C++

20 / 23

slide-21
SLIDE 21

Marawacc Motivation Marawacc-API Runtime Code Generation Runtime Management Results Conclusion

.zip(Conclusions).map(Future)

Present

◮ We have presented Marawacc framework for

programming GPUs from Java

◮ Custom array type to reduce overheads when

transforming the data

◮ Runtime system to run heterogeneous applications

within Java Future

◮ Code generation for multiple devices ◮ Runtime scheduling (Where is the best place to run the

code?)

21 / 23

slide-22
SLIDE 22

Marawacc Motivation Marawacc-API Runtime Code Generation Runtime Management Results Conclusion

Thanks so much for your attention

Juan Fumero <juan.fumero@ed.ac.uk>

22 / 23

slide-23
SLIDE 23

Marawacc Motivation Marawacc-API Runtime Code Generation Runtime Management Results Conclusion

OpenCL code generated

1 double

lambda0 ( f l o a t p0 ) {

2

double c a s t 1 = ( double ) p0 ;

3

double r e s u l t 2 = c a s t 1 ∗ 2 . 0 ;

4

return r e s u l t 2 ;

5 } 6 k e r n e l

void lambdaComputationKernel (

7

g l o b a l f l o a t ∗ p0 ,

8

g l o b a l i n t ∗ p0 index data ,

9

g l o b a l double ∗p1 ,

10

g l o b a l i n t ∗ p 1 i n d e x d a t a ) {

11

i n t p0 dim 1 = 0 ; i n t p1 dim 1 = 0 ;

12

i n t gs = g e t g l o b a l s i z e (0) ;

13

i n t lo op 1 = g e t g l o b a l i d (0) ;

14

f o r ( ; ; l oop 1 += gs ) {

15

i n t p 0 l e n d i m 1 = p 0 i n d e x d a t a [ p0 dim 1 ] ;

16

bool cond 2 = l oo p 1 < p 0 l e n d i m 1 ;

17

i f ( cond 2 ) {

18

f l o a t auxVar0 = p0 [ l oo p 1 ] ;

19

double r e s = lambd0 ( auxVar0 ) ;

20

p1 [ p 1 i n d e x d a t a [ p1 dim 1 + 1] + l oo p 1 ]

21

= r e s ;

22

} e l s e { break ; }

23

}

24 } 23 / 23