Automatic Code Generation for Library Method Inclusion in Domain - - PowerPoint PPT Presentation

automatic code generation for library method inclusion in
SMART_READER_LITE
LIVE PREVIEW

Automatic Code Generation for Library Method Inclusion in Domain - - PowerPoint PPT Presentation

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e Faculty of Science Automatic Code Generation for Library Method Inclusion in Domain Specific Languages Communicating Process Architectures 2017 University of


slide-1
SLIDE 1

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

Faculty of Science

Automatic Code Generation for Library Method Inclusion in Domain Specific Languages

Communicating Process Architectures 2017 – University of Malta

Mads Ohm Larsen

Niels Bohr Institute, University of Copenhagen, Denmark

21 August 2017 Slide 1/21

slide-2
SLIDE 2

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

Why use libraries?

Introduction

Somebody else has already written a faster method than you could ever do.

Slide 2/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-3
SLIDE 3

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

Why use libraries?

Introduction

Somebody else has already written a faster method than you could ever do. An example of such a method is a fast way of multiplying two matrices that comes with the blas library call *gemm.

Slide 2/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-4
SLIDE 4

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

Why use libraries?

Introduction

Somebody else has already written a faster method than you could ever do. An example of such a method is a fast way of multiplying two matrices that comes with the blas library call *gemm. If possible, we always want to use this faster method.

Slide 2/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-5
SLIDE 5

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

cBLAS, Accelerate, clBLAS, LAPACK

Why can’t we?

So why not just use one of these specialized libraries all the time?

Slide 3/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-6
SLIDE 6

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

cBLAS, Accelerate, clBLAS, LAPACK

Why can’t we?

So why not just use one of these specialized libraries all the time? There exist many different libraries, for many different purposes/architectures/OSes.

Slide 3/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-7
SLIDE 7

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

cBLAS, Accelerate, clBLAS, LAPACK

Use the best?

We have cBLAS, Accelerate, clBLAS, lapack and many many more.

Slide 4/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-8
SLIDE 8

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

cBLAS, Accelerate, clBLAS, LAPACK

Use the best?

We have cBLAS, Accelerate, clBLAS, lapack and many many more. “Best” is hard to define. No one of the above is “best”. They all are “best” in their own way.

Slide 4/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-9
SLIDE 9

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

Coding blas

Code

cBLAS code:

1

#include <cblas.h>

2 3

... // Set up m, n, k, A_data, B_data, and C_data

4 5

// Calculates

6

// C := alpha * op(A) * op(B) + beta * C

7

// where op(X) is either X or X^T

8

cblas_sgemm(

9

CblasRowMajor, // Memory management

10

CblasNoTrans, // Transpose A?

11

CblasNoTrans, // Transpose B?

12

m, // Number of rows of op(A)

13

n, // Number of columns of op(B)

14

k, // Number of columns/rows of op(A) and op(B)

15

1.0, // Alpha argument

16

A_data, // Array of size m*k

17

k, // First dimension of A / Stride of A

18

B_data, // Array of size k*n

19

n, // Stride of B

20

0.0, // Beta argument

21

C_data, // Array of size m*n

22

n // Stride of C

23

);

Slide 5/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-10
SLIDE 10

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

Coding blas

Code

Python code:

1

import numpy as np

2

... # Set up a and b

3

c = np.matmul(a, b)

Slide 6/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-11
SLIDE 11

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

Python/NumPy

NumPy

NumPy already uses blas for calls like matmul. Problem solved?

Slide 7/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-12
SLIDE 12

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

Python/NumPy

NumPy

NumPy already uses blas for calls like matmul. Problem solved?

  • No. Python/NumPy is “slow” (single threaded) and cannot

utilize GPGPUs or other accelerators out-of-the-box.

Slide 7/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-13
SLIDE 13

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

Python/NumPy vs. Bohrium

Bohrium

Bohrium can use GPGPUs, but does not support blas.

Slide 8/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-14
SLIDE 14

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

Python/NumPy vs. Bohrium

Bohrium

Bohrium can use GPGPUs, but does not support blas. Let us make it support these library methods, such as blas.

Slide 8/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-15
SLIDE 15

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

Code-generation for Bohrium

Compile

When you compile/install Bohrium, CMake can look for present libraries to link with. NumPy does the same when you compile or install it.

Slide 9/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-16
SLIDE 16

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

Code-generation for Bohrium

Compile

When you compile/install Bohrium, CMake can look for present libraries to link with. NumPy does the same when you compile or install it. If we find blas we want to link with it.

Slide 9/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-17
SLIDE 17

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

Code-generation for Bohrium

Compile

When you compile/install Bohrium, CMake can look for present libraries to link with. NumPy does the same when you compile or install it. If we find blas we want to link with it. However, if we find clBLAS we would also link with that.

Slide 9/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-18
SLIDE 18

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

Code-generation for Bohrium

Choose

With automatic code inclusion, we can choose which library we want to use on compile- and run-time!

Slide 10/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-19
SLIDE 19

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

Code-generation for Bohrium

Choose

With automatic code inclusion, we can choose which library we want to use on compile- and run-time! We want Bohrium to link to both blas and choose the correct one later.

Slide 10/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-20
SLIDE 20

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

Code-generation for Bohrium

Implementing

We can implement all the blas calls ourselves.

Slide 11/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-21
SLIDE 21

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

Code-generation for Bohrium

Implementing

We can implement all the blas calls ourselves.

  • Tedious. Let’s generate it instead!

Slide 11/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-22
SLIDE 22

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

JSON, template, generate!

JSON

All of the blas methods follow a similar pattern.

Slide 12/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-23
SLIDE 23

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

JSON, template, generate!

JSON

All of the blas methods follow a similar pattern. Let’s use that to our advantage.

Slide 12/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-24
SLIDE 24

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

JSON, template, generate!

JSON

1

{

2

"methods": [

3

{

4

"name": "gemm",

5

"types": [ "s", "d", "c", "z" ],

6

"options": [

7

"layout", "notransA", "notransB",

8

"m", "n", "k",

9

"A", "B", "C"

10

]

11

},

12

...

13

]

14

}

Slide 13/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-25
SLIDE 25

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

JSON, template, generate!

C++

1

case @!utype!@: {

2

@!alpha!@

3

@!beta!@

4

cblas_@!t!@@!name!@(

5

<!--(if if_layout)--> CblasRowMajor, <!--(end)-->

6

<!--(if if_side)--> CblasLeft, <!--(end)-->

7

<!--(if if_uplo)--> CblasUpper, <!--(end)-->

8

<!--(if if_notransA)--> CblasNoTrans, <!--(end)-->

9

<!--(if if_transA)--> CblasTrans, <!--(end)-->

10

<!--(if if_notransB)--> CblasNoTrans, <!--(end)-->

11

<!--(if if_diag)--> CblasUnit, <!--(end)-->

12

<!--(if if_m)--> m, <!--(end)-->

13

<!--(if if_n)--> n, <!--(end)-->

14

<!--(if if_k)--> k, <!--(end)-->

15

@!alpha_arg!@,

16

(@!blas_type!@*)(((@!type!@*) A_data) + A->start),

17

k,

18

<!--(if if_B)-->

19

(@!blas_type!@*)(((@!type!@*) B_data) + B->start),

20

n<!--(if if_C)-->,<!--(end)-->

21

<!--(end)-->

22

<!--(if if_C)-->

23

@!beta_arg!@,

24

(@!blas_type!@*)(((@!type!@*) C_data) + C->start),

25

n

26

<!--(end)-->

27

);

28

break;

29

}

Slide 14/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-26
SLIDE 26

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

Examples and results

Run

All of the following CPU examples have been run on a Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz. The GPGPU examples are run on GeForce GTX 680 (OpenCL C 1.1). All examples were run 10 times and the average was measured.

Slide 15/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-27
SLIDE 27

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

Examples and results

0 s 0.2 s 0.4 s 0.6 s 0.8 s 1 s

0.47 0.47 10.91 0.46 0.08 17.12 0.1

Time [s] OpenBLAS (C) NumPy Bohrium Bohrium w/Ext clBLAS (C++) Bohrium w/OpenCL Bohrium w/OpenCL w/Ext

Figure: General matrix multiplication (C = AB) for multiple

  • implementations. Calculating 2000 × 3000 multiplied with

3000 × 4000 floating point elements.

Slide 16/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-28
SLIDE 28

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

Examples and results

The C version (OpenBLAS) is 76 lines of code.

Slide 17/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-29
SLIDE 29

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

Examples and results

The C version (OpenBLAS) is 76 lines of code. The C++version (clBLAS) is 265 lines of code

Slide 17/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-30
SLIDE 30

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

Examples and results

The C version (OpenBLAS) is 76 lines of code. The C++version (clBLAS) is 265 lines of code The NumPy version is 11 lines of code – and Bohrium is just another library import. Using Bohrium with OpenCL is just an environment variable.

Slide 17/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-31
SLIDE 31

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

Examples and results

0 s 0.2 s 0.4 s 0.6 s

0.02 10.81 0.49 0.03

Time [s] OpenCV NumPy Bohrium Bohrium w/Ext

Figure: Erode for multiple implementations. We use a 10, 000 × 10, 000 pixel random binary image.

Slide 18/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-32
SLIDE 32

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

Conclusion

Productivity and performance

We can now utilize the same libraries in Bohrium as in NumPy.

Slide 19/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-33
SLIDE 33

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

Conclusion

Productivity and performance

We can now utilize the same libraries in Bohrium as in NumPy. ... and they work just as fast.

Slide 19/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-34
SLIDE 34

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

Conclusion

Productivity and performance

We can now utilize the same libraries in Bohrium as in NumPy. ... and they work just as fast. ... and even faster in cases, where we can utilize GPGPUs.

Slide 19/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-35
SLIDE 35

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

Future work

DSL

A more generic DSL-like interface for creating these method inclusions:

1

extmethod_name "blas"

2

method_prefix "blas"

3 4

method do

5

names "sgemm", "dgemm", "cgemm", "zgemm"

6

data_operand 0

7

types "bh_float32", "bh_float64", "bh_complex64", "bh_complex128"

8 9

load_data "C"

10

load_data "A"

11

load_data "B"

12 13

...

14

end

15 16

...

Slide 20/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21

slide-36
SLIDE 36

u n i v e r s i t y o f c o p e n h a g e n n i e l s b o h r i n s t i t u t e

Questions?

Comments?

Feel free to ask anything.

Slide 21/21 — M. O. Larsen — Automatic Code Generation for Library Method Inclusion in DSLs — August 21