COMPILER-ASSISTED TEST ACCELERATION ON GPUS FOR EMBEDDED SOFTWARE - - PowerPoint PPT Presentation

compiler assisted test acceleration on gpus for embedded
SMART_READER_LITE
LIVE PREVIEW

COMPILER-ASSISTED TEST ACCELERATION ON GPUS FOR EMBEDDED SOFTWARE - - PowerPoint PPT Presentation

COMPILER-ASSISTED TEST ACCELERATION ON GPUS FOR EMBEDDED SOFTWARE VANYA YANEVA Ajitha Rajan, Christophe Dubach ISSTA 2017 10 July 2017 Santa Barbara, CA EMBEDDED SOFTWARE IS EVERYWHERE ITS SAFETY AND CORRECTNESS ARE CRUCIAL FUNCTIONAL


slide-1
SLIDE 1

COMPILER-ASSISTED TEST ACCELERATION ON GPUS FOR EMBEDDED SOFTWARE

VANYA YANEVA Ajitha Rajan, Christophe Dubach

ISSTA 2017 10 July 2017 Santa Barbara, CA

slide-2
SLIDE 2

EMBEDDED SOFTWARE IS EVERYWHERE ITS SAFETY AND CORRECTNESS ARE CRUCIAL FUNCTIONAL TESTING IS CRITICAL

slide-3
SLIDE 3

FUNCTIONAL TESTING CAN BE EXTREMELY TIME CONSUMING

slide-4
SLIDE 4

FUNCTIONAL TESTING CAN BE EXTREMELY TIME CONSUMING

Test suite

Test case 1 Test case 2 Test case 3 Application Test case n Expected result 1 Expected result 2 Expected result 3 Expected result n

slide-5
SLIDE 5

FUNCTIONAL TESTING CAN BE EXTREMELY TIME CONSUMING

Test suite

Test case 1 Test case 2 Test case 3 Application Test case n Expected result 1 Expected result 2 Expected result 3 Expected result n

TESTING IS AN IDEAL CANDIDATE FOR PARALLELISATION

slide-6
SLIDE 6

CPU SERVERS

Expensive Do not scale easily as test suites grow Can be extremely underutilised

slide-7
SLIDE 7

CPU SERVERS

Expensive Do not scale easily as test suites grow Can be extremely underutilised

GPUS

Cheap and widely available Large-scale parallelism, thousands of threads SIMD architecture suited to functional testing

slide-8
SLIDE 8

Test suite

Test case 1 Test case 2 Test case 3 Test case n Expected result 1 Expected result 2 Expected result 3 Expected result n Read test cases: INPUT[] = {test case 1 … test case n} Transfer INPUT[] to GPU memory Build and launch tested program

  • n the GPU threads

Transfer OUTPUT[] to CPU memory 1 2 3 n-1 th_id

OUTPUT[th_id] = program( INPUT[th_id] )

  • A. Rajan, S. Sharma, P. Schrammel, D.
  • Kroening. Accelerated test execution using
  • GPUs. In proceedings of ASE 2014, pages 97-

102, Sweden, Nov 2014.

EXECUTE TESTS IN PARALLEL ON THE GPU THREADS

slide-9
SLIDE 9

Test suite

Test case 1 Test case 2 Test case 3 Test case n Expected result 1 Expected result 2 Expected result 3 Expected result n Read test cases: INPUT[] = {test case 1 … test case n} Transfer INPUT[] to GPU memory Build and launch tested program

  • n the GPU threads

Transfer OUTPUT[] to CPU memory 1 2 3 n-1 th_id

OUTPUT[th_id] = program( INPUT[th_id] )

  • A. Rajan, S. Sharma, P. Schrammel, D.
  • Kroening. Accelerated test execution using
  • GPUs. In proceedings of ASE 2014, pages 97-

102, Sweden, Nov 2014.

EXECUTE TESTS IN PARALLEL ON THE GPU THREADS

CHALLENGES Usability ✘ Scope ✘ Performance ?

slide-10
SLIDE 10

INTRODUCING PARTECL

Unmodified source files Config file ParTeCL CodeGen OpenCL Execution

  • n the GPU

ParTeCL Runtime Test cases (CSV format)

slide-11
SLIDE 11

Example: Configuration: Test cases: INPUTS

#include <stdio.h> #include <stdlib.h> int c; int addc(int a, int b){ return a + b + c; } int main(int argc, char* argv[]){ int a = atoi(argv[1]); int b = atoi(argv[2]); c = 3; int sum = addc(a, b); printf("%d + %d + %c = %d\n", a, b, c, sum); } input: int a 1 input: int b 2 result: int sum variable: sum 1 13 7 2 50 22 3 1000 0 4 0 1000 5 0 0

slide-12
SLIDE 12

Example: OpenCL: PARTECL CODEGEN

#include <stdio.h> #include <stdlib.h> int c; int addc(int a, int b){ return a + b + c; } int main(int argc, char* argv[]){ int a = atoi(argv[1]); int b = atoi(argv[2]); c = 3; int sum = addc(a, b); printf("%d + %d + %c = %d\n", a, b, c, sum); } #include "structs.h" //#include <stdio.h> //#include <stdlib.h> /*int c;*/ int addc(int a, int b, int *c){ return a + b + (*c); } kernel void main_kernel( global struct test_input* inputs, global struct test_result* results){ int idx = get_global_id(0); struct test_input input_gen = inputs[idx]; global struct test_result *result_gen = &results[idx]; int argc = input_gen.argc; result_gen->test_case_num = input_gen.test_case_num; int c; int a = input_gen.a; int b = input_gen.b; c = 3; int sum = addc(a, b, &c); /*printf("%d + %d + %c = %d\n", a, b, c, sum);*/ result_gen->sum = sum; }

slide-13
SLIDE 13

CODE TRANSFORMATIONS global scope variables command line arguments standard in/out standard library (partial support): clClibc

slide-14
SLIDE 14

PARTECL RUNTIME

Read test cases: INPUT[] = {test case 1 … test case n} Transfer INPUT[] to GPU memory Build and launch tested program

  • n the GPU threads

Transfer OUTPUT[] to CPU memory 1 2 3 n-1 th_id

OUTPUT[th_id] = program( INPUT[th_id] )

Automatically generated

OpenCL

slide-15
SLIDE 15

CHALLENGES Usability ✔ Scope ✔ Performance ?

Unmodified source files Config file ParTeCL CodeGen OpenCL Execution

  • n the GPU

ParTeCL Runtime Test cases (CSV format)

slide-16
SLIDE 16

EVALUATION

  • 1. Speedup against CPU
  • 2. Data transfer overhead
  • 3. Comparison to a multi-core CPU
  • 4. Correctness
slide-17
SLIDE 17

EXPERIMENT

Subjects: EEMBC - Industry-standard benchmark suite for embedded software Hardware: GPU - NVidia Tesla K40m; CPU - Intel Xeon, 8 cores Test suite size: 130K

slide-18
SLIDE 18

SPEEDUP AGAINST CPU

slide-19
SLIDE 19

DATA TRANSFER OVERHEAD

2

8 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 17

Number of tests (log base 2 scale)

20 40 60 80 100 120 140

Execution time [ms]

a2time01

Input transfer Output transfer Kernelexecution

2

8 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 17

Number of tests (log base 2 scale)

10 20 30 40

Execution time [ms]

autcor00

Input transfer Output transfer Kernelexecution

2

8 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 17

Number of tests (log base 2 scale)

10 20 30 40 50 60

Execution time [ms]

conven00

Input transfer Output transfer Kernelexecution

2

8 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 17

Number of tests (log base 2 scale)

20 40 60 80

Execution time [ms]

fbital00

Input transfer Output transfer Kernelexecution

2

8 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 17

Number of tests (log base 2 scale)

10 20 30 40 50 60

Execution time [ms]

fft00

Input transfer Output transfer Kernelexecution

2

8 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 17

Number of tests (log base 2 scale)

50 100 150 200 250

Execution time [ms]

puwmod01

Input transfer Output transfer Kernelexecution

2

8 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 17

Number of tests (log base 2 scale)

20 40 60 80 100 120 140

Execution time [ms]

rspeed01

Input transfer Output transfer Kernelexecution

2

8 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 17

Number of tests (log base 2 scale)

20 40 60 80 100 120 140

Execution time [ms]

tblook01

Input transfer Output transfer Kernelexecution

2

8 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 17

Number of tests (log base 2 scale)

20 40 60 80 100 120

Execution time [ms]

viterb00

Input transfer Output transfer Kernelexecution

slide-20
SLIDE 20

DATA TRANSFER OVERHEAD

slide-21
SLIDE 21

COMPARISON TO A MULTI-CORE CPU

slide-22
SLIDE 22

CHALLENGES Usability ✔ Scope ✔ Performance ✔

slide-23
SLIDE 23

CORRECTNESS For all 9 benchmarks, testing results from the GPU are an exact match to the testing results from the CPU.

slide-24
SLIDE 24

SUMMARY Automatic GPU code generation Automatic test execution on the GPU threads Speedup of up to 53x (avg 16x) on EEMBC benchmarks Correct testing results

slide-25
SLIDE 25

SUMMARY Automatic GPU code generation Automatic test execution on the GPU threads Speedup of up to 53x (avg 16x) on EEMBC benchmarks Correct testing results FUTURE WORK Extend evaluation & scope Analyse & improve performance

slide-26
SLIDE 26

THANKS

ParTeCL CodeGen github.com/wyaneva/partecl-codegen ParTeCL Runtime github.com/wyaneva/partecl-runtime clClibc github.com/wyaneva/clClibc

slide-27
SLIDE 27
slide-28
SLIDE 28

C FEATURES

Out of the box: pure functions, function calls, double precision (for OpenCL 1.2) With transformations: standard in/out global scope variables standard library calls (partial support) Unsupported (yet): dynamic memory allocation file I/O recursion