QUASAR (GPU Programming Language) on GDaaS Accelerates Coding from - - PowerPoint PPT Presentation

quasar
SMART_READER_LITE
LIVE PREVIEW

QUASAR (GPU Programming Language) on GDaaS Accelerates Coding from - - PowerPoint PPT Presentation

QUASAR (GPU Programming Language) on GDaaS Accelerates Coding from Months to Days OUTLINE CAUSE DEMO 1 4 THE 2 5 RESULTS OFFER HOW DOES 3 6 CONCLUSION IT WORK 2 GPUs are everywhere ALMOST EVERYWHERE ??? Low level Long


slide-1
SLIDE 1

QUASAR

(GPU Programming Language)

  • n GDaaS Accelerates Coding

from Months to Days

slide-2
SLIDE 2

2

OUTLINE

1 2 3 CAUSE THE OFFER HOW DOES IT WORK 4 5 6 DEMO RESULTS CONCLUSION

slide-3
SLIDE 3

GPUs are everywhere

slide-4
SLIDE 4

4

Each HW platform requires a new implementation Long development lead times Strong coupling between algorithm development and implementation

… ALMOST EVERYWHERE ???

Low level coding experts are required

slide-5
SLIDE 5

5

OBSERVATION

  • While breakthrough results are achieved, still limited

usage in research

—Scientific articles mentioning CUDA: 90K —Scientific articles mentioning a specific scripting language: 400K-1900K

  • A continuous investment in easy access:

—Optimized libraries: cuFFT,cuDNN, cuBLAS,… —Tools: Digits

  • High level access increases potential market by a factor 7

Limited to specific applications

slide-6
SLIDE 6

6

On GPU desktop as a service (GDAAS)

HIGH LEVEL PROGRAMMING LANGUAGE IDE & RUNTIME OPTIMIZATION KNOWLEDGE BASE & LIBRARIES

slide-7
SLIDE 7

7

’S VALUE PROPOSITION

Larger market and faster take up

Days iso of months to get started * Algorithm development decoupled from implementation

Lowering barrier

  • f entry

Earlier product launch

Development cycle reduction by a factor of 3 to 10

Faster

development Reducing R&D and maintenance costs

#lines of codes reduction by a factor of 2 to 3 Same performance

Efficient code

Early access to highest performance

Future proof code can also target

  • ther GPU

models

Future proof

Better products

Distinctive tools for coding and design analysis and exploration

Better algorithms

slide-8
SLIDE 8

8

HOW DOES IT WORK?

Y = sum(A + B .* C + D)

Code analysis and target-dependent lowering

slide-9
SLIDE 9

9

HOW DOES IT WORK?

Y = sum(A + B .* C + D)

function $out:scalar = __kernel__ kernel$1(A:vec'col'unchecked,B:vec'unchecked,C:scalar, D:vec‘unchecked,$datadims:int,blkpos:int,blkdim:int,blkidx:int) $bins:vec'unchecked=shared(blkdim) $accum0=0. for $m=(blkpos+(blkidx*blkdim))..(64*blkdim)..($datadims-1) pos=$m $accum0+=(A[pos]+(B[pos].*C)+D[pos]) end $bins[blkpos]=$accum0 syncthreads $bit=1 while ($bit<blkdim) if (mod(blkpos,(2*$bit))==0) $bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit]) endif syncthreads $bit*=2 continue end if (blkpos==0) $out+=$bins[0] endif end $out=parallel_do([($blksz.*[1,64,1]),$blksz],A,B,C,D,numel(A),kernel$1)

Code analysis and target-dependent lowering Parallel reduction algorithm using shared memory Automatic generation of a kernel function Compile-time handling of boundary checks

slide-10
SLIDE 10

10

HOW DOES IT WORK?

Y = sum(A + B .* C + D)

function $out:scalar = __kernel__ kernel$1(A:vec'col'unchecked,B:vec'unchecked,C:scalar, D:vec‘unchecked,$datadims:int,blkpos:int,blkdim:int,blkidx:int) $bins:vec'unchecked=shared(blkdim) $accum0=0. for $m=(blkpos+(blkidx*blkdim))..(64*blkdim)..($datadims-1) pos=$m $accum0+=(A[pos]+(B[pos].*C)+D[pos]) end $bins[blkpos]=$accum0 syncthreads $bit=1 while ($bit<blkdim) if (mod(blkpos,(2*$bit))==0) $bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit]) endif syncthreads $bit*=2 continue end if (blkpos==0) $out+=$bins[0] endif end $out=parallel_do([($blksz.*[1,64,1]),$blksz],A,B,C,D,numel(A),kernel$1)

Code analysis and target-dependent lowering

__global__ void kernel(scalar *ret, Vector _PA, Vector _PB, scalar _PC, Vector _PD, int _P_datadims) { shmem shmem; shmem_init(&shmem); int blkpos = threadIdx.x, blkdim = blockDim.x, blkidx = blockIdx.x; Matrix o35, bins; scalar accum0; int m, _Lpos, bit; bins = shmem_alloc<scalar>(&shmem,blkdim); accum0 = 0.0f; for (m = (blkpos + (blkidx * blkdim)); m <= _P_datadims - 1; m += (64 * blkdim)) { accum0 += vector_get_at<scalar>(_PA, m) + vector_get_at_checked<scalar>(_PB, _Lpos) * _PC + vector_get_at_checked<scalar>(_PD, m); } vector_set_at<scalar>(bins, blkpos, accum0); __syncthreads(); for (bit = 1; bit < blkdim; bit *= 2) { if (mod(blkpos,(2 * bit)) == 0) { scalar t05 = vector_get_at_safe<scalar>(bins, blkpos + bit); scalar t15 = vector_get_at<scalar>(bins, blkpos); vector_set_at<scalar>(bins, blkpos, (t15 + t05)); } } if (blkpos == 0) atomicAdd(ret, vector_get_at<scalar>(bins, 0)); }

Parallel reduction algorithm using shared memory Automatic generation of a kernel function Compile-time handling of boundary checks Automatic generation of CUDA/OpenCL/C++ code

slide-11
SLIDE 11

11

’S WORKFLOW

OPTIMAL RUNTIME EXECUTION DEVELOPMENT Runtime information High level scripting Ideal for rapid prototyping Compact, readable code

SCRIPTING LANGUAGE CODE ANALYSIS & LOWERING INPUT DATA COMPILATION HARDWARE

slide-12
SLIDE 12

12

’S WORKFLOW

OPTIMAL RUNTIME EXECUTION Runtime information Optimization hints Automatic detection of parallelism

SCRIPTING LANGUAGE CODE ANALYSIS & LOWERING INPUT DATA COMPILATION HARDWARE

DEVELOPMENT

slide-13
SLIDE 13

13

’S WORKFLOW

OPTIMAL RUNTIME EXECUTION Runtime information

SCRIPTING LANGUAGE CODE ANALYSIS & LOWERING INPUT DATA COMPILATION HARDWARE

DEVELOPMENT

slide-14
SLIDE 14

14

’S WORKFLOW

OPTIMAL RUNTIME EXECUTION Runtime information

SCRIPTING LANGUAGE CODE ANALYSIS & LOWERING INPUT DATA HARDWARE COMPILATION

HW-setup, load, memory state, scheduling DEVELOPMENT

slide-15
SLIDE 15

QUASAR ON GDaaS BENEFITS

Anyness (screen, device, GPU power) Hourly model (1-4 GPUs) Monthly Quasar licenses Instant app distribution Today: M60 Coming: Multi-GPU

1 2 3

slide-16
SLIDE 16

DEMO

slide-17
SLIDE 17

17

50 100 150 200 250 300 350 400

Filter 32 taps, with global memory 2D spatial filter 32x32 separable, with global memory Wavelet filter, with global memory

CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0.5 1 1.5 2 2.5 3 3.5 4

Filter 32 taps, with global memory 2D spatial filter 32x32 separable, with global memory Wavelet filter, with global memory

CUDA-Dev Time QUASAR-Dev time

Development Time

1000 2000 3000 4000 5000 6000

Filter 32 taps, with global memory 2D spatial filter 32x32 separable, with global memory Wavelet filter, with global memory

CUDA-time (ms) QUASAR-time (ms)

Execution time (ms)

Implementation of MRI reconstruction algorithm in <14 days using QUASAR versus 3 months using CUDA

More efficient code and shorter development times while keeping same performance

slide-18
SLIDE 18

18

QUASAR APPLICATIONS

Quasar on GDaaS Accelerates Coding from Months to Days

slide-19
SLIDE 19

19

CONCLUSION

High level scripting language

Ideal for rapid prototyping Fast development Maintainable, compact code

Optimal usage of heterogeneous hardware (multi-core, GPUs) Context aware execution

Build once, execute on any system Different hardware, different optimization Future proof code

Better, faster and smarter development thanks to GDaaS

Quasar on GDaaS Accelerates Coding from Months to Days

slide-20
SLIDE 20

www.gdaas.com/quasar www.gepura.io Visit us on booth: 826

LEAVE YOUR BUSINESS CARD TO REQUEST YOUR FREE TRIAL OR GO TO www.gdaas.com