SOCAO: Source-to-Source OpenCL Compiler for Intel-Altera FPGAs - - PowerPoint PPT Presentation

socao source to source opencl compiler for intel altera
SMART_READER_LITE
LIVE PREVIEW

SOCAO: Source-to-Source OpenCL Compiler for Intel-Altera FPGAs - - PowerPoint PPT Presentation

SOCAO: Source-to-Source OpenCL Compiler for Intel-Altera FPGAs Johanna Rohde 1 , Marcos Martinez-Peir 2 , Rafael Gadea-Girons 2 Date: 7.9.2017 1 Computer Systems Group, TU Darmstadt 2 Department of Electronic Engineering, Universitat


slide-1
SLIDE 1

07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 1

SOCAO: Source-to-Source OpenCL Compiler for Intel-Altera FPGAs

Johanna Rohde1, Marcos Martinez-Peiró2, Rafael Gadea-Gironés2

Date: 7.9.2017

1Computer Systems Group, TU Darmstadt 2Department of Electronic Engineering, Universitat

Politècnica de València

slide-2
SLIDE 2

07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 2

Overview

  • Introduction
  • Background
  • Design
  • Implementation
  • Evaluation
  • Conclusion
slide-3
SLIDE 3

07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 3

Introduction Problem: Accelerate a program with an FPGA

  • How do I program the FPGA?
  • How do I communicate with the FPGA?
  • How much time do I need to rewrite the code?
slide-4
SLIDE 4

07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 4

Introduction

ASIC FPGA Programmers

Parallel Programmers Sofuware Programmers Low Level Tools OpenCL for FPGA SOCAO Compiler for C to OpenCL

slide-5
SLIDE 5

07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 5

Background

  • OpenCL

Open programming standard for heterogeneous parallel systems

Calculations are passed to external accelerator

Accelerator can be

  • CPU
  • GPU
  • FPGA
  • ...
slide-6
SLIDE 6

07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 6

Background OpenCL

  • Platform

Host

  • Manages the system
  • Is connected to one or more

compute devices

Compute Device

  • Executes a kernel
  • Consists of multiple compute

units

Compute Unit

  • Consists of multiple processing

elements

Host Compute Device

Compute Unit PE PE PE PE

...

Compute Unit PE PE PE PE

...

Compute Unit PE PE PE PE

... ...

slide-7
SLIDE 7

07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 7

  • Memory Model

– Global Memory

  • Used to transfer data
  • Accessible by all work groups
  • Normally the slowest memory

– Constant Memory

  • Used to save constants

– Local Memory

  • Accessible by all work items of
  • ne work group
  • Not accessible by host

– Private Memory

  • Accessible by one work item
  • Holds intermediate values

HOST

Host Memory

DEVICE

Global Memory Constant Memory

WORK GROUP

Local Memory

WORK ITEM

Private Memory

WORK ITEM

Private Memory

...

Background OpenCL

slide-8
SLIDE 8

07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 8

  • Host Program Flow

Background OpenCL

Initialize Environment Load Kernel Set Kernel Arguments Launch Kernel Execution Write Buffer Read Buffer

Host Program Host Program OpenCL Kernel OpenCL Kernel

slide-9
SLIDE 9

07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 9

  • 2 additional forms of parallelism
  • Instruction-level parallelism

– Instructions that are independent of each other can be

calculated at the same time

Background OpenCL for FPGAs

x = (a+b)*(c+d); y = x - e; z = x << 4;

+ a b *

<<

4 z e

  • y

c d +

time

+ a b c d * +

<<

  • e

4 y z

slide-10
SLIDE 10

07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 10

  • Loop Pipelining

– Iterations are overlapped – Ideal case: One Iteration per clock cycle – Problem when the loop has loop-carried dependencies

Background OpenCL for FPGAs

for(int a=0; a<4; a++) { x = (a+b)*(c+d); y = x << 4; z = x - e; }

+/+ * <</- +/+ * <</- +/+ * <</- +/+ * <</-

time iteration 1: iteration 2: iteration 3: iteration 4:

slide-11
SLIDE 11

07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 11

  • Intel’s SDK for OpenCL

Background OpenCL for FPGAs

Cross Compiler Cross Compiler Intel Offmine Compiler Intel Offmine Compiler

.exe .exe .aocx .aocx

__kernel void sum( __global int *A, __global int *B, __global int *res, int size) { for(int i=0; i<size; i++) res[i] = A[i] + B[i]; } __kernel void sum( __global int *A, __global int *B, __global int *res, int size) { for(int i=0; i<size; i++) res[i] = A[i] + B[i]; }

OpenCL Accelerator Code

void sum(int *A, int *B, int *res, int size) { clEnqueueWriteBuffer(...); ClEnqueueTask(...); ClEnqueueReadBuffer(...); } void sum(int *A, int *B, int *res, int size) { clEnqueueWriteBuffer(...); ClEnqueueTask(...); ClEnqueueReadBuffer(...); }

Host Code

slide-12
SLIDE 12

07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 12

Design

SOCAO Compiler SOCAO Compiler SOCAO Compiler SOCAO Compiler

//Altera_OpenCL_Accelerate //Altera_OpcnCL_size A size //Altera_OpenCL... void sum(int *A, int *B, int *res, int size) { for(int i=0; i<size; i++) res[i] = A[i]+B[i]; } //Altera_OpenCL_Accelerate //Altera_OpcnCL_size A size //Altera_OpenCL... void sum(int *A, int *B, int *res, int size) { for(int i=0; i<size; i++) res[i] = A[i]+B[i]; }

Input Program

void sum(int *A, int *B, int *res, int size) { clEnqueueWriteBuffer( ... ); clEnqueueTask( … ); clEnqueueReadBuffer( … ); } void sum(int *A, int *B, int *res, int size) { clEnqueueWriteBuffer( ... ); clEnqueueTask( … ); clEnqueueReadBuffer( … ); }

Host Code

__kernel void aocl_generated_kernel( __global int *A, __global int *B, __global int *res, int size) { for(int i=0; i < size; i++) res[i] = A[i] + B[i]; } __kernel void aocl_generated_kernel( __global int *A, __global int *B, __global int *res, int size) { for(int i=0; i < size; i++) res[i] = A[i] + B[i]; }

OpenCL Accelerator Code

slide-13
SLIDE 13

07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 13

Implementation

  • ROSE Framework

– Open-source compiler framework – Provides front-end, back-end and additional functionalities

Middle-end Front-end Back-end

Start Abort Abort End Parse Unparse Semantic Check Function Detection Kernel Creation Function Analysis & Transformation Host Program Transformation

yes no

Successful

yes no

Successful

slide-14
SLIDE 14

07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 14

Implementation

  • ROSE Framework

– Open-source compiler framework – Provides front-end, back-end and additional functionalities

Middle-end Front-end Back-end

Start Abort Abort End Parse Unparse Semantic Check Function Detection Kernel Creation Function Analysis & Transformation Host Program Transformation

yes no

Successful

yes no

Successful

slide-15
SLIDE 15

07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 15

Implementation

  • The Function Analysis & Transformation phase is the most

important

  • All decisions are made during this phase
  • Consists of 10 analysis/transformation steps

Inline Transformation Constant Value Transformation Constant Folding Constant Array Analysis 2D to 1D Array Transformation In/Out Analysis Parameter Analysis Typedef Analysis Memory Analysis Loop Unrolling

slide-16
SLIDE 16

07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 16

Implementation

void vector_process( char *input, char value) { int i; for(i = 0; i < 64; i++) input[i] += value; } void vector_update(char *input, int ilen) { vector_process(input, 'c'); } void vector_update(char *input, int ilen) { { char value_1 = 'c'; int i; for(i = 0; i < 64; i++) input[i] += value_1; } }

Inline Transformation Constant Value Transformation Constant Folding Constant Array Analysis 2D to 1D Array Transformation In/Out Analysis Parameter Analysis Typedef Analysis Memory Analysis Loop Unrolling

slide-17
SLIDE 17

07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 17

Implementation

__constant const uint32_t K[] = { 0x428A2F98, …}; __kernel void aocl_generated_kernel( …. ) { … } const uint32_t K[] = {0x428A2F98, …}; //Altera_OpenCL_Accelerate //Altera_OpenCL_size K 64 //Altera_OpenCL_const_vec K … void update_accelerated( … )

program.cpp aocl_kernel.cl

Inline Transformation Constant Value Transformation Constant Folding Constant Array Analysis 2D to 1D Array Transformation In/Out Analysis Parameter Analysis Typedef Analysis Memory Analysis Loop Unrolling

slide-18
SLIDE 18

07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 18

Implementation

#define WIDTH 512 __kernel void aocl_generated_kernel(int __global __restrict__ *A, ...) { ... int tmp = A[2*WIDTH + 3]; } #define WIDTH 512 int[5][WIDTH] A; //Altera_OpenCL_Accelerate //Altera_OpenCL_size A 2560 void func(...) { int tmp = A[2][3]; }

Inline Transformation Constant Value Transformation Constant Folding Constant Array Analysis 2D to 1D Array Transformation In/Out Analysis Parameter Analysis Typedef Analysis Memory Analysis Loop Unrolling

slide-19
SLIDE 19

07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 19

Implementation

int *res; //Altera_OpenCL_Accelerate //Altera_OpenCL_size a size //Altera_OpenCL_size res size void vec_accumulate(int *a, int size) { for(int i = 0; i < size; i++) res[i] = res[i]+a[i]; }

Input Variables: {i, size, res, a} Output Variables: {i, res} Kernel Parameter: {a, size, res}

__kernel void aocl_generated_kernel( int __global __restrict__ *a, const int size, int __global __restrict__ *res ); Inline Transformation Constant Value Transformation Constant Folding Constant Array Analysis 2D to 1D Array Transformation In/Out Analysis Parameter Analysis Typedef Analysis Memory Analysis Loop Unrolling

slide-20
SLIDE 20

07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 20

Implementation

Transfer Allocatjon Internal Constant Normal No copy Shared No copy Global Normal No copy Local copy Shared No copy Local copy

  • Determine which memory buffer is used

Inline Transformation Constant Value Transformation Constant Folding Constant Array Analysis 2D to 1D Array Transformation In/Out Analysis Parameter Analysis Typedef Analysis Memory Analysis Loop Unrolling

slide-21
SLIDE 21

07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 21

Implementation

yes yes no Inline Transformation Constant Value Transformation Constant Folding Constant Array Analysis 2D to 1D Array Transformation In/Out Analysis Parameter Analysis Typedef Analysis Memory Analysis Loop Unrolling

Is SoC? Is size constant? Enough space? Is runtime constant? Shared memory allocation Normal memory allocation No internal copy Local internal copy Constant address space Global address space

yes yes yes no no no

slide-22
SLIDE 22

07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 22

Implementation

#pragma unroll for (j = 0; j < 8; j++) A[j] = state[j];

  • Insert #pragma unroll in front
  • f a loop to unroll it
  • Only unroll inner most loop
  • Exclusion criteria

Number of iterations is not static

Number of iterations exceeds 16

The loop contains an operation that requires a lot of area

Inline Transformation Constant Value Transformation Constant Folding Constant Array Analysis 2D to 1D Array Transformation In/Out Analysis Parameter Analysis Typedef Analysis Memory Analysis Loop Unrolling

slide-23
SLIDE 23

07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 23

Evaluation

  • DE1SoC evaluation board

– ARM Cortex

  • Dual Core
  • 800 MHz

– Cyclone V FPGA

slide-24
SLIDE 24

07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 24

Evaluation Secure Hash Algorithm (SHA-256)

  • Contains two nested for loops
  • Both loops have loop-carried dependencies
  • Hard to parallelize
  • Has a lot of instruction parallelism

//Altera_OpenCL_Accelerate //Altera_OpenCL_size input len //Altera_OpenCL_size K 64 //Altera_OpenCL_size state 8 //Altera_OpenCL_const_vec K //Altera_OpenCL_soc void mbedtls_sha256_update_accelerated( uint32_t *state, const unsigned char *input, uint64_t len )

slide-25
SLIDE 25

07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 25

  • Inner loop is pipelined perfectly
  • Outer loop is not pipelined well
  • Maximum speedup-factor: 1,54
  • Break even point: 10kBytes

Evaluation Secure Hash Algorithm (SHA-256)

slide-26
SLIDE 26

07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 26

  • Contains two nested loops
  • Inner loop has loop-carried

dependencies and is unrolled

  • Uses 10 constant arrays
  • Loop Unrolling

//Altera_OpenCL_Accelerate //Altera_OpenCL_size input length //Altera_OpenCL_size output length //Altera_OpenCL_size RK 68 //Altera_OpenCL_const_vec FT3 //Altera_OpenCL_const_vec FT2 //Altera_OpenCL_const_vec FT1 //Altera_OpenCL_const_vec FT0 //Altera_OpenCL_const_vec RT3 //Altera_OpenCL_const_vec RT2 //Altera_OpenCL_const_vec RT1 //Altera_OpenCL_const_vec RT0 //Altera_OpenCL_const_vec FSb //Altera_OpenCL_const_vec RSb //Altera_OpenCL_soc int mbedtls_aes_crypt_ctr_nr10( uint32_t *RK, uint64_t length, unsigned char nonce_counter[16], unsigned char stream_block[16], const unsigned char *input, unsigned char *output )

Evaluation

Advanced Encryption Standard (AES-CTR)

slide-27
SLIDE 27

07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 27

  • Remaining loop can be pipelined perfectly
  • Maximum speedup-factor: 3,78
  • Break even point: 3,3 kBytes

Evaluation

Advanced Encryption Standard (AES-CTR)

slide-28
SLIDE 28

07.09.17 | TU Darmstadt | Computer Systems Group | Johanna Rohde | 28

  • C code → OpenCL for FPGAs
  • Two test cases

– SHA-256

  • Speedup of 1.54

– AES-CTR

  • Speedup 3.78

Conclusion