A Case for Better Integration of Host and Target Compilation When - PowerPoint PPT Presentation

A Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs Taylor Lloyd, Artem Chikin, Erick Ochoa, Karim Ali, José Nelson Amaral University of Alberta Sept 7 FSP 2017 1

University of Alberta Systems Group ● Focused on compiler optimizations, heterogeneous systems ● Recently working primarily on GPU computing 2

So can traditional compiler techniques help with OpenCL for FPGAs? 3

Background: OpenCL Execution Models Data Parallelism (NDRange) Task Parallelism (Single Work-Item) ● kernel defined per-thread ● Kernel defines complete unit of work ● Kernel execution defines number and grouping of threads ● Kernel execution starts single thread ● Behaviour varies by querying thread ID 4

Background: OpenCL Execution Model NDRange Example Single Work-Item Example __kernel void memcpy(char* tgt, char* src, int length) { int index = get_global_id(0); while (index<length) { tgt[index] = src[index]; index += get_global_size(0); } } 5

Background: OpenCL Execution Model NDRange Example Single Work-Item Example __kernel void memcpy(char* tgt, char* src, int length) { int index = get_global_id(0); while (index<length) { tgt[index] = src[index]; index += get_global_size(0); } } int offset = 0, threads = 2048, groupsize = 128; clSetKernelArg(kernel, 0, sizeof(char*), tgtbuf); clSetKernelArg(kernel, 1, sizeof(char*), srcbuf); clSetKernelArg(kernel, 2, sizeof(int), length); clEnqueueNDRangeKernel( queue, kernel, 1, &offset, &threads, &groupsize, 0, NULL, NULL); 6

Background: OpenCL Execution Model NDRange Example Single Work-Item Example __kernel void memcpy(char* tgt, __kernel void memcpy(char* tgt, char* src, (char* src, int length) { int length) { int index = get_global_id(0); for(int i=0; i<length; i++) { while (index<length) { tgt[i] = src[i]; tgt[index] = src[index]; } index += get_global_size(0); } } } int offset = 0, threads = 2048, groupsize = 128; clSetKernelArg(kernel, 0, sizeof(char*), tgtbuf); clSetKernelArg(kernel, 1, sizeof(char*), srcbuf); clSetKernelArg(kernel, 2, sizeof(int), length); clEnqueueNDRangeKernel( queue, kernel, 1, &offset, &threads, &groupsize, 0, NULL, NULL); 7

Background: OpenCL Execution Model NDRange Example Single Work-Item Example __kernel void memcpy(char* tgt, __kernel void memcpy(char* tgt, char* src, (char* src, int length) { int length) { int index = get_global_id(0); for(int i=0; i<length; i++) { while (index<length) { tgt[i] = src[i]; tgt[index] = src[index]; } index += get_global_size(0); } } } int offset = 0, threads = 2048, groupsize = 128; clSetKernelArg(kernel, 0, sizeof(char*), tgtbuf); clSetKernelArg(kernel, 0, sizeof(char*), tgtbuf); clSetKernelArg(kernel, 1, sizeof(char*), srcbuf); clSetKernelArg(kernel, 1, sizeof(char*), srcbuf); clSetKernelArg(kernel, 2, sizeof(int), length); clSetKernelArg(kernel, 2, sizeof(int), length); clEnqueueTask( clEnqueueNDRangeKernel( queue, kernel, queue, kernel, 0, NULL, NULL); 1, &offset, &threads, &groupsize, 0, NULL, NULL); 8

Single Work-Item Kernel versus NDRange Kernel “ Intel recommends that you structure your OpenCL kernel as a single work-item, if possible” [1] 9 [1]

NDRange Kernel Single Work Item __kernel void memcpy(char* tgt, char* src, int length ) { int index = get_global_id(0); while (index<length) { tgt[index] = src[index]; index += get_global_size(0); } } 10

NDRange Kernel Single Work Item __kernel void memcpy(char* tgt, char* src, int length, int offset, int threads, int group ) { int index = get_global_id(0); while (index<length) { tgt[index] = src[index]; index += get_global_size(0); } } 11

NDRange Kernel Single Work Item __kernel void memcpy(char* tgt, char* src, int length, int offset, int threads, int groups) { for(int tid=offset; tid<offset+threads; tid++) { int index = tid ; while (index<length) { tgt[index] = src[index]; index += threads ; } } } 12

Is that really better? 13

Loop Canonicalization __kernel void memcpy(char* tgt, char* src, int length, int offset, int threads, int groups) { for(int tid=offset; tid<offset+threads; tid++) { int index = tid; for (int i=0; i<length/threads; i++) { if(index+i*threads < length) tgt[ index+i*threads ] = src[ index+i*threads ]; } } } 14

Loop Canonicalization __kernel void memcpy(char* tgt, char* src, int length, int offset, int threads, int groups) { for(int j=0; j<threads; j++) { int tid = j+offset; int index = tid; for (int i=0; i<length/threads; i++) { if(index+i*threads < length) tgt[index+i*threads] = src[index+i*threads]; } } } 15

Loop Collapsing __kernel void memcpy(char* tgt, char* src, int length, int offset, int threads, int groups) { for(int x=0; x<threads*length/threads; x++) { int j = x/(length/threads); int i = x%(length/threads); int tid = j+offset; int index = tid; if(index+i*threads < length) tgt[index+i*threads] = src[index+i*threads]; } } } 16

Copy Propagation __kernel void memcpy(char* tgt, char* src, int length, int offset, int threads, int groups) { for(int x=0; x<length; x++) { int j = x/(length/threads); int i = x%(length/threads); if( j+offset+i*threads < length) tgt[ j+offset+i*threads ] = src[ j+offset+i*threads ]; } } } 17

Why isn’t this done today? 18

Recall: Host OpenCL API ● Host code must be rewritten to pass new arguments, call different API 19

Recall: Host OpenCL API int offset = 0, threads = 2048, groupsize = 128; clSetKernelArg(kernel, 0, sizeof(char*), tgtbuf); ● Host code must be rewritten to pass clSetKernelArg(kernel, 1, sizeof(char*), srcbuf); clSetKernelArg(kernel, 2, sizeof(int), length); new arguments, call different API clEnqueueNDRangeKernel ( queue, kernel, 1, &offset, &threads, &groupsize, 0, NULL, NULL); int offset = 0, threads = 2048, groupsize = 128; clSetKernelArg(kernel, 0, sizeof(char*), tgtbuf); clSetKernelArg(kernel, 1, sizeof(char*), srcbuf); clSetKernelArg(kernel, 2, sizeof(int), length); clSetKernelArg(kernel, 3, sizeof(int), offset); clSetKernelArg(kernel, 4, sizeof(int), threads); clSetKernelArg(kernel, 5, sizeof(int), groups); clEnqueueTask ( queue, kernel, 0, NULL, NULL); 20

Kernel Code The Altera OpenCL Toolchain (.cl) Altera OpenCL Compiler (LLVM-based) OpenCL Host Code Runtime (.c/.cpp) Library Kernel Code (Verilog) C/C++ Compiler Quartus Placement & Routing Host Binary FPGA Bitstream 21

The Argument for Separation ● Device-side code can be Just-In-Time (JIT) compiled for each device 22

The Argument for Separation ● Device-side code can be Just-In-Time (JIT) compiled for each device ● Host compilers can be separately maintained by experts (icc, xlc, gcc, clang) 23

The Argument for Separation ● Device-side code can be Just-In-Time (JIT) compiled for each device ● Host compilers can be separately maintained by experts (icc, xlc, gcc, clang) ● Host code can be recompiled without needing to recompile device code 24

The Argument for Combined Compilation ● Execution context information (constants, pointer aliases) can be passed from host to device ● Context information allows for better compiler transformations (Strength Reduction, Pipelining) ● Better transformations improve final executables 25

Our Proposed OpenCL Toolchain OpenCL Host Code Kernel Code Runtime (.c/.cpp) (.cl) Library Combined Host/Device Compiler Quartus FPGA Bitstream Kernel Code Placement & Routing Host Binary (Verilog) 26

Research Question: Can OpenCL be better targeted to FPGAs given communication between host and device compilers? 27

Inspiration 28 [SC 16]

Inspiration ● Zohouri et al. hand-tuned OpenCL benchmarks for FPGA execution ● Achieved speedups of 30% to 100x ● Can we match their performance through compiler transformations? 29 [SC 16]

Kernel Code Prototype OpenCL Toolchain (.cl) Altera OpenCL Prototype Compiler (LLVM 3 Transformations OpenCL based) Host Code Runtime (.c/.cpp) Library Host Context Information Kernel Code Kernel Information (Verilog) Prototype LLVM 4.0 Transformations Quartus Placement & Routing Host Binary FPGA Bitstream 30

1. Geometry Propagation Prototype 2. NDRange To Loop 3. Restricted Pointer Analysis Transformations 4. Reduction Dependence Elimination 31

1. Geometry Propagation - Motivation ● Operations on constants in kernel can undergo strength reduction 32

1. Geometry Propagation - Motivation ● Operations on constants in kernel can undergo strength reduction ● Loops of known size are easier to manipulate by the compiler 33

1. Geometry Propagation 1. Collect Host-Side kernel invocations int offset = 0, threads = 2048, groupsize = 128; cl_kernel kernel = clCreateKernel(program, “memcpy”, &err); clSetKernelArg(kernel, 0, sizeof(char*), tgtbuf); clSetKernelArg(kernel, 1, sizeof(char*), srcbuf); clSetKernelArg(kernel, 2, sizeof(int), length); clEnqueueNDRangeKernel( queue, kernel, 1, &offset, &threads, &groupsize, 0, NULL, NULL); 34

A Case for Better Integration of Host and Target Compilation When - PowerPoint PPT Presentation

A Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs Taylor Lloyd, Artem Chikin, Erick Ochoa, Karim Ali, Jos Nelson Amaral University of Alberta Sept 7 FSP 2017 1 University of Alberta Systems Group

Transports and TCP Transports and TCP Adolfo Rodriguez CPS 214 Host- -to to- -Host vs. Host

ROCKBOX FABRIQ EDITION ITS TIME FOR FOR BETTER SOUND. BETTER DESIGN. BETTER SPECS.

Systems Systems Systems Integration Systems Integration Systems Systems Integration Systems

Target Risk vs. Target Date Funds in 401(k) Plans: Maybe the answer is both January 14, 2015

Better Advice, Better Lives Adults Select Committee 21 st June Usk 1 Better Advice, Better Lives

Nektarios Georgios Tsoutsos HOST HOST DATA PROG data(23:0) we ready_int almost_full Host IF

Architecture Research On Transport Information Services of EXPO 2010 Shanghai China Better City,

Cotton Incorporated TARGET SPOT UPDATE A. K. Hagan Auburn University TARGET SPOT Target Spot

Natural Target Pruning Making Proper Pruning Cuts Natural Target Pruning In this lesson we

LBNE 1.2MW Target NBI 2014 Presented by Brian Hartsell LBNE Target - Introduction Target

Semi-Heuristic Target-Based Fuzzy Target . . . Fuzzy Target . . . Fuzzy Decision Procedures:

Better health Better health Better health Better health for Europe: for Europe: p equitable

BETTER BART BETTER BAY AREA BETT BETTER ER BAR ART T / / BETT BETTER ER BAY Y AREA AREA

Introductory Webinar Better Care, Better Health, Better Value A Better Rehabilitative Care System

ProPosed Host: town of eagle, eagle County, Colorado Host Presentation Eagle County proposes to

Current Home-Host Issues in CESE Vienna, March 2009 Current Home-Host Issues in CESE Topics

Novel Fuel Saving Technology; Compliance, Approvals and Our Innovative Compliance Process SMTF

Quantum Algorithms Tutorial Ronald de Wolf 1/ 37 Post-quantum cryptography I Quantum computers

Program Transformation and Constraint-based Verification Valerio Senni Department of Computer

Improved Static Analysis to Generate More Effjcient Code for Execution of Loop Nests in GPUs J.

Time energy entropic uncertainty relations: an algebraic approach Christian Bertoni, Yuxiang

A WAM Implementation for the Meta-logic Programming Language Log _______________ University

Quantum Information Processing and Quantum Error Correction and Quantum Error Correction with

Lecture 13 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Message Passing Stencil