H ETERO R EFACTOR : Refactoring for Heterogeneous Computing with - - PowerPoint PPT Presentation

h etero r efactor refactoring for heterogeneous computing
SMART_READER_LITE
LIVE PREVIEW

H ETERO R EFACTOR : Refactoring for Heterogeneous Computing with - - PowerPoint PPT Presentation

H ETERO R EFACTOR : Refactoring for Heterogeneous Computing with FPGA Jason Lau*, Aishwarya Sivaraman*, Qian Zhang*, Muhammad Ali Gulzar, Jason Cong, Miryung Kim University of California, Los Angeles *Equal co-first authors in alphabetical


slide-1
SLIDE 1

HETEROREFACTOR: Refactoring for Heterogeneous Computing with FPGA

Jason Lau*, Aishwarya Sivaraman*, Qian Zhang*, Muhammad Ali Gulzar, Jason Cong, Miryung Kim

University of California, Los Angeles *Equal co-first authors in alphabetical order

slide-2
SLIDE 2

HETEROREFACTOR: Refactoring for Heterogeneous Computing with FPGA

Jason Lau Aishwarya Sivaraman Qian Zhang Muhammad Ali Gulzar Jason Cong Miryung Kim

slide-3
SLIDE 3

FPGA*-based Acceleration

Fast Efficient

* FPGA: Field Programmable Gate Array 3

slide-4
SLIDE 4

FPGA*-based Acceleration

Fast Efficient Effort

* Field Programmable Gate Array 4

slide-5
SLIDE 5

Evolution of Programming Model

module vecdot(a, b, c, clk, rst); input [67:0] a, b;

  • utput [16:0] c;

reg [5:0] s; reg [16:0] prod [0:7]; ... always @(posedge clk or posedge rst) if (!rst) begin if (s == 6’b00001) prod[0] = a[..] * b[..]; prod[1] =... s = 6’b00010; else if (s == 6’b00010) reg1 = prod[0] + prod[1] + prod[2]; s = 6’b00100; // goto L00100; else if (s == 6’b00100) reg1 = reg1 + prod[3] + prod[4]; s = 6’b01000; else ... ; ... endmodule

goto-style control. instructions. typeless. registers. Verilog HDL*

5 * HDL: Hardware Description Language

slide-6
SLIDE 6

Evolution of Programming Model

fpga_float<8,15> vecdot( fpga_float<8,15> a[], fpga_float<8,15> b[], fpga_int<31> n) { for (fpga_int<31> i = 0; i < n; i++) sum += a[i] * b[i]; return sum; }

Merlin HLS*, etc. auto optimization. typed. auto schedule. auto resource.

6 * HLS: High-Level Synthesis

slide-7
SLIDE 7

Something is missing...

fpga_float<8,15> vecdot( fpga_float<8,15> a[], fpga_float<8,15> b[], fpga_int<31> n) { for (fpga_int<31> i = 0; i < n; i++) sum += a[i] * b[i]; return sum; }

Merlin HLS*, etc. bit-width.

waste scarce memory! FPGA memory: < 100 MB

7 * HLS: High-Level Synthesis

bitwidth = 31

slide-8
SLIDE 8

Something is missing...

fpga_float<8,15> vecdot( fpga_float<8,15> a[], fpga_float<8,15> b[], fpga_int<31> n) { for (fpga_int<31> i = 0; i < n; i++) sum += a[i] * b[i]; return sum; }

Merlin HLS*, etc. bit-width. floating-point precision.

precision? memory?

8 * HLS: High-Level Synthesis

exponent 8 bits fraction 15 bits

slide-9
SLIDE 9

Something is missing...

struct Node { Node *left, *right; int val; }; void init(Node **root) { *root = (Node *)malloc(sizeof(Node)); } void insert(Node **root, int *arr); void delete_tree(Node *root) { // … free(root); } void traverse(Node *curr) { if (curr == NULL) return; int ret = visit(curr->val); traverse(curr->left); traverse(curr->right); }

recursive data structure. Merlin HLS*, etc. bit-width. floating-point precision.

4 errors in 14 lines of code nested pointers

9 * HLS: High-Level Synthesis

slide-10
SLIDE 10

Something is missing...

struct Node { Node *left, *right; int val; }; void init(Node **root) { *root = (Node *)malloc(sizeof(Node)); } void insert(Node **root, int *arr); void delete_tree(Node *root) { // … free(root); } void traverse(Node *curr) { if (curr == NULL) return; int ret = visit(curr->val); traverse(curr->left); traverse(curr->right); }

recursive data structure. Merlin HLS*, etc. bit-width. floating-point precision.

4 errors in 14 lines of code nested pointers dynamic mem mgmt preallocated size?

10 * HLS: High-Level Synthesis

slide-11
SLIDE 11

Something is missing...

struct Node { Node *left, *right; int val; }; void init(Node **root) { *root = (Node *)malloc(sizeof(Node)); } void insert(Node **root, int *arr); void delete_tree(Node *root) { // … free(root); } void traverse(Node *curr) { if (curr == NULL) return; int ret = visit(curr->val); traverse(curr->left); traverse(curr->right); }

recursive data structure. Merlin HLS*, etc. bit-width. floating-point precision.

4 errors in 14 lines of code nested pointers dynamic mem mgmt pointer operations preallocated size?

11 * HLS: High-Level Synthesis

slide-12
SLIDE 12

Something is missing...

struct Node { Node *left, *right; int val; }; void init(Node **root) { *root = (Node *)malloc(sizeof(Node)); } void insert(Node **root, int *arr); void delete_tree(Node *root) { // … free(root); } void traverse(Node *curr) { if (curr == NULL) return; int ret = visit(curr->val); traverse(curr->left); traverse(curr->right); }

recursive data structure. Merlin HLS*, etc. bit-width. floating-point precision.

4 errors in 14 lines of code nested pointers dynamic mem mgmt pointer operations recursion functions preallocated size?

12 * HLS: High-Level Synthesis

slide-13
SLIDE 13

Evolution of Programming Model

evolving

ANSI C

C++ 03 C++ 11 C++ 14 C++ 17

CPU FPGA

Vivado HLS C/C++

Year

Programmability (languages features, programming difficulty, etc.) HDL

gap

untimed descriptions

Merlin HLS C/C++ 1989 2003 2011 2014 2017

pragma simplified evolving

Credit: A Multi-Paradigm Programming Infrastructure for Heterogeneous Architectures by Cong et al.

13

slide-14
SLIDE 14

Evolution of Programming Model

significant human effort. & error-prone.

14

evolving

ANSI C

C++ 03 C++ 11 C++ 14 C++ 17

CPU FPGA

Vivado HLS C/C++

Year

Programmability (languages features, programming difficulty, etc.) HDL

gap

untimed descriptions

Merlin HLS C/C++ 1989 2003 2011 2014 2017

pragma simplified evolving

Credit: A Multi-Paradigm Programming Infrastructure for Heterogeneous Architectures by Cong et al.

slide-15
SLIDE 15

Evolution of Programming Model

significant human effort. & error-prone. waste scarce memory

15

evolving

ANSI C

C++ 03 C++ 11 C++ 14 C++ 17

CPU FPGA

Vivado HLS C/C++

Year

Programmability (languages features, programming difficulty, etc.) HDL

gap

untimed descriptions

Merlin HLS C/C++ 1989 2003 2011 2014 2017

pragma simplified evolving

Credit: A Multi-Paradigm Programming Infrastructure for Heterogeneous Architectures by Cong et al.

slide-16
SLIDE 16

I want it to run!

16

slide-17
SLIDE 17

I want it to run efficiently!

17

slide-18
SLIDE 18

Automation!

18

slide-19
SLIDE 19

HETEROREFACTOR

Instrumentation Refactoring Selective Offloading C++ Inputs Vivado HLS / Merlin Recursive Data Structures Support and Optimization Integers Bitwidth Optimization Floating Points Bitwidth Optimization

  • ne-click

19

slide-20
SLIDE 20

Part 1. Dynamic Data Structures

Instrumentation Refactoring Selective Offloading C++ Inputs Vivado HLS / Merlin

  • ne-click

Recursive Data Structures Support and Optimization Integers Bitwidth Optimization Floating Points Bitwidth Optimization

20

slide-21
SLIDE 21

Instrumentation Refactoring Selective Offloading C++ Inputs Vivado HLS / Merlin

  • ne-click

Data Structure Size Recursion Depth Recursive Data Structures

Dynamic Data Structures: Instrumentation

21

slide-22
SLIDE 22

Dynamic Data Structures: Refactoring

Instrumentation Refactoring Selective Offloading C++ Inputs Vivado HLS / Merlin Rewrite Memory Management Modify Pointer Access Convert Recursion

  • ne-click

Recursive Data Structures Support and Optimization Data Structure Size Recursion Depth Recursive Data Structures

22

slide-23
SLIDE 23

Example Program

Refactoring Selective Offloading C++ Inputs Vivado HLS / Merlin

void init(Node **root) { *root = (Node *)malloc(sizeof(Node)); } void delete_tree(Node *root) { // … free(root); } void traverse(Node *curr) { // entry if (curr == NULL) return; int ret = visit(curr->val); traverse(curr->left); traverse(curr->right); // return } // top-level function float kernel(float input[], int n) { float value = computation(float(..), ..); }

Instrumentation

23

slide-24
SLIDE 24

Refactoring Rule 1: Rewrite Mem. Mgmt.

Refactoring Selective Offloading C++ Inputs Vivado HLS / Merlin Instrumentation

void init(Node **root) { *root = (Node *)malloc(sizeof(Node)); } void delete_tree(Node *root) { // … free(root); } void init(Node_ptr *root) { *root = (Node_ptr)Node_malloc(sizeof(Node)); } void delete_tree(Node_ptr root) { // … Node_free(root); }

24

slide-25
SLIDE 25

Refactoring Rule 1: Rewrite Mem. Mgmt.

Refactoring Selective Offloading C++ Inputs Vivado HLS / Merlin Instrumentation

void init(Node **root) { *root = (Node *)malloc(sizeof(Node)); } void delete_tree(Node *root) { // … free(root); } void init(Node_ptr *root) { *root = (Node_ptr)Node_malloc(sizeof(Node)); } void delete_tree(Node_ptr root) { // … Node_free(root); }

25

slide-26
SLIDE 26

Refactoring Rule 2: Rewrite Pointer Access

Refactoring Selective Offloading C++ Inputs Vivado HLS / Merlin Instrumentation

void traverse(Node_ptr curr) { if (curr == NULL) return; int ret = visit(curr->val); traverse(curr->left); traverse(curr->right); } Node Node_arr[NODE_ARR_SIZE]; void traverse(Node_ptr curr) { if (curr == NULL) return; int ret = visit(Node_arr[curr].val); traverse(Node_arr[curr].left); traverse(Node_arr[curr].right); }

26

slide-27
SLIDE 27

Refactoring Rule 3: Convert Recursion

Refactoring Selective Offloading C++ Inputs Vivado HLS / Merlin Instrumentation

void traverse(Node_ptr curr) { traverse(Node_arr[curr].left); traverse(Node_arr[curr].right); } void traverse_converted(Node_ptr curr) { stack<context> s(STACK_SIZE); while (!s.empty()) { context c = s.pop(); goto c.location; L0: // traverse(Node_arr[curr].left); c.location = L1; s.push(c); s.push({curr: Node_arr[curr].left}); continue; L1: // ... } }

27

slide-28
SLIDE 28

Part 2. Integers

Instrumentation Refactoring Selective Offloading C++ Inputs Vivado HLS / Merlin Recursive Data Structures Support and Optimization Integers Bitwidth Optimization Floating Points Bitwidth Optimization

  • ne-click

28

slide-29
SLIDE 29

Integers: Kvasir-based Instrumentation

Instrumentation Refactoring Selective Offloading C++ Inputs Vivado HLS / Merlin

  • ne-click

Value Range Integers

29

slide-30
SLIDE 30

Integers: Refactoring

Instrumentation Refactoring Selective Offloading C++ Inputs Vivado HLS / Merlin

  • ne-click

Rewrite Integer Types Value Range Integers

30

slide-31
SLIDE 31

Refactoring Rule: Modify Integer Type

Refactoring Selective Offloading C++ Inputs Vivado HLS / Merlin Instrumentation

Node Node_arr[NODE_ARR_SIZE]; void init(Node_ptr *root) { *root = (Node_ptr)Node_malloc(sizeof(Node)); } void delete_tree(Node_ptr root) { // … Node_free(root); } void traverse(Node_ptr curr) { if (curr == NULL) return; // @invarants(ret[21,255]) // int ret = visit(Node_arr[curr].val); fpga_uint<8> ret = visit(Node_arr[curr].val); traverse(Node_arr[curr].left); traverse(Node_arr[curr].right); } float kernel(float input[], int n) { float value = computation(float(..), ..); }

31

slide-32
SLIDE 32

Part 3. Floating Points

Program Variants Generation

Differential Execution and Probabilistic Verification

Selective Offloading C++ Inputs Vivado HLS / Merlin Recursive Data Structures Support and Optimization Integers Bitwidth Optimization Floating Points Bitwidth Optimization

  • ne-click

32

slide-33
SLIDE 33

Program Variants Generation

Differential Execution and Probabilistic Verification

Selective Offloading C++ Inputs Vivado HLS / Merlin

  • ne-click

ddddC/ ddddC/ Pre-transformed Programs with Different Precisions Floating Points

Floating Points: Program Variants Generation

33

slide-34
SLIDE 34

Floating Points: Program Variants Generation

Differential Execution and Probabilistic Verification

Selective Offloading C++ Inputs Vivado HLS / Merlin

float kernel(float input[], int n) { float value = computation(float(..), ..); } float low_bit(float input[], int n) { fpga_float<8,16> value = computation(fpga_float<8,16>(..), ..); } float high_bit(float input[], int n) { fpga_float<8,23> value = computation(fpga_float<8,23>(..), ..); }

Program Variants Generation

fpga_float<Exponent, Fraction> to customize FP precision

* note: fpga_float<8,23> is 32 bit float type, fpga_float<5,16>uses 22 bits in total 34

slide-35
SLIDE 35

Floating Points: Differential Execution

Program Variants Generation

Differential Execution and Probabilistic Verification

Selective Offloading C++ Inputs Vivado HLS / Merlin

  • ne-click

Precision Loss from Differential Execution ddddC/ ddddC/ Pre-transformed Programs with Different Precisions Floating Points

35

slide-36
SLIDE 36

Differential Execution and Probabilistic Verification

Selective Offloading C++ Inputs Vivado HLS / Merlin

float kernel(float input[], int n) { float value = computation(float(..), ..); } float low_bit(float input[], int n) { fpga_float<8,16> value = computation(fpga_float<8,16>(..), ..); } float high_bit(float input[], int n) { fpga_float<8,23> value = computation(fpga_float<8,23>(..), ..); } void verification() { float diff = high_bit(..) - bit_ver(..); if (diff > epsilon) // failed sample }

Program Variants Generation

Floating Points: Differential Execution

36

slide-37
SLIDE 37

Floating Points: Probabilistic Verification

Differential Execution and Probabilistic Verification

Selective Offloading C++ Inputs Vivado HLS / Merlin

void verification() { float diff = high_ver(..) - low_ver(..); if (diff > epsilon) // failed sample }

Program Variants Generation Use Hoeffding’s inequality [1] to calculate the number of samples to meet the required confidence level: alpha. [1] Hoeffding, Wassily (1963). "Probability inequalities for sums of bounded random variables"

37

slide-38
SLIDE 38

Guard Checking

Refactoring Selective Offloading C++ Inputs Vivado HLS / Merlin Instrumentation

  • Input check on host and intermediate check on device
  • Send a signal to the host to indicate fallback when:

○ Recursive programs: stack overflow, memory failure ○ Integers: overflow

  • The host restart computation on CPU

38

slide-39
SLIDE 39

Evaluation: Coding Effort

ID / Program Orig. LOC Manual LOC Δ LOC Auto. LOC Orig. Chars Manual Chars Δ Chars R1/A.-C. 190 291 33% 557 5673 8776 35% R2/DFS 86 198 57% 464 2236 5699 61% R3/L. List 131 235 44% 329 3061 6686 54% R4/M. Sort 128 342 63% 390 3267 9124 64% R5/Strassen’s 342 735 53% 1006 10026 40971 76% Geomean 49% 56%

49%

reduction in LOC

56%

in chars

39

slide-40
SLIDE 40

Evaluation: Resource Reduction

83%

reduction in BRAM

Recursive Data Structures*

42%

increase in Fmax

* assuming a typical size of 2k, + a conservative size of 16k

22%

reduction in FF

Integer

21%

reduction in LUT

41%

reduction in BRAM

52%

increase in DSP

61%

reduction in FF

Floating-point

39%

reduction in LUT

50%

increase in DSP

40

slide-41
SLIDE 41

Acknowledgement

  • Guy Van den Broeck, Brett Chalabian, Todd Millstein,

Peng Wei, Cody Hao Yu, Janice Wheeler

  • Intel CAPA grant
  • CRISP, one of six centers in JUMP, a SRC program
  • NSF grants: CCF-1764077, CCF-1527923, CCF-1723773
  • ONR grant: N00014-18-1-2037
  • Samsung grant
  • Center for Domain-Specific Computing (CDSC)

○ Xilinx and VMWare.

41

slide-42
SLIDE 42
  • We adapt and expand automated refactoring to heterogeneous computing with FPGA.
  • HETEROREFACTOR provides a novel, end-to-end solution that combines:

○ dynamic invariant analysis for identifying common-case. ○ kernel refactoring to enhance HLS synthesizability and to reduce memory usage. ○ selective offloading with guard checking to guarantee correctness.

  • The proposed combination is unique to the best of our knowledge.

HETEROREFACTOR: Refactoring for Heterogeneous Computing with FPGA

Jason Lau*, Aishwarya Sivaraman*, Qian Zhang*, Muhammad Ali Gulzar, Jason Cong, Miryung Kim

University of California, Los Angeles *Equal co-first authors in alphabetical order 42