IR An intermediate representation for transforming and optimizing - - PowerPoint PPT Presentation

ir an intermediate representation for transforming and
SMART_READER_LITE
LIVE PREVIEW

IR An intermediate representation for transforming and optimizing - - PowerPoint PPT Presentation

IR An intermediate representation for transforming and optimizing the microarchitecture of application accelerators Amirali Sharifian 1 , Reza Hojabr 1 , Navid Rahimi 1 , Sihao Liu 2 , Apala Guha 1 Tony Nowatzki 2 and Arrvindh Shriraman 1


slide-1
SLIDE 1

µIR – An intermediate representation for transforming and optimizing the microarchitecture of application accelerators

1

Amirali Sharifian1, Reza Hojabr1, Navid Rahimi1, Sihao Liu2, Apala Guha1 Tony Nowatzki2 and Arrvindh Shriraman1

Simon Fraser University1, UCLA2

https://github.com/sfu-arch/uir

slide-2
SLIDE 2

The Accelerator Flowchart

2

Study the Application

Object Detection Model

slide-3
SLIDE 3

The Accelerator Flowchart

3

Study the Application

Object Detection Model

Design Hardware RTL Design

f()

Valid Input Weights Output Clock Reset Valid

slide-4
SLIDE 4

The Accelerator Flowchart

4

Study the Application

Object Detection Model

Design Hardware RTL Design

f()

Valid Input Weights Output Clock Reset Valid

New Software

Compiler

slide-5
SLIDE 5

Feature Dimension (Transistor Count) — Source: IBS 125 250 375 500

65nm 45/40 nm 28nm 20nm 16/14nm

Architecture Software Validation

80 160 240 320

Cost(M$)

Problems With Design Flowchart

5

slide-6
SLIDE 6

ISA-based Flowchart

6

Compile Execute

✓Isolate Application from Architecture

ISA


Application

GPU

slide-7
SLIDE 7

ISA-based Flowchart

7

Compile Execute

✓Isolate Application from Architecture

ISA


Application

GPU

slide-8
SLIDE 8

ISA-based Flowchart

8

Accelerator

?

✓Isolate Application from Architecture

  • Not expressive enough to create hardware

Compile Execute

ISA


Application

slide-9
SLIDE 9

ISA-based Flowchart

9

?

✓Isolate Application from Architecture

  • Not expressive enough to create hardware
  • Not precise enough to explore hardware

Compile Execute

ISA


Application

slide-10
SLIDE 10

µIR — A New Accelerator Flowchart

10

✓End-to-End flow — Existing software for behavior/functionality

Applications

Auto-Synthesis

Behavior CREATE

RTL

Microarch.
 Representation (µIR)

slide-11
SLIDE 11

µIR — A New Accelerator Flowchart

11

✓End-to-End flow — Existing software for behavior/functionality ✓Reduce effort — Compiler to extract behaviour

Applications

Auto-Synthesis

Behavior CREATE

RTL

Microarch.
 Representation (µIR)

slide-12
SLIDE 12

µIR — A New Accelerator Flowchart

12

✓End-to-End flow — Existing software for behavior/functionality ✓Reduce effort — Compiler to extract behaviour ✓Design exploration — New model for exploring architectures

Applications

Auto-Synthesis

Behavior CREATE

Microarch.
 Representation (µIR)

slide-13
SLIDE 13

µIR — A New Accelerator Flowchart

13

✓End-to-End flow — Existing software for behavior/functionality ✓Reduce effort — Compiler to extract behaviour ✓Design exploration — New model for exploring architectures ✓Extensibility — Extensible to capture domain information

Applications

Auto-Synthesis

Behavior CREATE

Microarch.
 Representation (µIR)

slide-14
SLIDE 14

µIR — A New Accelerator Flowchart

14

✓End-to-End flow — Existing software for behavior/functionality ✓Reduce effort — Compiler to extract behaviour ✓Design exploration — New model for exploring architectures ✓Extensibility — Extensible to capture domain information

Applications

Auto-Synthesis

Behavior CREATE

Microarch.
 Representation (µIR)

slide-15
SLIDE 15

15

Application Compile

GPU TPU

LLVM IR

/Cilk

slide-16
SLIDE 16

µIR Graph

µIR A New Accelerator Flowchart

16

Application Compile LLVM IR

/Cilk

slide-17
SLIDE 17

Hierarchical data flow Structure Graph

µIR Graph

µIR A New Accelerator Flowchart

17

Application Compile LLVM IR

/Cilk

slide-18
SLIDE 18

Hierarchical data flow Structure Graph

µIR Graph

µIR A New Accelerator Flowchart

18

µOpt

Application Compile LLVM IR

/Cilk

slide-19
SLIDE 19

Hierarchical data flow Structure Graph

µIR Graph

µIR A New Accelerator Flowchart

19

µOpt

Application Compile LLVM IR

/Cilk

Chisel

µLib

slide-20
SLIDE 20
  • Motivation
  • µIR behaviour graph
  • µIR structural graph
  • Evaluation
  • Summary

20

slide-21
SLIDE 21

parallel_for(i=0; I<(M-W); i++) parallel_for(j=0; j<W; j++)

  • utput[i] += input[i+j] * weight[j];

µIR Behavioural Graph (1/3)

X +=

Output Input Weights

Conv =

N−W

X

i=0 W

X

j=0

Wij ∗ INij

<latexit sha1_base64="TjIVljatGhcSJ86l70FkKFCXcy0=">ACHicbVDLSgMxFM3UV62vUZdugkUQwTKjgm4KxW50UyrYB7R1yKRpmzaTGZJMoQzIW78FTcuFHjQvBvTKez0NYDCSfn3MvNPW7AqFSW9W1klpZXVtey67mNza3tHXN3ry79UGBSwz7zRdNFkjDKSU1RxUgzEAR5LiMNd1Se+o0xEZL6/F5NAtLxUJ/THsVIackxz8s+H8MibMvQcyJatOKHqHLaiKOZMEwE/Ww4dHhyW9F3HDtm3ipYCeAisVOSBymqjvnZ7vo49AhXmCEpW7YVqE6EhKYkTjXDiUJEB6hPmlpypFHZCdKlovhkVa6sOcLfbiCifq7I0KelBP1ZUeUgM5703F/7xWqHpXnYjyIFSE49mgXsig8uE0KdilgmDFJpogLKj+K8QDJBWOs+cDsGeX3mR1M8KtlWw7y7ypes0jiw4AIfgGNjgEpTADaiCGsDgETyDV/BmPBkvxrvxMSvNGnPvgD4+sHonKhug=</latexit><latexit sha1_base64="TjIVljatGhcSJ86l70FkKFCXcy0=">ACHicbVDLSgMxFM3UV62vUZdugkUQwTKjgm4KxW50UyrYB7R1yKRpmzaTGZJMoQzIW78FTcuFHjQvBvTKez0NYDCSfn3MvNPW7AqFSW9W1klpZXVtey67mNza3tHXN3ry79UGBSwz7zRdNFkjDKSU1RxUgzEAR5LiMNd1Se+o0xEZL6/F5NAtLxUJ/THsVIackxz8s+H8MibMvQcyJatOKHqHLaiKOZMEwE/Ww4dHhyW9F3HDtm3ipYCeAisVOSBymqjvnZ7vo49AhXmCEpW7YVqE6EhKYkTjXDiUJEB6hPmlpypFHZCdKlovhkVa6sOcLfbiCifq7I0KelBP1ZUeUgM5703F/7xWqHpXnYjyIFSE49mgXsig8uE0KdilgmDFJpogLKj+K8QDJBWOs+cDsGeX3mR1M8KtlWw7y7ypes0jiw4AIfgGNjgEpTADaiCGsDgETyDV/BmPBkvxrvxMSvNGnPvgD4+sHonKhug=</latexit><latexit sha1_base64="TjIVljatGhcSJ86l70FkKFCXcy0=">ACHicbVDLSgMxFM3UV62vUZdugkUQwTKjgm4KxW50UyrYB7R1yKRpmzaTGZJMoQzIW78FTcuFHjQvBvTKez0NYDCSfn3MvNPW7AqFSW9W1klpZXVtey67mNza3tHXN3ry79UGBSwz7zRdNFkjDKSU1RxUgzEAR5LiMNd1Se+o0xEZL6/F5NAtLxUJ/THsVIackxz8s+H8MibMvQcyJatOKHqHLaiKOZMEwE/Ww4dHhyW9F3HDtm3ipYCeAisVOSBymqjvnZ7vo49AhXmCEpW7YVqE6EhKYkTjXDiUJEB6hPmlpypFHZCdKlovhkVa6sOcLfbiCifq7I0KelBP1ZUeUgM5703F/7xWqHpXnYjyIFSE49mgXsig8uE0KdilgmDFJpogLKj+K8QDJBWOs+cDsGeX3mR1M8KtlWw7y7ypes0jiw4AIfgGNjgEpTADaiCGsDgETyDV/BmPBkvxrvxMSvNGnPvgD4+sHonKhug=</latexit><latexit sha1_base64="hP+6LrUf2d3tZaldqaQvEKMXyw=">AB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i7+gsTLXTzQrMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GS7KDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RtRxzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D7odBu3wMYA6nMFXEIN3AHD9CBLghI4BXevYn35n2suqp569LO4I+8zx84xIo4</latexit><latexit sha1_base64="1o0GYhCcfa+lGsY5k3S1GLhb0I=">ACEXicbZDLSgMxFIbP1FutVUe3boJFEMEyowvdFIRudFMq2Au0dcikaZs2kxmSTKEM8yBufBU3LhRx4863Mb0stPWHhD/fOSE5vx9xprTjfFuZtfWNza3sdm4nv7u3bx/k6yqMJaE1EvJQNn2sKGeC1jTnDYjSXHgc9rwR+VpvTGmUrFQPOhJRDsB7gvWYwRrgz7shyKMSqhtoDL2ElJ31MKueNJmD4QyY8Njw7O7itnT1LMLTtGZCa0ad2EKsFDVs7/a3ZDEARWacKxUy3Ui3Umw1IxwmubasaIRJiPcpy1jBQ6o6iSz4VJ0YkgX9UJpltBoRn/fSHCg1CTwTWeA9UAt16bwv1or1r3rTsJEFGsqyPyhXsyRDtE0KdRlkhLNJ8ZgIpn5KyIDLDHRJs+cCcFdHnV1C+KrlN07x3IwhEcwym4cAU3cAtVqAGBJ3iBN3i3nq1X62MeV8Za5HYIf2R9/gCy6Az</latexit><latexit sha1_base64="1o0GYhCcfa+lGsY5k3S1GLhb0I=">ACEXicbZDLSgMxFIbP1FutVUe3boJFEMEyowvdFIRudFMq2Au0dcikaZs2kxmSTKEM8yBufBU3LhRx4863Mb0stPWHhD/fOSE5vx9xprTjfFuZtfWNza3sdm4nv7u3bx/k6yqMJaE1EvJQNn2sKGeC1jTnDYjSXHgc9rwR+VpvTGmUrFQPOhJRDsB7gvWYwRrgz7shyKMSqhtoDL2ElJ31MKueNJmD4QyY8Njw7O7itnT1LMLTtGZCa0ad2EKsFDVs7/a3ZDEARWacKxUy3Ui3Umw1IxwmubasaIRJiPcpy1jBQ6o6iSz4VJ0YkgX9UJpltBoRn/fSHCg1CTwTWeA9UAt16bwv1or1r3rTsJEFGsqyPyhXsyRDtE0KdRlkhLNJ8ZgIpn5KyIDLDHRJs+cCcFdHnV1C+KrlN07x3IwhEcwym4cAU3cAtVqAGBJ3iBN3i3nq1X62MeV8Za5HYIf2R9/gCy6Az</latexit><latexit sha1_base64="It+9uCmNSaTXOYeEREhAZVbEHuc=">ACHicbVDLSgMxFM34rPU16tJNsAgiWDK60E2h2I1uSgX7gLYOmTRt02YyQ5IplGE+xI2/4saFIm5cCP6N6bQLbT2QcHLOvdzc4WcKY3Qt7W0vLK6tp7ZyG5ube/s2nv7NRVEktAqCXgGx5WlDNBq5pThuhpNj3OK17w9LEr4+oVCwQ93oc0raPe4J1GcHaSK59UQrECBZgS0W+G7MCSh7i8lk9iafCIBXMs+6ywelt2dxJ4to5lEcp4CJxZiQHZqi49merE5DIp0ITjpVqOijU7RhLzQinSbYVKRpiMsQ92jRUYJ+qdpwul8Bjo3RgN5DmCA1T9XdHjH2lxr5nKn2s+2rem4j/ec1Id6/aMRNhpKkg0HdiEMdwElSsMkJZqPDcFEMvNXSPpYqJNnlkTgjO/8iKpnecdlHfuUK54PYsjAw7BETgBDrgERXADKqAKCHgEz+AVvFlP1ov1bn1MS5esWc8B+APr6wehMqG2</latexit><latexit sha1_base64="TjIVljatGhcSJ86l70FkKFCXcy0=">ACHicbVDLSgMxFM3UV62vUZdugkUQwTKjgm4KxW50UyrYB7R1yKRpmzaTGZJMoQzIW78FTcuFHjQvBvTKez0NYDCSfn3MvNPW7AqFSW9W1klpZXVtey67mNza3tHXN3ry79UGBSwz7zRdNFkjDKSU1RxUgzEAR5LiMNd1Se+o0xEZL6/F5NAtLxUJ/THsVIackxz8s+H8MibMvQcyJatOKHqHLaiKOZMEwE/Ww4dHhyW9F3HDtm3ipYCeAisVOSBymqjvnZ7vo49AhXmCEpW7YVqE6EhKYkTjXDiUJEB6hPmlpypFHZCdKlovhkVa6sOcLfbiCifq7I0KelBP1ZUeUgM5703F/7xWqHpXnYjyIFSE49mgXsig8uE0KdilgmDFJpogLKj+K8QDJBWOs+cDsGeX3mR1M8KtlWw7y7ypes0jiw4AIfgGNjgEpTADaiCGsDgETyDV/BmPBkvxrvxMSvNGnPvgD4+sHonKhug=</latexit><latexit sha1_base64="TjIVljatGhcSJ86l70FkKFCXcy0=">ACHicbVDLSgMxFM3UV62vUZdugkUQwTKjgm4KxW50UyrYB7R1yKRpmzaTGZJMoQzIW78FTcuFHjQvBvTKez0NYDCSfn3MvNPW7AqFSW9W1klpZXVtey67mNza3tHXN3ry79UGBSwz7zRdNFkjDKSU1RxUgzEAR5LiMNd1Se+o0xEZL6/F5NAtLxUJ/THsVIackxz8s+H8MibMvQcyJatOKHqHLaiKOZMEwE/Ww4dHhyW9F3HDtm3ipYCeAisVOSBymqjvnZ7vo49AhXmCEpW7YVqE6EhKYkTjXDiUJEB6hPmlpypFHZCdKlovhkVa6sOcLfbiCifq7I0KelBP1ZUeUgM5703F/7xWqHpXnYjyIFSE49mgXsig8uE0KdilgmDFJpogLKj+K8QDJBWOs+cDsGeX3mR1M8KtlWw7y7ypes0jiw4AIfgGNjgEpTADaiCGsDgETyDV/BmPBkvxrvxMSvNGnPvgD4+sHonKhug=</latexit><latexit sha1_base64="TjIVljatGhcSJ86l70FkKFCXcy0=">ACHicbVDLSgMxFM3UV62vUZdugkUQwTKjgm4KxW50UyrYB7R1yKRpmzaTGZJMoQzIW78FTcuFHjQvBvTKez0NYDCSfn3MvNPW7AqFSW9W1klpZXVtey67mNza3tHXN3ry79UGBSwz7zRdNFkjDKSU1RxUgzEAR5LiMNd1Se+o0xEZL6/F5NAtLxUJ/THsVIackxz8s+H8MibMvQcyJatOKHqHLaiKOZMEwE/Ww4dHhyW9F3HDtm3ipYCeAisVOSBymqjvnZ7vo49AhXmCEpW7YVqE6EhKYkTjXDiUJEB6hPmlpypFHZCdKlovhkVa6sOcLfbiCifq7I0KelBP1ZUeUgM5703F/7xWqHpXnYjyIFSE49mgXsig8uE0KdilgmDFJpogLKj+K8QDJBWOs+cDsGeX3mR1M8KtlWw7y7ypes0jiw4AIfgGNjgEpTADaiCGsDgETyDV/BmPBkvxrvxMSvNGnPvgD4+sHonKhug=</latexit><latexit sha1_base64="TjIVljatGhcSJ86l70FkKFCXcy0=">ACHicbVDLSgMxFM3UV62vUZdugkUQwTKjgm4KxW50UyrYB7R1yKRpmzaTGZJMoQzIW78FTcuFHjQvBvTKez0NYDCSfn3MvNPW7AqFSW9W1klpZXVtey67mNza3tHXN3ry79UGBSwz7zRdNFkjDKSU1RxUgzEAR5LiMNd1Se+o0xEZL6/F5NAtLxUJ/THsVIackxz8s+H8MibMvQcyJatOKHqHLaiKOZMEwE/Ww4dHhyW9F3HDtm3ipYCeAisVOSBymqjvnZ7vo49AhXmCEpW7YVqE6EhKYkTjXDiUJEB6hPmlpypFHZCdKlovhkVa6sOcLfbiCifq7I0KelBP1ZUeUgM5703F/7xWqHpXnYjyIFSE49mgXsig8uE0KdilgmDFJpogLKj+K8QDJBWOs+cDsGeX3mR1M8KtlWw7y7ypes0jiw4AIfgGNjgEpTADaiCGsDgETyDV/BmPBkvxrvxMSvNGnPvgD4+sHonKhug=</latexit><latexit sha1_base64="TjIVljatGhcSJ86l70FkKFCXcy0=">ACHicbVDLSgMxFM3UV62vUZdugkUQwTKjgm4KxW50UyrYB7R1yKRpmzaTGZJMoQzIW78FTcuFHjQvBvTKez0NYDCSfn3MvNPW7AqFSW9W1klpZXVtey67mNza3tHXN3ry79UGBSwz7zRdNFkjDKSU1RxUgzEAR5LiMNd1Se+o0xEZL6/F5NAtLxUJ/THsVIackxz8s+H8MibMvQcyJatOKHqHLaiKOZMEwE/Ww4dHhyW9F3HDtm3ipYCeAisVOSBymqjvnZ7vo49AhXmCEpW7YVqE6EhKYkTjXDiUJEB6hPmlpypFHZCdKlovhkVa6sOcLfbiCifq7I0KelBP1ZUeUgM5703F/7xWqHpXnYjyIFSE49mgXsig8uE0KdilgmDFJpogLKj+K8QDJBWOs+cDsGeX3mR1M8KtlWw7y7ypes0jiw4AIfgGNjgEpTADaiCGsDgETyDV/BmPBkvxrvxMSvNGnPvgD4+sHonKhug=</latexit><latexit sha1_base64="TjIVljatGhcSJ86l70FkKFCXcy0=">ACHicbVDLSgMxFM3UV62vUZdugkUQwTKjgm4KxW50UyrYB7R1yKRpmzaTGZJMoQzIW78FTcuFHjQvBvTKez0NYDCSfn3MvNPW7AqFSW9W1klpZXVtey67mNza3tHXN3ry79UGBSwz7zRdNFkjDKSU1RxUgzEAR5LiMNd1Se+o0xEZL6/F5NAtLxUJ/THsVIackxz8s+H8MibMvQcyJatOKHqHLaiKOZMEwE/Ww4dHhyW9F3HDtm3ipYCeAisVOSBymqjvnZ7vo49AhXmCEpW7YVqE6EhKYkTjXDiUJEB6hPmlpypFHZCdKlovhkVa6sOcLfbiCifq7I0KelBP1ZUeUgM5703F/7xWqHpXnYjyIFSE49mgXsig8uE0KdilgmDFJpogLKj+K8QDJBWOs+cDsGeX3mR1M8KtlWw7y7ypes0jiw4AIfgGNjgEpTADaiCGsDgETyDV/BmPBkvxrvxMSvNGnPvgD4+sHonKhug=</latexit>
slide-22
SLIDE 22

parallel_for(i=0; I<(M-W); i++) parallel_for(j=0; j<W; j++)

  • utput[i] += input[i+j] * weight[j];

µIR Behavioural Graph (2/3)

  • Hierarchical and heterogeneous task graph
  • Decompose the input program to task blocks.
  • Graph of task blocks
  • Implement arbitrary heterogenous parallel and

serial

22

for i for j Body

Spawn Sync

slide-23
SLIDE 23

µIR Behavioural Graph (3/3)

  • µIR Dataflow graph for each task block
  • Pipelined (single and multi-cycle)
  • Typed (e.g., FP precision)
  • Non-deterministic cycle ops (e.g., Mem ops)
  • Permits local transformations because tasks

are asynchronous

23

LD LD

X +

LD ST

  • utput[i] += input[i+j] * weight[j];

?

slide-24
SLIDE 24
  • Motivation
  • µIR behaviour graph
  • µIR structural graph
  • Evaluation
  • Summary

24

slide-25
SLIDE 25

µIR Structural Graph : Stage 0 — Unoptimized

25

DRAM

MAC

slide-26
SLIDE 26

µIR Structural Graph : Stage 1 — Locality

26

DRAM

MAC

Buffer

  • Define custom memory hierarchy — Buffer, FIFO
slide-27
SLIDE 27

µIR Structural Graph : Stage 1 — Locality

27

DRAM

MAC

Buffer

  • Define custom memory hierarchy — Buffer, FIFO
slide-28
SLIDE 28

µIR Structural Graph : Stage 2 — Tiling

28

DRAM

  • Define custom memory hierarchy — Buffer, FIFO
  • Tiling asynchronous task blocks

MAC MAC

slide-29
SLIDE 29

µIR Structural Graph : Stage 3 — Pipelining

29

DRAM

  • Define custom memory hierarchy — Buffer, FIFO
  • Tiling asynchronous task blocks
  • Start pipelining the operands

MAC MAC MAC MAC

slide-30
SLIDE 30

µIR Structural Graph : Stage 3 — Pipelining

30

DRAM

  • Define custom memory hierarchy — Buffer, FIFO
  • Tiling asynchronous task blocks
  • Start pipelining the operands
  • Define custom operator

Systolic Array Systolic Array

slide-31
SLIDE 31

µIR Structural Graph : Stage 4 Higer-Order Ops

  • Extensible IR
  • Introduce new ops and new types
  • Existing stages/transformations have to work
  • µIR components are generic: Dataflow nodes, buffers

and memory network

31

DRAM Tensor 
 Network

GEMM

slide-32
SLIDE 32
  • Motivation
  • µIR behaviour graph
  • µIR structural graph
  • Evaluation
  • Summary

32

slide-33
SLIDE 33

µIR Targets

  • Cycle accurate simulation (e.g., gem5)
  • C++ driver and a Python binding for ease of use
  • Cache-based memory interface

33

Simulator

https://github.com/sfu-arch/uir-sim https://github.com/sfu-arch/uir-fpga

FPGA

  • FPGA
  • Arria 10 Soc
  • Scripted process
slide-34
SLIDE 34

µIR Iterative Optimizations — Stacking

  • Normalized to the baseline accelerator

34

0.25 0.5 0.75 1

STENCIL IMA.SC GEMM FFT SPMV CONV. DENSE16 SOFTM16

Banking, Localization, Op-Fusion, Tiling

Between: 20% — 80%

slide-35
SLIDE 35

µIR Productivity

35

#Changes µIR #Changes Firrtl Ratio

Saxpy

9 73

9.3

Stencil

12 142

12.4

Image Sca.

10 84

8.4

slide-36
SLIDE 36

36

Available Now!

https://github.com/sfu-arch/uir

µIR — lib Simulator

https://github.com/sfu-arch/uir-sim https://github.com/sfu-arch/uir-fpga

FPGA

https://github.com/sfu-arch/uir-docker

Docker

Thanks to Open-Source community!