COSMOS: Coordination of High-Level Synthesis and Memory Optimization - - PowerPoint PPT Presentation

cosmos coordination of high level synthesis and memory
SMART_READER_LITE
LIVE PREVIEW

COSMOS: Coordination of High-Level Synthesis and Memory Optimization - - PowerPoint PPT Presentation

ACM/IEEE CODES+ISSS 2017, Seoul, South Korea COSMOS: Coordination of High-Level Synthesis and Memory Optimization for Hardware Accelerators Luca Piccolboni, Paolo Mantovani, Giuseppe Di Guglielmo, Luca Carloni Columbia University, New York,


slide-1
SLIDE 1

COSMOS: Coordination of High-Level Synthesis and Memory Optimization for Hardware Accelerators

Luca Piccolboni, Paolo Mantovani, Giuseppe Di Guglielmo, Luca Carloni

Columbia University, New York, USA

ACM/IEEE CODES+ISSS 2017, Seoul, South Korea

slide-2
SLIDE 2

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

Hardware Accelerators

Motivations

DianNao

[T. Chen et al., ASPLOS’14]

Efficiency Generality Hardware Accelerators

2 / 16

  • Hardware accelerators are devices designed and
  • ptimized to realize very specific functionalities

General-Purpose Processor Cores

slide-3
SLIDE 3

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

Accelerator

On-chip Interconnect

Hardware Accelerators

Component Logic Component Interface

Component Datapath

Private Local Memory (PLM)

… bank bank bank bank bank bank bank bank

Loop #1 Loop #N

Component #1 Component #2 Component #K

3 / 16

Architecture

slide-4
SLIDE 4

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

Hardware Accelerators

High-Level Synthesis (HLS)

SystemC Specification High-Level Synthesis

knob

  • conf. #1

RTL

Component Logic Component Interface

Component Datapath

Private Local Memory (PLM)

… bank bank bank bank bank bank bank bank

Loop #1 Loop #N 4 / 16

knob

  • conf. #2

Cost (Area) Performance (Latency)

slide-5
SLIDE 5

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

Hardware Accelerators

High-Level Synthesis (HLS)

SystemC Specification High-Level Synthesis

Pareto-Optimal Implementations

RTL

Component Logic Component Interface

Component Datapath

Private Local Memory (PLM)

… bank bank bank bank bank bank bank bank

Loop #1 Loop #N 4 / 16

Cost (Area) Performance (Latency)

slide-6
SLIDE 6

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

Hardware Accelerators

High-Level Synthesis (HLS)

SystemC Specification High-Level Synthesis

Pareto Dominated

RTL

Component Logic Component Interface

Component Datapath

Private Local Memory (PLM)

… bank bank bank bank bank bank bank bank

Loop #1 Loop #N 4 / 16

Cost (Area) Performance (Latency)

slide-7
SLIDE 7

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

Hardware Accelerators

RTL

4 / 16

Which knobs can be used to obtain several RTL implementations?

High-Level Synthesis (HLS)

for (k = 0; k < N; ++k) a[k] = b[k] + c[k];

  • 1. Loop unrolling

b[k] c[k] a[k]

Cost (Area) Performance (Latency)

slide-8
SLIDE 8

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

Hardware Accelerators

RTL

4 / 16

Which knobs can be used to obtain several RTL implementations?

High-Level Synthesis (HLS)

b[k+1] c[k+1] a[k+1]

for (k = 0; k < N; k += 2) a[k+0] = b[k+0] + c[k+0]; a[k+1] = b[k+1] + c[k+1];

apply unrolling

  • 1. Loop unrolling

b[k+0] c[k+0] a[k+0]

Cost (Area) Performance (Latency)

slide-9
SLIDE 9

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

Hardware Accelerators

RTL

4 / 16

Which knobs can be used to obtain several RTL implementations?

High-Level Synthesis (HLS)

  • 2. Memory Ports

Private Local Memory (PLM)

bank bank bank bank

Cost (Area) Performance (Latency)

port 1 port 2

slide-10
SLIDE 10

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

Hardware Accelerators

RTL

4 / 16

Which knobs can be used to obtain several RTL implementations?

High-Level Synthesis (HLS)

Private Local Memory (PLM)

bank bank bank bank bank bank bank bank

increase number

  • f ports
  • 2. Memory Ports

Cost (Area) Performance (Latency)

port 1 port 4 port 2 port 3

slide-11
SLIDE 11

Motivational Examples

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

  • Performing an accurate and exhaustive design-space

exploration for a hardware accelerator is complex:

5 / 16

slide-12
SLIDE 12

Motivational Examples

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

  • Performing an accurate and exhaustive design-space

exploration for a hardware accelerator is complex:

  • 1. HLS tools do not always support the generation

(and optimization) of the private local memories

5 / 16

slide-13
SLIDE 13

Motivational Examples

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

Need of multi-port memories

5 / 16

0.5 1.0 1.5 2.0 2.5 3.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Area (mm2) Effective Latency (ms)

1 port 2 ports 4 ports 8 ports

using standard memories

Gradient

latency span: 1.4× area span: 1.2×

slide-14
SLIDE 14

Motivational Examples

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

Need of multi-port memories

5 / 16

0.5 1.0 1.5 2.0 2.5 3.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Area (mm2) Effective Latency (ms)

1 port 2 ports 4 ports 8 ports

using multi-port memories area span: 3.7× latency span: 7.9×

Gradient

slide-15
SLIDE 15

Motivational Examples

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

  • Performing an accurate and exhaustive design-space

exploration for a hardware accelerator is complex:

  • 1. HLS tools do not always support the generation

(and optimization) of the private local memories

  • 2. The algorithms adopted by HLS tools are based
  • n heuristics that make it hard to set the knobs

5 / 16

slide-16
SLIDE 16

Motivational Examples

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

Gradient

5 / 16

Unpredictability of HLS tools

0.5 1.0 1.5 2.0 2.5 3.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Area (mm2) Effective Latency (ms)

1 port 2 ports 4 ports 8 ports

1.00 1.04 1.08 1.12 1.16 1.20 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48

2u 3u 4u 5u 6u 7u 8u 9u 10u 14u

2 3 4 5 6 9 8 7 10 14

# unrolls

Gradient

slide-17
SLIDE 17

Motivational Examples

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

0.5 1.0 1.5 2.0 2.5 3.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Area (mm2) Effective Latency (ms)

1 port 2 ports 4 ports 8 ports

1.00 1.04 1.08 1.12 1.16 1.20 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48

2u 3u 4u 5u 6u 7u 8u 9u 10u 14u

2 3 4 5 6 9 8 7 10 14

# unrolls

Gradient

Unpredictability of HLS tools

5 / 16

slide-18
SLIDE 18

Motivational Examples

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

0.5 1.0 1.5 2.0 2.5 3.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Area (mm2) Effective Latency (ms)

1 port 2 ports 4 ports 8 ports

1.00 1.04 1.08 1.12 1.16 1.20 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48

2u 3u 4u 5u 6u 7u 8u 9u 10u 14u

2 3 4 5 6 9 8 7 10 14

# unrolls

Gradient

Unpredictability of HLS tools

5 / 16

slide-19
SLIDE 19

Motivational Examples

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

  • Performing an accurate and exhaustive design-space

exploration for a hardware accelerator is complex:

  • 1. HLS tools do not always support the generation

(and optimization) of the private local memories

  • 2. The algorithms adopted by HLS tools are based
  • n heuristics that make it hard to set the knobs
  • 3. HLS tools do not handle the simultaneous
  • ptimization of multiple components

5 / 16

slide-20
SLIDE 20

Motivational Examples

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

1.60 1.64 1.68 1.72 1.76 1.80 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4

Composition

Area (mm2) Effective Throughput (1/ms)

Pareto Dominated

1.00 1.04 1.08 1.12 1.16 1.20 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 Gradient Area (mm2) Effective Latency (ms) 0.59 0.60 0.61 0.62 0.63 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 Grayscale Area (mm2) Effective Latency (ms)

Need of compositionality

5 / 16

slide-21
SLIDE 21

Contributions

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 6 / 16

  • We propose COSMOS, an automatic methodology for

the design-space exploration of complex accelerators

  • 1. COSMOS is able to efficiently coordinate high-

level synthesis and memory generator tools

slide-22
SLIDE 22

Contributions

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 6 / 16

  • We propose COSMOS, an automatic methodology for

the design-space exploration of complex accelerators

  • 1. COSMOS is able to efficiently coordinate high-

level synthesis and memory generator tools

  • 2. COSMOS leverages a scalable compositional

design-space exploration methodology

slide-23
SLIDE 23

Contributions

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 6 / 16

  • We propose COSMOS, an automatic methodology for

the design-space exploration of complex accelerators § Step 1: Component Characterization

Accelerator

Component #1 Component #K

region 1 region 2

latency area

SystemC Specification Step 1

region 1

latency

#1 #K region 2

area

slide-24
SLIDE 24

Contributions

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 6 / 16

  • We propose COSMOS, an automatic methodology for

the design-space exploration of complex accelerators § Step 2: Design-Space Exploration Step 2

throughput area

Design Space of the Accelerator

region 1 region 2

latency latency

#1 #K region 2

area

region 1

area

slide-25
SLIDE 25

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

  • Goal: for each component of the accelerator identify

the regions with the Pareto-optimal implementations

0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Area (mm2) Effective Latency (ms)

1 port

Component Characterization

7 / 16

region 2 4 ports 2 ports region 1

slide-26
SLIDE 26

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Area (mm2) Effective Latency (ms)

1 port

lower-right point

  • Goal: for each component of the accelerator identify

the regions with the Pareto-optimal implementations

Component Characterization

4 ports region 2 region 1

7 / 16

upper-left point 2 ports

slide-27
SLIDE 27

Component Characterization

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Area (mm2) Effective Latency (ms)

1 port

MAX latency MIN area EASY à set the number of unrolls equal to the number of ports

How to identify the lower-right point

7 / 16

slide-28
SLIDE 28

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Area (mm2) Effective Latency (ms)

1 port

EASY à set the number of unrolls equal to the max we can afford MIN latency MAX area

16

Component Characterization

7 / 16

How to identify the upper-left point

slide-29
SLIDE 29

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Area (mm2) Effective Latency (ms)

1 port

EASY à set the number of unrolls equal to the max we can afford

Component Characterization

7 / 16

How to identify the upper-left point

NO!

14 15 16

slide-30
SLIDE 30

Component Characterization

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Area (mm2) Effective Latency (ms)

1 port

SOLUTION à introduce a constraint on latency

14 15

ℎ"#$%&(()*+,,-) = /0$#11& ∗ 45678

"#$%&

+ 495:;6

"#$%& + η

16

7 / 16

How to identify the upper-left point

slide-31
SLIDE 31

Component Characterization

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Area (mm2) Effective Latency (ms)

1 port

14 15

ℎ"#$%&(()*+,,-) = /0$#11& ∗ 45678

"#$%&

+ 495:;6

"#$%& + η

16

7 / 16

max number of read accesses to the same array per loop iteration

How to identify the upper-left point

slide-32
SLIDE 32

Component Characterization

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Area (mm2) Effective Latency (ms)

1 port

14 15

ℎ"#$%&(()*+,,-) = /0$#11& ∗ 45678

"#$%&

+ 495:;6

"#$%& + η

16

7 / 16

max number of write accesses to the same array per loop iteration

How to identify the upper-left point

slide-33
SLIDE 33

Component Characterization

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Area (mm2) Effective Latency (ms)

1 port

14 15

ℎ"#$%&(()*+,,-) = /0$#11& ∗ 45678

"#$%&

+ 495:;6

"#$%& + η

16

7 / 16

this accounts for the latency of the operations that do not access the local memory

How to identify the upper-left point

slide-34
SLIDE 34

Component Characterization

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Area (mm2) Effective Latency (ms)

1 port

14 15

ℎ"#$%&(()*+,,-) = /0$#11& ∗ 45678

"#$%&

+ 495:;6

"#$%& + η

Identifying the upper-left point

16

7 / 16

discarded because they violate the constraint

slide-35
SLIDE 35

Design-Space Exploration

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

region 1 region 2 region 1

latency area latency area latency area Component #1 Component #2 Component #3

region 2 region 2 region 1

throughput area Accelerator

8 / 16

  • Goal: obtain the

combinations that are Pareto-optimal for the accelerator

slide-36
SLIDE 36

Design-Space Exploration

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

region 1 region 2 region 1

Component #1 Component #2 Component #3

region 2 region 2 region 1

Accelerator latency area latency area latency area throughput area

8 / 16

  • Goal: obtain the

combinations that are Pareto-optimal for the accelerator

slide-37
SLIDE 37

Design-Space Exploration

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

region 1 region 2 region 1

Component #1 Component #2 Component #3

region 2 region 2 region 1

Accelerator latency area latency area latency area throughput area

8 / 16

  • Goal: obtain the

combinations that are Pareto-optimal for the accelerator

slide-38
SLIDE 38

Design-Space Exploration

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

region 1 region 2 region 1

latency area latency area latency area Component #1 Component #2 Component #3

region 2 region 2 region 1

  • Goal: obtain the

combinations that are Pareto-optimal for the accelerator

Accelerator throughput area

8 / 16

slide-39
SLIDE 39

Design-Space Exploration

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

Step 1: Synthesis Planning

Computational dependencies among the components of the accelerator

9 / 16

COMPONENT #1 COMPONENT #2 COMPONENT #3

slide-40
SLIDE 40

Design-Space Exploration

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

Step 1: Synthesis Planning

COMPONENT #1 COMPONENT #2 COMPONENT #3

Timed Marked Graph (TMG)

=> =? =@ =AB0> < => < =ADE>

from component characterization

9 / 16

slide-41
SLIDE 41

Design-Space Exploration

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

Step 1: Synthesis Planning

COMPONENT #1 COMPONENT #2 COMPONENT #3

Timed Marked Graph (TMG)

=> =? =@

F =

> AB0 (

G HGI HJ, G HGI HL)

throughput of the accelerator:

9 / 16

slide-42
SLIDE 42

Design-Space Exploration

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

Step 1: Synthesis Planning

COMPONENT #1 COMPONENT #2 COMPONENT #3

Timed Marked Graph (TMG)

=> =? =@

F =

> AB0 (

G HGI HJ, G HGI HL)

throughput of the accelerator:

throughput F =>, =?, =@

[Liu et al., DATE ’12]

Linear Programming

  • minimize area

9 / 16

slide-43
SLIDE 43

Design-Space Exploration

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

Step 2: Synthesis Mapping throughput F =M

=M is a theoretical solution, thus we need to map =M to knob setting

region 1 region 2

latency area

Linear Programming

  • minimize area

how do we choose the knob setting?

=M

10 / 16

[Liu et al., DATE ’12]

slide-44
SLIDE 44

Design-Space Exploration

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

Step 2: Synthesis Mapping

CASE 1: =M corresponds to one of the extreme point of region N à no synthesis required!

region 1 region 2

latency area

=M

  • unrolls = ports
  • ports = ports
  • f region N

throughput F =M

Linear Programming

  • minimize area

10 / 16

[Liu et al., DATE ’12]

slide-45
SLIDE 45

Design-Space Exploration

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

Step 2: Synthesis Mapping

region 1 region 2

latency area

CASE 2: =M falls outside the regions à no synthesis and preserving throughput is our objective

=M

  • unrolls = ports
  • ports = ports of

the next region

throughput F =M

Linear Programming

  • minimize area

10 / 16

[Liu et al., DATE ’12]

slide-46
SLIDE 46

Design-Space Exploration

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

Step 2: Synthesis Mapping

region 1 region 2

latency area

CASE 3: =M falls inside a region à synthesis required to get the actual implementation

=M throughput F =M

Linear Programming

  • minimize area

10 / 16

[Liu et al., DATE ’12]

slide-47
SLIDE 47

Design-Space Exploration

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

Step 2: Synthesis Mapping

latency

=ADE =AB0 =%D$QR% SAB0 SADE S%D$QR%

Amdahl’s Law S =

> > UV W X

Y

S =

Z;75[6; Z\7]

F =

Z\:^ Z\7]

P =

_;75[6; U _\:^ _\7] U _\:^

10 / 16

CASE 3: =M falls inside a region

slide-48
SLIDE 48

Design-Space Exploration

5 10 15 20 25 30 35 10 15 20 25 30 35 40 45

latency = 40 unrolls = 1 latency = 30 unrolls = 4 latency = 20 unrolls = 11 latency = 10 unrolls = 30

Number of Unrolls Effective Latency (ms) ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

Step 2: Synthesis Mapping

=ADE = 40 ms SAB0 = 1 SADE = 30 =AB0 = 10 ms

10 / 16

slide-49
SLIDE 49

DEBAYER GRAYSCALE GRADIENT WARP-DX WARP-DY STEEP.-DESCENT HESSIAN MATRIX-INV WARP-GRAY MATRIX-SUB SD-UPDATE MATRIX-MUL MATRIX-RESH MATRIX-ADD WARP-IWXP CHANGE-DET.

LUCAS-KANADE

Experimental Results

Case Study

WAMI (Wide-Area Motion Imagery) C Specification SystemC HLS- ready Specif. RTL code for 32nm ASIC tech.

11 / 16

slide-50
SLIDE 50

0.0 1.6 3.2 4.8 6.4 8.0 9.6 0.0 0.9 1.8 2.7 3.6 4.5 5.4 6.3 7.2

Area (mm2) Effective Latency (ms) 2 ports 4 ports 8 ports 16 ports

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

Experimental Results

Component Characterization

12 / 16

Hessian

2.00 2.10 2.20 2.30 2.40 2.50 2.4 2.7 3.0 3.3 3.6 3.9

slide-51
SLIDE 51

Experimental Results

Component Characterization

12 / 16

the higher the better

DEBAYER GRAYSCALE GRADIENT MATRIX-SUB WARP MATRIX-ADD MATRIX-MUL MATRIX-RESH HESSIAN STEEP-DESCENT SD-UPDATE CHANGE-DET

1 2 3 4 5 6 7 8 9 10

No Memory

DEBAYER GRAYSCALE GRADIENT MATRIX-SUB WARP MATRIX-ADD MATRIX-MUL MATRIX-RESH HESSIAN STEEP-DESCENT SD-UPDATE CHANGE-DET

1 2 3 4 5 6 7 8 9 10

No Memory

latency span

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

area span

slide-52
SLIDE 52

Experimental Results

Component Characterization

12 / 16

DEBAYER GRAYSCALE GRADIENT MATRIX-SUB WARP MATRIX-ADD MATRIX-MUL MATRIX-RESH HESSIAN STEEP-DESCENT SD-UPDATE CHANGE-DET

1 2 3 4 5 6 7 8 9 10

No Memory COSMOS

DEBAYER GRAYSCALE GRADIENT MATRIX-SUB WARP MATRIX-ADD MATRIX-MUL MATRIX-RESH HESSIAN STEEP-DESCENT SD-UPDATE CHANGE-DET

1 2 3 4 5 6 7 8 9 10

No Memory COSMOS

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

the higher the better

latency span area span

slide-53
SLIDE 53

DEBAYER GRAYSCALE GRADIENT MATRIX-SUB WARP MATRIX-ADD MATRIX-MUL MATRIX-RESH HESSIAN STEEP-DESCENT SD-UPDATE CHANGE-DET

10 20 30 40 50 60 70 80 90 100 110 120

Exhaustjve Characterizatjon Failed λ-constraints Synthesis Mapping

Experimental Results

the lower the better

Design-Space Exploration (Efficiency)

non-critical components

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

COSMOS

invocations to the HLS tool

13 / 16

slide-54
SLIDE 54

DEBAYER GRAYSCALE GRADIENT MATRIX-SUB WARP MATRIX-ADD MATRIX-MUL MATRIX-RESH HESSIAN STEEP-DESCENT SD-UPDATE CHANGE-DET

10 20 30 40 50 60 70 80 90 100 110 120

Exhaustjve Characterizatjon Failed λ-constraints Synthesis Mapping

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

Experimental Results

the lower the better

calls to the HLS tool reduced by up to one

  • rder of magnitude!!

Design-Space Exploration (Efficiency)

13 / 16

slide-55
SLIDE 55

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

11.0 12.0 13.0 14.0 15.0 16.0 17.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0 110.0 Area (mm2) Throughput (frames/s)

Planned design point (theoretical) Mapped design point (algorithm)

2.5% 11.9% 13.0% 1.5% 2.5% 0.1% 2.1% 1.8% 1.6% 1.8%

Design-Space Exploration (Accuracy)

Experimental Results

percentage of area mismatch

14 / 16

slide-56
SLIDE 56

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

11.0 12.0 13.0 14.0 15.0 16.0 17.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0 110.0 Area (mm2) Throughput (frames/s)

Planned design point (theoretical) Mapped design point (algorithm)

2.5% 11.9% 13.0% 1.5% 2.5% 0.1% 2.1% 1.8% 1.6% 1.8%

Design-Space Exploration (Accuracy)

Experimental Results

large mismatch in area because:

=M

14 / 16

slide-57
SLIDE 57

Concluding Remarks

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

  • We presented COSMOS, an automatic methodology

for design-space exploration (DSE) of accelerators that coordinates HLS and memory generator tools

15 / 16

slide-58
SLIDE 58

Concluding Remarks

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

  • We presented COSMOS, an automatic methodology

for design-space exploration (DSE) of accelerators that coordinates HLS and memory generator tools

  • 1. COSMOS guarantees a richer DSE compared to the

methods that do not consider the accelerator PLMs

15 / 16

slide-59
SLIDE 59

Concluding Remarks

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

  • We presented COSMOS, an automatic methodology

for design-space exploration (DSE) of accelerators that coordinates HLS and memory generator tools

  • 1. COSMOS guarantees a richer DSE compared to the

methods that do not consider the accelerator PLMs

15 / 16

  • 2. COSMOS guarantees a much faster DSE compared to

exhaustive methods in case of complex accelerators

slide-60
SLIDE 60

Concluding Remarks

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

  • We presented COSMOS, an automatic methodology

for design-space exploration (DSE) of accelerators that coordinates HLS and memory generator tools

  • 1. COSMOS guarantees a richer DSE compared to the

methods that do not consider the accelerator PLMs

15 / 16

  • 2. COSMOS guarantees a much faster DSE compared to

exhaustive methods in case of complex accelerators

  • 3. COSMOS is a scalable methodology for DSE
slide-61
SLIDE 61

Speaker: Luca Piccolboni Columbia University, NY

Questions?

COSMOS: Coordination of High-Level Synthesis and Memory Optimization for Hardware Accelerators

ACM/IEEE CODES + ISSS 2017, Seoul, South Korea

Images from: https://www.flaticon.com/