in FPGA HLS to improve Maximum Frequency Licheng Guo*, Jason Lau*, - - PowerPoint PPT Presentation

in fpga hls to improve maximum frequency
SMART_READER_LITE
LIVE PREVIEW

in FPGA HLS to improve Maximum Frequency Licheng Guo*, Jason Lau*, - - PowerPoint PPT Presentation

Analysis and Optimization of the Implicit Broadcasts in FPGA HLS to improve Maximum Frequency Licheng Guo*, Jason Lau*, Yuze Chi, Jie Wang, Cody Hao Yu, Zhe Chen, Zhiru Zhang and Jason Cong University of California Los Angeles, Cornell


slide-1
SLIDE 1

Analysis and Optimization of the Implicit Broadcasts in FPGA HLS to improve Maximum Frequency

Licheng Guo*, Jason Lau*, Yuze Chi, Jie Wang, Cody Hao Yu, Zhe Chen, Zhiru Zhang and Jason Cong University of California Los Angeles, Cornell University

* indicates co-first authors https://github.com/Licheng-Guo/vivado-hls-broadcast-optimization

slide-2
SLIDE 2

Outline

  • Introduction
  • Problem Classification
  • Solution
  • Experiments
slide-3
SLIDE 3

RTL Verilog vs. Untimed C/C++

  • Much higher developing efficiency
  • Less achievable frequency compared to RTL designs
  • Hard to debug the critical path
slide-4
SLIDE 4

We Analyze the Timing Issues of Complex Designs

slide-5
SLIDE 5

We Analyze the Timing Issues of Complex Designs

  • Most critical paths are related to broadcasts
  • Some are hidden in user codes
  • Some are inferred by the HLS compiler
  • Lead to high-fanout interconnects and bad timing quality
slide-6
SLIDE 6

We Analyze the Timing Issues of Complex Designs

  • Most critical paths are related to broadcasts
  • Some are hidden in user codes
  • Some are inferred by the HLS compiler
  • Lead to high-fanout interconnects and bad timing quality
  • We categorize common types of broadcasts in HLS-based designs.
slide-7
SLIDE 7

We Analyze the Timing Issues of Complex Designs

  • Most critical paths are related to broadcasts
  • Some are hidden in user codes
  • Some are inferred by the HLS compiler
  • Lead to high-fanout interconnects and bad timing quality
  • We categorize common types of broadcasts in HLS-based designs.
  • We analyze the inherent limitations of current HLS tools exposed by the

broadcast problem

slide-8
SLIDE 8

We Analyze the Timing Issues of Complex Designs

  • Most critical paths are related to broadcasts
  • Some are hidden in user codes
  • Some are inferred by the HLS compiler
  • Lead to high-fanout interconnects and bad timing quality
  • We categorize common types of broadcasts in HLS-based designs.
  • We analyze the inherent limitations of current HLS tools exposed by the

broadcast problem

  • Our lightweight solutions bring significant frequency boost on real-world

HLS designs

slide-9
SLIDE 9

Outline

  • Introduction
  • Problem Classification
  • Solution
  • Experiments
slide-10
SLIDE 10

Classification of Broadcasts

  • Data Broadcast
  • Originate from the source code
  • High fan-out signals in the datapath
  • Can be mapped back to certain lines in the source code
slide-11
SLIDE 11

Classification of Broadcasts

  • Data Broadcast
  • Originate from the source code
  • High fan-out signals in the datapath
  • Can be mapped back to certain lines in the source code
  • Control Broadcast
  • Originate from the compiler
  • High fan-out signals from control logic
  • Completely transparent to users
slide-12
SLIDE 12

Data Broadcast

  • Scenario 1: unrolled loop
slide-13
SLIDE 13

Data Broadcast

  • Scenario 1: unrolled loop
slide-14
SLIDE 14

Data Broadcast

  • Scenario 1: unrolled loop
slide-15
SLIDE 15

Data Broadcast

  • Scenario 1: unrolled loop

Problem: current HLS delay model does not consider the additional net delay

slide-16
SLIDE 16

Data Broadcast

  • Scenario 1: unrolled loop

underestimated delay --> inadequate registering

slide-17
SLIDE 17

Data Broadcast

  • Scenario 2: Large buffer
slide-18
SLIDE 18

Control Broadcast

  • Scenario 1: Pipeline backpressure
slide-19
SLIDE 19

Control Broadcast

  • Scenario 1: Pipeline backpressure
slide-20
SLIDE 20

Control Broadcast

  • Scenario 2: Synchronization of parallel logics
  • The compiler infers parallelism from sequential code
  • Insert synchronization logic to guarantee correctness
slide-21
SLIDE 21

Control Broadcast

  • Scenario 2: Synchronization of parallel logics
  • The compiler infers parallelism from sequential code
  • Insert synchronization logic to guarantee correctness
slide-22
SLIDE 22

Control Broadcast

  • Scenario 2: Synchronization of parallel logics
  • The compiler infers parallelism from sequential code
  • Insert synchronization logic to guarantee correctness

/

slide-23
SLIDE 23

Control Broadcast

  • Scenario 2: Synchronization of parallel logics
  • The compiler infers parallelism from sequential code
  • Insert synchronization logic to guarantee correctness

/

slide-24
SLIDE 24

Control Broadcast

  • Scenario 2: Synchronization of parallel logics
  • The compiler infers parallelism from sequential code
  • Insert synchronization logic to guarantee correctness

/

reduce-then-broadcast

slide-25
SLIDE 25

Summary of Broadcast Types

  • Data Broadcast
  • Loop unrolling: loop-invariants variables will be broadcast
  • Large buffer: logical buffer entity will become scattered memory units
  • Lead to incorrect delay prediction -> bad clock insertion
  • Control Broadcast
  • Pipeline control: backpressure signals are broadcast to the whole datapath
  • Synchronization control: guarantee the correctness of concurrent execution
  • Unscalable broadcast of control signals -> not working for large designs
slide-26
SLIDE 26

Outline

  • Introduction
  • Problem Classification
  • Solution
  • Experiments
slide-27
SLIDE 27

Broadcast-Aware Scheduling

+ + + + ...

  • Isolate the broadcast skeletons and measure the delay

a broadcast skeleton measure delay

slide-28
SLIDE 28

Broadcast-Aware Scheduling

+ + + + ...

  • Isolate the broadcast skeletons and measure the delay
  • The additional delay serve as a conservative calibration

A broadcast skeleton measure delay

slide-29
SLIDE 29

Broadcast-Aware Scheduling

  • Example: a genome sequencing accelerator design
  • Broadcast elements to 64 datapaths
slide-30
SLIDE 30

Broadcast-Aware Scheduling

  • Example: a genome sequencing accelerator design
  • Broadcast elements to 64 datapaths

0.78 ns

slide-31
SLIDE 31

Broadcast-Aware Scheduling

  • Example: a genome sequencing accelerator design
  • Broadcast elements to 64 datapaths

slide-32
SLIDE 32

Broadcast-Aware Scheduling

Delay of the aforementioned path

slide-33
SLIDE 33

Broadcast-Aware Scheduling

Delay of the aforementioned path Overrall frequency improvements

slide-34
SLIDE 34

Skid-Buffer-Based Pipeline Control

  • Adopt skid buffer for flow control

# item <= 1

slide-35
SLIDE 35

Skid-Buffer-Based Pipeline Control

  • Adopt skid buffer for flow control

# item <= 1

slide-36
SLIDE 36

Skid-Buffer-Based Pipeline Control

  • Adopt skid buffer for flow control

# item <= 1

slide-37
SLIDE 37

Skid-Buffer-Based Pipeline Control

  • Adopt skid buffer for flow control

# item <= 1

slide-38
SLIDE 38

Skid-Buffer-Based Pipeline Control

  • Buffer width equals that of the pipeline output
  • Different pipeline stages have different output width

# item <= 1

slide-39
SLIDE 39

Skid-Buffer-Based Pipeline Control

  • Buffer width equals that of the pipeline output
  • Different pipeline stages have different output width
  • Dynamic programming to optimize the area overhead

# item <= 1 # item <= 1 # item <= 1

slide-40
SLIDE 40
  • Prune away redundant synchronization logic

Synchronization Logic Pruning

slide-41
SLIDE 41
  • > 50% improvement on our benchmarks
  • For more details please check our paper :)
  • https://github.com/Licheng-Guo/vivado-hls-broadcast-optimization

Experiment Results

slide-42
SLIDE 42
  • We classify and analyze the common types of broadcasts in HLS
  • We propose methods:
  • delay model calibration to optimize the data broadcast
  • min-area skid-buffer to optimize pipeline control
  • synchronization pruning to optimize synchronization broadcast
  • We bring over 50% of frequency gain to well-optimized designs.
  • https://github.com/Licheng-Guo/vivado-hls-broadcast-optimization

Analysis and Optimization of the Implicit Broadcasts in FPGA HLS to improve Maximum Frequency

Licheng Guo*, Jason Lau*, Yuze Chi, Jie Wang, Cody Hao Yu, Zhe Chen, Zhiru Zhang and Jason Cong University of California Los Angeles, Cornell University