LogCA: A High-Level Performance Model for Hardware Accelerators - - PowerPoint PPT Presentation

logca a high level performance model for hardware
SMART_READER_LITE
LIVE PREVIEW

LogCA: A High-Level Performance Model for Hardware Accelerators - - PowerPoint PPT Presentation

Everything should be made as simple as possible, but not simplerAlbert Einstein LogCA: A High-Level Performance Model for Hardware Accelerators Muhammad Shoaib Bin Altaf* David A. Wood University of Wisconsin-Madison * Now at AMD Research,


slide-1
SLIDE 1

LogCA: A High-Level Performance Model for Hardware Accelerators

Muhammad Shoaib Bin Altaf* David A. Wood University of Wisconsin-Madison

Everything should be made as simple as possible, but not simpler—Albert Einstein

*Now at AMD Research, Austin TX

slide-2
SLIDE 2

Executive Summary

  • Accelerators do not always perform as expected
  • Crucial for programmers and architects to understand the factors which

affect performance

  • Simple analytical models beneficial early in the design stage
  • Our proposal: LogCA

– High-level performance model – Help identify design bottlenecks and possible optimizations

  • Validation across variety of on-chip and off-chip accelerators
  • Two retrospective case studies demonstrate the usefulness of the model

2

slide-3
SLIDE 3

Outline

  • Motivation
  • LogCA
  • Results
  • Conclusion

3

slide-4
SLIDE 4

Why Need a Model?

4

“An accelerator is a separate architectural substructure ... that is architected using a different set of objectives than the base processor, ...., the accelerator is tuned to provide HIGHER PERFORMANCE ….. than with the general-purpose base hardware”

  • S. Patel and W. Hwu. Accelerators Architectures. Micro 2008

M7: Next Generation SPARC Hotchips-26 2014 Power8 Hpctchips-25 2013

slide-5
SLIDE 5

Why a Model?

5

0.001 0.01 0.1 1 10 Time (ms) Block Size (Bytes) Host Accelerator

Encryption algorithm on UltraSPARC T2

Break-even point Accelerator outperforms Host outperforms

Amdahl’s Law for Accelerators

Better

slide-6
SLIDE 6

Why a Model?

6

0.1 1 10 100 Speedup Offloaded Data (Bytes) UltraSPARC T2 SPARC T4 GPU Break-even points Advanced Encryption Standard (AES)

Running the same kernel, accelerators can have different break-even points

Better

slide-7
SLIDE 7

Outline

  • Motivation
  • LogCA
  • Results
  • Conclusion

7

slide-8
SLIDE 8

The Performance Model

  • Inspired by LogP [CACM 1996]
  • Abstract accelerator using five parameters

– L Latency: Cycles to move data – o Overhead: Setup cost – g Granularity: Size of the off-loaded data – C Computational index: Amount of work done per byte of data – A Acceleration: Speedup ignoring overheads

  • Sixth parameter 𝜸 generalizes to kernels with non-linear

complexity

8

Host Accelerator Interface

slide-9
SLIDE 9

The Performance Model

  • Execution w/o an accelerator

– T0(g) = C0 (g)

  • Execution with one accelerator

– T1 (g) = o1 (g) + L1(g) + C1(g)

9

T0(g) C0(g) time

  • 1(g)

L1(g) C1(g)=

#$(&) (

T1(g) Host Accelerator Interface Gain

slide-10
SLIDE 10

Granularity independent latency

  • Captures the effect of granularity on speedup
  • Speedup bounded by acceleration

– lim

&→- 𝑇𝑞𝑓𝑓𝑒𝑣𝑞 𝑕 = 𝐵

  • Overheads dominate at smaller granularities

– 𝑇𝑞𝑓𝑓𝑒𝑣𝑞(𝑕)&67 =

# 89:9;

<

<

# 89:

10

0.1 1 10 Speedup (g) Granularity (Bytes)

A

Amdahl’s law for Accelerators

slide-11
SLIDE 11

Performance Metrics

  • Right amount of off-loaded data?
  • Inspired from vector machine metrics 𝑂?, 𝑂@

A

  • 𝑕7: Granularity for a speedup of 1

– 𝑕7 is essentially independent of acceleration – Identify complexity of the interface

  • 𝑕<

A

: Granularity for a speedup of

( B

– Increasing A also increases 𝑕<

A

11

0.1 1 10 100 Speedup Granularity (Bytes)

A 𝑕7 𝑕(

B

Simple Interface Complex Interface 𝒉𝟐 𝒉𝟐 Large Small

slide-12
SLIDE 12

Granularity dependent latency

  • Speedup bounded by computational intensity C/L

– lim

&→- 𝑇𝑞𝑓𝑓𝑒𝑣𝑞 𝑕 < # : (𝑚𝑗𝑜𝑓𝑏𝑠 𝑏𝑚𝑕𝑝𝑠𝑗𝑢ℎ𝑛𝑡)

  • Speedup for sub-linear algorithms asymptotically

decreases with the increase in granularity

12

0.1 1 10 Speedup (g) Granularity (Bytes)

A 𝐷 𝑀

0.1 1 10 Speedup (g) Granularity (Bytes)

A g Speedup Sub-linearly Linearly

slide-13
SLIDE 13

Granularity dependent latency

  • Computational intensity must be greater

than 1 to achieve any speedup

  • Computational intensity should be greater

than peak performance to achieve A/2

13

Speedup Granularity (Bytes) 𝑕A/2 A/2 1 𝑕7

𝐷 𝑀 ≥ 1 A 𝐷 𝑀 ≥ 𝐵

Performance metrics help programmers early in the design cycle

slide-14
SLIDE 14

Bottleneck Analysis using LogCA

14

0.1 1 10 100 1000 Speedup Granularity (Bytes)

LogCA L_0.1x

  • _0.1x

C_10x A_10x

  • 10X change in parameter è 20% performance gain
  • Helps focus on performance bottlenecks

A

  • C

A

  • C
  • CA

A 𝐷 𝑀 ⁄

  • C
  • CA

A

slide-15
SLIDE 15

Outline

  • Motivation
  • LogCA
  • Results
  • Conclusion

15

slide-16
SLIDE 16

Experimental Methodology

  • Fixed-function and general-purpose accelerators

– Cryptographic accelerators on SPARC architectures – Discrete and integrated GPUs

  • Kernels with varying complexities

– Encryption, Hashing, Matrix Multiplication, FFT, Search, Radix Sort

  • Retrospective case studies

– Cryptographic interface in SPARC architectures – Memory interface in GPUs

16

slide-17
SLIDE 17

Case Study I Cryptographic Interface in the SPARC Architecture

17

PCIe Crypto Accelerator UltraSPARC T2 SPARC T3 SPARC T4 engine SPARC T4 instructions

slide-18
SLIDE 18

Conclusion

  • Simple models effective in predicting performance of accelerators
  • Proposed a high-level performance model for hardware accelerators
  • These models help programmers and architects visually identify

bottlenecks and suggest optimizations

  • Performance metrics for programmers in deciding the right amount of
  • ffloaded data
  • Limitations include inability to model resource contention, caches, and

irregular memory access patterns

18

slide-19
SLIDE 19

Questions?

19 Source: http://www.medarcade.com/