LogCA: A High-Level Performance Model for Hardware Accelerators - - PowerPoint PPT Presentation

▶

Apr 23, 2023 132 likes •334 views

Everything should be made as simple as possible, but not simplerAlbert Einstein LogCA: A High-Level Performance Model for Hardware Accelerators Muhammad Shoaib Bin Altaf* David A. Wood University of Wisconsin-Madison * Now at AMD Research,

SLIDE 1

LogCA: A High-Level Performance Model for Hardware Accelerators

Muhammad Shoaib Bin Altaf* David A. Wood University of Wisconsin-Madison

Everything should be made as simple as possible, but not simpler—Albert Einstein

*Now at AMD Research, Austin TX

SLIDE 2

Executive Summary

Accelerators do not always perform as expected
Crucial for programmers and architects to understand the factors which

affect performance

Simple analytical models beneficial early in the design stage
Our proposal: LogCA

– High-level performance model – Help identify design bottlenecks and possible optimizations

Validation across variety of on-chip and off-chip accelerators
Two retrospective case studies demonstrate the usefulness of the model

SLIDE 3

Outline

Motivation
LogCA
Results
Conclusion

SLIDE 4

Why Need a Model?

“An accelerator is a separate architectural substructure ... that is architected using a different set of objectives than the base processor, ...., the accelerator is tuned to provide HIGHER PERFORMANCE ….. than with the general-purpose base hardware”

S. Patel and W. Hwu. Accelerators Architectures. Micro 2008

M7: Next Generation SPARC Hotchips-26 2014 Power8 Hpctchips-25 2013

SLIDE 5

Why a Model?

0.001 0.01 0.1 1 10 Time (ms) Block Size (Bytes) Host Accelerator

Encryption algorithm on UltraSPARC T2

Break-even point Accelerator outperforms Host outperforms

Amdahl’s Law for Accelerators

Better

SLIDE 6

Why a Model?

0.1 1 10 100 Speedup Offloaded Data (Bytes) UltraSPARC T2 SPARC T4 GPU Break-even points Advanced Encryption Standard (AES)

Running the same kernel, accelerators can have different break-even points

Better

SLIDE 7

Outline

Motivation
LogCA
Results
Conclusion

SLIDE 8

The Performance Model

Inspired by LogP [CACM 1996]
Abstract accelerator using five parameters

– L Latency: Cycles to move data – o Overhead: Setup cost – g Granularity: Size of the off-loaded data – C Computational index: Amount of work done per byte of data – A Acceleration: Speedup ignoring overheads

Sixth parameter 𝜸 generalizes to kernels with non-linear

complexity

Host Accelerator Interface

SLIDE 9

The Performance Model

Execution w/o an accelerator

– T0(g) = C0 (g)

Execution with one accelerator

– T1 (g) = o1 (g) + L1(g) + C1(g)

T0(g) C0(g) time

1(g)

L1(g) C1(g)=

#$(&) (

T1(g) Host Accelerator Interface Gain

SLIDE 10

Granularity independent latency

Captures the effect of granularity on speedup
Speedup bounded by acceleration

– lim

&→- 𝑇𝑞𝑓𝑓𝑒𝑣𝑞 𝑕 = 𝐵

Overheads dominate at smaller granularities

– 𝑇𝑞𝑓𝑓𝑒𝑣𝑞(𝑕)&67 =

# 89:9;

<

# 89:

0.1 1 10 Speedup (g) Granularity (Bytes)

Amdahl’s law for Accelerators

SLIDE 11

Performance Metrics

Right amount of off-loaded data?
Inspired from vector machine metrics 𝑂?, 𝑂@

𝑕7: Granularity for a speedup of 1

– 𝑕7 is essentially independent of acceleration – Identify complexity of the interface

𝑕<

: Granularity for a speedup of

( B

– Increasing A also increases 𝑕<

0.1 1 10 100 Speedup Granularity (Bytes)

A 𝑕7 𝑕(

Simple Interface Complex Interface 𝒉𝟐 𝒉𝟐 Large Small

SLIDE 12

Granularity dependent latency

Speedup bounded by computational intensity C/L

– lim

&→- 𝑇𝑞𝑓𝑓𝑒𝑣𝑞 𝑕 < # : (𝑚𝑗𝑜𝑓𝑏𝑠 𝑏𝑚𝑕𝑝𝑠𝑗𝑢ℎ𝑛𝑡)

Speedup for sub-linear algorithms asymptotically

decreases with the increase in granularity

0.1 1 10 Speedup (g) Granularity (Bytes)

A 𝐷 𝑀

0.1 1 10 Speedup (g) Granularity (Bytes)

A g Speedup Sub-linearly Linearly

SLIDE 13

Granularity dependent latency

Computational intensity must be greater

than 1 to achieve any speedup

Computational intensity should be greater

than peak performance to achieve A/2

Speedup Granularity (Bytes) 𝑕A/2 A/2 1 𝑕7

𝐷 𝑀 ≥ 1 A 𝐷 𝑀 ≥ 𝐵

Performance metrics help programmers early in the design cycle

SLIDE 14

Bottleneck Analysis using LogCA

0.1 1 10 100 1000 Speedup Granularity (Bytes)

LogCA L_0.1x

_0.1x

C_10x A_10x

10X change in parameter è 20% performance gain
Helps focus on performance bottlenecks

A 𝐷 𝑀 ⁄

A

SLIDE 15

Outline

Motivation
LogCA
Results
Conclusion

SLIDE 16

Experimental Methodology

Fixed-function and general-purpose accelerators

– Cryptographic accelerators on SPARC architectures – Discrete and integrated GPUs

Kernels with varying complexities

– Encryption, Hashing, Matrix Multiplication, FFT, Search, Radix Sort

Retrospective case studies

– Cryptographic interface in SPARC architectures – Memory interface in GPUs

SLIDE 17

Case Study I Cryptographic Interface in the SPARC Architecture

PCIe Crypto Accelerator UltraSPARC T2 SPARC T3 SPARC T4 engine SPARC T4 instructions

SLIDE 18

Conclusion

Simple models effective in predicting performance of accelerators
Proposed a high-level performance model for hardware accelerators
These models help programmers and architects visually identify

bottlenecks and suggest optimizations

Performance metrics for programmers in deciding the right amount of
ffloaded data
Limitations include inability to model resource contention, caches, and

irregular memory access patterns

SLIDE 19

Questions?

19 Source: http://www.medarcade.com/