logca a high level performance model for hardware
play

LogCA: A High-Level Performance Model for Hardware Accelerators - PowerPoint PPT Presentation

Everything should be made as simple as possible, but not simplerAlbert Einstein LogCA: A High-Level Performance Model for Hardware Accelerators Muhammad Shoaib Bin Altaf* David A. Wood University of Wisconsin-Madison * Now at AMD Research,


  1. Everything should be made as simple as possible, but not simpler—Albert Einstein LogCA: A High-Level Performance Model for Hardware Accelerators Muhammad Shoaib Bin Altaf* David A. Wood University of Wisconsin-Madison * Now at AMD Research, Austin TX

  2. Executive Summary • Accelerators do not always perform as expected • Crucial for programmers and architects to understand the factors which affect performance • Simple analytical models beneficial early in the design stage • Our proposal: LogCA – High-level performance model – Help identify design bottlenecks and possible optimizations • Validation across variety of on-chip and off-chip accelerators • Two retrospective case studies demonstrate the usefulness of the model 2

  3. Outline • Motivation • LogCA • Results • Conclusion 3

  4. Why Need a Model? “An accelerator is a separate architectural substructure ... that is architected using a different set of objectives than the base processor, ...., the accelerator is tuned to provide HIGHER PERFORMANCE ….. than with the general-purpose base hardware” M7: Next Generation SPARC Hotchips-26 2014 Power8 Hpctchips-25 2013 S. Patel and W. Hwu. Accelerators Architectures. Micro 2008 4

  5. Why a Model? Encryption algorithm on UltraSPARC T2 10 Accelerator outperforms Host outperforms 1 Break-even point Time (ms) Better 0.1 Host Accelerator 0.01 0.001 Block Size (Bytes) Amdahl’s Law for Accelerators 5

  6. Why a Model? 100 UltraSPARC T2 Advanced Encryption Standard (AES) SPARC T4 10 Speedup Better GPU Break-even points 1 0.1 Offloaded Data (Bytes) Running the same kernel, accelerators can have different break-even points 6

  7. Outline • Motivation • LogCA • Results • Conclusion 7

  8. The Performance Model • Inspired by LogP [CACM 1996] • Abstract accelerator using five parameters Accelerator Host – L Latency: Cycles to move data – o Overhead: Setup cost – g Granularity: Size of the off-loaded data Interface – C Computational index: Amount of work done per byte of data – A Acceleration: Speedup ignoring overheads • Sixth parameter 𝜸 generalizes to kernels with non-linear complexity 8

  9. The Performance Model • Execution w/o an accelerator Accelerator Host – T 0 (g) = C 0 (g) • Execution with one accelerator – T 1 (g) = o 1 (g) + L 1 (g) + C 1 (g) Interface T 0 (g) C 0 (g) time o 1 (g) L 1 (g) # $ (&) C 1 (g)= ( T 1 (g) Gain 9

  10. Granularity independent latency • Captures the effect of granularity on speedup A • Speedup bounded by acceleration 10 – lim &→- 𝑇𝑞𝑓𝑓𝑒𝑣𝑞 𝑕 = 𝐵 Speedup (g) • Overheads dominate at smaller granularities 1 # # – 𝑇𝑞𝑓𝑓𝑒𝑣𝑞(𝑕) &67 = < 89:9 ; 89: 0.1 < Granularity (Bytes) Amdahl’s law for Accelerators 10

  11. Performance Metrics 100 • Right amount of off-loaded data? A • Inspired from vector machine metrics 𝑂 ? , 𝑂 @ 10 Speedup A 𝑕 ( 1 𝑕 7 • 𝑕 7 : Granularity for a speedup of 1 B – 𝑕 7 is essentially independent of acceleration 0.1 Granularity (Bytes) 𝒉 𝟐 Small – Identify complexity of the interface Large 𝒉 𝟐 Simple Interface Complex Interface ( • 𝑕 < : Granularity for a speedup of B A – Increasing A also increases 𝑕 < A 11

  12. Granularity dependent latency A • Speedup bounded by computational intensity C/L 10 # 𝐷 – lim &→- 𝑇𝑞𝑓𝑓𝑒𝑣𝑞 𝑕 < : (𝑚𝑗𝑜𝑓𝑏𝑠 𝑏𝑚𝑕𝑝𝑠𝑗𝑢ℎ𝑛𝑡) Speedup (g) 𝑀 1 • Speedup for sub-linear algorithms asymptotically 0.1 decreases with the increase in granularity Granularity (Bytes) A 10 Speedup (g) Sub-linearly 1 g Speedup 0.1 Granularity (Bytes) Linearly 12

  13. Granularity dependent latency • Computational intensity must be greater A than 1 to achieve any speedup 𝐷 A/2 Speedup 𝐷 𝑀 ≥ 𝐵 𝑀 ≥ 1 • Computational intensity should be greater 1 than peak performance to achieve A/2 𝑕 7 𝑕 A/2 Granularity (Bytes) Performance metrics help programmers early in the design cycle 13

  14. Bottleneck Analysis using LogCA • 10X change in parameter è 20% performance gain • Helps focus on performance bottlenecks oC oCA A 1000 𝐷 𝑀 ⁄ 100 oC A Speedup LogCA 10 L_0.1x oC oCA A o_0.1x 1 C_10x A A_10x 0.1 Granularity (Bytes) 14

  15. Outline • Motivation • LogCA • Results • Conclusion 15

  16. Experimental Methodology • Fixed-function and general-purpose accelerators – Cryptographic accelerators on SPARC architectures – Discrete and integrated GPUs • Kernels with varying complexities – Encryption, Hashing, Matrix Multiplication, FFT, Search, Radix Sort • Retrospective case studies – Cryptographic interface in SPARC architectures – Memory interface in GPUs 16

  17. Case Study I Cryptographic Interface in the SPARC Architecture UltraSPARC T2 PCIe Crypto Accelerator SPARC T4 engine SPARC T3 SPARC T4 instructions 17

  18. Conclusion • Simple models effective in predicting performance of accelerators • Proposed a high-level performance model for hardware accelerators • These models help programmers and architects visually identify bottlenecks and suggest optimizations • Performance metrics for programmers in deciding the right amount of offloaded data • Limitations include inability to model resource contention, caches, and irregular memory access patterns 18

  19. Questions? Source: http://www.medarcade.com/ 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend