johnathan alsop matthew d sinclair sarita v adve
play

Johnathan Alsop *, Matthew D. Sinclair* , Sarita V. Adve* *Illinois, - PowerPoint PPT Presentation

Johnathan Alsop *, Matthew D. Sinclair* , Sarita V. Adve* *Illinois, AMD, Wisconsin Sponsors: NSF, C-FAR, ADA (JUMP center by SRC, DARPA) Specialized architectures are increasingly important in all compute domains 2


  1. Johnathan Alsop *, Matthew D. Sinclair* †‡ , Sarita V. Adve* *Illinois, † AMD, ‡ Wisconsin Sponsors: NSF, C-FAR, ADA (JUMP center by SRC, DARPA)

  2. Specialized architectures are increasingly important in all compute domains 2

  3. Specialization Requires Better Memory Systems Traditional heterogeneity: Shared coherent memory: CPU Mem Shared Space Mem Space CPU CPU data in coherent Existing solutions: complex and inflexible data data out Accelerator Accelerator Accelerator Mem Space  No fine-grain synchronization ✓ Fine-grain synchronization  No irregular access patterns ✓ Irregular access  Wasteful data movement ✓ Implicit data reuse 3

  4. Heterogeneous devices have diverse memory demands Fine-grain Latency Temporal Synch Sensitivity locality Spatial Throughput locality Sensitivity 4

  5. Heterogeneous devices have diverse memory demands Fine-grain Latency Temporal Synch Sensitivity locality Spatial Throughput locality Sensitivity Typical CPU workloads: fine-grain synch, latency sensitive 5

  6. Heterogeneous devices have diverse memory demands Fine-grain Latency Temporal Synch Sensitivity locality Spatial Throughput locality Sensitivity Typical GPU workloads: spatial locality, throughput sensitive

  7. MESI protocol fits CPU workloads Properties MESI GPU coherence DeNovo ✓ Spatial locality Reads: Line Reads: Flexible Granularity Line  False sharing Writes: Word Writes: Word ✓ Temporal locality for reads Invalidation Writer-invalidate Self-invalidate Self-invalidate  Overheads limit throughput ✓ Temporal locality for writes Updates Ownership Write-through Ownership  Indirection if low locality MESI Good for: CPU 7

  8. GPUs prefer simpler protocols Properties MESI GPU coherence DeNovo ✓ No false sharing Reads: Line Reads: Flexible Granularity Line  Synch limits spatial locality Writes: Word Writes: Word ✓ Simple, scalable Invalidation Writer-invalidate Self-invalidate Self-invalidate  Synch limits read reuse ✓ Simple, low overhead Updates Ownership Write-through Ownership  Synch limits write reuse GPU MESI coh. Good for: CPU GPU 8

  9. DeNovo is a good fit for CPU and GPU Properties MESI GPU coherence DeNovo Reads: Line Reads: Flexible Granularity Line Writes: Word Writes: Word Invalidation Writer-invalidate Self-invalidate Self-invalidate Updates Ownership Write-through Ownership GPU DeNovo MESI coh. Good for: CPU GPU CPU or GPU 9

  10. Existing Solutions: Inflexible and Inefficient Accel 2 ? GPU CPU Accel 1 GPU GPU DeNovo MESI L1 MESI L1 coh. L1 coh. L1 L1 MESI/GPU coh. Hybrid L2 MESI LLC Examples: ARM ACE, IBM CAPI, AMD APU 10

  11. Existing Solutions: Inflexible and Inefficient Accel 2 ? GPU CPU Accel 1 If the glove doesn’t fit… GPU MESI L1 MESI L1 coh. L1 There’s limited benefit! MESI/GPU coh. Hybrid L2 MESI LLC Examples: ARM ACE, IBM CAPI, AMD APU 11

  12. Existing Solutions: Inflexible and Inefficient Accel 2 ? GPU CPU Accel 1 If the glove doesn’t fit… GPU MESI L1 MESI L1 coh. L1 There’s limited benefit! MESI/GPU coh. Hybrid L2 MESI LLC Examples: ARM ACE, IBM CAPI, AMD APU 12

  13. Spandex: Flexible Heterogeneous Coherence Interface Accel 2 ? GPU CPU Accel 1 GPU GPU DeNovo MESI L1 coh. L1 coh. L1 L1 Spandex Adapts to exploit individual device’s workload attributes Better performance, lower complexity ⇒ Fits like a glove for any heterogeneous system! 13

  14. Spandex Overview Accel ? CPU GPU Key Components • Flexible device request interface MESI L1 GPU coh. L1 DeNovo L1 TU TU TU • DeNovo-based LLC External Request Interface • External request interface Device Request Interface Device may need a translation unit (TU) Spandex LLC 14

  15. Spandex Overview Accel ? CPU GPU Key Components • Flexible device request interface MESI L1 GPU coh. L1 DeNovo L1 TU TU TU • DeNovo-based LLC External Request Interface • External request interface Device Request Interface Device may need a translation unit (TU) Spandex LLC 15

  16. Device Request Interface Action Request Indicates ReqV Self-invalidation Read ReqS Writer-invalidation ReqWT Write-through Write ReqO Ownership only ReqWT+data Atomic write-through Read+ Write ReqO+data Ownership + Data Writeback ReqWB Owned data eviction Requests also specify granularity and (optionally) a bitmask 16

  17. Spandex Overview Accel ? CPU GPU Key Components • Flexible device request interface MESI L1 GPU coh. L1 DeNovo L1 • DeNovo-based LLC TU TU TU External Request Interface • External request interface Device Request Interface Device may need a translation unit (TU) Spandex LLC 17

  18. Spandex LLC Accel ? CPU GPU • States: I, V, O, S ST ST • Allocation at line granularity GPU coh. L1 DeNovo L1 MESI L1 • Ownership at word granularity • Data field tracks owner ID • May generate requests to owner/sharer ReqWT ReqO RspWT RspO Tag State O Mask Data ✓ No false sharing V ID ID ✓ Non-blocking ownership transfer Spandex LLC 18

  19. Spandex Overview Accel ? CPU GPU Key Components • Flexible device request interface MESI L1 GPU coh. L1 DeNovo L1 • DeNovo-based LLC TU TU TU External Request Interface • External request interface Device Request Interface Device may need a translation unit (TU) Spandex LLC 19

  20. External Request Interface Must handle if Accel ? External Request CPU GPU supports state ReqV O ReqO O GPU coh. L1 DeNovo L1 MESI L1 ReqO+data O RvkO O States: I, V States: I, V, O States: I, S, O Inv S ReqS S and O ReqV ReqV RspV • Translation Unit may implement Spandex LLC O functionality if not supported by device Spandex LLC 20

  21. Evaluation: Configurations Hierarchical MESI Configuration LLC protocol CPU protocol GPU protocol GPU L1 CPU L1 GPU L1 CPU L1 … … HMG Hierarchical MESI MESI GPU coherence GPU L2 HMD Hierarchical MESI MESI DeNovo MESI LLC SMG Spandex MESI GPU coherence Spandex SMD Spandex MESI DeNovo CPU L1 GPU L1 GPU L1 CPU L1 … … SDG Spandex DeNovo GPU coherence SDD Spandex DeNovo DeNovo Spandex LLC CPU-GPU workloads from Pannotia and Chai benchmark suites 21

  22. Evaluation: CPU-GPU Applications 120% Execution Time (cycles) 100% 80% 60% 40% 20% 0% HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD Hbest Sbest BC PR HSTI TRNS RSCT TQH Average • Different workloads prefer different protocols • Spandex flexibility ⇒ consistently better execution time (avg 16% lower) 22

  23. Evaluation: CPU-GPU Applications Probe ReqWT+data ReqWB/WT ReqO[+data] ReqV/S Network Traffic (flits) 100% 80% 60% 40% 20% 0% HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD Hbest Sbest BC PR HSTI TRNS RSCT TQH Average • Spandex flexibility ⇒ consistently better NW traffic (avg 27% lower) 23

  24. Conclusion and Future Work Producer Consumer MESI LLC ⇒ Simple, Flexible, Efficient Future Work: exploit SW or HW hints about data access patterns • Dynamic Spandex request selection • Producer-consumer forwarding • Extended granularity flexibility 24

  25. Johnathan Alsop *, Matthew D. Sinclair* †‡ , Sarita V. Adve* *Illinois, † AMD, ‡ Wisconsin Sponsors: NSF, C-FAR, ADA (JUMP center by SRC, DARPA)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend