interface data approximation
play

Interface, Data, Approximation Sarita Adve With: Vikram Adve, - PowerPoint PPT Presentation

Programming Systems for Specialized Architectures Interface, Data, Approximation Sarita Adve With: Vikram Adve, Johnathan Alsop, Maria Kotsifakou, Sasa Misailovic, Matt Sinclair, Prakalp Srivastava University of Illinois at Urbana-Champaign


  1. Programming Systems for Specialized Architectures Interface, Data, Approximation Sarita Adve With: Vikram Adve, Johnathan Alsop, Maria Kotsifakou, Sasa Misailovic, Matt Sinclair, Prakalp Srivastava University of Illinois at Urbana-Champaign sadve@illinois.edu Sponsors: NSF, C-FAR, ADA (JUMP center by SRC, DARPA)

  2. A Modern Mobile SoC Different hardware ISAs Incompatible memory systems CPU CPU GPU Modem Increasing diversity in & across SoCs Vector Vector A/V Hardware Accelerators GPS L1 L1 & supercomputers, data centers, … Cache Cache Multi- DSP DSP DSP media L2 Cache Interconnect Different parallelism models Main Memory Need common interface (abstractions): HW-independent SW development, “object code” portability Data movement critical: Memory structures, communication, consistency, synchronization Approximation: Application-driven solution quality trade off to increase efficiency

  3. Interfaces: Back to the Future April 7, 1964: IBM announced the 360 • Family of machines w/ common abstraction/interface/ISA – Programmer freedom: no reprogramming – Designer freedom: implementation creativity Not unique • CPUs : ISAs; Internet : IP; GPUs : CUDA; Databases : SQL; …

  4. Current Interface Levels App. productivity Domain-specific prog. language TensorFlow, MXNet, Halide, … CUDA, OpenCL, OpenAcc, App. performance General-purpose prog. language OpenMP, Python, Julia Language innovation Language-level Compiler IR Delite DSL IR, DLVM, TVM, … Compiler investment Language-neutral Compiler IR Delite IR, HPVM, OSCAR, Polly Object-code portability Virtual ISA SPIR, HPVM IBM AS/400, Hardware innovation "Hardware" ISA Transmeta, PTX, HSAIL, Codesigned Virtual Machines … CPUs + Vector DSP FPGA GPU Source: Vikram Adve, HPVM project, Domain-specific SIMD Units https://publish.illinois.edu/hpvm-project/ Accelerators

  5. Which Interface Levels Can Be Uniform? Domain-specific prog. language Too diverse to define a General-purpose prog. language uniform interface Language-level Compiler IR Language-neutral Compiler IR Much more uniform Virtual ISA "Hardware" ISA Also too diverse … … CPUs + Vector DSP FPGA GPU Source: Vikram Adve, HPVM project, Domain-specific SIMD Units https://publish.illinois.edu/hpvm-project/ Accelerators

  6. One Example HPVM: Heterogeneous Parallel Virtual Machine [PPoPP’18] Parallel program representation for heterogeneous parallel hardware • Virtual ISA: portable virtual object code, simpler translators • Compiler IR: optimizations, map diverse parallel languages • Runtime Representation for flexible scheduling: mapping, load balancing Generalization of LLVM IR for parallel heterogeneous hardware PPoPP’18: Results on GPU (Nvidia), Vector ISA (AVX), Multicore (Intel Xeon) Ongoing: FPGA, novel domain-specific SoCs

  7. HPVM Abstractions Vector Hierarchical Dataflow Graph with side effects V A = load <L4 x float>* A V B = load <L4 x float>* B … V C = fmul <L4 x float> V A , V B or

  8. HPVM Abstractions Vector Hierarchical Dataflow Graph with side effects V A = load <L4 x float>* A V B = load <L4 x float>* B … V C = fmul <L4 x float> V A , V B • Task, data, vector parallelism • Streams, pipelines • Shared memory or • High-level optimizations • FPGAs (more custom hw?) N different parallelism models single unified model

  9. Data inter-chip IF Data movement critical to efficiency Accel. 2 Accel. 1 • Memory structures IF IF cache coherent FIFO • Communication Accel. 3 Accel. 4 • Coherence IF IF • Consistency stash RDMA inter-chip • Synchronization IF inter-chip IF Uniform communication interface for hardware Abstract to software interface

  10. Application-Customized Accelerator Communication Arch Problem: Design + Integrate Multiple accelerator memory systems + Communication inter-chip IF Challenges: Accel. 2 Accel. 1 ‒ Friction between different app-specific specializations IF IF ‒ Inefficiencies due to deep memory hierarchy cache coherent FIFO ‒ Multiple scales: on-chip to cloud Accel. 3 Accel. 4 IF IF New accelerator communication architecture stash RDMA ‒ Coherent, global address space inter-chip IF ‒ App-specialized coherence, comm, storage, soln quality inter-chip One example next focused on coherence: Spandex [ISCA’18] IF

  11. Heterogeneous devices have diverse memory demands Fine-grain Latency Temporal Synch Sensitivity locality Spatial Throughput locality Sensitivity 11

  12. Heterogeneous devices have diverse memory demands Fine-grain Latency Temporal Synch Sensitivity locality Spatial Throughput locality Sensitivity Typical CPU workloads: fine-grain synch, latency sensitive 12

  13. Heterogeneous devices have diverse memory demands Fine-grain Latency Temporal Synch Sensitivity locality Spatial Throughput locality Sensitivity Typical GPU workloads: spatial locality, throughput sensitive

  14. MESI coherence targets CPU workloads MESI GPU Coherence Protocol properties MESI GPU coherence DeNovo  Coarse-grain state  Fine-grain writes Reads: line Reads: flexible  Spatial locality  No false sharing Granularity Line writes: word Writes: word  False sharing  Reduced spatial locality Stale data Writer-invalidate Self-invalidate Self-invalidate invalidation  Writer-initiated invalidation  Self invalidation Write propagation Ownership Write-through Write-back  Temporal locality for reads  Simple, scalable  Overheads limit throughput, scalability  Synch limits read reuse GPU  Ownership-based updates  Write-through caches DeNovo MESI coh. Good for:  Temporal locality for writes  Simple, low overhead CPU GPU  CPU or GPU Indirection if low locality  Synch limits write reuse

  15. GPU coherence fits GPU workloads GPU Coherence Protocol properties MESI GPU coherence DeNovo  Fine-grain writes Reads: line Reads: flexible  No false sharing Granularity Line writes: word Writes: word  Reduced spatial locality Stale data Writer-invalidate Self-invalidate Self-invalidate invalidation  Self invalidation Write propagation Ownership Write-through Write-back  Simple, scalable  Synch limits read reuse GPU  Write-through caches DeNovo MESI coh. Good for:  Simple, low overhead CPU GPU CPU or GPU  Synch limits write reuse 15

  16. DeNovo is good fit for CPU and GPU Protocol properties MESI GPU coherence DeNovo Reads: line Reads: flexible Granularity Line writes: word Writes: word Stale data Writer-invalidate Self-invalidate Self-invalidate invalidation Write propagation Ownership Write-through Ownership GPU DeNovo MESI coh. Good for: CPU GPU CPU or GPU

  17. Integrating Diverse Coherence Strategies GPU ASIC ? GPU CPU FPGA/ Existing Solutions : MESI-based LLC  Accelerator Requests forced to use MESI GPU GPU coh. L1 coh. L1 MESI L1 MESI L1  Added latency for inter-device communication MESI/GPU coh. Hybrid L2  MESI is complex: extensions are difficult MESI LLC Spandex : DeNovo-based interface [ISCA’18] FPGA/ ASIC ? CPU GPU  Supports write-through and write-back  Supports self-invalidate and writer-invalidate MESI L1 GPU coh. L1 DeNovo L1  Supports requests of variable granularity  Directly interfaces MESI, GPU coherence, hybrid Spandex LLC (e.g. DeNovo) caches

  18. Example: Collaborative Graph Applications Vertex-centric algorithms: distribute vertices among CPU, GPU threads Application Access Pattern Important Dimension Results Flat LLC avoids Spandex LLC ⇒ Pull-based Read neighbor vertices, indirection for read 37% better exec. time PageRank Update local vertex misses 9% better NW traffic Push-based Ownership-based write DeNovo at GPU ⇒ Read local vertex, Betweenness propagation exploits 18% better exec. time Update (RMW) neighbor vertices Centrality locality in updates 61% better NW traffic 18

  19. Looking Forward… Software Innovations Coarse-grain Producer/consumer Synchronization Data locality, operations locality visibility relationships + + HPVM + DRF Consistency + ??? Coherent hLRC adaptive Hardware scratchpads Hardware queues laziness Stash, ISCA’15 Innovations Spandex HBM caches NVRAM dynamic caches

  20. Approximation How to express quality of solution from the application to the hardware? Integrate approximation (quality) into the interface

  21. Summary • Interfaces • Data • Approximation

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend