efficient data supply for hardware accelerators with
play

Efficient Data Supply for Hardware Accelerators with Prefetching and - PowerPoint PPT Presentation

Cornell University Efficient Data Supply for Hardware Accelerators with Prefetching and Access/ Execute Decoupling Tao Chen and G. Edward Suh Computer Systems Laboratory Cornell University Accelerator-Rich Computing Systems Computing


  1. Cornell University Efficient Data Supply for Hardware Accelerators with Prefetching and Access/ Execute Decoupling Tao Chen and G. Edward Suh Computer Systems Laboratory Cornell University

  2. Accelerator-Rich Computing Systems • Computing systems are becoming accelerator-rich • General-purpose cores + a large number of accelerators • Challenge : Design and verification complexity • Non-recurring engineering (NRE) cost per accelerator • Manual efforts are a major source of cost • Create computation pipelines High-Level Synthesis (HLS) • Manage data supply from memory This work: An automated framework for generating accelerators with efficient data supply 2 Cornell University Tao Chen

  3. Inefficiencies in Accelerator Data Supply Scratchpad-based accelerators • On-chip scratchpad memory (SPM) Accelerator • Manually designed logic to move Compute data between SPM and main Logic memory • Pros : Good performance Preload SPM • Cons : High design effort, Logic accelerator-specific, not reusable Memory Bus 3 Cornell University Tao Chen

  4. Inefficiencies in Accelerator Data Supply Scratchpad-based accelerators • On-chip scratchpad memory (SPM) Accelerator • Manually designed logic to move Compute data between SPM and main Logic memory • Pros : Good performance Cache • Cons : High design effort, accelerator-specific, not reusable Memory Bus Cache-based accelerators • Pros : Low design effort, cache can be reused • Cons : Uncertain memory latency impacts performance 3 Cornell University Tao Chen

  5. Optimize Data Supply for Cache-Based Accelerators Approach : automated framework for generating accelerators with efficient data supply Accelerator Accelerator Automated w/ Efficient Source Framework Data Supply 4 Cornell University Tao Chen

  6. Optimize Data Supply for Cache-Based Accelerators Approach : automated framework for generating accelerators with efficient data supply Accelerator Accelerator Automated w/ Efficient Source Framework Data Supply Accelerator Techniques • Prefetching Compute Logic • Tagging memory accesses HW Cache Prefetcher Memory Bus 4 Cornell University Tao Chen

  7. Optimize Data Supply for Cache-Based Accelerators Approach : automated framework for generating accelerators with efficient data supply Accelerator Accelerator Automated w/ Efficient Source Framework Data Supply Accelerator Techniques • Prefetching Access Execute Logic Logic • Tagging memory accesses • Access/Execute Decoupling HW Cache • Program slicing + architecture Prefetcher template Memory Bus 4 Cornell University Tao Chen

  8. Impact of Uncertain Memory Latency • Example : Sparse Matrix Vector Multiplication (spmv) • Pipeline generated with High-Level Synthesis (HLS) Time // inner loop of sparse matrix LD // vector multiplication LD LD LD LD LD HLS LD LD LD for (j = begin; j < end; j++) { MUL LD LD #pragma HLS pipeline MUL LD Si = val[j] * vec[cols[j]]; ADD MUL ADD MUL sum = sum + Si; ADD } ADD 5 Cornell University Tao Chen

  9. Impact of Uncertain Memory Latency • Example : Sparse Matrix Vector Multiplication (spmv) • Pipeline generated with High-Level Synthesis (HLS) Time // inner loop of sparse matrix LD miss // vector multiplication LD LD LD LD LD HLS for (j = begin; j < end; j++) { #pragma HLS pipeline Si = val[j] * vec[cols[j]]; LD LD LD sum = sum + Si; LD LD MUL } LD MUL ADD MUL Regular Regular Irregular ADD MUL stride stride ADD • Reduce cache misses for regular accesses ADD • Prefetch data into the cache A cache miss stalls the • Tolerate cache misses for irregular accesses entire accelerator pipeline • Access/Execute Decoupling 5 Cornell University Tao Chen

  10. Hardware Prefetching Global • Predict future memory accesses Addr Stream 2380 • PC is often used as a hint 541C 8010 • Stream localization 2384 • Spatial correlation prediction 5420 8328 2388 for (j = begin; j < end; j++) { 5424 8454 Si = val[j] * vec[cols[j]]; 238C sum = sum + Si; 5428 } 81B8 ● ● ● 6 Cornell University Tao Chen

  11. Hardware Prefetching Local Addr Global • Predict future memory accesses Streams Addr Stream PC y PC x PC z 2380 2380 2380 541C 8010 • PC is often used as a hint 541C 541C 2384 5420 8328 8010 8010 2388 5424 8454 • Stream localization 2384 2384 238C 5428 81B8 • Spatial correlation prediction 5420 5420 8328 8328 regular irregular 2388 2388 strides no pred for (j = begin; j < end; j++) { 5424 5424 8454 8454 Si = val[j] * vec[cols[j]]; 238C 238C sum = sum + Si; 5428 5428 } 81B8 81B8 ● • Problem : accelerators lack a PC ● ● • Solution : generate PC-like tags for accelerator memory accesses 6 Cornell University Tao Chen

  12. Hardware Prefetching Local Addr Global • Predict future memory accesses Streams Addr Stream PC y PC x PC z 2380 2380 2380 541C 8010 • PC is often used as a hint 541C 541C 2384 5420 8328 8010 8010 2388 5424 8454 • Stream localization 2384 2384 238C 5428 81B8 • Spatial correlation prediction 5420 5420 8328 8328 regular irregular 2388 2388 strides no pred for (j = begin; j < end; j++) { 5424 5424 8454 8454 Si = val[j] * vec[cols[j]]; BB1 238C 238C sum = sum + Si; 5428 5428 } BB2 81B8 81B8 LD ● • Problem : accelerators lack a PC y ● LD LD ● x × • Solution : generate PC-like tags for z accelerator memory accesses + CDFG BB3 6 Cornell University Tao Chen

  13. Decoupled Access/Execute (DAE) • Limitations of Hardware Prefetching • Not accurate for complex patterns / Needs warm-up time • Fundamental reason: lack of semantic information • Decoupled Access/Execute • Allow memory accesses to run ahead to preload data Memory Cache w/ DAE Accelerator Access memory Value Access Execute latency Comp Logic Logic Time SPM w/ Manual Preload HW Cache Prefetche memory Manual r Compute latency Preload Memory Bus 7 Cornell University Tao Chen

  14. Traditional DAE is not Effective for Accelerators • Traditional DAE : access part forwards data to execute part • Problem : access pipeline stalls on misses • Throughput is limited by access pipeline Decoupled Original Access Execute LD LD miss LD LD LD LD LD LD LD LD LD LD LD LD LD LD LD LD MUL LD LD LD LD MUL MUL LD LD MUL MUL ADD MUL ADD MUL ADD MUL ADD ADD ADD ADD ADD • Goal : allow access pipeline to continue to flow under misses 8 Cornell University Tao Chen

  15. DAE Accelerator with Decoupled Loads • Anatomy of a load AGen Req LD AGen LD Req Resp Resp • Solution : Delegate request/response handling Req Resp AGen 9 Cornell University Tao Chen

  16. Memory Unit • Proxy for handling memory requests and responses • Supports response reordering and store-to-load forwarding Store Addr Load Addr Load Data to AccU Store Data from ExeU LD Fwd Data Dep Fwd Check Load Queue Data Queue to ExeU Store Store Data Addr Queue Queue Mem Unit memreq memresp 10 Cornell University Tao Chen

  17. Memory Unit • Proxy for handling memory requests and responses • Supports response reordering and store-to-load forwarding Load Data to AccU Store Addr Load Addr Store Data from ExeU Fwd LD Data Dep Fwd Check Load Queue Data Queue to ExeU Store Store Data Addr Queue Queue Mem Unit memreq memresp ST 10 Cornell University Tao Chen

  18. Automated DAE Accelerator Generation • Program slicing for generating access/execute slices • Architectural template with configurable parameters accel.c slicing slicing access.c execute.c HLS HLS Architectural access.v execute.v Template HW Parameters Generation • Queue sizes • Port width • MemUnit config Written in PyMTL • etc Access/Execute Decoupled Accel 11 Cornell University Tao Chen

  19. Evaluation Methodology • Vertically integrated modeling methodology • System components : cycle-level (gem5) • Accelerators : register-transfer-level (Vivado HLS, PyMTL) • Area, power and energy : gate-level (commercial ASIC flow) • Benchmark accelerators from MachSuite Name Description bbgemm Blocked matrix multiplication bfsbulk Breadth-First Search gemm Dense matrix multiplication mdknn Molecular dynamics (K-Nearest Neighbor) nw Needleman-Wunsch algorithm spmvcrs Sparse matrix vector multiplication stencil2d 2D stencil computation viterbi Viterbi algorithm 12 Cornell University Tao Chen

  20. Performance Comparison • 2.28x speedup on average • Prefetching and DAE work in synergy 13 Cornell University Tao Chen

  21. Energy Comparison • 15% energy reduction on average because of reduced stalls • MemUnits/queues only consume a small amount of energy 14 Cornell University Tao Chen

  22. More Details in the Paper • Deadlock Avoidance • Customization of Memory Units • Baseline Validation • Power and Area Comparison • Energy, Power and Area Breakdown • Sensitivity Study on Varying Queue Sizes • Design Space Exploration: Queue Size Customization 15 Cornell University Tao Chen

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend