near data processing
play

NEAR DATA PROCESSING Mahdi Nazm Bojnordi Assistant Professor - PowerPoint PPT Presentation

NEAR DATA PROCESSING Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 7810: Advanced Computer Architecture Requirement for Your Presentations Prepare for exactly 20m talk followed by 5m Q&A


  1. NEAR DATA PROCESSING Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 7810: Advanced Computer Architecture

  2. Requirement for Your Presentations ¨ Prepare for exactly 20m talk followed by 5m Q&A ¨ Required components in your presentation ¤ Summary of the work n Clearly present key ideas, mechanisms, and results ¤ Strength and weaknesses n Slides highlighting the strengths and weaknesses ¤ Discussion n Future research directions n Alternative ways to solving the problem

  3. Data Processing Trends ¨ Why processing in memory … Scaling Limitations Emerging Applications Near Data Processing Graph analytics, End of Denard deep neural scaling nets, etc. New Technologies 3D stacking, resistive memories, etc. [ref: GraphLab, ZME]

  4. Requirements for Efficient NDP ¨ Throughput ¨ Power ¤ High processing ¤ Thermal constraints throughput to match limit clock the high memory frequency Near Data bandwidth Processing ¨ Flexibility ¤ Must amortize manufacturing cost through reuse across apps

  5. Memory Technologies ¨ HMC: hybrid memory cube [ref: micron]

  6. High-Bandwidth Memory Buses ¨ Current DDR4 maxes out at 25.6 GB/sec ¨ High Bandwidth Memory (HBM) led by AMD and NVIDIA ¤ Supports 1,024 bit-wide bus at125 GB/sec ¨ Hybrid Memory Cube (HMC) consortium led by Intel ¤ Claimed that 400 GB/sec possible ¨ Both based on stacked memory chips ¤ Limited capacity (won’t replace DRAM), but much higher than on- chip caches

  7. High Bandwidth Memory

  8. High Bandwidth Memory

  9. Resistive Memory Technology ¨ 3D crosspoint promises virtually unlimited memory ¤ Non-volatile ¤ 10x higher density ¤ Main limit for now 6GB/sec interface Line Driver Decoder SA SA SA SA

  10. Near Data Processing ¨ How to map applications to NDP technology? Scaling Limitations Emerging Applications Near Data Processing Graph analytics, End of Denard deep neural scaling nets, etc. New Technologies 3D stacking, resistive memories, etc. [ref: GraphLab, ZME]

  11. Near Data Processing ¨ Example ¤ Stacks run memory intensive code ¤ Multiple stacks linked to host processor (SerDes) HMC

  12. Design Challenges ¨ Communication within and across stacks ¤ Fine-grained vs. coarse-grained blocks [PACT’15]

  13. Efficient NDP Architecture ¨ Flexibility, area and power efficiency ¨ HRL: heterogeneous reconfigurable logic Configurable Logic Block • LUTs for embedded control logic • Special functions: sigmoid, tanh, etc. Functional Unit Output MUX Block • Efficient 48-bit • Configurable MUXes arithmetic/logic ops (tree, cascading, • Registers for pipelining parallel) and retiming • Put close to output, low cost and flexible [HPCA’16]

  14. Power Efficiency ¨ 2.2x performance/Watt over FPGA ¨ Within 92% of ASIC performance [HPCA’16]

  15. Combinatorial Optimization ¨ Numerous critical problems in science and engineering can be cast within the combinatorial optimization framework. Approximate Heuristic Algorithms Pharmaceuticals Combinatorial Genetic Algorithms Communication Optimization Problems Networks Ant Colony Optimization Traveling Salesman Knapsack Semi-Definite Programming Bin Packing Artificial Tabu Search Scheduling Intelligence 10010 Machine Learning Simulated Annealing 01 1 Data 1001 DNA Mining Analysis Massively Parallel Boltzmann Machine

  16. The Boltzmann Machine ¨ Two-state units connected with real-valued edge weights form a stochastic neural network. ¨ Goal: iteratively update the state or weight variables to minimize the network energy (E). E = -½ ΣΣ x i x j w i,j 1 x 0 w 0,j x j 1 + e δ/ C Control Σ Parameter x 3 w 3,j The Boltzmann Machine δ = (2 x j -1) Σ x i w i,j

  17. Computational Model ¨ Network energy is minimized by adjusting either the edge weights or recomputing the states. ¨ Iterative matrix-vector multiplication between weights and states is critical to finding minimal network energy. Memory Arrays w 0,0 w 0,1 … x 0 w 1,0 x 1 … … … Data Movement … … … 1 Σ , × , The Boltzmann Machine Functional Units 1 + e x

  18. Resistive Random Access Memory ¨ An RRAM cell comprises an access transistor and a resistive switching medium. V RRAM Arrays RRAM Cell Wordline Bitline … … RRAM: Resistive RAM … The Boltzmann Machine Functional Units (source: HP, 2009)

  19. Resistive Random Access Memory ¨ A read is performed by activating a wordline and measuring the bitline current (I). V I = V/R 1 RRAM Arrays ‘1’ R 1 … … … The Boltzmann Machine Functional Units

  20. Memristive Boltzmann Machine ¨ Key Idea: exploit current summation on the RRAM bitlines to compute dot product. V I = Σ V/R i RRAM Arrays ‘1’ ‘1’ ‘1’ … … … The Boltzmann Machine Functional Units ‘1’

  21. Memristive Boltzmann Machine ¨ Memory cells represent the weights and state variables are used to control the bitline and wordlines. V I = Σ X 0 X i W 0i I = Σ W 0i I = Σ V/R i RRAM Arrays w 01 X 1 X 0 w 02 w 04 w 03 X 2 X 4 X 3 … … … The Boltzmann Machine Functional Units

  22. Chip Organization ¨ Hierarchical organization with configurable reduction tree is used to compute large sum of product. Subbank H-Tree Mat Bank Chip Controller Reduction Tree

  23. System Integration DDR3 reads and writes are used for configuration Software configures the and data transfer. on-chip data layout and 1. Configure the DIMM initiates the optimization 2. Write weights and states by writing to a memory mapped control register. 3. Compute D R A M 4. Read the outcome To maintain ordering, accesses to the accelerator are made Controller uncacheable by the CPU processor. Accelerator DIMM

  24. System Integration DDR3 reads and writes are used for configuration Software configures the and data transfer. on-chip data layout and 1. Configure the DIMM initiates the optimization 2. Write weights and states by writing to a memory mapped control register. 3. Compute D R A M 4. Read the outcome To maintain ordering, accesses to the accelerator are made Controller uncacheable by the CPU processor. Accelerator DIMM

  25. System Integration DDR3 reads and writes are used for configuration Software configures the and data transfer. on-chip data layout and 1. Configure the DIMM initiates the optimization 2. Write weights and states by writing to a memory Model (m × n) mapped control register. 3. Compute D R A M 4. Read the outcome To maintain ordering, accesses to the accelerator are made Controller uncacheable by the CPU processor. Accelerator DIMM m n

  26. System Integration DDR3 reads and writes are used for configuration Software configures the and data transfer. on-chip data layout and 1. Configure the DIMM initiates the optimization 2. Write weights and states by writing to a memory mapped control register. 3. Compute D R A M 4. Read the outcome To maintain ordering, accesses to the accelerator are made Controller uncacheable by the CPU processor. Start Ready Accelerator DIMM

  27. System Integration DDR3 reads and writes are used for configuration Software configures the and data transfer. on-chip data layout and 1. Configure the DIMM initiates the optimization 2. Write weights and states by writing to a memory mapped control register. 3. Compute D R A M 4. Read the outcome To maintain ordering, accesses to the accelerator are made Controller uncacheable by the CPU processor. Start Ready Accelerator DIMM

  28. Summary of Results 1 Execution Time Normalized to 34x the Single Threaded Kernel Multi-threaded Kernel PIM Accelerator 60x 0.1 9x Memristive Accelerator 6x 0.01 0.01 0.1 1 System Energy Normalized to the Single Threaded Baseline

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend