NEAR DATA PROCESSING Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 7810: Advanced Computer Architecture
Requirement for Your Presentations ¨ Prepare for exactly 20m talk followed by 5m Q&A ¨ Required components in your presentation ¤ Summary of the work n Clearly present key ideas, mechanisms, and results ¤ Strength and weaknesses n Slides highlighting the strengths and weaknesses ¤ Discussion n Future research directions n Alternative ways to solving the problem
Data Processing Trends ¨ Why processing in memory … Scaling Limitations Emerging Applications Near Data Processing Graph analytics, End of Denard deep neural scaling nets, etc. New Technologies 3D stacking, resistive memories, etc. [ref: GraphLab, ZME]
Requirements for Efficient NDP ¨ Throughput ¨ Power ¤ High processing ¤ Thermal constraints throughput to match limit clock the high memory frequency Near Data bandwidth Processing ¨ Flexibility ¤ Must amortize manufacturing cost through reuse across apps
Memory Technologies ¨ HMC: hybrid memory cube [ref: micron]
High-Bandwidth Memory Buses ¨ Current DDR4 maxes out at 25.6 GB/sec ¨ High Bandwidth Memory (HBM) led by AMD and NVIDIA ¤ Supports 1,024 bit-wide bus at125 GB/sec ¨ Hybrid Memory Cube (HMC) consortium led by Intel ¤ Claimed that 400 GB/sec possible ¨ Both based on stacked memory chips ¤ Limited capacity (won’t replace DRAM), but much higher than on- chip caches
High Bandwidth Memory
High Bandwidth Memory
Resistive Memory Technology ¨ 3D crosspoint promises virtually unlimited memory ¤ Non-volatile ¤ 10x higher density ¤ Main limit for now 6GB/sec interface Line Driver Decoder SA SA SA SA
Near Data Processing ¨ How to map applications to NDP technology? Scaling Limitations Emerging Applications Near Data Processing Graph analytics, End of Denard deep neural scaling nets, etc. New Technologies 3D stacking, resistive memories, etc. [ref: GraphLab, ZME]
Near Data Processing ¨ Example ¤ Stacks run memory intensive code ¤ Multiple stacks linked to host processor (SerDes) HMC
Design Challenges ¨ Communication within and across stacks ¤ Fine-grained vs. coarse-grained blocks [PACT’15]
Efficient NDP Architecture ¨ Flexibility, area and power efficiency ¨ HRL: heterogeneous reconfigurable logic Configurable Logic Block • LUTs for embedded control logic • Special functions: sigmoid, tanh, etc. Functional Unit Output MUX Block • Efficient 48-bit • Configurable MUXes arithmetic/logic ops (tree, cascading, • Registers for pipelining parallel) and retiming • Put close to output, low cost and flexible [HPCA’16]
Power Efficiency ¨ 2.2x performance/Watt over FPGA ¨ Within 92% of ASIC performance [HPCA’16]
Combinatorial Optimization ¨ Numerous critical problems in science and engineering can be cast within the combinatorial optimization framework. Approximate Heuristic Algorithms Pharmaceuticals Combinatorial Genetic Algorithms Communication Optimization Problems Networks Ant Colony Optimization Traveling Salesman Knapsack Semi-Definite Programming Bin Packing Artificial Tabu Search Scheduling Intelligence 10010 Machine Learning Simulated Annealing 01 1 Data 1001 DNA Mining Analysis Massively Parallel Boltzmann Machine
The Boltzmann Machine ¨ Two-state units connected with real-valued edge weights form a stochastic neural network. ¨ Goal: iteratively update the state or weight variables to minimize the network energy (E). E = -½ ΣΣ x i x j w i,j 1 x 0 w 0,j x j 1 + e δ/ C Control Σ Parameter x 3 w 3,j The Boltzmann Machine δ = (2 x j -1) Σ x i w i,j
Computational Model ¨ Network energy is minimized by adjusting either the edge weights or recomputing the states. ¨ Iterative matrix-vector multiplication between weights and states is critical to finding minimal network energy. Memory Arrays w 0,0 w 0,1 … x 0 w 1,0 x 1 … … … Data Movement … … … 1 Σ , × , The Boltzmann Machine Functional Units 1 + e x
Resistive Random Access Memory ¨ An RRAM cell comprises an access transistor and a resistive switching medium. V RRAM Arrays RRAM Cell Wordline Bitline … … RRAM: Resistive RAM … The Boltzmann Machine Functional Units (source: HP, 2009)
Resistive Random Access Memory ¨ A read is performed by activating a wordline and measuring the bitline current (I). V I = V/R 1 RRAM Arrays ‘1’ R 1 … … … The Boltzmann Machine Functional Units
Memristive Boltzmann Machine ¨ Key Idea: exploit current summation on the RRAM bitlines to compute dot product. V I = Σ V/R i RRAM Arrays ‘1’ ‘1’ ‘1’ … … … The Boltzmann Machine Functional Units ‘1’
Memristive Boltzmann Machine ¨ Memory cells represent the weights and state variables are used to control the bitline and wordlines. V I = Σ X 0 X i W 0i I = Σ W 0i I = Σ V/R i RRAM Arrays w 01 X 1 X 0 w 02 w 04 w 03 X 2 X 4 X 3 … … … The Boltzmann Machine Functional Units
Chip Organization ¨ Hierarchical organization with configurable reduction tree is used to compute large sum of product. Subbank H-Tree Mat Bank Chip Controller Reduction Tree
System Integration DDR3 reads and writes are used for configuration Software configures the and data transfer. on-chip data layout and 1. Configure the DIMM initiates the optimization 2. Write weights and states by writing to a memory mapped control register. 3. Compute D R A M 4. Read the outcome To maintain ordering, accesses to the accelerator are made Controller uncacheable by the CPU processor. Accelerator DIMM
System Integration DDR3 reads and writes are used for configuration Software configures the and data transfer. on-chip data layout and 1. Configure the DIMM initiates the optimization 2. Write weights and states by writing to a memory mapped control register. 3. Compute D R A M 4. Read the outcome To maintain ordering, accesses to the accelerator are made Controller uncacheable by the CPU processor. Accelerator DIMM
System Integration DDR3 reads and writes are used for configuration Software configures the and data transfer. on-chip data layout and 1. Configure the DIMM initiates the optimization 2. Write weights and states by writing to a memory Model (m × n) mapped control register. 3. Compute D R A M 4. Read the outcome To maintain ordering, accesses to the accelerator are made Controller uncacheable by the CPU processor. Accelerator DIMM m n
System Integration DDR3 reads and writes are used for configuration Software configures the and data transfer. on-chip data layout and 1. Configure the DIMM initiates the optimization 2. Write weights and states by writing to a memory mapped control register. 3. Compute D R A M 4. Read the outcome To maintain ordering, accesses to the accelerator are made Controller uncacheable by the CPU processor. Start Ready Accelerator DIMM
System Integration DDR3 reads and writes are used for configuration Software configures the and data transfer. on-chip data layout and 1. Configure the DIMM initiates the optimization 2. Write weights and states by writing to a memory mapped control register. 3. Compute D R A M 4. Read the outcome To maintain ordering, accesses to the accelerator are made Controller uncacheable by the CPU processor. Start Ready Accelerator DIMM
Summary of Results 1 Execution Time Normalized to 34x the Single Threaded Kernel Multi-threaded Kernel PIM Accelerator 60x 0.1 9x Memristive Accelerator 6x 0.01 0.01 0.1 1 System Energy Normalized to the Single Threaded Baseline
Recommend
More recommend