NEAR DATA PROCESSING Mahdi Nazm Bojnordi Assistant Professor - - PowerPoint PPT Presentation
NEAR DATA PROCESSING Mahdi Nazm Bojnordi Assistant Professor - - PowerPoint PPT Presentation
NEAR DATA PROCESSING Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 7810: Advanced Computer Architecture Requirement for Your Presentations Prepare for exactly 20m talk followed by 5m Q&A
Requirement for Your Presentations
¨ Prepare for exactly 20m talk followed by 5m Q&A ¨ Required components in your presentation
¤ Summary of the work
n Clearly present key ideas, mechanisms, and results
¤ Strength and weaknesses
n Slides highlighting the strengths and weaknesses
¤ Discussion
n Future research directions n Alternative ways to solving the problem
Data Processing Trends
¨ Why processing in memory …
Near Data Processing Emerging Applications
Graph analytics, deep neural nets, etc.
Scaling Limitations
End of Denard scaling
New Technologies
3D stacking, resistive memories, etc.
[ref: GraphLab, ZME]
Requirements for Efficient NDP
Near Data Processing
¨ Throughput
¤ High processing
throughput to match the high memory bandwidth
¨ Power
¤ Thermal constraints
limit clock frequency
¨ Flexibility
¤ Must amortize manufacturing
cost through reuse across apps
Memory Technologies
¨ HMC: hybrid memory cube
[ref: micron]
High-Bandwidth Memory Buses
¨ Current DDR4 maxes out at 25.6 GB/sec ¨ High Bandwidth Memory (HBM) led by AMD and NVIDIA ¤ Supports 1,024 bit-wide bus at125 GB/sec ¨ Hybrid Memory Cube (HMC) consortium led by Intel ¤ Claimed that 400 GB/sec possible ¨ Both based on stacked memory chips ¤ Limited capacity (won’t replace DRAM), but much higher than on-
chip caches
High Bandwidth Memory
High Bandwidth Memory
Resistive Memory Technology
¨ 3D crosspoint promises virtually unlimited memory
¤ Non-volatile ¤ 10x higher density ¤ Main limit for now 6GB/sec interface
Line Driver SA SA SA SA Decoder
Near Data Processing
¨ How to map applications to NDP technology?
Near Data Processing Emerging Applications
Graph analytics, deep neural nets, etc.
Scaling Limitations
End of Denard scaling
New Technologies
3D stacking, resistive memories, etc.
[ref: GraphLab, ZME]
Near Data Processing
¨ Example
¤ Stacks run memory intensive code ¤ Multiple stacks linked to host processor (SerDes) HMC
Design Challenges
¨ Communication within and across stacks
¤ Fine-grained vs. coarse-grained blocks [PACT’15]
Efficient NDP Architecture
¨ Flexibility, area and power efficiency ¨ HRL: heterogeneous reconfigurable logic
[HPCA’16]
Configurable Logic Block
- LUTs for embedded
control logic
- Special functions:
sigmoid, tanh, etc. Output MUX Block
- Configurable MUXes
(tree, cascading, parallel)
- Put close to output,
low cost and flexible Functional Unit
- Efficient 48-bit
arithmetic/logic ops
- Registers for pipelining
and retiming
Power Efficiency
¨ 2.2x performance/Watt over FPGA ¨ Within 92% of ASIC performance
[HPCA’16]
Combinatorial Optimization
¨ Numerous critical problems in science and engineering can be
cast within the combinatorial optimization framework.
Massively Parallel Boltzmann Machine
Approximate Heuristic Algorithms
Genetic Algorithms Ant Colony Optimization Semi-Definite Programming Simulated Annealing Tabu Search Communication Networks 10010 01 1 1001 Data Mining DNA Analysis Artificial Intelligence Pharmaceuticals
Combinatorial Optimization Problems Traveling Salesman Knapsack Scheduling Machine Learning Bin Packing
The Boltzmann Machine
¨ Two-state units connected with real-valued edge weights form
a stochastic neural network.
¨ Goal: iteratively update the state or weight variables to
minimize the network energy (E).
xj
The Boltzmann Machine
Σ x0 x3 w3,j w0,j
1 1 + eδ/C
Control Parameter
δ = (2xj-1) Σxiwi,j
E = -½ ΣΣxixjwi,j
Computational Model
¨ Network energy is minimized by adjusting either the edge
weights or recomputing the states.
¨ Iterative matrix-vector multiplication between weights and
states is critical to finding minimal network energy.
Data Movement Functional Units … … … Memory Arrays
w0,0 w0,1 … w1,0 … … x0 x1 … Σ, ×, 1 1 + ex
The Boltzmann Machine
Resistive Random Access Memory
¨ An RRAM cell comprises an access transistor and a resistive
switching medium.
RRAM Cell Wordline Bitline The Boltzmann Machine Functional Units … … … RRAM Arrays
V
RRAM: Resistive RAM (source: HP, 2009)
¨ A read is performed by activating a wordline and measuring
the bitline current (I).
Resistive Random Access Memory
I = V/R1 V ‘1’
R1 The Boltzmann Machine Functional Units … … … RRAM Arrays
Memristive Boltzmann Machine
¨ Key Idea: exploit current summation on the RRAM bitlines to
compute dot product.
‘1’ ‘1’ ‘1’ ‘1’ I =ΣV/Ri V
The Boltzmann Machine Functional Units … … … RRAM Arrays
Memristive Boltzmann Machine
¨ Memory cells represent the weights and state variables are
used to control the bitline and wordlines.
I =ΣV/Ri
w01 w02 w03 w04
I =ΣW0i V
X1 X2 X3 X4 X0
I =ΣX0XiW0i
The Boltzmann Machine Functional Units … … … RRAM Arrays
Chip Organization
¨ Hierarchical organization with configurable reduction tree is
used to compute large sum of product.
Mat Subbank H-Tree Bank Reduction Tree Controller Chip
System Integration
Software configures the
- n-chip data layout and
initiates the optimization by writing to a memory mapped control register. To maintain ordering, accesses to the accelerator are made uncacheable by the processor. DDR3 reads and writes are used for configuration and data transfer.
Accelerator DIMM
- 1. Configure the DIMM
- 2. Write weights and states
- 3. Compute
- 4. Read the outcome
Controller CPU
D R A M
System Integration
Software configures the
- n-chip data layout and
initiates the optimization by writing to a memory mapped control register. To maintain ordering, accesses to the accelerator are made uncacheable by the processor. DDR3 reads and writes are used for configuration and data transfer.
Accelerator DIMM
- 1. Configure the DIMM
- 2. Write weights and states
- 3. Compute
- 4. Read the outcome
Controller CPU
D R A M
Controller
System Integration
Software configures the
- n-chip data layout and
initiates the optimization by writing to a memory mapped control register. To maintain ordering, accesses to the accelerator are made uncacheable by the processor. DDR3 reads and writes are used for configuration and data transfer.
Accelerator DIMM Model (m×n) n m
- 1. Configure the DIMM
- 2. Write weights and states
- 3. Compute
- 4. Read the outcome
CPU
D R A M
System Integration
Software configures the
- n-chip data layout and
initiates the optimization by writing to a memory mapped control register. To maintain ordering, accesses to the accelerator are made uncacheable by the processor. DDR3 reads and writes are used for configuration and data transfer.
Accelerator DIMM
- 1. Configure the DIMM
- 2. Write weights and states
- 3. Compute
- 4. Read the outcome
Controller Start Ready CPU
D R A M
System Integration
Software configures the
- n-chip data layout and
initiates the optimization by writing to a memory mapped control register. To maintain ordering, accesses to the accelerator are made uncacheable by the processor. DDR3 reads and writes are used for configuration and data transfer.
Accelerator DIMM
- 1. Configure the DIMM
- 2. Write weights and states
- 3. Compute
- 4. Read the outcome
Controller Start Ready CPU
D R A M
Summary of Results
0.01 0.1 1 0.01 0.1 1
Execution Time Normalized to the Single Threaded Kernel System Energy Normalized to the Single Threaded Baseline
60x 34x 9x 6x
Multi-threaded Kernel PIM Accelerator Memristive Accelerator