NEAR DATA PROCESSING Mahdi Nazm Bojnordi Assistant Professor - PowerPoint PPT Presentation

NEAR DATA PROCESSING Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 7810: Advanced Computer Architecture

Requirement for Your Presentations ¨ Prepare for exactly 20m talk followed by 5m Q&A ¨ Required components in your presentation ¤ Summary of the work n Clearly present key ideas, mechanisms, and results ¤ Strength and weaknesses n Slides highlighting the strengths and weaknesses ¤ Discussion n Future research directions n Alternative ways to solving the problem

Data Processing Trends ¨ Why processing in memory … Scaling Limitations Emerging Applications Near Data Processing Graph analytics, End of Denard deep neural scaling nets, etc. New Technologies 3D stacking, resistive memories, etc. [ref: GraphLab, ZME]

Requirements for Efficient NDP ¨ Throughput ¨ Power ¤ High processing ¤ Thermal constraints throughput to match limit clock the high memory frequency Near Data bandwidth Processing ¨ Flexibility ¤ Must amortize manufacturing cost through reuse across apps

Memory Technologies ¨ HMC: hybrid memory cube [ref: micron]

High-Bandwidth Memory Buses ¨ Current DDR4 maxes out at 25.6 GB/sec ¨ High Bandwidth Memory (HBM) led by AMD and NVIDIA ¤ Supports 1,024 bit-wide bus at125 GB/sec ¨ Hybrid Memory Cube (HMC) consortium led by Intel ¤ Claimed that 400 GB/sec possible ¨ Both based on stacked memory chips ¤ Limited capacity (won’t replace DRAM), but much higher than on- chip caches

High Bandwidth Memory

Resistive Memory Technology ¨ 3D crosspoint promises virtually unlimited memory ¤ Non-volatile ¤ 10x higher density ¤ Main limit for now 6GB/sec interface Line Driver Decoder SA SA SA SA

Near Data Processing ¨ How to map applications to NDP technology? Scaling Limitations Emerging Applications Near Data Processing Graph analytics, End of Denard deep neural scaling nets, etc. New Technologies 3D stacking, resistive memories, etc. [ref: GraphLab, ZME]

Near Data Processing ¨ Example ¤ Stacks run memory intensive code ¤ Multiple stacks linked to host processor (SerDes) HMC

Design Challenges ¨ Communication within and across stacks ¤ Fine-grained vs. coarse-grained blocks [PACT’15]

Efficient NDP Architecture ¨ Flexibility, area and power efficiency ¨ HRL: heterogeneous reconfigurable logic Configurable Logic Block • LUTs for embedded control logic • Special functions: sigmoid, tanh, etc. Functional Unit Output MUX Block • Efficient 48-bit • Configurable MUXes arithmetic/logic ops (tree, cascading, • Registers for pipelining parallel) and retiming • Put close to output, low cost and flexible [HPCA’16]

Power Efficiency ¨ 2.2x performance/Watt over FPGA ¨ Within 92% of ASIC performance [HPCA’16]

Combinatorial Optimization ¨ Numerous critical problems in science and engineering can be cast within the combinatorial optimization framework. Approximate Heuristic Algorithms Pharmaceuticals Combinatorial Genetic Algorithms Communication Optimization Problems Networks Ant Colony Optimization Traveling Salesman Knapsack Semi-Definite Programming Bin Packing Artificial Tabu Search Scheduling Intelligence 10010 Machine Learning Simulated Annealing 01 1 Data 1001 DNA Mining Analysis Massively Parallel Boltzmann Machine

The Boltzmann Machine ¨ Two-state units connected with real-valued edge weights form a stochastic neural network. ¨ Goal: iteratively update the state or weight variables to minimize the network energy (E). E = -½ ΣΣ x i x j w i,j 1 x 0 w 0,j x j 1 + e δ/ C Control Σ Parameter x 3 w 3,j The Boltzmann Machine δ = (2 x j -1) Σ x i w i,j

Computational Model ¨ Network energy is minimized by adjusting either the edge weights or recomputing the states. ¨ Iterative matrix-vector multiplication between weights and states is critical to finding minimal network energy. Memory Arrays w 0,0 w 0,1 … x 0 w 1,0 x 1 … … … Data Movement … … … 1 Σ , × , The Boltzmann Machine Functional Units 1 + e x

Resistive Random Access Memory ¨ An RRAM cell comprises an access transistor and a resistive switching medium. V RRAM Arrays RRAM Cell Wordline Bitline … … RRAM: Resistive RAM … The Boltzmann Machine Functional Units (source: HP, 2009)

Resistive Random Access Memory ¨ A read is performed by activating a wordline and measuring the bitline current (I). V I = V/R 1 RRAM Arrays ‘1’ R 1 … … … The Boltzmann Machine Functional Units

Memristive Boltzmann Machine ¨ Key Idea: exploit current summation on the RRAM bitlines to compute dot product. V I = Σ V/R i RRAM Arrays ‘1’ ‘1’ ‘1’ … … … The Boltzmann Machine Functional Units ‘1’

Memristive Boltzmann Machine ¨ Memory cells represent the weights and state variables are used to control the bitline and wordlines. V I = Σ X 0 X i W 0i I = Σ W 0i I = Σ V/R i RRAM Arrays w 01 X 1 X 0 w 02 w 04 w 03 X 2 X 4 X 3 … … … The Boltzmann Machine Functional Units

Chip Organization ¨ Hierarchical organization with configurable reduction tree is used to compute large sum of product. Subbank H-Tree Mat Bank Chip Controller Reduction Tree

System Integration DDR3 reads and writes are used for configuration Software configures the and data transfer. on-chip data layout and 1. Configure the DIMM initiates the optimization 2. Write weights and states by writing to a memory mapped control register. 3. Compute D R A M 4. Read the outcome To maintain ordering, accesses to the accelerator are made Controller uncacheable by the CPU processor. Accelerator DIMM

System Integration DDR3 reads and writes are used for configuration Software configures the and data transfer. on-chip data layout and 1. Configure the DIMM initiates the optimization 2. Write weights and states by writing to a memory Model (m × n) mapped control register. 3. Compute D R A M 4. Read the outcome To maintain ordering, accesses to the accelerator are made Controller uncacheable by the CPU processor. Accelerator DIMM m n

System Integration DDR3 reads and writes are used for configuration Software configures the and data transfer. on-chip data layout and 1. Configure the DIMM initiates the optimization 2. Write weights and states by writing to a memory mapped control register. 3. Compute D R A M 4. Read the outcome To maintain ordering, accesses to the accelerator are made Controller uncacheable by the CPU processor. Start Ready Accelerator DIMM

Summary of Results 1 Execution Time Normalized to 34x the Single Threaded Kernel Multi-threaded Kernel PIM Accelerator 60x 0.1 9x Memristive Accelerator 6x 0.01 0.01 0.1 1 System Energy Normalized to the Single Threaded Baseline

NEAR DATA PROCESSING Mahdi Nazm Bojnordi Assistant Professor - PowerPoint PPT Presentation

NEAR DATA PROCESSING Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 7810: Advanced Computer Architecture Requirement for Your Presentations Prepare for exactly 20m talk followed by 5m Q&A

The Origin of Near Earth The Origin of Near Earth The Origin of Near Earth The Origin of Near

61A Lecture 30 Announcements Data Processing Data Processing 4 Data Processing Many data sets

Biscuit: A Framework for Near-Data Processing of Big Data Workloads Oct 21, 2016 Duck-Ho Bae

Liquid Argon Near Detector Simulation Liquid Argon Near Detector Simulation Jonathan Asaadi 1

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

DUNE Near Detector: Perspective from NDDG A. D. Bross (FNAL), H.A. Tanaka (SLAC/Stanford) for the

Military Munitions Support Services Accident Reporting Near Miss Near Miss: Osha

Ballot Processing | PP 2016 Ballot Processing | PP 2016 Keys to processing the PP from Heidi Hunt,

STAR-CCM+ Pre/Post Processing Bill Jester, CD-adapco Introduction Pre/Post Processing

Near-Memory Processing: Its the SW and HW, stupid! Boris Grot www.inf.ed.ac.uk DATE 2019 The

Annotation Processing in a Kotlin World Zac Sweers @pandanomic Annotation Processing in a

Signal Processing - Introduction Signal Processing Analogue/digital filters: extensively used

Data Preparation Discretization Data cleaning (Data pre-processing) Data

Build for Sustainability and Safety- Near the Water- Over the Water- On the Water Near Over On

Maritime Near Miss Reporting Brian Craig, PhD, PE, CPE Department of Industrial Engineering at

FEI Canada HST is Near HST is Near FEI Canada June 10, 2010 June 10, 2010 Danny

OpenMP and GPU Programming GPU Intro Emanuele Ruffaldi

Similarity Measures There are an enormous number of ways in which we can measure similarity

Improving NAMD Performance and Scaling on Heterogeneous Architectures David J. Hardy and Julio D.

Integrating Maude into Hets Mihai Codescu, 1 Till Mossakowski, 1 Adri an Riesco 2 and Christian

HPC for Computational Astrophysics: Looking Forward Ann Almgren Center for Computational

Inference and Learning for Probabilistic Logic Programming Fabrizio Riguzzi Dipartimento di

Programming for Hybrid Architectures John E. Stone Theoretical and Computational Biophysics Group

Lattice QCD, Programming Models and Porting LQCD codes to Exascale Blint Jo - Jefferson Lab

Sambuz

Useful Links

Newsletter

Mail Us