NEAR DATA PROCESSING Mahdi Nazm Bojnordi Assistant Professor - - PowerPoint PPT Presentation

near data processing
SMART_READER_LITE
LIVE PREVIEW

NEAR DATA PROCESSING Mahdi Nazm Bojnordi Assistant Professor - - PowerPoint PPT Presentation

NEAR DATA PROCESSING Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 7810: Advanced Computer Architecture Requirement for Your Presentations Prepare for exactly 20m talk followed by 5m Q&A


slide-1
SLIDE 1

NEAR DATA PROCESSING

CS/ECE 7810: Advanced Computer Architecture

Mahdi Nazm Bojnordi

Assistant Professor School of Computing University of Utah

slide-2
SLIDE 2

Requirement for Your Presentations

¨ Prepare for exactly 20m talk followed by 5m Q&A ¨ Required components in your presentation

¤ Summary of the work

n Clearly present key ideas, mechanisms, and results

¤ Strength and weaknesses

n Slides highlighting the strengths and weaknesses

¤ Discussion

n Future research directions n Alternative ways to solving the problem

slide-3
SLIDE 3

Data Processing Trends

¨ Why processing in memory …

Near Data Processing Emerging Applications

Graph analytics, deep neural nets, etc.

Scaling Limitations

End of Denard scaling

New Technologies

3D stacking, resistive memories, etc.

[ref: GraphLab, ZME]

slide-4
SLIDE 4

Requirements for Efficient NDP

Near Data Processing

¨ Throughput

¤ High processing

throughput to match the high memory bandwidth

¨ Power

¤ Thermal constraints

limit clock frequency

¨ Flexibility

¤ Must amortize manufacturing

cost through reuse across apps

slide-5
SLIDE 5

Memory Technologies

¨ HMC: hybrid memory cube

[ref: micron]

slide-6
SLIDE 6

High-Bandwidth Memory Buses

¨ Current DDR4 maxes out at 25.6 GB/sec ¨ High Bandwidth Memory (HBM) led by AMD and NVIDIA ¤ Supports 1,024 bit-wide bus at125 GB/sec ¨ Hybrid Memory Cube (HMC) consortium led by Intel ¤ Claimed that 400 GB/sec possible ¨ Both based on stacked memory chips ¤ Limited capacity (won’t replace DRAM), but much higher than on-

chip caches

slide-7
SLIDE 7

High Bandwidth Memory

slide-8
SLIDE 8

High Bandwidth Memory

slide-9
SLIDE 9

Resistive Memory Technology

¨ 3D crosspoint promises virtually unlimited memory

¤ Non-volatile ¤ 10x higher density ¤ Main limit for now 6GB/sec interface

Line Driver SA SA SA SA Decoder

slide-10
SLIDE 10

Near Data Processing

¨ How to map applications to NDP technology?

Near Data Processing Emerging Applications

Graph analytics, deep neural nets, etc.

Scaling Limitations

End of Denard scaling

New Technologies

3D stacking, resistive memories, etc.

[ref: GraphLab, ZME]

slide-11
SLIDE 11

Near Data Processing

¨ Example

¤ Stacks run memory intensive code ¤ Multiple stacks linked to host processor (SerDes) HMC

slide-12
SLIDE 12

Design Challenges

¨ Communication within and across stacks

¤ Fine-grained vs. coarse-grained blocks [PACT’15]

slide-13
SLIDE 13

Efficient NDP Architecture

¨ Flexibility, area and power efficiency ¨ HRL: heterogeneous reconfigurable logic

[HPCA’16]

Configurable Logic Block

  • LUTs for embedded

control logic

  • Special functions:

sigmoid, tanh, etc. Output MUX Block

  • Configurable MUXes

(tree, cascading, parallel)

  • Put close to output,

low cost and flexible Functional Unit

  • Efficient 48-bit

arithmetic/logic ops

  • Registers for pipelining

and retiming

slide-14
SLIDE 14

Power Efficiency

¨ 2.2x performance/Watt over FPGA ¨ Within 92% of ASIC performance

[HPCA’16]

slide-15
SLIDE 15

Combinatorial Optimization

¨ Numerous critical problems in science and engineering can be

cast within the combinatorial optimization framework.

Massively Parallel Boltzmann Machine

Approximate Heuristic Algorithms

Genetic Algorithms Ant Colony Optimization Semi-Definite Programming Simulated Annealing Tabu Search Communication Networks 10010 01 1 1001 Data Mining DNA Analysis Artificial Intelligence Pharmaceuticals

Combinatorial Optimization Problems Traveling Salesman Knapsack Scheduling Machine Learning Bin Packing

slide-16
SLIDE 16

The Boltzmann Machine

¨ Two-state units connected with real-valued edge weights form

a stochastic neural network.

¨ Goal: iteratively update the state or weight variables to

minimize the network energy (E).

xj

The Boltzmann Machine

Σ x0 x3 w3,j w0,j

1 1 + eδ/C

Control Parameter

δ = (2xj-1) Σxiwi,j

E = -½ ΣΣxixjwi,j

slide-17
SLIDE 17

Computational Model

¨ Network energy is minimized by adjusting either the edge

weights or recomputing the states.

¨ Iterative matrix-vector multiplication between weights and

states is critical to finding minimal network energy.

Data Movement Functional Units … … … Memory Arrays

w0,0 w0,1 … w1,0 … … x0 x1 … Σ, ×, 1 1 + ex

The Boltzmann Machine

slide-18
SLIDE 18

Resistive Random Access Memory

¨ An RRAM cell comprises an access transistor and a resistive

switching medium.

RRAM Cell Wordline Bitline The Boltzmann Machine Functional Units … … … RRAM Arrays

V

RRAM: Resistive RAM (source: HP, 2009)

slide-19
SLIDE 19

¨ A read is performed by activating a wordline and measuring

the bitline current (I).

Resistive Random Access Memory

I = V/R1 V ‘1’

R1 The Boltzmann Machine Functional Units … … … RRAM Arrays

slide-20
SLIDE 20

Memristive Boltzmann Machine

¨ Key Idea: exploit current summation on the RRAM bitlines to

compute dot product.

‘1’ ‘1’ ‘1’ ‘1’ I =ΣV/Ri V

The Boltzmann Machine Functional Units … … … RRAM Arrays

slide-21
SLIDE 21

Memristive Boltzmann Machine

¨ Memory cells represent the weights and state variables are

used to control the bitline and wordlines.

I =ΣV/Ri

w01 w02 w03 w04

I =ΣW0i V

X1 X2 X3 X4 X0

I =ΣX0XiW0i

The Boltzmann Machine Functional Units … … … RRAM Arrays

slide-22
SLIDE 22

Chip Organization

¨ Hierarchical organization with configurable reduction tree is

used to compute large sum of product.

Mat Subbank H-Tree Bank Reduction Tree Controller Chip

slide-23
SLIDE 23

System Integration

Software configures the

  • n-chip data layout and

initiates the optimization by writing to a memory mapped control register. To maintain ordering, accesses to the accelerator are made uncacheable by the processor. DDR3 reads and writes are used for configuration and data transfer.

Accelerator DIMM

  • 1. Configure the DIMM
  • 2. Write weights and states
  • 3. Compute
  • 4. Read the outcome

Controller CPU

D R A M

slide-24
SLIDE 24

System Integration

Software configures the

  • n-chip data layout and

initiates the optimization by writing to a memory mapped control register. To maintain ordering, accesses to the accelerator are made uncacheable by the processor. DDR3 reads and writes are used for configuration and data transfer.

Accelerator DIMM

  • 1. Configure the DIMM
  • 2. Write weights and states
  • 3. Compute
  • 4. Read the outcome

Controller CPU

D R A M

slide-25
SLIDE 25

Controller

System Integration

Software configures the

  • n-chip data layout and

initiates the optimization by writing to a memory mapped control register. To maintain ordering, accesses to the accelerator are made uncacheable by the processor. DDR3 reads and writes are used for configuration and data transfer.

Accelerator DIMM Model (m×n) n m

  • 1. Configure the DIMM
  • 2. Write weights and states
  • 3. Compute
  • 4. Read the outcome

CPU

D R A M

slide-26
SLIDE 26

System Integration

Software configures the

  • n-chip data layout and

initiates the optimization by writing to a memory mapped control register. To maintain ordering, accesses to the accelerator are made uncacheable by the processor. DDR3 reads and writes are used for configuration and data transfer.

Accelerator DIMM

  • 1. Configure the DIMM
  • 2. Write weights and states
  • 3. Compute
  • 4. Read the outcome

Controller Start Ready CPU

D R A M

slide-27
SLIDE 27

System Integration

Software configures the

  • n-chip data layout and

initiates the optimization by writing to a memory mapped control register. To maintain ordering, accesses to the accelerator are made uncacheable by the processor. DDR3 reads and writes are used for configuration and data transfer.

Accelerator DIMM

  • 1. Configure the DIMM
  • 2. Write weights and states
  • 3. Compute
  • 4. Read the outcome

Controller Start Ready CPU

D R A M

slide-28
SLIDE 28

Summary of Results

0.01 0.1 1 0.01 0.1 1

Execution Time Normalized to the Single Threaded Kernel System Energy Normalized to the Single Threaded Baseline

60x 34x 9x 6x

Multi-threaded Kernel PIM Accelerator Memristive Accelerator