Emerging NVM Enabled Storage Architecture: From Evolution to - - PowerPoint PPT Presentation

emerging nvm enabled storage architecture
SMART_READER_LITE
LIVE PREVIEW

Emerging NVM Enabled Storage Architecture: From Evolution to - - PowerPoint PPT Presentation

Emerging NVM Enabled Storage Architecture: From Evolution to Revolution. Yiran Chen Electrical and Computer Engineering University of Pittsburgh Sponsors: NSF, DARPA, AFRL, and HP Labs 1 Outline Introduction Evolution with eNVM:


slide-1
SLIDE 1

1

Yiran Chen

Electrical and Computer Engineering University of Pittsburgh Sponsors: NSF, DARPA, AFRL, and HP Labs

Emerging NVM Enabled Storage Architecture:

From Evolution to Revolution.

slide-2
SLIDE 2

2

Outline

  • Introduction
  • Evolution with eNVM:

– On‐chip high speed storage; – Off‐chip secondary storage;

  • Revolution with eNVM:

– Memristor‐based neuromorphic accelerator

  • Conclusion
slide-3
SLIDE 3

3

Conventional Memory Scaling

2012 – 2013 38nm ‐ 32nm M: Stacked MIM P: Planar A: 6F2, bWL G: poly/SiO2 C: Si V: 1.35V 2014 – 2015 29nm ‐ 22nm M: Stacked MIM P: Planar, HKMG A: 6F2, bWL G: HKMG C: Si V: 1.2V 2016 – 2017 22nm ‐ 16nm M: Stacked MIM P: Planar A: 6F2, bBL, LBL, 1T1C(VFET) G: HKMG C: Si V: 1.1V 2018 – 2019 16nm ‐ 14nm M: FBRAM, STT‐RAM, RRAM, PCRAM P: Planar A: 4F2, 1T, 1T1R, 1TMTJ (VFET) G: HKMG C: Si V: ~1V

Burj Khalifa A/R=6

100 20 80 60 40

Aspect Ratio A/R

60 50 40 30 20 11Å 9Å 8Å 7Å 5Å 3Å TOX

Technology Node

1990 2000 2010 101 102 103 104

Mb/Chip

EDO 50 SDRAM 133 DDR1 200-400 DDR2 400-800 DDR3 800-1600

Mbps

Sources: ASML, ITRS, IMEC, Hynix, IBM

Intrinsic difficulty of charge-based com puting and storage!

slide-4
SLIDE 4

4

Emerging Nonvolatile Memory

slide-5
SLIDE 5

5

Memory Technologies Comparison

ReRAM >10 y <1 1015 None STT‐RAM >10 y 8 1015 None NAND FLASH 10y 4 0.1 ms 1/0.1 ms 105 High None PCRAM >10 y 4 12 ns <50 ns 108 Low None DRAM 4 ms 7‐9 2 ns 1 ns 1016 Low Refresh Power SRAM N 120‐140 0.2 ns 70 ps 1016 Low Leakage Current Data Retention Memory Cell (F2) Read Time Write/Erase Time Number of Rewrites Power Consumption Read/Write Power Consumption

  • ther than R/W

N 4 ms 0.1 ms

1/0.1ms

<50 ns Leakage Current Refresh Power High >10 y >10 y

5‐10 ns 5‐10 ns <10 ns <10 ns

5‐10 ns <10 ns <10 ns 5‐10 ns

Low Low

Low Low None None

Source: ITRS ERD workshop presentation by Prof. Y. Chen

6

slide-6
SLIDE 6

6

Challenges:

  • Identifying the evolutional applications that can

– Easily and seamlessly integrated into the current memory hierarchy and computing platform; – Fully leverage the advantages of emerging NVM; – Not be easily replaced by other alternative technology or architecture.

  • Inventing a revolutionary computing and storage

architecture that can

– Offer a high‐performance, power efficient, and scalable computing model; – Provide a truly seamless integration between computing and memory.

slide-7
SLIDE 7

7

Outline

  • Introduction
  • Evolution with eNVM:

– On‐chip high speed storage;

  • STT‐RAM based 3D cache for CPU.
  • Racetrack based register file for GPU.

– Off‐chip secondary storage;

  • Revolution with eNVM:

– Memristor‐based neuromorphic accelerator.

  • Conclusion
slide-8
SLIDE 8

8

Writing ‘1’

1T‐1MTJ STT‐RAM Schematic

STT‐RAM based 3D cache

Spin‐Transfer Torque Random Access Memory

Source‐line MTJ Reference Layer Free Layer Bit‐line Word‐line

A scalable technology Writing ‘0’

MgO Layer

Magnetic tunneling junction

slide-9
SLIDE 9

9

  • Pros: Low leakage power, high density.
  • Cons: Long write latency and large write power

SRAM vs. MRAM (STT‐RAM)

Area (65nm) 3.66mm2 SRAM 3.30mm2 MRAM Capacity/Bank 128KB 512KB Read latency 2.25ns 2.32ns Write latency 2.26ns 11.02ns Read energy 0.90nJ 0.86nJ Write energy 0.80nJ 5.00nJ Cache configurations Leakage power 2MB (16x128KB) SRAM cache 2.09W 8MB (16x512KB) MRAM cache 0.26W

slide-10
SLIDE 10

10

STT‐RAM based 3D cache

  • Baseline 3D Architecture

– Core Layer + Cache Layers. – NUCA caches with NOC connections. Layer 1 Cache Controller Core Layer 2 TSV Cache Bank Router

Cache Bank Cache Bank Cache Bank Cache Bank

R R R R

Horizontal Hop Vertical Hop Data Migration

  • G. Sun, X. Dong, Y. Xie, J. Li, Y. Chen, HPCA, 2009.
slide-11
SLIDE 11

11

STT‐RAM based 3D cache

  • Challenges: long write latency of STT‐RAM.
  • Solution 1 (S1): Read‐Preemptive Write Buffer.

STT-RAM Caches

Cores

Write Op. Read Op. Read Op. Read Data Read Data

Write Buffer (FIFO)

Write Req. Read Req. Write just begins. Write is alm ost done.

slide-12
SLIDE 12

12

STT‐RAM based 3D cache

  • Solution S2: SRAM‐MRAM Hybrid L2 Cache

Core Core Core Core

MRAM Bank TSV

Core Core Core Core

SRAM Bank

32-Way STT-RAM 31-Way STT-RAM & 1-Way SRAM

slide-13
SLIDE 13

13

STT‐RAM based 3D cache

  • Result (S1 & S2):

– Performance is improved by 4.91% compared with STT‐RAM baseline. – Power consumption is reduced by 73.5%.

0.2 0.4 0.6 0.8 1

2M-SRAM-DNUCA 8M-MRAM-DNUCA 8M Hybrid DNUCA

0.2 0.4 0.6 0.8 1 IPC Power

slide-14
SLIDE 14

14

Outline

  • Introduction
  • Evolution with eNVM:

– On‐chip high speed storage;

  • STT‐RAM based 3D cache for CPU.
  • Racetrack based register file for GPU.

– Off‐chip secondary storage;

  • Revolution with eNVM:

– Memristor‐based neuromorphic accelerator.

  • Conclusion
slide-15
SLIDE 15

15

Racetrack for GPU

  • Racetrack cell:

– Two fixed pinning regions: free region, and fix region – Write `0’ – Write `1’ – Read WWL RWL SL BL

Pinning layer Pinning layer Free layer Reference layer

  • Racetrack

– Racetrack‐magnetic track – Inject current to move cell – Access port

slide-16
SLIDE 16

16

Racetrack for GPU

  • Benefits from Racetrack:

– Extremely small cell size;

  • Major challenges:

– Shifting caused delay/energy.

  • Warp register remapping (WRR)

– 60.0% RF are allocated during the execution – Non‐optimal warp register mapping, max shift distance —8‐cell – WRR, interleaves the warp registers across the access ports, max shift distance—4‐cell

WWL RWL WWL RWL …...

SL BL SL BL SL BL SL BL

Row Decoder Write/Read/Shifter Driver Column Mux Sense Amplifier Arrays Shift Controller

Arbitrator Warp 0 Warp 0

  • M. Mao, W. Wen, Y. Zhang, Y. Chen, H. Li, DAC 2014
slide-17
SLIDE 17

17

Racetrack for GPU

  • Write buffer

– “piggyback‐write” to write back to RF from write buffer; – Rely on the track movement triggered by the read requests; – Positive side‐effect: filter the redundant RF R/W by leveraging RAW and WAW.

1 3 2 4 8 7 9 6 5

To EXE/MEM

slide-18
SLIDE 18

18

Racetrack for GPU

  • Experiment results:

– Baseline: SRAM‐based register files. – Energy reduction: 59%. – Performance improvement: 4%.

slide-19
SLIDE 19

19

Outline

  • Introduction
  • Evolution with eNVM:

– On‐chip high speed storage; – Secondary storage;

  • PCRAM and NAND hybrid SSD;
  • Revolution with eNVM:

– Memristor‐based neuromorphic accelerator.

  • Conclusion
slide-20
SLIDE 20

20

Hybrid SSD

  • Memory hierarchy

Off-chip memory 100~300 cycles On-chip memory 1~30 cycles

Page mode ↓ Random access erase-before- write (EBW) ↓ In-place- update (IPU)

Courtesy: Al Fazio (Intel) Solid State Disk (Flash) 25K~2M cycles

PN=0, V

Erase Unit

PN=1, V PN=2, V

PN=n, V

X X

slide-21
SLIDE 21

21

  • One transistor/diode and one GST (GeSbTe).
  • In‐place updating (IPU)

PRAM (PCM) Cell

High resistance: ‘0’ Low resistance: ‘1’

Top Electrode GST Substrate

Bottom Electrode

Heater +N Top Electrode GST Substrate

Bottom Electrode

Heater +N

Amorphous Crystalline

slide-22
SLIDE 22

22

Hybrid SSD

  • Conventional SSD: FLASH.
  • Promising candidate: PRAM (Phase change).
  • To combine benefits of

both technologies: – Hybrid SSD.

  • Two usage:

– Performance; – Reliability.

slide-23
SLIDE 23

23

Hybrid SSD: performance enhancement

PN=0, V

Erase Unit 1

PN=1, V PN=2, V

PN=n, V

PN=Page Number; V=Valid; I=Invalid

Erase Unit 2

PN=0, V

Erase Unit 3

PN=1, V PN=2, V

PN=n, V PN=n, I (Empty Pages) PN=2, V PN=2, I PN=n, V

Merge Operation (time consuming)

Erase Unit = 128/ 256KB, Page = 512Bytes ~ 8KB

G.Sun, Y. Joo, Y. Chen, Y. Xie, Y.Chen, H. Li, HPCA, 2010.

slide-24
SLIDE 24

24

Hybrid SSD: performance enhancement

… …

Data Region

Data Buffer in Memory Hybrid Architecture Physical View Structural View

… …

Log Region NAND flash PRAM

Erase Unit In-place updating

Sector (512Bytes)

slide-25
SLIDE 25

25

Different Log Assignments

Data Region Log Region

Erase Unit

Fixed Assignment

Data Region Log Region

Erase Unit

Organize log pages in group

Data Region Log Region

Erase Unit

Dynamic Assignment

Static log assignment Group log assignment Dynamic log assignment

slide-26
SLIDE 26

26

Hybrid SSD: performance enhancement

slide-27
SLIDE 27

27

Outline

  • Introduction
  • Evolution with eNVM:

– On‐chip high speed storage; – Secondary storage;

  • Revolution with eNVM:

– Memristor‐based neuromorphic accelerator.

  • Conclusion
slide-28
SLIDE 28

28

Computing: Present and Future

2000 2010 1990 1000 100

Multi‐core

Clock Frequency (MHz)

New Trend:

  • Multi‐core, advanced power management, large on‐chip storage.

Future:

  • Heterogeneous system, Brain‐like computing.

Source: CPU DB, Intel Neural Network

2000 2010 1990

10000

Rocket Launch Nuclear Reactor Hot Plate

Power Density (mW/mm2)

1000 100

slide-29
SLIDE 29

29 Gray matter White matter Neocortex 6 layers Signals travel within and between layers

Brain – The Most Efficient Computing Machine

Brain:

15–30B neurons Extremely complex

  • rgan

4km/mm3 35w

Neuron:

Process signals from

  • ther neurons.

Synapse:

Memory Weight signals Neural Network

slide-30
SLIDE 30

30

Brain‐like Neuromorphic Circuits

Highly parallel Ultra power efficient Flexible Extremely robust

Real world input Human friendly output

Data friendly

Slow progress in neuoromoprhic hardware implementation

  • Lack of efficient synapse design
  • Not supportive to mass connection
slide-31
SLIDE 31

31

10 20 30 40 50 60 70 300 400 500 600 700 Pulse number Resistance () 10 20 30 40 50 60 70

  • 4
  • 2

2 4 Voltage (V)

Memristor – Rebirth of Neuromorphic Circuits

  • Two terminal, high density
  • Non‐volatility
  • Analog/multi‐level states
  • Natural matrix function
  • A MIMO system
  • Good combination with memristor

Memristor ↔ Synapse Crossbar ↔ Network

TaN1+x

HP lab, 2012

EI lab, DAC’12

2 3 4 i i+1 n 1 2 3 j-1 j n-1 n 1

EI lab, APL’13 EI lab & HP lab TiN-TaOx device, pulses grows linearly in am plitude

slide-32
SLIDE 32

32

Conclusion

  • Emerging nonvolatile memory technology (NVM) such as

STT‐RAM, racetrack, PRAM delivers significant improvement for various applications.

  • Challenges exist and can be solved by architecture level
  • ptimization.
  • Innovation of revolutionary architecture which provides

Multi‐order speedup, power efficiency improvement, and hardware cost reduction is promised.