1
Emerging NVM Enabled Storage Architecture: From Evolution to - - PowerPoint PPT Presentation
Emerging NVM Enabled Storage Architecture: From Evolution to - - PowerPoint PPT Presentation
Emerging NVM Enabled Storage Architecture: From Evolution to Revolution. Yiran Chen Electrical and Computer Engineering University of Pittsburgh Sponsors: NSF, DARPA, AFRL, and HP Labs 1 Outline Introduction Evolution with eNVM:
2
Outline
- Introduction
- Evolution with eNVM:
– On‐chip high speed storage; – Off‐chip secondary storage;
- Revolution with eNVM:
– Memristor‐based neuromorphic accelerator
- Conclusion
3
Conventional Memory Scaling
2012 – 2013 38nm ‐ 32nm M: Stacked MIM P: Planar A: 6F2, bWL G: poly/SiO2 C: Si V: 1.35V 2014 – 2015 29nm ‐ 22nm M: Stacked MIM P: Planar, HKMG A: 6F2, bWL G: HKMG C: Si V: 1.2V 2016 – 2017 22nm ‐ 16nm M: Stacked MIM P: Planar A: 6F2, bBL, LBL, 1T1C(VFET) G: HKMG C: Si V: 1.1V 2018 – 2019 16nm ‐ 14nm M: FBRAM, STT‐RAM, RRAM, PCRAM P: Planar A: 4F2, 1T, 1T1R, 1TMTJ (VFET) G: HKMG C: Si V: ~1V
Burj Khalifa A/R=6
100 20 80 60 40
Aspect Ratio A/R
60 50 40 30 20 11Å 9Å 8Å 7Å 5Å 3Å TOX
Technology Node
1990 2000 2010 101 102 103 104
Mb/Chip
EDO 50 SDRAM 133 DDR1 200-400 DDR2 400-800 DDR3 800-1600
Mbps
Sources: ASML, ITRS, IMEC, Hynix, IBM
Intrinsic difficulty of charge-based com puting and storage!
4
Emerging Nonvolatile Memory
5
Memory Technologies Comparison
ReRAM >10 y <1 1015 None STT‐RAM >10 y 8 1015 None NAND FLASH 10y 4 0.1 ms 1/0.1 ms 105 High None PCRAM >10 y 4 12 ns <50 ns 108 Low None DRAM 4 ms 7‐9 2 ns 1 ns 1016 Low Refresh Power SRAM N 120‐140 0.2 ns 70 ps 1016 Low Leakage Current Data Retention Memory Cell (F2) Read Time Write/Erase Time Number of Rewrites Power Consumption Read/Write Power Consumption
- ther than R/W
N 4 ms 0.1 ms
1/0.1ms
<50 ns Leakage Current Refresh Power High >10 y >10 y
5‐10 ns 5‐10 ns <10 ns <10 ns
5‐10 ns <10 ns <10 ns 5‐10 ns
Low Low
Low Low None None
Source: ITRS ERD workshop presentation by Prof. Y. Chen
6
6
Challenges:
- Identifying the evolutional applications that can
– Easily and seamlessly integrated into the current memory hierarchy and computing platform; – Fully leverage the advantages of emerging NVM; – Not be easily replaced by other alternative technology or architecture.
- Inventing a revolutionary computing and storage
architecture that can
– Offer a high‐performance, power efficient, and scalable computing model; – Provide a truly seamless integration between computing and memory.
7
Outline
- Introduction
- Evolution with eNVM:
– On‐chip high speed storage;
- STT‐RAM based 3D cache for CPU.
- Racetrack based register file for GPU.
– Off‐chip secondary storage;
- Revolution with eNVM:
– Memristor‐based neuromorphic accelerator.
- Conclusion
8
Writing ‘1’
1T‐1MTJ STT‐RAM Schematic
STT‐RAM based 3D cache
Spin‐Transfer Torque Random Access Memory
Source‐line MTJ Reference Layer Free Layer Bit‐line Word‐line
A scalable technology Writing ‘0’
MgO Layer
Magnetic tunneling junction
9
- Pros: Low leakage power, high density.
- Cons: Long write latency and large write power
SRAM vs. MRAM (STT‐RAM)
Area (65nm) 3.66mm2 SRAM 3.30mm2 MRAM Capacity/Bank 128KB 512KB Read latency 2.25ns 2.32ns Write latency 2.26ns 11.02ns Read energy 0.90nJ 0.86nJ Write energy 0.80nJ 5.00nJ Cache configurations Leakage power 2MB (16x128KB) SRAM cache 2.09W 8MB (16x512KB) MRAM cache 0.26W
10
STT‐RAM based 3D cache
- Baseline 3D Architecture
– Core Layer + Cache Layers. – NUCA caches with NOC connections. Layer 1 Cache Controller Core Layer 2 TSV Cache Bank Router
Cache Bank Cache Bank Cache Bank Cache Bank
R R R R
Horizontal Hop Vertical Hop Data Migration
- G. Sun, X. Dong, Y. Xie, J. Li, Y. Chen, HPCA, 2009.
11
STT‐RAM based 3D cache
- Challenges: long write latency of STT‐RAM.
- Solution 1 (S1): Read‐Preemptive Write Buffer.
STT-RAM Caches
Cores
Write Op. Read Op. Read Op. Read Data Read Data
Write Buffer (FIFO)
Write Req. Read Req. Write just begins. Write is alm ost done.
12
STT‐RAM based 3D cache
- Solution S2: SRAM‐MRAM Hybrid L2 Cache
Core Core Core Core
MRAM Bank TSV
Core Core Core Core
SRAM Bank
32-Way STT-RAM 31-Way STT-RAM & 1-Way SRAM
13
STT‐RAM based 3D cache
- Result (S1 & S2):
– Performance is improved by 4.91% compared with STT‐RAM baseline. – Power consumption is reduced by 73.5%.
0.2 0.4 0.6 0.8 1
2M-SRAM-DNUCA 8M-MRAM-DNUCA 8M Hybrid DNUCA
0.2 0.4 0.6 0.8 1 IPC Power
14
Outline
- Introduction
- Evolution with eNVM:
– On‐chip high speed storage;
- STT‐RAM based 3D cache for CPU.
- Racetrack based register file for GPU.
– Off‐chip secondary storage;
- Revolution with eNVM:
– Memristor‐based neuromorphic accelerator.
- Conclusion
15
Racetrack for GPU
- Racetrack cell:
– Two fixed pinning regions: free region, and fix region – Write `0’ – Write `1’ – Read WWL RWL SL BL
Pinning layer Pinning layer Free layer Reference layer
- Racetrack
– Racetrack‐magnetic track – Inject current to move cell – Access port
16
Racetrack for GPU
- Benefits from Racetrack:
– Extremely small cell size;
- Major challenges:
– Shifting caused delay/energy.
- Warp register remapping (WRR)
– 60.0% RF are allocated during the execution – Non‐optimal warp register mapping, max shift distance —8‐cell – WRR, interleaves the warp registers across the access ports, max shift distance—4‐cell
WWL RWL WWL RWL …...
SL BL SL BL SL BL SL BL
Row Decoder Write/Read/Shifter Driver Column Mux Sense Amplifier Arrays Shift Controller
Arbitrator Warp 0 Warp 0
- M. Mao, W. Wen, Y. Zhang, Y. Chen, H. Li, DAC 2014
17
Racetrack for GPU
- Write buffer
– “piggyback‐write” to write back to RF from write buffer; – Rely on the track movement triggered by the read requests; – Positive side‐effect: filter the redundant RF R/W by leveraging RAW and WAW.
1 3 2 4 8 7 9 6 5
To EXE/MEM
18
Racetrack for GPU
- Experiment results:
– Baseline: SRAM‐based register files. – Energy reduction: 59%. – Performance improvement: 4%.
19
Outline
- Introduction
- Evolution with eNVM:
– On‐chip high speed storage; – Secondary storage;
- PCRAM and NAND hybrid SSD;
- Revolution with eNVM:
– Memristor‐based neuromorphic accelerator.
- Conclusion
20
Hybrid SSD
- Memory hierarchy
Off-chip memory 100~300 cycles On-chip memory 1~30 cycles
Page mode ↓ Random access erase-before- write (EBW) ↓ In-place- update (IPU)
Courtesy: Al Fazio (Intel) Solid State Disk (Flash) 25K~2M cycles
PN=0, V
Erase Unit
PN=1, V PN=2, V
…
PN=n, V
X X
21
- One transistor/diode and one GST (GeSbTe).
- In‐place updating (IPU)
PRAM (PCM) Cell
High resistance: ‘0’ Low resistance: ‘1’
Top Electrode GST Substrate
Bottom Electrode
Heater +N Top Electrode GST Substrate
Bottom Electrode
Heater +N
Amorphous Crystalline
22
Hybrid SSD
- Conventional SSD: FLASH.
- Promising candidate: PRAM (Phase change).
- To combine benefits of
both technologies: – Hybrid SSD.
- Two usage:
– Performance; – Reliability.
23
Hybrid SSD: performance enhancement
PN=0, V
Erase Unit 1
PN=1, V PN=2, V
…
PN=n, V
PN=Page Number; V=Valid; I=Invalid
Erase Unit 2
…
PN=0, V
Erase Unit 3
PN=1, V PN=2, V
…
PN=n, V PN=n, I (Empty Pages) PN=2, V PN=2, I PN=n, V
Merge Operation (time consuming)
Erase Unit = 128/ 256KB, Page = 512Bytes ~ 8KB
G.Sun, Y. Joo, Y. Chen, Y. Xie, Y.Chen, H. Li, HPCA, 2010.
24
Hybrid SSD: performance enhancement
… …
Data Region
Data Buffer in Memory Hybrid Architecture Physical View Structural View
… …
Log Region NAND flash PRAM
Erase Unit In-place updating
Sector (512Bytes)
25
Different Log Assignments
Data Region Log Region
Erase Unit
Fixed Assignment
Data Region Log Region
Erase Unit
Organize log pages in group
Data Region Log Region
Erase Unit
Dynamic Assignment
Static log assignment Group log assignment Dynamic log assignment
26
Hybrid SSD: performance enhancement
27
Outline
- Introduction
- Evolution with eNVM:
– On‐chip high speed storage; – Secondary storage;
- Revolution with eNVM:
– Memristor‐based neuromorphic accelerator.
- Conclusion
28
Computing: Present and Future
2000 2010 1990 1000 100
Multi‐core
Clock Frequency (MHz)
New Trend:
- Multi‐core, advanced power management, large on‐chip storage.
Future:
- Heterogeneous system, Brain‐like computing.
Source: CPU DB, Intel Neural Network
2000 2010 1990
10000
Rocket Launch Nuclear Reactor Hot Plate
Power Density (mW/mm2)
1000 100
29 Gray matter White matter Neocortex 6 layers Signals travel within and between layers
Brain – The Most Efficient Computing Machine
Brain:
15–30B neurons Extremely complex
- rgan
4km/mm3 35w
Neuron:
Process signals from
- ther neurons.
Synapse:
Memory Weight signals Neural Network
30
Brain‐like Neuromorphic Circuits
Highly parallel Ultra power efficient Flexible Extremely robust
Real world input Human friendly output
Data friendly
Slow progress in neuoromoprhic hardware implementation
- Lack of efficient synapse design
- Not supportive to mass connection
31
10 20 30 40 50 60 70 300 400 500 600 700 Pulse number Resistance () 10 20 30 40 50 60 70
- 4
- 2
2 4 Voltage (V)
Memristor – Rebirth of Neuromorphic Circuits
- Two terminal, high density
- Non‐volatility
- Analog/multi‐level states
- Natural matrix function
- A MIMO system
- Good combination with memristor
Memristor ↔ Synapse Crossbar ↔ Network
TaN1+x
HP lab, 2012
EI lab, DAC’12
2 3 4 i i+1 n 1 2 3 j-1 j n-1 n 1
EI lab, APL’13 EI lab & HP lab TiN-TaOx device, pulses grows linearly in am plitude
32
Conclusion
- Emerging nonvolatile memory technology (NVM) such as
STT‐RAM, racetrack, PRAM delivers significant improvement for various applications.
- Challenges exist and can be solved by architecture level
- ptimization.
- Innovation of revolutionary architecture which provides