Novel Nonvolatile Memory Hierarchies to Realize "Normally-Off - - PowerPoint PPT Presentation

novel nonvolatile memory hierarchies to realize normally
SMART_READER_LITE
LIVE PREVIEW

Novel Nonvolatile Memory Hierarchies to Realize "Normally-Off - - PowerPoint PPT Presentation

Novel Nonvolatile Memory Hierarchies to Realize "Normally-Off Mobile Processors" ASP-DAC 2014 Shinobu Fujita, Kumiko Nomura, Hiroki Noguchi, Susumu Takeda , Keiko Abe Toshiba Corporation, R&D Center Advanced LSI technology


slide-1
SLIDE 1

1

ASP-DAC 2014

1

Novel Nonvolatile Memory Hierarchies to Realize "Normally-Off Mobile Processors"

Shinobu Fujita, Kumiko Nomura, Hiroki Noguchi, Susumu Takeda , Keiko Abe Toshiba Corporation, R&D Center Advanced LSI technology laboratory

This work was partly supported by Normally-off Computing PJ (NEDO) in Japan.

Acknowledgement

slide-2
SLIDE 2

2

OUTLINE

 Introduction:

Normally-off (N-off) Processor (from ver.0 to ver.1. )

 Key Point 1: Advanced STT-MRAM  Key Point 2: Decrease in power for short CPU standby

state by applying new memory cell design

 Key Point 3: Power Decrease for long CPU standby state

by Ultra-Fast- Power Gating

 Conclusions Towards N-off ver 2.

slide-3
SLIDE 3

3

Normally-Off Computer Ver.0 (2001)

(FED journal Japan, 2001)

volatile Non- Volatile Memory (MRAM) Register files gister files ALU/ ALU/ FlipFlop FlipFlop Cache Cache (L (L2) 2) Ma Main in me memo mory ry Storag Storage Cache Cache (L (L1) 1) non-volatile Memory Hierarchy

Proposed by K. Ando, AIST, Japan

History of Concept on Normally-Off Computer

The same Ver.0 concept presented by T. Kawahara, ASP-DAC 2011. (based on MRAM)

volatile Volatile Memory Register files gister files ALU/ ALU/ FlipFlop FlipFlop Cache Cache (L (L2) 2) Ma Main in me memo mory ry Storag Storage Cache Cache (L (L1) 1) non-volatile Memory Hierarchy

“Nonvolatile memory and normally-off computing”

slide-4
SLIDE 4

4

Attention:

  • Active power (write power) of nonvolatile memory is so large!
  • Speed of NV-Memory is much slower than that of SRAM.

(CPU core power and performance is largely degraded by Ver.0!)

L2 -cache resistor Main memory L1 -cache ALU Storage

Active power is dominant. Standby power is dominant. Ver.0 (All Nonvolatile Memory Hierarchy) is not suitable for decreasing power .. Power Time Rethink Normally-off Concept Ver.0

slide-5
SLIDE 5

5

Normally-Off Computer Ver.0 (2001)

(FED journal Japan, 2001)

volatile Non- Volatile Memory (MRAM) Register files gister files ALU/ ALU/ FlipFlop FlipFlop Cache Cache (L (L2) 2) Ma Main in me memo mory ry Storag Storage Cache Cache (L (L1) 1) non-volatile Memory Hierarchy

  • K. Ando, AIST, Japan

History of Concept on Normally-Off Computer (2)

Ver.0 (2011) T. Kawahara, ASP-DAC 2011.

volatile Volatile Memory Register files gister files ALU/ ALU/ FlipFlop FlipFlop Cache Cache (L (L2) 2) Ma Main in me memo mory ry Storag Storage Cache Cache (L (L1) 1) non-volatile Memory Hierarchy

  • K. Abe, S. Fujita et al.,

Toshiba, SSDM 2010.

Non- Volatile Memory (MRAM) Register files gister files ALU/ ALU/ FlipFlop FlipFlop Cache Cache (L (L2) 2) Ma Main in me memo mory ry Storag Storage Cache Cache (L (L1) 1) non-volatile Memory Hierarchy

CPU core power and performance is largely degraded!

Ver.1 (2010)

CPU core power and performance is largely degraded! Ultra low power applications such as Sensor Networks etc.

slide-6
SLIDE 6

6

Capacitance of Cache Memory in CPU is increasing, which increases standby power of processors!

<Background>

  • Increase performance not by increasing clock frequency.
  • Multi-core.

‘08 ‘09 ‘10 ‘11 ‘12 ‘13 MB Server, W/S CPU 2 4 8 16 32 1 64 Smart Phone CPU 0.5 Note PC CPU Desk Top PC CPU

CPU Core Resistors L2, L3 Cache Resistor file L1Cache More cache, More Leakage..

Why nonvolatile L2 , L3, LL Cache?

slide-7
SLIDE 7

7

Consumed Energy Caused by Leakage Power of Last Level Cache (L2$)

Especially for Mobile-Processor, not Standby Power but Leakage Power is Dominant!

(Evaluation from

  • ne-day use case.)

0% 20% 40% 60% 80% 100%

Active state Clock gated state Power gated state (except L2 (retention)) Energy consumed (%)

slide-8
SLIDE 8

8 1 10 100 1000

Access speed (ns)

limited endurance RAM

Storage

Memory capacity (bit)

1M 1G 1T

HDD

1000000 PCM FeRAM MRAM 10000000

practically unlimited endurance SRAM NAND Flash DRAM STT-MRAM

ReRAM

STT-MRAM is the best in NVM, but..

Its operation speed is slow and its power is high for cache memory.

slide-9
SLIDE 9

9

0% 20% 40% 60% 80% 100%

Active state Clock gated state Power gated state

( L2 retention)

Energy consumed (%)

General STT-MRAM Active power Increases drastically!

Standby power is low, but active energy is extremely higher than that of SRAM even using conventional STT-MRAM. “Dilemma of Nonvolatile Memory! “

General STT-MRAM Standby power Decreases largely!

slide-10
SLIDE 10

10

OUTLINE

 Introduction: New Design Concept

Normally-off (N-off) Processor (from ver.0 to ver.1. )

 Key Point 1: Advanced STT-MRAM  Key Point 2: Decrease in power for short CPU standby

state (in CPU active state) by applying new memory cell design

 Key Point 3: Power Decrease for long CPU standby state

by Ultra-Fast- Power Gating

 Conclusions Towards N-off ver 2.

slide-11
SLIDE 11

11 1 10 100 1000

Access speed (ns)

limited endurance RAM

Storage

Memory capacity (bit)

1M 1G 1T

HDD

1000000 PCM FeRAM MRAM 10000000

practically unlimited endurance SRAM NAND Flash DRAM STT-MRAM

ReRAM

p-STT-MRAM

Advanced

Advanced STT-MRAM has been developed!

slide-12
SLIDE 12

12

Breakthrough Breakthrough by Toshiba by Toshiba ’ ’s advanced STT s advanced STT-

  • MRAM

MRAM

1.0E-05 1.0E-04 1.0E-03 1.0E-02 1.E-10 1.E-09 1.E-08 1.E-07

Programming time (nsec) Programming current (A)

Breakeven point for the replacement of hp-SRAM in power of cache memory

[1] Sony corp. IEDM (2005) [2] New York univ. APPLIED PHYSICS LETTERS 97, 242510 (2010) [3] Cornel Univ. APPLIED PHYSICS LETTERS 95, 012506 (2009) [4] Minnesota univ. J. Phys. D: Appl. Phys. 45, 025001 (2012). [5] NEC corp. Symposium on VLSI Circuits 7.3 (2012). [6] IBM corp. Appl Phys Lett 98, 022501 (2011). [7] TDK-Headway Applied Physics Express 5 093008 (2012) [3] [1] [2] [4] [5] [6] [7]

Power of cache memory is increased

Reduction in power of cache memory

Toshiba 2012 (Advanced STT-MRAM, 28-30nm)

Power down Higher speed

slide-13
SLIDE 13

13

Access Time Measurements

Voltage (V) Access time (ns) Pass Fail 1.2 1.1 1.0 0.9 0.8 3 4 5 6

3.3ns@ 1.2V 4ns@ 1.05V

17.8mW @250MHz Read

STT-MRAM Test Chip

502.03μm 502.03μm 55.26μm 560.12μm 32.69μm 592.81μm 1059.32μm XDEC READ & WRITE 1K BL 1K WL 4:1MUX 256Rx256C 4:1MUX 256Rx256C 4:1MUX 256Rx256C 4:1MUX 256Rx256C Driver S A Process 65-nm CMOS process Macro size 0.628 mm2 Organization 4K words x 256 bits = 1 Mb Process 65-nm CMOS process Macro size 0.628 mm2 Organization 4K words x 256 bits = 1 Mb

2T-2MTJ

Embedded Memory Integration (by Toshiba N-off PJ) Access time < 4ns

  • H. Noguchi et al., VLSI circuit

symposium, 2013

slide-14
SLIDE 14

14

High speed STT-MRAM is NOT for high CPU performance, but for lower power CPU!

1 2 3 4 10 20 30

SRAM Access Time (ns) Average Time per Instruction (ns) STT-MRAM

< 3% @5ns 1 2 3 4 5 10 15 2

Relative Average Power for L2 Cache Memory Access Time of STT-MRAM (ns) HP-SRAM=1 Higher Power Lower Power 1MB

slide-15
SLIDE 15

15

CMOS 配線層 CMOS

STT-MRAM

Development of “STT-MRAM-top Integration”

Cross section image

Conventional CMOS Process (in-house fab, foundry..) Specific MRAM Integration Process

Pool -top construction (Marina Bay Sands Hotel) To be presented in VLSI-TSA 2014.

slide-16
SLIDE 16

16

OUTLINE

 Introduction: New Design Concept

Normally-off (N-off) Processor (from ver.0 to ver.1. )

 Key Point 1: Advanced STT-MRAM  Key Point 2: Decrease in power for short CPU standby

state (in CPU active state) by applying new memory cell design (normally-off type design)

 Key Point 3: Power Decrease for long CPU standby state

by Ultra-Fast- Power Gating

 Conclusions Towards N-off ver 2.

slide-17
SLIDE 17

17 Short Standby Leakage power

From “Normally-On Type Memory with Power Gating” to “Normally-Off Type Memory without Power Gating”

(1) SRAM and Nonvolatile SRAM without Power Gating

Time Power Short Standby Leakage power 10~30ns Power Time

WL BL /BL

Leakage path

Active Active Active Active Active Active

Leakage path

Nonvolatile SRAM

WL BL /BL F P P F WL BL /BL F P P F

  • (2004 Toshiba)

MTJ MTJ

Normally-On Type

NV-SRAM for High Speed!

Power Gating?

Short standby

slide-18
SLIDE 18

18

From “Normally-On Type Memory with Power Gating” to “Normally-Off Type Memory without Power Gating”

(1) SRAM and Nonvolatile SRAM with Power Gating (2) Normally-off Type Memory without Power Gating WL BL /BL

Leakage path

No Leakage path, No power gating switch. Nonvolatile SRAM

WL BL /BL F P P F WL BL /BL F P P F

  • (2004 Toshiba)

MTJ MTJ

New design: “Normally-off Type” (Next page) Normally-On Type Overhead of power gating switch is much large! (Delay and Power overhead also )

Power gating switch SRAM NV-SRAM Area SRAM

x2 ~x4

Area power gating switch STT-MRAM

STT-MRAM cell is much smaller than SRAM cell.

slide-19
SLIDE 19

19

(a) D-MRAM (b) 3T-2MTJ (c) 2T-2MTJ

W B L S L RWL WWL SWL R B L MTJ WL B L S L / B L WWL MTJ MTJ WL B L S L / B L / S L MTJ MTJ

  • K. Abe et al.

IEDM2012 (Toshiba)

  • A. Kawasumi et al.

IMW2013 (Toshiba)

  • H. Noguchi et al.

VLSI Circuit 2013 (Toshiba)

(d) 4T-2MTJ

  • C. Tanaka et al.

SSDM 2013 (Toshiba)

Various kinds of Normally-off Type Memory Cell designs using advanced p-STT-MRAM presented by Toshiba. As there are No Leakage paths like SRAM, no power gating switch is needed in the memory arrays.

MTJ1 WL BL /BL SL M1 M2 N1 M3 M4 N2

P F P F P F P F

MTJ2

slide-20
SLIDE 20

20

Processors # of cores 1 Frequency 1GHz Issue width 1(out of

  • rder)

ISA ARMv7 Memory L1 cache 32+32kB, 4

  • way, 64B line,

Write-back, 1 read/write port, 1ns latency L2 cache 1MB, 8

  • way, 64B line,
  • Write-back,1 read/write port

Execution Warm

  • up

1M inst. Execution 10M inst.

  • SRAM

(Reference) 3ns / 50uA (Advanced p-MTJ) 2MTJ-6T 25ns / 120uA (Reference p-MTJ) 2MTJ-4T 3ns / 50uA (Advanced p-MTJ) MTJ device Write Time / Current D-MRAM (1MTJ-3T, This work) Cell Type

  • SRAM

(Reference) 3ns / 50uA (Advanced p-MTJ) 2MTJ-6T 25ns / 120uA (Reference p-MTJ) 2MTJ-4T 3ns / 50uA (Advanced p-MTJ) MTJ device Write Time / Current D-MRAM (1MTJ-3T, This work) Cell Type

CPU-Simulation (ARM core, Linux-OS)

Processor benchmark sets: SPEC2006

slide-21
SLIDE 21

21

Results of Power of Cache Memory (Short standby state) (case study: (a) D-MRAM)

0.1 1 10 bzip2 gobmk h264ref xalancbmk gromacs namd calculix lbm Normalized 1MB-8way L2$ power 2MTJ-6T 2MTJ-4T 1MTJ-3T (advanced p-MTJ this work) 1MTJ-3T (referenced p-MTJ)

Normally-Off

Normally-On without PG Normally-Off memory using low-power and advanced p-STT-MRAM can reduce the cache power the most effectively.

Worse than SRAM (Normally-on) SRAM SRAM Better than SRAM (Normally-Off) Better than SRAM (Normally-Off)

slide-22
SLIDE 22

22

0.2 0.4 0.6 0.8 1 bzip2 gobmk h264ref xalancbmk gromacs namd calculix lbm Normalized instructions per cycles (IPC) (a. u.) 2MTJ-6T 2MTJ-4T 1MTJ-3T (advanced p-MTJ this work) 1MTJ-3T (referenced p-MTJ)

Normally-Off memory cell design using advanced p-MTJ(STT-MRAM) has the best performance comparable to that of SRAM.

Processor performance (Short standby state) (case study: Normally-off STT-MRAM)

SRAM case = 1

slide-23
SLIDE 23

23

OUTLINE

 Introduction: New Design Concept

Normally-off (N-off) Processor (from ver.0 to ver.1. )

 Key Point 1: Advanced STT-MRAM  Key Point 2: Decrease in power for short CPU standby

state (in CPU active state) by applying new memory cell design

 Key Point 3: Power Decrease for long CPU standby state

by Ultra-Fast- Power Gating

 Conclusions Towards N-off ver 2.

slide-24
SLIDE 24

24

Recovery time to active-state Others L2-$ L1-$ Clock Conventional Power Gating Active state Clock gated state OFF ~1ns CPU core sleep (L2-Cache retention) OFF OFF 10s CPU core Sleep (L2-Cache decay) OFF OFF decay 20s Deep power- down state (DPS) OFF OFF OFF OFF except State SRAM 100s ~100ns Normally OFF OFF OFF OFF Deeper power- down state OFF L2-logic +prepheral Recovery time to active-state L2-memory L1-$ Clock High-speed PG with NV-cache Active state Normally OFF Clock gated state OFF Normally OFF ~1ns CPU core sleep; Deep power-down state (DPS) OFF Normally OFF ~10ns

slide-25
SLIDE 25

25

Power (%)

Active state CPU core sleep state Deep power-down state (DPS) Clock gated

1

20sec 20nsec

Idle time (ns)

100 50 10 102 103 104 105 106

Clock gated L2 cache decay 2nsec 40sec 200sec Deep power-down state (DPS)

Conventional PG Ultra-fast PG + Nonvolatile-L2 cache

200nsec

Long time standby state State transition policy for long time standby state: Conventional power gating (PG) vs. Ultra-fast PG with NV-L2-cache

Short standby state

slide-26
SLIDE 26

26

a) Conventional PG (Power Gating) b) Ultra-fast PG+NV-cache Time Power Interrupt Interrupt Interrupt Time Power Interrupt Interrupt Interrupt

Deep sleep state Deep sleep state Deep sleep state Deep sleep state

slide-27
SLIDE 27

27

Case 1 Case 2 Case 3 100 200 300 400 500 600 700 800 900 1000

Conventional NV-L2 cache

29% 66% 90%

Average Power of Processor (mW)

Case studies: Decrease in average power of processor by ultra-fast PG with nonvolatile-L2 cache.

Case1:CPU active state dominant, Case2: Moderate, Case3: CPU idle state dominant.

slide-28
SLIDE 28

28

Normally-off Processor (Concept Ver.2) from 2012 Normally-off Processor (Concept Ver.1) 2010

Volatile

Rethink! Nonvolatile/ Volatile Hybrid

CPU core

Nonvolatile

L2, L3 Cache L1 Cache Register file Registers

Nonvolatile Volatile CPU core Volatile

Conclusion

  • For HP-mobile processors, we proposed N-off processor ver.1; volatile L1-

cache/ nonvolatile L2,LLC.

  • To realize N-off processor ver.1, advanced STT-MRAM, normally-off type

memory cell design, ultra-fast power gating are three key points.

  • By applying new memory cell designs without leakage paths, not only CPU

standby power but CPU active power has been effectively reduced.

  • Average power reduction by 29 to 90% can be expected with little

degradation of CPU performance.

  • N-off processor concept shifting Ver.1 to Ver.2 has been in progress.

Ex.

L2 (256kB) SRAM/ STT-MRAM L3 (16MB) High density STT-MRAM L2 (1MB) STT-MRAM CPU Core (Big) CPU Core (LITTLE)

slide-29
SLIDE 29

29

Thank you!