normally off computing for smart systems
play

Normally off computing for smart systems Cache and main memory - PowerPoint PPT Presentation

Normally off computing for smart systems Cache and main memory architecture based on MRAM: application to High Performance Computing and Embedded Systems Lionel Torres Univ . Montpellier, France Merci : G . Sassatelli, A . Gamatie,


  1. « Normally off » computing for smart systems Cache and main memory architecture based on MRAM: application to High Performance Computing and Embedded Systems Lionel Torres – Univ . Montpellier, France Merci à : G . Sassatelli, A . Gamatie, P . Benoit, P . Nouet, D . Novo, G . Dinatale, A . Todri, A . Virazel, L . Latorre, M . Robert, G . Patrigeon, P . Y . Peneau, J . Modad, F . Ouattara, J . Lopes, O . Coi, K . Sevin RTNS 2018

  2. General context Current IC Integration Challenges ● Energy is critical ● We need more and more Performances for applications ● Actual technology limitations ( CMOS ) - Integration is more and more complex 10 9 transistors / cm2 ● Actual Reliability is a problem – X% of the systems encounter an uncorrectable error per year (X ranging from 1 to 5%) Source: Bernie Meyerson, IBM 140 Bi Bipolar CMOS CM IBM ES9000 120 100 aging Watts / cm 2 performance 80 Fujitsu VP2000 ? IBM GP IBM 3090S mature NTT IBM RY5 60 Pentium 4 Fujitsu M-780 IBM RY7 Techno 2 growing Pulsar 40 IBM 3090 IBM RY6 CDC Cyber 205 IBM RY4 embryonic IBM 4381 20 IBM 3081 Apache Fujitsu M380 Merced IBM 370 IBM 3033 Techno 1 IBM 360 Vacuum Pentium II(DSIP) 0 2 1950 1960 1970 1980 1990 2000 2010 time

  3. General context Technology target : CMOS < 20 nm To Transport 1 bit à 1pJ / mm To transport 10 9 data – 1s ( 1Ghz ) à 1pJ / mm x 10 9 = 1mw / mm 64 Bits Bus à 64mw / mm On real IC à several W / cm2 Calcul, Bit transition à 1 aJ Calcul, 10 9 data transition - 1s à 1aJ x 10 9 = 1nw à It is better to ” calculate ” than to “ transport ”’ the information à In computing memory is certainly interesting à Reminder : minimal energy to change 1 bit d ’ information - K . T Ln2 à 2,85 zJ 3

  4. One challenge : the memory ● Today, 50% of the silicon area of IC is memory ● Take care to energy ( static ) ! L2 L1 L1 L2 L2 L1 L1 L2 L3 L2 L1 L1 L2 L1 L2 L1 L2 4

  5. Technology evolution Actual memories: ● SRAM for fast access ● DRAM for applications ● Flash (mass storage) ● … Emerging memories Universal memory: ● Magnetic tunneling junctions “Non-volatile memory” ● Phase change memory • SRAM performance ● Programmable metallization cells • Size of DRAM/Flash • Non-volatility ● OxRRAM • Scalibility ● … Resistance Switching Memory Emerging memories offer non-volatility, speed and endurance => disruption of the memory hierarchy? 5

  6. Spin Technology Conductance of magnetic metal plates is larger in the presence of a magnetic field perpendicular to the current flow William Thomson 1824-1907 Currently known as Anisotropic Magnetoresistance (AMR) Resistance variation attained: 2%-5% in RT 6

  7. Spin Technology Peter Grünberg and Albert Fert 2007 Nobel Prize in Physics ¢ Thin stacks of FM/NM metals have seen a conductance increase of up to 100% when subjected to a magnetic field B. Guinasch et al., 1989 M. N. Baibich et al., 1988 7

  8. Spin Technology . Magnetoresistance Tunnel : (( − *+, " ) ! " = !$ + ∆ ! . " = 3 → ! = !$ " = (53 → ! = !$ + ∆ ! = !2$ ∆ ! ! 2$ − !$ TMR Classique between 150% et 250% ( or /0! % = = ! $ ! $ more ) M. Bowen et al. Nearly total spin polarization… 8

  9. Spin Technology Compatible with CMOS Non - volatile memory Switching time < 1ns writing current < 10uA-100uA density x4 vs SRAM Immune to radiations Samsung demonstrator ( 8 Mbit STT - MRAM ) - 2016 9

  10. Motivations • A way MRAM PCRAM • Go towards non-volatile systems using emerging NVMs FeRAM ReRAM • Current NVMs issues : Speed, Dynamic energy, Reliability … NV Cache Cache Embedded On-chip GPU GPU Non-Volatile MRAM SRAM CPU CPU High performance bus High performance bus Non-volatile Flash Memory DDR DDR eFPGA FPGA Controller Controller Controller Controller External Flash External DRAM External MRAM External MRAM Where and how to place MRAM to: reduce total power consumption ? keep same or get better performance ? 10

  11. Contributions 1. Evaluation of MRAM-based cache memory hierarchy: • Exploration flow and extraction of memory activity • L1 and L2 caches based on STT-MRAM and TAS- MRAM 2. Non-volatile computing • Instant-on/off capability for embedded processor • Analysis and validation of Rollback mechanism 11

  12. MRAM applied to cache NV Cache On-Chip Possible studies SRAM CPU High performance bus Performance comparison New architectures DDR 3D-Stacking Controller capability of ENERG SPEED Y MRAM External DRAM NV Memory Logic layer Non-volatile Cache AREA POTENTIAL APPLICATIONS ? Cache SRAM vs MRAM ¢ fast Hybrid (SRAM) Benefits of MRAM ¢ SRAM/MRAM CPU Low leakage Cache — cache slow High density — (MRAM) Non-volatility — Take advantages of MRAM Mitigate drawbacks of MRAM Low leakage High write latency High density High write energy Non-volatility 12

  13. MRAM applied to cache NVM exploration flow Benchmarks NVM memory array 1. Define the architecture gem5* Modeling (NVSim**, SPICE…) Memory Prototype Latency Full-system simulator § Single/Multi-core 2. Explore MRAM-based cache configurations Architecture level Circuit level § L1, L2, L3, Hybrid… Execution time # Reads / Writes 3. Extract many useful information # Hits / Misses § Runtime, cache energy, cache transactions… Access energy Total L1/L2 energy consumption Static power * N. Binkert et al., “ The gem5 simulator ,” ACM SIGARCH Computer Architecture News, Aug. 2011. ** X. Dong et al., “ NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory ,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Jul. 2012 . 13

  14. MRAM applied to cache Experimental setup From single to multi-core architecture ARMv7 ISA Private L1 instruction/Data Shared L2 (Additional levels of caches possible) Main Memory Core Core Core … 1 2 N L1 I/D L1 I/D L1 I/D Shared L2 45nm 130nm 120nm 45nm DDR3 SRAM SRAM STT-MRAM TAS-MRAM (Baseline) (Baseline) 14 14

  15. MRAM applied to cache Circuit-level analysis: Models (NVSim) & Prototype Area SRAM STT-MRAM 512kB L2 32kB L1 100 Node Technology (mm²) (mm²) SRAM 1.36 0.091 10 45nm Area (mm²) STT-MRAM 0.82 0.117 1 SRAM 9.7 120nm - 0,1 TAS-MRAM 11.7 0,01 8kB 16kB 32kB 64kB 128kB 256kB 512kB 1MB 2MB 4MB ¢ MRAM is denser for large cache capacity ¢ MRAM cell size smaller than that of SRAM ¢ MRAM needs large transistors for write ¢ TAS-MRAM cache larger due to field lines 15

  16. MRAM applied to cache Circuit-level analysis: Models (NVSim) & Prototype Read Write Standby Latency Energy Latency Energy Leakage Technolog Node y (ns) (nJ) (ns) (nJ) (mW) SRAM 4.28 0.27 2.87 0.02 320 /2.2 ≈ /14 2.1 2.5 512kB 45nm STT- L2 cache 2.61 0.28 6.25 0.05 23 MRAM /8 SRAM 5.95 1.05 4.14 0.08 82 120nm TAS- 35 4.62 35 1.96 10 MRAM STT-MRAM ≈ SRAM MRAM > SRAM MRAM << SRAM TAS-MRAM > SRAM Latency Energy Latency Energy Leakage Technolog Node (ns) (nJ) (ns) (nJ) (mW) y 32kB L1 cache SRAM 1.25 0.024 1.05 0.006 22 /7 45nm STT- 1.94 0.095 5.94 0.04 3.3 MRAM MRAM > SRAM MRAM > SRAM MRAM << SRAM 16

  17. MRAM applied to cache Case study Quad-core architecture: ● Frequency 1GHz Core Core Core Core ● ARMv7 ISA 1 2 3 4 ● Private L1 I/D L1 I/D L1 I/D L1 I/D L1 I/D 32 kB 32 kB 32 kB 32 kB ● Shared L2 Shared L2 512 kB ● DDR3 Main memory DDR3 512 MB Benchmarks ● SPLASH-2 ╶ Mostly high performance computing ● PARSEC ╶ Animation, data mining, computer vision, media processing 17

  18. MRAM applied to cache Architecture-level analysis: gem5 Read/Write ratio L2/L1 access ratio Number of accesses Benchmark L1 cache L2 cache ~2 billions SPLASH-2 ~26 millions (0.5 billions/CPU) ~12 billions PARSEC ~16 millions (3 billions /CPU) Static/Dynamic energy ratio L2 à 90% Static energy L1 à 80% 18

  19. MRAM-based L2 Execution time STT-MRAM L2 (45 nm) TAS-MRAM L2 (130 nm) 1,4 Normalized execution time SRAM 1,2 Baseline 1 0,8 0,6 0,4 0,2 0 e e s m t 1 2 1 2 x s t e r k 4 f e e e n n i g e c 6 g f u u t m d r t a a a n l l a a l r s a 2 a o e m e e r u r r r f x r e h t e a f c c i l y c c b v n v o o d m a s a a o k d a b c i e a u l r l t b f s Observations: barnes ocean2 ● STT shows good performance 100 Cache miss rate (%) ╶ L2 has small impact in overall performance 80 ● For TAS, 14% of penalty in average (SPLASH-2) 60 ╶ 40 Depends on applications (Cache miss rate, L1/L2 access ratio) 20 0 Execution time 19 19

  20. MRAM-based L2 Total L2 cache energy consumption STT-MRAM L2 (45 nm) TAS-MRAM L2 (130 nm) SRAM Baseline 1,0 0,8 Normalized L2 energy 0,6 0,4 0,2 0,0 s m t 1 2 e s e k e 1 2 x t r 4 f e e e e c f u u n n i g t 6 g m d r a t n l l a l a a a a o r s 2 a m e r r f e e r u r h t x a r e e f i y c c c l b n c v d v o o s a m a a o k d b c a i a u e l l r b f t s fluidanimate (read) fluidanimate (write) Observations: radix (read) radix (write) (GigaBytes/s) Bandwidth 1,6 ● Up to 90% of gain for STT End of fluidanimate 1,2 End of radix ● From 40% to 90% for TAS 0,8 0,4 ╶ Due to the very low leakage of MRAM-based cache 0 Execution time 20 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend