Memory Hierarchy Design Issues Memory Hierarchy Design Issues in - PDF document

Memory Hierarchy Design Issues Memory Hierarchy Design Issues in Many in Many-Core Processors -Core Processors Sangye Sangyeun Cho un Cho Dept. of Computer Scien Dept. of Com uter Science Uni Univer ersi sity o of Pi Pitt ttsb sburgh 1

Multicores are Here AMD Opteron Dual-Core IBM Power5 SUN UltraSPARC IV+ SUN UltraSPARC T1 Tomorrow’s Processors? Dance-hall organization Round-table organization Tiled organization 2

Techno Technology/applicatio logy/application t n trends ends? Po Poten tential prob ial problems lems/cons /constrain aints ts? Discussions are based on ITRS 2001/2003/2005 Intel’s “Platform 2015” whitepapers S. Borkar’s MICRO 2004 keynote presentation Other references Moore’s Law • ~2300 transistors in Intel 4004 (1971) • ~276M transistors in Power5 (2003) • ~1.7B transistors (24MB L3 cache) in Intel Montecito (2005) • 2016 forecast by ITRS 2005 – 3B transistors @22nm technologies – 40GHz local clock • Building a processor with MANY transistors not infeasible – Single core (OoO/VLIW) scalability is limited – Multicore is the result of natural evolution 3

Power Trend Power trend, unconstrained Max. allowed power • Power drivers – # of transistors 1000 – Faster clock frequency – Increased leakage power • Power density (W/cm2) 100 – Related with temperature – Becomes more critical – Perf. & reliability issues 10 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Bandwidth Trend • Today’s processor bandwidth is 2~20GB/s. • Limited by – # of pins – Bandwidth per pin • Bandwidth drivers – # of processors – Faster clock frequency • Two fronts – Off-chip bandwidth – On-chip interconnect bandwidth 4

On-Chip Wire Speed • Scaling leads to faster devices (transistors) • Scaling however leads to slower global wires (increased RC delay) • Possible implications – Simpler processor cores – On-chip switched network – Non-uniform memory access latency Yield & Reliability Issues • Errors due to variations (e.g., V TH variation) – Run-time dependent – Reserving larger margins means lower yield • Traditional test methods not enough – Burn-in/I DDQ less effective • Time-dependent device degradation – ~9% SNM degradation/3yr in SRAM due to NBTI – Electromigration, TDDB, … • Soft error – FIT: ~8% degradation (bit/generation) 5

Application Requirements • Applications’ performance demand growing • RMS applications (Intel’s term) – Recognition – Mining – Synthesis • More multimedia applications – Games – Animations • Pradip Dubey (Intel) – “The era of tera is coming quickly. This will be an age when people need teraflops of computing power, terabits of communication bandwidth, and terabytes of data storage to handle the information all around them.” Issues Summary • We must keep scaling performance @Moore’s law • Power consumption – Every component design must (re-)consider power consumption • Power density – Thermal management a must (but not sufficient) – Design/software methods for low temperature further needed • Off-chip/on-chip bandwidth requirement – High-speed/low-power I/O – Larger on-chip memory (e.g., L2) – Package-level memory integration may become more interesting • Wire delay dominance – Smaller cores – Non-uniform memory latency (i.e., hierarchy at same level) • Yield/reliability – Microarchitectural provisions for yield/reliability improvement a must – Dynamic self-test/diagnosis/reconfiguration/adapt 6

Memory Hierarchy Design Considerations • Reduce traffic (and power) – Off-chip/on-chip traffic ~20% of total power consumption – Off-chip traffic primarily determined by on-chip capacity – On-chip traffic determined by data location – Are there redundant accesses? • Improve flexibility – Data placement in L2 – Cache/line/set/way isolation – Help from OS needed • It doesn’t assume non-uniform memory latency in uniprocessors… (is a multicore a uniprocessor?) Remaining Topics • An L1 cache traffic reduction technique • L1 cache performance sensitivity to faults • A flexible L2 cache management approach 7

Macro Macro Data Load Data Load: An : An Efficient Mechanism Efficient Mechanism for Enhancing Loaded Value Reuse for Enhancing Loaded Value Reuse L. Jin and S. Cho ACM Int’l Symp. Low Power Electronics and Design (ISLPED) Oct. 2006 Motivation • L1 cache – Essential for performance, traffic reduction, and power – All high-perf. processors have both i-cache and d-cache • Energy consumption – N mem × E cache +N miss × E miss – Usually N miss ≪ N mem , E cache <E miss – Conventional approaches • Reduce N miss (victim cache, highly set-associative cache, …) • Reduce E cache (filter cache, cache sub-banking, …) • Reduce E miss • Can we reduce N mem ? 8

L1 Traffic Reduction Ideas • Store-to-load forwarding – Usually needed for correctness in OoO engine – Implemented in LSQ – Design pipeline in such a way that cache is not accessed if the desired value is in LSQ • Load-to-load forwarding (“loaded value reuse”) – A loaded value may be necessary again soon – Use a separate structure or LSQ • Silent stores – Stores that write a same value again – Identify, track, and eliminate silent stores – Lepak and Lipasti, ASPLOS 2002 Store-to-Load Forwarding • Basic idea – Stores are kept in Load Store Queue (LSQ) until they are committed – A load dependent on a previous store may find the value in LSQ • Often, a load accesses LSQ and cache together for higher performance – One can re-design pipeline so that LSQ is looked up before cache is accessed – How to deal with performance impact? 9

Load-to-Load Forwarding • Basic idea – Loaded values are kept in Load Store Queue (LSQ) – A load targeting a value previously loaded may find the value in LSQ • Related work – Nicolaescu et al., ISLPED 2003 Macro Data Load • Goal – Maximize loaded value reuse • Idea – Bring full data (64 bits) regardless of load size – Keep it in LSQ – Use partial matching and data alignment • Essentially, we want to exploit spatial locality present in cache line 10

Macro Data Load, cont’d • Architectural changes – Relocated data alignment logic – Sequential LSQ-cache access • Net impact – LSQ becomes a small fully associative cache with FIFO replacement Macro Data Load, cont’d • Architectural changes – Relocated data alignment logic – Sequential LSQ-cache access • Net impact – LSQ becomes a small fully associative cache with FIFO replacement 11

Idealized Limit Study • MVRT (Memory Value Reuse Table) – N entries (parameter) – Tracks store-to-load (S2L), load-to-load (L2L), and macro data load (ML) • Simple, idealized processor model – No branch mis-prediction; single-issue pipeline Overall Result 100% 90% 80% 70% 60% 50% 40% 30% ML 20% L2L 10% S2L 0% x p r c f r l p 2 f d a r t e e d e d e d a h l h g g g i p c c e e r a e p o l e m a k . . . . b t l v v v z v g m s t w s i r i s g g m . m . d d n e c a a a g p g o r z i w i w g e a e e n n g y p a r . a r b t m m u s s 2 r s s T . P . B v p s q p p g g j i i j i e p u e j j r r f f r s N F M i i I C w t C • Assuming 256-entry buffer size (maximum in our study) • Up to over 70% of accesses are redundant • Most programs have significant reuse opportunities – In certain cases, reuse distance is short and data footprint is small (wupwise) • ML consistently boosts loaded value reuse (40~60% in CINT and MiBench) 12

Load Size Mix 100% 90% 80% 70% 60% 50% 40% 30% DWORD 20% WORD HALF 10% BYTE 0% x g p p r c c f r r l p e 2 l f e m d a r t e e d e d e d a h l h g g z i c m e e a p o i s a k . . . . . . b t e l c v v v v g s p g t r i w s i r a g g m m d d g n p r a a a g r o z w i w g e e e n n r y a . . . a v b t s m m u p p s s j j 2 s s e T P B p p q j j g g i r r i f r i N F i u e f s M w t i C I C • CINT2k – Many word (32-bit) accesses • CFP2k – Relatively frequent long-word (64-bit) accesses • MiBench – More frequent half (16-bit) and byte (8-bit) accesses Per-Type Reuse 100% CINT2k CFP2k MiBench 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 8 16 32 64 Avg. 8 16 32 64 Avg. 8 16 32 64 Avg. • 8-/16-bit macro data reuse is high – Many word (32-bit) accesses • CFP2k – Relatively frequent long-word (64-bit) accesses • MiBench – More frequent half (16-bit) and byte (8-bit) accesses 13

Memory Hierarchy Design Issues Memory Hierarchy Design Issues in - PDF document

Memory Hierarchy Design Issues Memory Hierarchy Design Issues in Many in Many-Core Processors -Core Processors Sangye Sangyeun Cho un Cho Dept. of Computer Scien Dept. of Com uter Science Uni Univer ersi sity o of Pi Pitt ttsb

Memory Hierarchy Design Memory Hierarchy Design Chapter 5 and Appendix C 1 Overview

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy Soner Onder

Abstractions for Practical Systems Caching and the memory hierarchy Operating systems and the

1 5.1 Introduction A Typical Memory Hierarchy A Typical Memory Hierarchy Memory Technology

1 Basic use of caches Levels in the memory hierarchy When fetching an instruction, first

EE 457 Unit 7a Cache and Memory Hierarchy 2 Memory Hierarchy & Caching Use several

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Hierarchy of School Marketing Needs Leadership Day - February 16, 2018 Maslows Hierarchy of

Extensions of the Caucal Hierarchy? Pawe Parys University of Warsaw LATA 2019 Caucal

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Data Management Systems Storage Management The Memory hierarchy Memory hierarchy

Why memory hierarchy (3 rd Ed: p.468-487, 4 th Ed: p. 452-470) users want unlimited fast

Memory Management Ideally programmers want memory that is large fast non

Semiconductor Detectors Stefan Heindl and Martin Printz KSETA Doctoral Researchers Workshop,

Lieven Vandersypen Spin Qubit From transistors to quantum dots 1947 First transistor 1954 First

Hermann Kolanoski Humboldt-Universitt zu Berlin and DESY Coll. Ljubljana, 16. 3. 2015

Dosimetry Exercise G. Hartmann EFOMP & German Cancer Research Center (DKFZ)

Overview Motivation ECE 553: TESTING AND Fault Modeling TESTABLE DESIGN OF Why

Main aspect s of CMOS t echnology scaling forming int erconnect reliabilit y t hreat s:

Transistor reliability trends Shekhar Borkar, Intel Corp: As technology scales, variability in

Digital Testing g g Lecture 5: Fault Modeling Instructor: Shaahin Hessabi Department of

Sambuz

Useful Links

Newsletter

Mail Us

Memory Hierarchy Design Issues Memory Hierarchy Design Issues in - PDF document

Memory Hierarchy Design Issues Memory Hierarchy Design Issues in Many in Many-Core Processors -Core Processors Sangye Sangyeun Cho un Cho Dept. of Computer Scien Dept. of Com uter Science Uni Univer ersi sity o of Pi Pitt ttsb

Memory Hierarchy Design Memory Hierarchy Design Chapter 5 and Appendix C 1 Overview

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy Soner Onder

Abstractions for Practical Systems Caching and the memory hierarchy Operating systems and the

1 5.1 Introduction A Typical Memory Hierarchy A Typical Memory Hierarchy Memory Technology

1 Basic use of caches Levels in the memory hierarchy When fetching an instruction, first

EE 457 Unit 7a Cache and Memory Hierarchy 2 Memory Hierarchy &amp; Caching Use several

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Hierarchy of School Marketing Needs Leadership Day - February 16, 2018 Maslows Hierarchy of

Extensions of the Caucal Hierarchy? Pawe Parys University of Warsaw LATA 2019 Caucal

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Data Management Systems Storage Management The Memory hierarchy Memory hierarchy

Why memory hierarchy (3 rd Ed: p.468-487, 4 th Ed: p. 452-470) users want unlimited fast

Memory Management Ideally programmers want memory that is large fast non

Semiconductor Detectors Stefan Heindl and Martin Printz KSETA Doctoral Researchers Workshop,

Lieven Vandersypen Spin Qubit From transistors to quantum dots 1947 First transistor 1954 First

Hermann Kolanoski Humboldt-Universitt zu Berlin and DESY Coll. Ljubljana, 16. 3. 2015

Dosimetry Exercise G. Hartmann EFOMP &amp; German Cancer Research Center (DKFZ)

Overview Motivation ECE 553: TESTING AND Fault Modeling TESTABLE DESIGN OF Why

Main aspect s of CMOS t echnology scaling forming int erconnect reliabilit y t hreat s:

Transistor reliability trends Shekhar Borkar, Intel Corp: As technology scales, variability in

Digital Testing g g Lecture 5: Fault Modeling Instructor: Shaahin Hessabi Department of

Sambuz

Useful Links

Newsletter

Mail Us

EE 457 Unit 7a Cache and Memory Hierarchy 2 Memory Hierarchy & Caching Use several

Dosimetry Exercise G. Hartmann EFOMP & German Cancer Research Center (DKFZ)