1
Memory Hierarchy Design Issues Memory Hierarchy Design Issues in Many in Many-Core Processors
- Core Processors
Sangye Sangyeun Cho un Cho
- Dept. of Com
- Dept. of Computer Scien
uter Science Uni Univer ersi sity o
- f Pi
Pitt ttsb sburgh
Memory Hierarchy Design Issues Memory Hierarchy Design Issues in - - PDF document
Memory Hierarchy Design Issues Memory Hierarchy Design Issues in Many in Many-Core Processors -Core Processors Sangye Sangyeun Cho un Cho Dept. of Computer Scien Dept. of Com uter Science Uni Univer ersi sity o of Pi Pitt ttsb
Sangye Sangyeun Cho un Cho
uter Science Uni Univer ersi sity o
Pitt ttsb sburgh
AMD Opteron Dual-Core IBM Power5 SUN UltraSPARC IV+ SUN UltraSPARC T1
Dance-hall organization Round-table organization Tiled organization
Discussions are based on
ITRS 2001/2003/2005 Intel’s “Platform 2015” whitepapers
Other references
– 3B transistors @22nm technologies – 40GHz local clock
– Single core (OoO/VLIW) scalability is limited – Multicore is the result of natural evolution
Power trend, unconstrained
– # of transistors – Faster clock frequency – Increased leakage power
10 100 1000 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 2 1 2 1 1 2 1 2 2 1 3 2 1 4 2 1 5 2 1 6 2 1 7 2 1 8 2 1 9 2 2
– Related with temperature – Becomes more critical –
– # of processors – Faster clock frequency
– Off-chip bandwidth – On-chip interconnect bandwidth
bandwidth is 2~20GB/s.
– # of pins – Bandwidth per pin
– Simpler processor cores – On-chip switched network – Non-uniform memory access latency
– Run-time dependent – Reserving larger margins means lower yield
– Burn-in/IDDQ less effective
– ~9% SNM degradation/3yr in SRAM due to NBTI – Electromigration, TDDB, …
– FIT: ~8% degradation (bit/generation)
– Recognition – Mining – Synthesis
– Games – Animations
– “The era of tera is coming quickly. This will be an age when people need teraflops of computing power, terabits of communication bandwidth, and terabytes of data storage to handle the information all around them.”
– Every component design must (re-)consider power consumption
– Thermal management a must (but not sufficient) – Design/software methods for low temperature further needed
– High-speed/low-power I/O – Larger on-chip memory (e.g., L2) – Package-level memory integration may become more interesting
– Smaller cores – Non-uniform memory latency (i.e., hierarchy at same level)
– Microarchitectural provisions for yield/reliability improvement a must – Dynamic self-test/diagnosis/reconfiguration/adapt
– Off-chip/on-chip traffic ~20% of total power consumption – Off-chip traffic primarily determined by on-chip capacity – On-chip traffic determined by data location – Are there redundant accesses?
– Data placement in L2 – Cache/line/set/way isolation – Help from OS needed
uniprocessors… (is a multicore a uniprocessor?)
ACM Int’l Symp. Low Power Electronics and Design (ISLPED)
– Essential for performance, traffic reduction, and power – All high-perf. processors have both i-cache and d-cache
– Nmem×Ecache+Nmiss×Emiss – Usually Nmiss≪Nmem, Ecache<Emiss – Conventional approaches
– Usually needed for correctness in OoO engine – Implemented in LSQ – Design pipeline in such a way that cache is not accessed if the desired value is in LSQ
– A loaded value may be necessary again soon – Use a separate structure or LSQ
– Stores that write a same value again – Identify, track, and eliminate silent stores – Lepak and Lipasti, ASPLOS 2002
– Stores are kept in Load Store Queue (LSQ) until they are committed – A load dependent on a previous store may find the value in LSQ
– One can re-design pipeline so that LSQ is looked up before cache is accessed – How to deal with performance impact?
– Loaded values are kept in Load Store Queue (LSQ) – A load targeting a value previously loaded may find the value in LSQ
– Nicolaescu et al., ISLPED 2003
– Maximize loaded value reuse
– Bring full data (64 bits) regardless of load size – Keep it in LSQ – Use partial matching and data alignment
– Relocated data alignment logic – Sequential LSQ-cache access
– LSQ becomes a small fully associative cache with FIFO replacement
– Relocated data alignment logic – Sequential LSQ-cache access
– LSQ becomes a small fully associative cache with FIFO replacement
– N entries (parameter) – Tracks store-to-load (S2L), load-to-load (L2L), and macro data load (ML)
– No branch mis-prediction; single-issue pipeline
– In certain cases, reuse distance is short and data footprint is small (wupwise)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% g z i p v p r g c c m c f p a r s e r p e r l g a p v
t e x b z i p 2 t w
f w u p w i s e s w i m m g r i d m e s a a r t e q u a k e j p e g . e j p e g . d g s m . e g s m . d r i j n d . e r i j n d . d t i f f 2 r g b a r s y n t h i s p e l l s e a r c h C I N T . a v g C F P . a v g M i B . a v g ML L2L S2L
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% g z i p v p r g c c m c f p a r s e r p e r l g a p v
t e x b z i p 2 t w
f w u p w i s e s w i m m g r i d m e s a a r t e q u a k e j p e g . e j p e g . d g s m . e g s m . d r i j n d . e r i j n d . d t i f f 2 r g b a r s y n t h i s p e l l s e a r c h C I N T . a v g C F P . a v g M i B . a v g DWORD WORD HALF BYTE
– Many word (32-bit) accesses
– Relatively frequent long-word (64-bit) accesses
– More frequent half (16-bit) and byte (8-bit) accesses
– Many word (32-bit) accesses
– Relatively frequent long-word (64-bit) accesses
– More frequent half (16-bit) and byte (8-bit) accesses
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
CINT2k CFP2k MiBench 8 16 32 64 Avg. 8 16 32 64 Avg. 8 16 32 64 Avg.
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 32 64 32 64 32 64 ML L2L S2L
– Running a 32-bit binary on a 32-bit machine – Running a 32-bit binary on a 64-bit machine
CINT2k CFP2k MiBench
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 16 32 64 128 256 ML L2L S2L 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 16 32 64 128 256 ML L2L S2L 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 16 32 64 128 256 ML L2L S2L
CINT2k CFP2k MiBench
– 4-issue OoO w/ 64-entry
– L1: 32kB, 2-way, 64B line, 2-cycle access – L2: 2MB, 4-way, 128B line, 10-cycle access
– 120 cycle latency
– Traffic-optimized pipeline
– Performance-optimized pipeline
– Branch misspeculation (LSQ drain) – OoO execution (simultaneous LSQ accesses)
S2L L2L ML
CINT CFP MiBench
CINT CFP MiBench
Submitted to 2nd Workshop on Architectural Reliability (WAR-2)
– Process variation – VTH, leakage, … – Lifetime reliability – EM, NBTI, TDDB, … – Deteriorated testing capability – IDDQ, burn-in, …
– Burden on keeping Moore’s law – Profitability threatened
– Faults are masked – Graceful performance degradation
– E.g., cache lines, sets, ways, etc. – Predictable performance degradation @low cost
– Share existing resource on demand – Slight performance degradation @small cost
– Deleting & remapping decisions made intelligently – With little/no hardware addition – Synergistic architectural & system collaboration
Evaluation flow Fault map
– Results will become available soon
– Study L2 cache protection techniques – Study interconnection network switch protection schemes – Study directory (cache coherence mechanism) protection schemes
IEEE/ACM Int’l Symp. Microarchitecture (MICRO)
ACM Workshop Memory Systems Performance & Correctness (MSPC)
– Physical address determines location within cache
– 64B~256B cache line size – 4~8-way set associative
– Page coloring – Bin hopping – Best bin
– Treat each cache slice private to a core (“private” design) – Treat all cache slices shared by all (“shared” design)
– Hybrid private/shared design (“hybrid” design)
– Access local L2 slice (always) – If hit
– If miss
– Perform coherence action and get data
– “Data attraction”
– Each processor core has limited caching space
– Replication, unknown data location
– Determine which cache slice to access
# slices)
– Access data
– Fine distribution of data onto available cache slices
– Data location is deterministic
– Wanted data may be found in cache slice far off
– IBM Power4/5 – Sun Microsystems Niagara – Intel Core Duo
– Each private cache cluster comprises multiple shared caches – Huh et al., ICS 2005
– Victims from L1 are copied to local L2 slice – Zhang and Asanovic, ISCA 2005
– Limit degree of sharing, evict globally unused blocks, exploit cache-to-cache transfer – Chang and Sohi, ISCA 2006
– Scaling will not help single program – it won’t get more capacity
– More caches increase overall capacity – Average latency increases!
– Performance – Power – Reliability
– Good proximity (~private cache) – Good on-chip hit rate (~shared cache) – Cache slice isolation
– Mapping information look-up and maintenance – Light-weight performance monitor
– Production-quality OS – Full-system simulation environment
– OS can map program’s virtual pages to decide data location – Mapping creation at page allocation time – Mapping information size is more manageable – Data access behavior (e.g., sequential access) preserved
– Borrow cache space from nearby neighbors
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 gcc parser eon tw olf PRV SL SP-RR SP80 SP60 SP40 PRV8 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 w upw ise galgel ammp sixtrack
Relative Performance Relative Performance
0.25 0.5 0.75 1 1.25 1.5 1.75 2
gcc parser eon twolf gcc parser eon twolf gcc parser eon twolf
SL SP40 SP40-CS 0.25 0.5 0.75 1 1.25 1.5 1.75 2
wup gal ammp six wup gal ammp six wup gal ammp six
Relative Performance Relative Performance Low traffic Medium traffic High traffic
0.25 0.5 0.75 1 1.25 FFT LU RADIX OCEAN PRV SL VM
– Reliability issues (e.g., faults) – Power issues
1 2 3 4 5 6 7 8 1 2 4 8
# of slices deleted Relative L2 cache latency Conventional shared design Our approach
– Many cores – Many cache slices
– Capitalizes on a simple shared cache organization – Page level data to cache slice mapping – Use page allocation for node allocation – Adjustable proximity & on-chip cache miss rate trade-off – Cache slice isolation is trivial
– For best performance – For lowest power
– Now it is 2-dimensional – node allocation & color assignment
– Read-only data replication is relatively simple – What about writeable data?
– Start from 100 integer numbers, 100, 99, …, 2, 1 – Run the quicksort to sort the numbers to 1, 2, …, 99, 100