Anurag Mukkara, Nathan Beckmann, Daniel Sanchez MIT CSAIL
W HIRLPOOL ! I MPROVING D YNAMIC C ACHE M ANAGEMENT WITH S TATIC D - - PowerPoint PPT Presentation
W HIRLPOOL ! I MPROVING D YNAMIC C ACHE M ANAGEMENT WITH S TATIC D - - PowerPoint PPT Presentation
ASPLOS XXI - Atlanta, Georgia 4 April 2016 W HIRLPOOL ! I MPROVING D YNAMIC C ACHE M ANAGEMENT WITH S TATIC D ATA C LASSIFICATION Anurag Mukkara, Nathan Beckmann , Daniel Sanchez MIT CSAIL Processors are limited by data movement Data
Processors are limited by data movement
Data movement often consumes >50% of time & energy
E.g., FP multiply-add: 20 pJ DRAM access: 20,000 pJ
To scale performance, must keep data near where its used But how do programs use memory?
Cache banks
Good: nearby cache banks Bad: faraway cache banks Terrible: DRAM access
Static policies have limitations
3
Program Code Fixed policy Exploits program semantics Binary E.g., scratchpads, bypass hints Can’t adapt to application phases, input-dependent behavior, or shared systems Static analysis
- r profiling
Dynamic policies have limitations, too
4
Binary Dynamic policy Responsive to actual application behavior E.g., data migration & replication Difficult to recover program semantics from loads/stores Expensive mechanisms (eg, extra data movement & directories) Observe loads/stores
Combining static and dynamic is best
5
Program Code Binary Static analysis
- r profiling
Observe loads/stores Pool A Pool B Pool C Pool D
Policy A Policy B Policy C Policy D
Exploits program semantics at low overhead Responsive to actual application behavior
Agenda
6
Case study Manual classification Parallel applications WhirlTool
System configuration
7
Core L1i L1d Private L2 Non-uniform cache access (NUCA): Cache banks have different access latencies
We apply Whirlpool to Jigsaw [Beckmann PACT’13],
a state-of-the-art NUCA cache
Allocates virtual caches, collections of parts of cache banks Significantly outperforms prior D-NUCA schemes
Baseline dynamic NUCA scheme
8
Reduce cache misses Reduce on-chip network traversals Simple mechanisms
Dynamic policies can reduce data movement
9
Jigsaw
[Beckmann, PACT’13]
Dynamic policy performs somewhat better: Static NUCA 4% better performance 12% lower energy App: Delaunay triangulation
Static analysis can help!
10
Access Intensity
Points Vertices Triangles
Accesses Footprint (MB)
Jigsaw with Static Classification
11
Jigsaw
[Beckmann, PACT’13]
Whirlpool! Vs Jigsaw: 19% better performance 42% lower energy Few data structures accessed more frequently than others
Access Intensity
Points Vertices Triangles
Agenda
12
Case study Manual classification Parallel applications WhirlTool
Whirlpool – Manual classification
Organize application data into memory pools
int poolPoints = pool_create(); Point* points = pool_malloc(sizeof(Point)*n, poolPoints); int poolTris = pool_create(); Tri* smallTris = pool_malloc(sizeof(Tri)*m, poolTris); Tri* largeTris = pool_malloc(sizeof(Tri)*M, poolTris);
Insight: Group semantically similar data into a pool
Points, Triangles
13
Minor changes to programs
14
Application Pools LOC Delaunay triangulation 3 11 Maximal matching 3 13 Delaunay refinement 3 8 Maximal independent set 3 13 Minimal spanning forest 3 11 401.bzip2 4 43 470.lbm 2 21 429.mcf 2 14 436.cactusADM 2 53
SPECCPU 2006 PBBS
Whirlpool on NUCA placement
15
Use pools to improve Jigsaw’s decisions
Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA banks
Whirlpool requires no changes to core Jigsaw
Increase size of structures (few KBs) Minor improvements, e.g. bypassing (see paper)
Pools useful elsewhere, eg to dynamic prefetching
Significant improvements on some apps
16
b z i p 2 r e f i n e M S T l b m m c f c a c t u s m a t c h i n g D T M I S 10 20 30 40 50 60
Energy savings vs Jigsaw (%)
bzip2 refine MST lbm mcf cactus matching DT MIS 2 4 6 8 10 12 14
Speedup vs Jigsaw (%)
38
Up to 38% better performance Up to 53% lower energy Performance Energy
Agenda
17
Case study Manual classification Parallel applications WhirlTool
Conventional runtimes can harm locality
18
Optimize load balance, not locality
Whirlpool co-locates tasks and data
19
Break input into pools Application indicates task affinity Schedule + steal tasks from nearby their data Dynamically adapt data placement Requires minimal changes to task-parallel runtimes
Input
Whirlpool improves locality
20
Whirlpool adapts schedule dynamically
21
Data placement implicitly schedules tasks
Significant improvements at 16 cores 22
MS FFT TC DT PR CC
10 20 30 40 50 60 70
Speedup vs Jigsaw (%) MS FFT TC DT PR CC
1.0 1.5 2.0 2.5 3.0
Energy savings vs Jigsaw
Up to 67% better performance Up to 2.6x lower energy
Applications
Divide and conquer algorithms: Mergesort, FFT Graph analytics: PageRank, Triangle Counting, Connected Components Graphics: Delaunay Triangulation
Caveat: Splitting data into pools can be expensive!
Agenda
23
Case study Manual classification Parallel applications WhirlTool
WhirlTool – Automated classification
24
Modifying program code is not always practical A profile-guided tool can automatically classify data into
pools
WhirlTool Profiler WhirlTool Analyzer
Per-callpoint miss curves Callpoint-to- pool map
Application WhirlTool runtime Whirlpool Allocator
malloc() pool_malloc()
WhirlTool profiles miss curves
25
Periodically records per-callpoint miss curves
Application
A B C ….
Alloc Accs
Groups allocations by callpoint Profiles accesses to each pool T i m e Misses Cache size
WhirlTool analyzes curves to find pools
26
Hardware can only support a limited number of pools
Jigsaw uses 3 virtual caches / thread
0.6% area overhead over LLC
Whirlpool adds 4 pools (each mapped to a virtual cache)
1.2% total area overhead over LLC
Must cluster callpoints into semantically similar groups
Per-callpoint miss curves
Agglomerative clustering
Callpoint-to-pool mapping
Example of agglomerative clustering 27
1 1 1 2 2 3
WhirlTool’s distance metric
28
Cache Size Misses Small distance Cache Size Misses Large distance Pool 1 Pool 2
Separated Combined
Pool 3 How many misses are saved by separating pools?
WhirlTool matches manual hints
29
leslie gcc gems bzip2
- mnet
ray refine sphinx3 MST lbm setCover soplex xalanc mcf SA cactus matching DT MIS 2 4 6 8 10 12 14
Speedup vs Jigsaw (%)
38
WhirlTool
l e s l i e g c c g e m s b z i p 2
- m
n e t r a y r e f i n e s p h i n x 3 M S T l b m s e t C
- v
e r s
- p
l e x x a l a n c m c f S A c a c t u s m a t c h i n g D T M I S 2 4 6 8 10 12 14
Speedup vs Jigsaw (%)
38 38
WhirlTool Manual
Multiprogram mixes
30
4-core system with random SPECCPU2006 apps
Including those that do not benefit
Whirlpool improves performance by (gmean over 20 mixes)
35% over S-NUCA 30% over idealized shared-private D-NUCA [Hererro, ISCA’10] 26% over R-NUCA
[Hardavellas, ISCA’09]
18% over page placement by Awasthi et al. [Awasthi HPCA’09] 5% over Jigsaw
[Beckmann, PACT’13]
Conclusion
31
Semantic information from applications improves
performance of dynamic policies
Coordinated data and task placement gives large
improvements in parallel applications
Automated classification reduces programmer burden
THANKS FOR YOUR ATTENTION! QUESTIONS ARE WELCOME!
32
WhirlTool code available at http://bit.ly/WhirlTool