W HIRLPOOL ! I MPROVING D YNAMIC C ACHE M ANAGEMENT WITH S TATIC D - - PowerPoint PPT Presentation

w hirlpool
SMART_READER_LITE
LIVE PREVIEW

W HIRLPOOL ! I MPROVING D YNAMIC C ACHE M ANAGEMENT WITH S TATIC D - - PowerPoint PPT Presentation

ASPLOS XXI - Atlanta, Georgia 4 April 2016 W HIRLPOOL ! I MPROVING D YNAMIC C ACHE M ANAGEMENT WITH S TATIC D ATA C LASSIFICATION Anurag Mukkara, Nathan Beckmann , Daniel Sanchez MIT CSAIL Processors are limited by data movement Data


slide-1
SLIDE 1

Anurag Mukkara, Nathan Beckmann, Daniel Sanchez MIT CSAIL

ASPLOS XXI - Atlanta, Georgia – 4 April 2016

WHIRLPOOL!

IMPROVING DYNAMIC CACHE MANAGEMENT

WITH STATIC DATA CLASSIFICATION

slide-2
SLIDE 2

Processors are limited by data movement

 Data movement often consumes >50% of time & energy

 E.g., FP multiply-add: 20 pJ  DRAM access: 20,000 pJ

 To scale performance, must keep data near where its used  But how do programs use memory?

Cache banks

Good: nearby cache banks Bad: faraway cache banks Terrible: DRAM access

slide-3
SLIDE 3

Static policies have limitations

3

Program Code Fixed policy Exploits program semantics Binary E.g., scratchpads, bypass hints Can’t adapt to application phases, input-dependent behavior, or shared systems Static analysis

  • r profiling
slide-4
SLIDE 4

Dynamic policies have limitations, too

4

Binary Dynamic policy Responsive to actual application behavior E.g., data migration & replication Difficult to recover program semantics from loads/stores  Expensive mechanisms (eg, extra data movement & directories) Observe loads/stores

slide-5
SLIDE 5

Combining static and dynamic is best

5

Program Code Binary Static analysis

  • r profiling

Observe loads/stores Pool A Pool B Pool C Pool D

Policy A Policy B Policy C Policy D

Exploits program semantics at low overhead Responsive to actual application behavior

slide-6
SLIDE 6

Agenda

6

 Case study  Manual classification  Parallel applications  WhirlTool

slide-7
SLIDE 7

System configuration

7

Core L1i L1d Private L2 Non-uniform cache access (NUCA): Cache banks have different access latencies

slide-8
SLIDE 8

 We apply Whirlpool to Jigsaw [Beckmann PACT’13],

a state-of-the-art NUCA cache

 Allocates virtual caches, collections of parts of cache banks  Significantly outperforms prior D-NUCA schemes

Baseline dynamic NUCA scheme

8

Reduce cache misses Reduce on-chip network traversals Simple mechanisms

slide-9
SLIDE 9

Dynamic policies can reduce data movement

9

Jigsaw

[Beckmann, PACT’13]

Dynamic policy performs somewhat better: Static NUCA 4% better performance 12% lower energy App: Delaunay triangulation

slide-10
SLIDE 10

Static analysis can help!

10

Access Intensity

Points Vertices Triangles

Accesses Footprint (MB)

slide-11
SLIDE 11

Jigsaw with Static Classification

11

Jigsaw

[Beckmann, PACT’13]

Whirlpool! Vs Jigsaw: 19% better performance 42% lower energy Few data structures accessed more frequently than others

Access Intensity

Points Vertices Triangles

slide-12
SLIDE 12

Agenda

12

 Case study  Manual classification  Parallel applications  WhirlTool

slide-13
SLIDE 13

Whirlpool – Manual classification

Organize application data into memory pools

int poolPoints = pool_create(); Point* points = pool_malloc(sizeof(Point)*n, poolPoints); int poolTris = pool_create(); Tri* smallTris = pool_malloc(sizeof(Tri)*m, poolTris); Tri* largeTris = pool_malloc(sizeof(Tri)*M, poolTris);

Insight: Group semantically similar data into a pool

Points, Triangles

13

slide-14
SLIDE 14

Minor changes to programs

14

Application Pools LOC Delaunay triangulation 3 11 Maximal matching 3 13 Delaunay refinement 3 8 Maximal independent set 3 13 Minimal spanning forest 3 11 401.bzip2 4 43 470.lbm 2 21 429.mcf 2 14 436.cactusADM 2 53

SPECCPU 2006 PBBS

slide-15
SLIDE 15

Whirlpool on NUCA placement

15

 Use pools to improve Jigsaw’s decisions

 Each pool is allocated to a virtual cache  Jigsaw transparently places pools in NUCA banks

 Whirlpool requires no changes to core Jigsaw

 Increase size of structures (few KBs)  Minor improvements, e.g. bypassing (see paper)

 Pools useful elsewhere, eg to dynamic prefetching

slide-16
SLIDE 16

Significant improvements on some apps

16

b z i p 2 r e f i n e M S T l b m m c f c a c t u s m a t c h i n g D T M I S 10 20 30 40 50 60

Energy savings vs Jigsaw (%)

bzip2 refine MST lbm mcf cactus matching DT MIS 2 4 6 8 10 12 14

Speedup vs Jigsaw (%)

38

Up to 38% better performance Up to 53% lower energy Performance Energy

slide-17
SLIDE 17

Agenda

17

 Case study  Manual classification  Parallel applications  WhirlTool

slide-18
SLIDE 18

Conventional runtimes can harm locality

18

Optimize load balance, not locality

slide-19
SLIDE 19

Whirlpool co-locates tasks and data

19

 Break input into pools  Application indicates task affinity  Schedule + steal tasks from nearby their data  Dynamically adapt data placement  Requires minimal changes to task-parallel runtimes

Input

slide-20
SLIDE 20

Whirlpool improves locality

20

slide-21
SLIDE 21

Whirlpool adapts schedule dynamically

21

 Data placement implicitly schedules tasks

slide-22
SLIDE 22

Significant improvements at 16 cores 22

MS FFT TC DT PR CC

10 20 30 40 50 60 70

Speedup vs Jigsaw (%) MS FFT TC DT PR CC

1.0 1.5 2.0 2.5 3.0

Energy savings vs Jigsaw

Up to 67% better performance Up to 2.6x lower energy

Applications

Divide and conquer algorithms: Mergesort, FFT Graph analytics: PageRank, Triangle Counting, Connected Components Graphics: Delaunay Triangulation

Caveat: Splitting data into pools can be expensive!

slide-23
SLIDE 23

Agenda

23

 Case study  Manual classification  Parallel applications  WhirlTool

slide-24
SLIDE 24

WhirlTool – Automated classification

24

 Modifying program code is not always practical  A profile-guided tool can automatically classify data into

pools

WhirlTool Profiler WhirlTool Analyzer

Per-callpoint miss curves Callpoint-to- pool map

Application WhirlTool runtime Whirlpool Allocator

malloc() pool_malloc()

slide-25
SLIDE 25

WhirlTool profiles miss curves

25

Periodically records per-callpoint miss curves

Application

A B C ….

Alloc Accs

Groups allocations by callpoint Profiles accesses to each pool T i m e Misses Cache size

slide-26
SLIDE 26

WhirlTool analyzes curves to find pools

26

 Hardware can only support a limited number of pools

 Jigsaw uses 3 virtual caches / thread

 0.6% area overhead over LLC

 Whirlpool adds 4 pools (each mapped to a virtual cache)

 1.2% total area overhead over LLC

 Must cluster callpoints into semantically similar groups

Per-callpoint miss curves

Agglomerative clustering

Callpoint-to-pool mapping

slide-27
SLIDE 27

Example of agglomerative clustering 27

1 1 1 2 2 3

slide-28
SLIDE 28

WhirlTool’s distance metric

28

Cache Size Misses Small distance Cache Size Misses Large distance Pool 1 Pool 2

Separated Combined

Pool 3 How many misses are saved by separating pools?

slide-29
SLIDE 29

WhirlTool matches manual hints

29

leslie gcc gems bzip2

  • mnet

ray refine sphinx3 MST lbm setCover soplex xalanc mcf SA cactus matching DT MIS 2 4 6 8 10 12 14

Speedup vs Jigsaw (%)

38

WhirlTool

l e s l i e g c c g e m s b z i p 2

  • m

n e t r a y r e f i n e s p h i n x 3 M S T l b m s e t C

  • v

e r s

  • p

l e x x a l a n c m c f S A c a c t u s m a t c h i n g D T M I S 2 4 6 8 10 12 14

Speedup vs Jigsaw (%)

38 38

WhirlTool Manual

slide-30
SLIDE 30

Multiprogram mixes

30

 4-core system with random SPECCPU2006 apps

 Including those that do not benefit

 Whirlpool improves performance by (gmean over 20 mixes)

 35% over S-NUCA  30% over idealized shared-private D-NUCA [Hererro, ISCA’10]  26% over R-NUCA

[Hardavellas, ISCA’09]

 18% over page placement by Awasthi et al. [Awasthi HPCA’09]  5% over Jigsaw

[Beckmann, PACT’13]

slide-31
SLIDE 31

Conclusion

31

 Semantic information from applications improves

performance of dynamic policies

 Coordinated data and task placement gives large

improvements in parallel applications

 Automated classification reduces programmer burden

slide-32
SLIDE 32

THANKS FOR YOUR ATTENTION! QUESTIONS ARE WELCOME!

32

WhirlTool code available at http://bit.ly/WhirlTool