W HIRLPOOL ! I MPROVING D YNAMIC C ACHE M ANAGEMENT WITH S TATIC D - PowerPoint PPT Presentation

ASPLOS XXI - Atlanta, Georgia – 4 April 2016 W HIRLPOOL ! I MPROVING D YNAMIC C ACHE M ANAGEMENT WITH S TATIC D ATA C LASSIFICATION Anurag Mukkara, Nathan Beckmann , Daniel Sanchez MIT CSAIL

Processors are limited by data movement  Data movement often consumes >50% of time & energy  E.g., FP multiply-add: 20 pJ  DRAM access: 20,000 pJ  To scale performance, must keep data near where its used  But how do programs use memory? Cache banks Good: nearby cache banks Bad: faraway cache banks Terrible: DRAM access

Static policies have limitations 3 Program Code Static analysis Exploits program semantics or profiling Fixed policy Can’t adapt to application phases, input-dependent behavior, or shared systems Binary E.g., scratchpads, bypass hints

Dynamic policies have limitations, too 4 Binary Observe Responsive to actual loads/stores application behavior Dynamic policy Difficult to recover program E.g., data migration & replication semantics from loads/stores  Expensive mechanisms (eg, extra data movement & directories)

Combining static and dynamic is best 5 Program Code Static analysis or profiling Exploits program Pool Pool Pool Pool semantics at low overhead A B C D Binary Responsive to actual Observe application behavior loads/stores Policy Policy Policy Policy C D A B

Agenda 6  Case study  Manual classification  Parallel applications  WhirlTool

System configuration 7 Private L2 L1i L1d Core Non-uniform cache access (NUCA): Cache banks have different access latencies

Baseline dynamic NUCA scheme 8  We apply Whirlpool to Jigsaw [Beckmann PACT’13] , a state-of-the-art NUCA cache  Allocates virtual caches , collections of parts of cache banks  Significantly outperforms prior D-NUCA schemes Reduce cache misses Reduce on-chip network traversals Simple mechanisms

Dynamic policies can reduce data movement 9 App: Delaunay triangulation Static NUCA Jigsaw [Beckmann, PACT’13] Dynamic policy performs somewhat better: 4% better performance 12% lower energy

Static analysis can help! 10 Points Access Intensity Vertices Accesses Footprint (MB) Triangles

Jigsaw with Static Classification 11 Few data structures accessed more frequently than others Points Access Intensity Vertices Triangles Whirlpool! Jigsaw Vs Jigsaw: [Beckmann, PACT’13 ] 19% better performance 42% lower energy

Whirlpool – Manual classification 13 Organize application data into memory pools Points, Triangles int poolPoints = pool_create(); Point* points = pool_malloc(sizeof(Point)*n, poolPoints); int poolTris = pool_create(); Tri* smallTris = pool_malloc(sizeof(Tri)*m, poolTris); Tri* largeTris = pool_malloc(sizeof(Tri)*M, poolTris); Insight: Group semantically similar data into a pool

Minor changes to programs 14 Application Pools LOC Delaunay triangulation 3 11 Maximal matching 3 13 PBBS Delaunay refinement 3 8 Maximal independent set 3 13 Minimal spanning forest 3 11 401.bzip2 4 43 470.lbm 2 21 SPECCPU 429.mcf 2 14 2006 436.cactusADM 2 53

Whirlpool on NUCA placement 15  Use pools to improve Jigsaw’s decisions  Each pool is allocated to a virtual cache  Jigsaw transparently places pools in NUCA banks  Whirlpool requires no changes to core Jigsaw  Increase size of structures (few KBs)  Minor improvements, e.g. bypassing (see paper)  Pools useful elsewhere, eg to dynamic prefetching

Significant improvements on some apps 16 Performance Energy 38 Energy savings vs Jigsaw (%) 60 14 Speedup vs Jigsaw (%) 50 12 10 40 8 30 6 20 4 10 2 0 0 2 e T m f s g T S bzip2 refine MST lbm mcf cactus matching DT MIS c p n u n S D I m b M t i i i M z f l c h e b a c r c t a m Up to 38% better performance Up to 53% lower energy

Conventional runtimes can harm locality 18 Optimize load balance, not locality

Whirlpool co-locates tasks and data 19  Break input into pools Input  Application indicates task affinity  Schedule + steal tasks from nearby their data  Dynamically adapt data placement  Requires minimal changes to task-parallel runtimes

Whirlpool improves locality 20

Whirlpool adapts schedule dynamically 21  Data placement implicitly schedules tasks

Significant improvements at 16 cores 22 Applications Divide and conquer algorithms : Mergesort, FFT Graph analytics: PageRank, Triangle Counting, Connected Components Graphics: Delaunay Triangulation Caveat : Splitting data into 70 3.0 Energy savings vs Jigsaw Speedup vs Jigsaw (%) pools can be expensive! 60 2.5 50 40 2.0 30 20 1.5 10 0 1.0 MS FFT TC DT PR CC MS FFT TC DT PR CC Up to 67% better performance Up to 2.6x lower energy

WhirlTool – Automated classification 24  Modifying program code is not always practical  A profile-guided tool can automatically classify data into pools Application malloc() WhirlTool WhirlTool WhirlTool Profiler Analyzer runtime pool_malloc() Callpoint-to- Per-callpoint miss curves pool map Whirlpool Allocator

WhirlTool profiles miss curves 25 Groups allocations Application by callpoint Alloc Accs Profiles accesses to each pool …. B C A Periodically records per-callpoint miss curves T Misses i m e Cache size

WhirlTool analyzes curves to find pools 26  Hardware can only support a limited number of pools  Jigsaw uses 3 virtual caches / thread  0.6% area overhead over LLC  Whirlpool adds 4 pools (each mapped to a virtual cache)  1.2% total area overhead over LLC  Must cluster callpoints into semantically similar groups Per-callpoint Agglomerative Callpoint-to-pool clustering miss curves mapping

Example of agglomerative clustering 27 1 1 1 2 2 3

WhirlTool’s distance metric 28 How many misses are saved by separating pools? Pool 1 Small distance Misses Pool 2 Pool 3 Large distance Cache Size Combined Misses Separated Cache Size

WhirlTool matches manual hints Speedup vs Jigsaw (%) Speedup vs Jigsaw (%) 10 10 12 12 14 14 0 0 2 2 4 4 6 6 8 8 leslie l e s l i e gcc g c c g gems e m Manual WhirlTool WhirlTool s b bzip2 z i p 2 o omnet m n e t r ray a y r refine e f i n e s sphinx3 p h i n x 3 M MST S T l lbm b m s setCover e t C o v e r s soplex o p l e x x xalanc a l a n c m mcf c f S SA A c cactus a c t u s m matching a t c h i n g D DT T 38 38 38 M MIS I S 29

Multiprogram mixes 30  4-core system with random SPECCPU2006 apps  Including those that do not benefit  Whirlpool improves performance by (gmean over 20 mixes)  35% over S-NUCA  30% over idealized shared-private D-NUCA [Hererro , ISCA’10]  26% over R-NUCA [Hardavellas , ISCA’09 ]  18% over page placement by Awasthi et al. [Awasthi HPCA’09]  5% over Jigsaw [Beckmann, PACT’13]

Conclusion 31  Semantic information from applications improves performance of dynamic policies  Coordinated data and task placement gives large improvements in parallel applications  Automated classification reduces programmer burden

32 T HANKS F OR Y OUR A TTENTION ! Q UESTIONS ARE WELCOME ! WhirlTool code available at http://bit.ly/WhirlTool

W HIRLPOOL ! I MPROVING D YNAMIC C ACHE M ANAGEMENT WITH S TATIC D - PowerPoint PPT Presentation

ASPLOS XXI - Atlanta, Georgia 4 April 2016 W HIRLPOOL ! I MPROVING D YNAMIC C ACHE M ANAGEMENT WITH S TATIC D ATA C LASSIFICATION Anurag Mukkara, Nathan Beckmann , Daniel Sanchez MIT CSAIL Processors are limited by data movement Data

Autonomous Detection of Coronal Mass Ejections (CMEs) Using

Program will start at 11:00 a.m. Spencer P. Eccles Managing Director The Cynosure Group

Triangle-Free Penny Graphs: Degeneracy, Choosability, and Edge Count David Eppstein 25th

Facility-based Clouds using OpenStack John Hover, Xin Zhao OSG All-Hands Meeting 2013

Getting Science Out of Computing Dr Frank L offler Fri, Aug 1st 2014 Frank L offler Fri,

Markov Model Prediction of Markov Model Prediction of I/O Requests for Scientific I/O Requests

SAT-Based Methods for Circuit Synthesis* October 22, 2014 Roderick Bloem Uwe Egly Patrick

Connecting Content Information Connections Research SAA Research Forum August 23 2011 Richard T.

Concurrent Clause Strengthening Siert Wieringa and Keijo Heljanko Department of Information and

Cactus Shadows Curriculum Night 2020/2021 Ms. Cento Pre-AP English 9 Honors Welcome! Ms.

Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 Lecture cture 8 Yan n Gu

Towards a Compact and Efficient SAT-Encoding of Finite Linear CSP . . . . . Tomoya Tanjo,

Poster Compliance Audit: Are You Up to Date with the Latest Requirements? July 2020 Presented

ValiCert Case 2nd Part Exercise Questions on the ValiCert Case 1. Please describe: What

Public Safety Communications FY 2009-11 Recommended Budget June 22, 2009 Mission Statement PSC

2015 DBE/LBE Upcoming Opportunity Overview and Networking Event SAN FRANCISCO COUNTY

Matthew Samet msamet@tamdistrict.org 1st Period Ph.Un. - Room 120 3rd Period Ph.Un. - Room 115

StackGuard: A Historical Perspective Crispin Cowan, PhD Senior PM, Windows Core Security

CS642: Computer Security Professor Ristenpart

Software Security (II): Other types of software vulnerabilities Dawn Song 1 Dawn Song 3 #293

Binary Exploitation 1 Buffer Overflows (return-to-libc, ROP, Canaries, W^X, ASLR) Chester

DevOps + Infrastructure TRACK SUPPORTED BY About me Nils Peeters DevOps Engineer

ASSOCIATIVE NETS AND FRAME SYSTEMS Network Representations If L is a set of labeled links and N

Any Problem Here? /* File: drivers/usb/core/devio.c/ / define data structure

W HIRLPOOL ! I MPROVING D YNAMIC C ACHE M ANAGEMENT WITH S TATIC D - PowerPoint PPT Presentation

ASPLOS XXI - Atlanta, Georgia 4 April 2016 W HIRLPOOL ! I MPROVING D YNAMIC C ACHE M ANAGEMENT WITH S TATIC D ATA C LASSIFICATION Anurag Mukkara, Nathan Beckmann , Daniel Sanchez MIT CSAIL Processors are limited by data movement Data

Autonomous Detection of Coronal Mass Ejections (CMEs) Using

*Program will start at 11:00 a.m.* Spencer P. Eccles Managing Director The Cynosure Group

Triangle-Free Penny Graphs: Degeneracy, Choosability, and Edge Count David Eppstein 25th

Facility-based Clouds using OpenStack John Hover, Xin Zhao OSG All-Hands Meeting 2013

Getting Science Out of Computing Dr Frank L offler Fri, Aug 1st 2014 Frank L offler Fri,

Markov Model Prediction of Markov Model Prediction of I/O Requests for Scientific I/O Requests

SAT-Based Methods for Circuit Synthesis* October 22, 2014 Roderick Bloem Uwe Egly Patrick

Connecting Content Information Connections Research SAA Research Forum August 23 2011 Richard T.

Concurrent Clause Strengthening Siert Wieringa and Keijo Heljanko Department of Information and

Cactus Shadows Curriculum Night 2020/2021 Ms. Cento Pre-AP English 9 Honors Welcome! Ms.

Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 Lecture cture 8 Yan n Gu

Towards a Compact and Efficient SAT-Encoding of Finite Linear CSP . . . . . Tomoya Tanjo,

Poster Compliance Audit: Are You Up to Date with the Latest Requirements? July 2020 Presented

ValiCert Case 2nd Part Exercise Questions on the ValiCert Case 1. Please describe: What

Public Safety Communications FY 2009-11 Recommended Budget June 22, 2009 Mission Statement PSC

2015 DBE/LBE Upcoming Opportunity Overview and Networking Event SAN FRANCISCO COUNTY

Matthew Samet msamet@tamdistrict.org 1st Period Ph.Un. - Room 120 3rd Period Ph.Un. - Room 115

StackGuard: A Historical Perspective Crispin Cowan, PhD Senior PM, Windows Core Security

CS642: Computer Security Professor Ristenpart

Software Security (II): Other types of software vulnerabilities Dawn Song 1 Dawn Song 3 #293

Binary Exploitation 1 Buffer Overflows (return-to-libc, ROP, Canaries, W^X, ASLR) Chester

DevOps + Infrastructure TRACK SUPPORTED BY About me Nils Peeters DevOps Engineer

ASSOCIATIVE NETS AND FRAME SYSTEMS Network Representations If L is a set of labeled links and N

Any Problem Here? /* File: drivers/usb/core/devio.c*/ /* define data structure

Program will start at 11:00 a.m. Spencer P. Eccles Managing Director The Cynosure Group

Any Problem Here? /* File: drivers/usb/core/devio.c/ / define data structure