Does Cache Sharing on Modern CMP Matter to the Performance of - PowerPoint PPT Presentation

Does Cache Sharing on Modern CMP Matter to the Performance of Contemporary Multithreaded Programs? Eddy Zheng Zhang Yunlian Jiang Xipeng Shen (presenter) Computer Science Department The College of William and Mary, VA, USA

Cache Sharing • A common feature on modern CMP 2 The College of William and Mary

Cache Sharing on CMP • A double-edged sword • Reduces communication latency • But causes conflicts & contention 3 The College of William and Mary

Cache Sharing on CMP Non-Uniformity • A double-edged sword • Reduces communication latency • But causes conflicts & contention 4 The College of William and Mary

Many Efforts for Exploitation • Example: shared-cache-aware scheduling • Assigning suitable programs/threads to the same chip • Independent jobs • Job Co-Scheduling [Snavely+:00, Snavely+:02, El- Moursy+:06, Fedorova+:07, Jiang+:08, Zhou+:09] • Parallel threads of server applications • Thread Clustering [Tam+:07] 5 The College of William and Mary

Overview of this Work (1/3) • A surprising finding • Insignificant effects from shared cache on a recent multithreaded benchmark suite (PARSEC) • Drawn from a systematic measurement • thousands of runs • 7 dimensions on levels of programs, OS, & architecture • derived from timing results • confirmed by hardware performance counters 6 The College of William and Mary

Overview of this Work (2/3) • A detailed analysis • Reason • three mismatches between executables and CMP cache architecture • Cause • the current development and compilation are oblivious to cache sharing 7 The College of William and Mary

Overview of this Work (3/3) • An exploration of the implications • Exploiting cache sharing deserves not less but more attention. • But to exert the power, cache-sharing-aware transformations are critical • Cuts half of cache misses • Improves performance by 36%. 8 The College of William and Mary

Outline • Experiment design • Measurement and findings • Cache-sharing-aware transformation • Related work, summary, and conclusion. 9 The College of William and Mary

Benchmarks (1/3) • PARSEC suite by Princeton Univ [Bienia+:08] “focuses on emerging workloads and was designed to be representative of next-generation shared-memory programs for chip-multiprocessors” 10 The College of William and Mary

Benchmarks (2/3) • Composed of • RMS applications • Systems applications • …… • A wide spectrum of • working sets, locality, data sharing, synch., off-chip traffic, etc. 11 The College of William and Mary

Benchmarks (3/3) Program Description Parallelism Working Set Blackscholes Black-Scholes equation data 2MB Bodytrack body tracking data 8MB Canneal sim. Annealing unstruct. 256MB Facesim face simulation data 256MB Fluidanimate fluid dynamics data 64MB Streamcluster online clustering data 16MB Swaptions portfolio pricing data 0.5MB X264 video encoding pipeline 16MB Dedup stream compression pipeline 256MB Ferret image search pipeline 64MB 12 The College of William and Mary

Factors Covered in Measurements Dimension Variations tions Description benchmarks 10 from PARSEC parallelism 3 data, pipeline, unstructured Program level simsmall, simmedium, simlarge, inputs 4 native # of threads 4 1,2,4,8 assignment 3 threads assignment to cores OS level binding 2 yes, no subset of cores 7 The cores a program uses Arch. level platforms 2 Intel Xeon & AMD Operon 13 The College of William and Mary

Machines Intel (Xeon 5310) 32K 32K 32K 32K 32K 32K 32K 32K 4MB L2 4MB L2 4MB L2 4MB L2 AMD (Opeteron 2352) 8GB DRAM 64K 64K 64K 64K 64K 64K 64K 64K 512K 512K 512K 512K 512K 512K 512K 512K 2MB L3 2MB L3 4GB DRAM 4GB DRAM 14 The College of William and Mary

Measurement Schemes • Running times • Built-in hooks in PARSEC • Hardware performance counters • PAPI • cache miss, mem. bus, shared data accesses 15 The College of William and Mary

Outline • Experiment design • Measurement and findings • Cache-sharing-aware transformation • Related work, summary, and conclusions 16 The College of William and Mary

Observation I: Sharing vs. Non-sharing 17 The College of William and Mary

Sharing vs. Non-sharing T1 T2 VS. T1 T2 18

Sharing vs. Non-sharing T1 T3 T2 T4 VS. T1 T3 T4 T2 19

Sharing vs. Non-sharing • Performance Evaluation (Intel) 1.4 2t simlarge 2t native 4t simlarge 4t native 1.2 1 0.8 0.6 0.4 0.2 0 blackscholes fluidanimate streamcluster x264 bodytrack swaptions canneal facesim 20 The College of William and Mary

Sharing vs. Non-sharing • Performance Evaluation (AMD) 1.4 2t simlarge 2t native 4t simlarge 4t native 1.2 1 0.8 0.6 0.4 0.2 0 blackscholes fluidanimate streamcluster x264 bodytrack swaptions canneal facesim 21 The College of William and Mary

Sharing vs. Non-sharing • L2-cache accesses & misses (Intel) 22 The College of William and Mary

Reasons (1/2) 1) Small amount of inter-thread data sharing sharing ratio of reads (%) (Intel) 7 5.25 3.5 1.75 blackscholes bodytrack 0 canneal facesim fluidanimate streamcluster swaptions x264 23 The College of William and Mary

Reasons (2/2) 2) Large working sets Program Description Parallelism Working Set Blackscholes Black-Scholes equation data 2MB Bodytrack body tracking data 8MB Canneal sim. Annealing unstruct. 256MB Facesim face simulation data 256MB Fluidanimate fluid dynamics data 64MB Streamcluster online clustering data 16MB Swaptions portfolio pricing data 0.5MB X264 video encoding pipeline 16MB Dedup stream compression pipeline 256MB Ferret image search pipeline 64MB The College of William and Mary

Observation II: Different Sharing Cases • Threads may differ • Different data to be processed or tasks to be conducted. • Non-uniform communication and data sharing. • Different thread placement may give different performance in the sharing case. 25 The College of William and Mary

Di fg erent Sharing Cases T1 T3 T2 T4 T1 T2 T3 T4 T1 T2 T4 T3 26

Max. Perf. Diff (%) 2t simlarge 2t native 4t simlarge 4t native 16 statistically insignificant---large 14 fluctuations across runs of the same config. 12 10 8 6 4 2 0 blackscholes fluidanimate streamcluster x264 bodytrack swaptions canneal facesim 27 The College of William and Mary

Two Possible Reasons • Similar interactions among threads • Differences are smoothed by phase shifts 28 The College of William and Mary

Temporal Traces of L2 misses 29 The College of William and Mary

Temporal Traces of L2 misses 30 The College of William and Mary

Two Possible Reasons • Similar interactions among threads • Differences are smoothed by phase shifts 31 The College of William and Mary

Pipeline Programs • Two such programs: ferret, and dedup • Numerous concurrent stages • Interactions within and between stages • Large differences between different thread-core assignments • Mainly due to load balance rather than differences in cache sharing. 32 The College of William and Mary

A Short Summary • Insignificant influence on performance • Large working sets • Little data sharing • Thread placement does not matter • Due to uniform relations among threads • Hold across inputs, # threads, architecture, phases. 33 The College of William and Mary

Outline • Experiment design • Measurement and findings • Cache-sharing-aware transformation • Related work, summary, and conclusions 34 The College of William and Mary

Principle • Increase data sharing among siblings • Decrease data sharing otherwise Non-uniform threads Non-uniform cache sharing 35 The College of William and Mary

Example: streamcluster original code thread 1 thread 2 for i = 1 to N, step =1 for i = 1 to N, step =1 … … … … for j= T1 to T2 for j= T2+1 to T3 dist=foo(p[j],p[c[i]]) dist=foo(p[j],p[c[i]]) end end … … … … end end 36 The College of William and Mary

Example: streamcluster optimized code thread 1 thread 2 for i = 1 to N, step =2 for i = 1 to N, step =2 … … … … for j= T1 to T3 for j= T1 to T3 dist=foo(p[j],p[c[i]]) dist=foo(p[j],p[c[i+1]]) end end … … … … end end 37 The College of William and Mary

Performance Improvement (streamcluster) 1 0.75 0.5 0.25 0 4t 8t L2 Cache Miss Mem Bus Trans 38 The College of William and Mary

Other Programs Normalized L2 Misses (on Intel) 1 0.75 0.5 0.25 0 4t 8t 4t 8t Blacksholes Bodytrack 39 The College of William and Mary

Implication • To exert the potential of shared cache, program-level transformations are critical. • Limited existing explorations • Sarkar & Tullsen’08, Kumar& Tullsen’02, Nokolopoulos’03. * A contrast to the large body of work in OS and architecture. 40 The College of William and Mary

Does Cache Sharing on Modern CMP Matter to the Performance of - PowerPoint PPT Presentation

Does Cache Sharing on Modern CMP Matter to the Performance of Contemporary Multithreaded Programs? Eddy Zheng Zhang Yunlian Jiang Xipeng Shen (presenter) Computer Science Department The College of William and Mary, VA, USA Cache Sharing

Atelier Num erique OMP Code Optimization: Vectorization Bertrand Putigny July 5, 2016 1 / 27

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

Pre-2012 CMP 2012 CMP Amendments 2018 CMP Amendments Above: Solar panel carports

MODERN 1 MODERN 2 MODERN 3 MODERN 4 MODERN A peep at some distant orb has power to raise

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Does Mobility Matter? Does Mobility Matter? Does Mobility Matter? Does Mobility Matter? The

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Workshop 1 North Central Texas Council of Governments CMP Workshop Overview Overview of

THE CMP INTEGRATES LIFELONG LEARNING WITH ASSESSMENT THE CMP INTEGRATES LIFELONG LEARNING WITH

http://cmp.imag.fr CMP annual users meeting, 4 Feb. 2016, PARIS Pr Process Portf rtfolio lio fr

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

UNDERGRADUATE ADMISSION AT THE UNIVERSITY OF VIRGINIA Gregory W. Roberts Dean of Undergraduate

William and Mary iGEM 2014 Calcium http://arthritida.com /

GOVERNANCE OF FEDERALLY PROTECTED RIVERS: AN INSTITUTIONAL ANALYSIS OF THE PARTNERSHIP APPROACH

Securing Permanent Protection for Public Land Tools for Wyoming Advocates Paul Spitler* The

Valley Regional High School International Baccalaureate Informational Presentation VRHS IB

Park River W atershed Revitalization I nitiative in collaboration with the Farmington River

NC Dual Eligibles Advisory Committee July 28, 2016 Welcome NC Department of Health and Human

RESPONDING & ADAPTING TO COVID19 www.thewinch.org Young Peoples Foundations 25 March 2020

Does Cache Sharing on Modern CMP Matter to the Performance of - PowerPoint PPT Presentation

Does Cache Sharing on Modern CMP Matter to the Performance of Contemporary Multithreaded Programs? Eddy Zheng Zhang Yunlian Jiang Xipeng Shen (presenter) Computer Science Department The College of William and Mary, VA, USA Cache Sharing

Atelier Num erique OMP Code Optimization: Vectorization Bertrand Putigny July 5, 2016 1 / 27

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

Pre-2012 CMP 2012 CMP Amendments 2018 CMP Amendments Above: Solar panel carports

MODERN 1 MODERN 2 MODERN 3 MODERN 4 MODERN A peep at some distant orb has power to raise

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Does Mobility Matter? Does Mobility Matter? Does Mobility Matter? Does Mobility Matter? The

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Workshop 1 North Central Texas Council of Governments CMP Workshop Overview Overview of

THE CMP INTEGRATES LIFELONG LEARNING WITH ASSESSMENT THE CMP INTEGRATES LIFELONG LEARNING WITH

http://cmp.imag.fr CMP annual users meeting, 4 Feb. 2016, PARIS Pr Process Portf rtfolio lio fr

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

UNDERGRADUATE ADMISSION AT THE UNIVERSITY OF VIRGINIA Gregory W. Roberts Dean of Undergraduate

William and Mary iGEM 2014 Calcium http://arthritida.com /

GOVERNANCE OF FEDERALLY PROTECTED RIVERS: AN INSTITUTIONAL ANALYSIS OF THE PARTNERSHIP APPROACH

Securing Permanent Protection for Public Land Tools for Wyoming Advocates Paul Spitler* The

Valley Regional High School International Baccalaureate Informational Presentation VRHS IB

Park River W atershed Revitalization I nitiative in collaboration with the Farmington River

NC Dual Eligibles Advisory Committee July 28, 2016 Welcome NC Department of Health and Human

RESPONDING &amp; ADAPTING TO COVID19 www.thewinch.org Young Peoples Foundations 25 March 2020

RESPONDING & ADAPTING TO COVID19 www.thewinch.org Young Peoples Foundations 25 March 2020