Compile-Time Detection of False Sharing via Loop Cost Modeling - PowerPoint PPT Presentation

Compile-Time Detection of False Sharing via Loop Cost Modeling Munara Tolubaeva, Yonghong Yan and Barbara Chapman High Performance Computing and Tools Group (HPCTools) Computer Science Department University of Houston

OUTLINE } Introduction and Motivation } Methodology } Experiment } Conclusion 2

Introduction & Motivation } Compiler Transformation Example Unroll factor for (i=0; i<N; i+=3) { Loop Unrolling A[i+0] = B[i+0] + C[i+0]; for (i=0; i<N; i++) { A[i+1] = B[i+1] + C[i+1]; A[i] = B[i] + C[i]; A[i+2] = B[i+2] + C[i+2]; } } 3

Introduction & Motivation } Compiler Cost Model Code Segment Performance Cost Prediction Model Architecture Details ¨ Estimates the time needed to execute a specific section of code on a given system ¨ Considers performance impacting architectural features (Processor, Cache, Memory bandwidth, etc) ¨ Open64 cost models – the most sophisticated models among open source compilers 4

Introduction & Motivation Open64 cost models Parallel model Processor model Cache model Computational Machine cost resource cost Cache cost Operation cost Cache cost TLB cost Issue cost Loop overhead Mem_ref cost Parallel overhead Dependency latency cost For loops Reduction cost only… Register spilling cost 5

Introduction & Motivation } Processor model - predicts the scheduling of instructions given the available amount of resources } Guides loop unrolling } Finds the optimal loop Processor model unrolling level and factor Computational resource cost Operation cost Issue cost Mem_ref cost Dependency latency cost Register spilling cost 6

Introduction & Motivation } Cache model - predicts the number of cache misses and estimates additional cycles needed to execute an iteration of an inner loop Cache model } Guides loop tiling } Finds the optimal loop tiling size Cache cost TLB cost 7

Introduction & Motivation } Parallel model – decides loop level that is the best candidate for parallelization Parallel model } Evaluates the cost involved in Machine cost parallelizing the loop } Used in auto-parallelization phase Cache cost Loop overhead Parallel overhead Reduction cost 8

Introduction & Motivation } State – of – Art } Optimize single CPU performance, not considering shared resource contention } Limited use of models for compiler optimizations and transformations } All false sharing detection techniques implemented at runtime 9

False Sharing Cache Memory S P1 I 1 2 3 4 1 2 3 4 write A cache line 2 S P2 M 1 2 3 4 with 4 words Cache } Processors maintain data consistency via cache coherency } Data sharing is at cache line granularity } A store to a single data invalidates the whole copy of a cache line } Successive read suffers a cache miss } Reload entire cache line 10

False Sharing Cache read Memory S P1 I 1 2 3 3 4 3 1 1 2 2 3 3 4 4 A cache line S P2 M 1 2 3 4 1 2 3 4 with 4 words Cache • Processors maintain data consistency via cache coherency – Data sharing is at cache line granularity • A store to a single data invalidates the whole copy of a cache line • Successive read suffers a cache miss – Reload entire cache line 11

Effects of False Sharing False ¡sharing ¡is ¡a ¡performance ¡degrading ¡data ¡access ¡pattern ¡ that ¡can ¡arise ¡in ¡systems ¡with ¡distributed, ¡coherent ¡caches. ¡ ¡ Execution Time (sec) Code Sequential 2 threads 4 threads 8 threads Version Unoptimized 0.503 4.563 3.961 4.432 Optimized 0.503 0.263 0.137 0.078 12

False Sharing: Monitoring Results } Cache line invalidation measurements Program 1-thread 2-threads 4-threads 8-threads name histogram 13 7,820,000 16,532,800 5,959,190 kmeans 383 28,590 47,541 54,345 linear_regression 9 417,225,000 254,442,000 154,970,000 matrix_multiply 31,139 31,152 84,227 101,094 pca 44,517 46,757 80,373 122,288 reverse_index 4,284 89,466 217,884 590,013 string_match 82 82,503,000 73,178,800 221,882,000 word_count 4,877 6,531,793 18,071,086 68,801,742

False Sharing: Data Analysis Results } Determining the variables that cause misses Program Global/static data Dynamic data Name histogram - main_221 linear_regression - main_155 reverse_index use_len main_519 string_match key2_final string_match_map_266 word_count length, use_len, words -

Runtime Handling of False Sharing Original Version Optimized Version 1-thread 2-threads 1-thread 2-threads 4-threads 8-threads 4-threads 8-threads 8 8 Speedup Speedup 6 6 4 4 2 2 0 0 B. ¡Wicaksono, ¡M. ¡Tolubaeva ¡and ¡B. ¡Chapman. ¡“Detecting ¡false ¡sharing ¡in ¡OpenMP ¡ applications ¡using ¡the ¡DARWIN ¡framework”, ¡LCPC ¡2011 ¡

Related Work } False Sharing Detection Methods } Cache simulation and Memory tracing (Gunther and Weidendorfer WBIA’09, Marathe and Muller TPDS’07, Martonosi et al Sigmetrics’92) } Hardware Performance Counters (Marathe et al. Tech. rep.’06, Wicaksono et al. LCPC’11) } Memory Protection (Tongping and Berger OOPSLA’11) } Memory Shadowing (Zhao et al. VEE’11) } False Sharing Elimination Methods } Tune scheduling parameters (chunk size, chunk stride) (Chow and Sarkar ICPP’97) } Compiler transformations (array padding, memory alignment) (Jeremiassen and Eggers PPoPP’94) } All FS detection methods are applied at runtime, incur overhead 16

False Sharing Cost Model } False Sharing (FS) Modeling } Estimates the performance impact of FS on OpenMP parallel loops at compile – time. } Features: } Ability to output the total number of FS cases that will occur during execution of the parallel loop. } Ability to analyze the performance impact of FS on a parallel loop as a percentage of execution time. } Introduces a linear regression model to reduce the modeling time by approximation without impacting its accuracy. 17

Methodology } False Sharing Model needs: } # of threads executing the loop } Loop boundaries } Step sizes } Index variables } Chunk size (if specified for OpenMP loop) 18

Methodology False Sharing Modeling } Technique is comprised of 4 steps } Obtain array references made in the innermost loop of a loop nest } Generate a cache line ownership list for each thread } Apply a stack distance analysis to cache state of each thread } Detect false sharing 19

Methodology – Step 1 } Obtain array references made in the innermost loop of a loop nest } Array base name } Array indices } Memory offsets for arrays with structured data types 20

Methodology – Step 2 } Generate a cache line ownership list Thread0 Thread1 Thread7 Iteration_1: Iteration_1: Iteration_1: cacheline_a_1_w, cacheline_a_1_w, cacheline_a_1_w, cacheline_b_1_r cacheline_b_2_r cacheline_b_8_r … ¡ } Assumption: all array variables are cache aligned 21

Methodology – Step 3 } Apply a stack distance analysis Evict ¡LRU ¡ ¡ cache ¡lines, ¡ ¡ } Simulate fully associative cache if ¡stack ¡overflows ¡ } impossible to know corresponding Thread0 Cache State cache line in a set at compile time Cacheline_a_1_w Cacheline_b_1_r } modeling the fully associative Cacheline_a_2_w cache is mostly valid especially for Cacheline_b_2 caches with high level of associativity 1 Insert ¡cache ¡lines ¡ ¡ from ¡new ¡cache ¡ ¡ ownership ¡lists ¡ 22 1. A. Sandberg, D. Eklov, and E. Hagersten. Reducing cache pollution through detection and elimination of non-temporal memory accesses. SC 2010.

Methodology – Step 4 } Detect False Sharing cl 1, ( if cl cs and cs W ) ⎧ ∈ i = } Perform 1 to All comparison i k k ( cs cl , ) ϕ = ⎨ k i 0, otherwise ⎩ } do other cache states contain my cache line? } Perform comparison for each thread’s new cache line ownership list at each iteration until all iterations in one chunk are evaluated k 1 n − false _ sharing ( cs cl , ) mask cs cl ( , ) ∑∑ = ϕ × iter j i j i j 0 i 0 = = cs 0, (cl if CLOL ) ⎧ j ∈ ⎪ i j mask cs cl ( , ) = ⎨ j i 1, otherwise ⎪ ⎩ } Perform steps 2-4 until all iterations are finished 23

Methodology - Prediction with linear regression } Full model is expensive when # iterations becomes large } Prediction with Linear Regression } Predict the total false sharing cases by evaluating much lower number of iterations in much less time 200 False Sharing 150 100 50 0 0 20 40 60 80 100 120 140 # of Iterations 24

Compile-Time Detection of False Sharing via Loop Cost Modeling - PowerPoint PPT Presentation

Compile-Time Detection of False Sharing via Loop Cost Modeling Munara Tolubaeva, Yonghong Yan and Barbara Chapman High Performance Computing and Tools Group (HPCTools) Computer Science Department University of Houston OUTLINE } Introduction

Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing

False fasting is driven by pride False fasting is driven by pride False fasting is

c } false loop body P (postcondition) Loop Invariant Defn : A boolean condition that

Repetition Types of Loops Counting loop Know how many times to loop

Trace while Loop, cont. Trace while Loop, cont. Print Welcome to Java Print Welcome to Java int

Trading Strategies Introduction Trading Loop Trading Loop Trading Loop Trading Loop Three

Coarse-Grained Parallelism Variable Privatization, Loop Alignment, Loop Fusion, Loop

LLVM Compile Time. Challenges. Improvements. Outlook. Michael Zolotukhin, Apple Agenda

Loop Optimizations Important because lots of execution Loop Optimizations Loop Optimizations

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

A New Architecture for Building Software Daniel Dunbar Overview Compile time How

Advanced Tools from Modern Cryptography Lecture 3 Secret-Sharing (ctd.) Secret-Sharing Last

University of Colorado Denver Office of Grants and Contracts Cost Sharing Cost Sharing is the

False Layers Delmarva Variant Strain Phylogenetic Tree Cloacal/Pharyngal One of these 50 week

FALSE CREEK SOUTH TOPIC WORKSHOP 2: SUSTAINABILITY Saturday, December 2, 2017 | False Creek

Loop Invariants: Part 2 7 January 2019 OSU CSE 1 Maintaining the Loop Invariant A claimed

A Predictable Execution Model for COTS-based Embedded Systems Research by: Rodolfo Pellizzoni,

Why Use Scheduling? Sequential accesses to DRAM Memory Access are wasteful Scheduler

Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching Jih-Kwon Peir,

SOFT FTWAR ARE- E- DE DEFI FINED-S NED-STORAGE AGE The futur The future of c e of cloud-

Adaptive Optimization using Hardware Performance Monitors Master Thesis by Mathias Payer

SCMFS: A File System for Storage Class Memory Xiaojian Wu, Narasimha Reddy Texas A&M

The Time-predictable Multicore Architecture T-CREST Martin Schoeberl Technical University of

Memory driven architecture: flipping the inequality computing vs. memory 1 The talk covers

Compile-Time Detection of False Sharing via Loop Cost Modeling - PowerPoint PPT Presentation

Compile-Time Detection of False Sharing via Loop Cost Modeling Munara Tolubaeva, Yonghong Yan and Barbara Chapman High Performance Computing and Tools Group (HPCTools) Computer Science Department University of Houston OUTLINE } Introduction

Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing

False fasting is driven by pride False fasting is driven by pride False fasting is

c } false loop body P (postcondition) Loop Invariant Defn : A boolean condition that

Repetition Types of Loops Counting loop Know how many times to loop

Trace while Loop, cont. Trace while Loop, cont. Print Welcome to Java Print Welcome to Java int

Trading Strategies Introduction Trading Loop Trading Loop Trading Loop Trading Loop Three

Coarse-Grained Parallelism Variable Privatization, Loop Alignment, Loop Fusion, Loop

LLVM Compile Time. Challenges. Improvements. Outlook. Michael Zolotukhin, Apple Agenda

Loop Optimizations Important because lots of execution Loop Optimizations Loop Optimizations

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

A New Architecture for Building Software Daniel Dunbar Overview Compile time How

Advanced Tools from Modern Cryptography Lecture 3 Secret-Sharing (ctd.) Secret-Sharing Last

University of Colorado Denver Office of Grants and Contracts Cost Sharing Cost Sharing is the

False Layers Delmarva Variant Strain Phylogenetic Tree Cloacal/Pharyngal One of these 50 week

FALSE CREEK SOUTH TOPIC WORKSHOP 2: SUSTAINABILITY Saturday, December 2, 2017 | False Creek

Loop Invariants: Part 2 7 January 2019 OSU CSE 1 Maintaining the Loop Invariant A claimed

A Predictable Execution Model for COTS-based Embedded Systems Research by: Rodolfo Pellizzoni,

Why Use Scheduling? Sequential accesses to DRAM Memory Access are wasteful Scheduler

Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching Jih-Kwon Peir,

SOFT FTWAR ARE- E- DE DEFI FINED-S NED-STORAGE AGE The futur The future of c e of cloud-

Adaptive Optimization using Hardware Performance Monitors Master Thesis by Mathias Payer

SCMFS: A File System for Storage Class Memory Xiaojian Wu, Narasimha Reddy Texas A&amp;M

The Time-predictable Multicore Architecture T-CREST Martin Schoeberl Technical University of

Memory driven architecture: flipping the inequality computing vs. memory 1 The talk covers

SCMFS: A File System for Storage Class Memory Xiaojian Wu, Narasimha Reddy Texas A&M