Optimization part 1 1 Changelog Changes made in this version not - PowerPoint PPT Presentation

Optimization part 1 1

Changelog Changes made in this version not seen in fjrst lecture: 29 Feb 2018: loop unrolling performance: remove bogus instruction cache overhead remark 1 29 Feb 2018: spatial locality in Akj: correct reference to B k +1 ,j to A k +1 ,j

last time what things in C code map to same set? key idea: if bytes per way apart from each other fjnding confmict misses in C how “overloaded” is each cache set cache ‘blocking’ for matrix-like code maximize work per cache miss 2

some logistics exam next week everything up to and including this lecture yes, I know offjce hours were very slow… like to think about how to help with ‘group’ offjce hours? better tools? difgerent priorities on queue? 3

view as an explicit cache imagine we explicitly moved things into cache original loop: for ( int k = 0; k < N; ++k) for ( int i = 0; i < N; ++i) { loadIntoCache(&A[i*N+k]); for ( int j = 0; j < N; ++j) { loadIntoCache(&B[i*N+j]); loadIntoCache(&A[k*N+j]); } } 4 B[i*N+j] += A[i*N+k] * A[k*N+j];

view as an explicit cache imagine we explicitly moved things into cache for ( int kk = 0; kk < N; kk += 2) for ( int i = 0; i < N; ++i) { loadIntoCache(&A[i*N+k]); loadIntoCache(&A[i*N+k+1]); for ( int j = 0; j < N; ++j) { loadIntoCache(&B[i*N+j]); loadIntoCache(&A[k*N+j]); loadIntoCache(&A[(k+1)*N+j]); for ( int k = kk; k < kk + 2; ++k) } } 5 blocking in k : B[i*N+j] += A[i*N+k] * A[k*N+j];

calculation counting with explicit cache 6 before: load ∼ 2 values for one add+multiply after: load ∼ 3 values for two add+multiply

simple blocking: temporal locality in Bij for ( int k = 0; k < N; k += 2) } 7 for ( int j = 0; j < N; ++j) { for ( int i = 0; i < N; i += 2) /* load a block around Aik */ /* process a "block": */ * A k +0 ,j B i +0 ,j += A i +0 ,k +0 * A k +1 ,j B i +0 ,j += A i +0 ,k +1 * A k +0 ,j B i +1 ,j += A i +1 ,k +0 * A k +1 ,j B i +1 ,j += A i +1 ,k +1 before: B ij s accessed once, then not again for N 2 iters after: B ij s accessed twice, then not again for N 2 iters (next k )

simple blocking: temporal locality in Akj for ( int k = 0; k < N; k += 2) } 8 for ( int j = 0; j < N; ++j) { for ( int i = 0; i < N; i += 2) /* load a block around Aik */ /* process a "block": */ * A k +0 ,j B i +0 ,j += A i +0 ,k +0 * A k +1 ,j B i +0 ,j += A i +0 ,k +1 * A k +0 ,j B i +1 ,j += A i +1 ,k +0 * A k +1 ,j B i +1 ,j += A i +1 ,k +1 before blocking: A kj s accessed once, then not again for N iters after blocking: A kj s accessed twice, then not again for N iters (next i )

simple blocking: temporal locality in Aik for ( int k = 0; k < N; k += 2) slightly less temporal locality } 9 for ( int i = 0; i < N; i += 2) for ( int j = 0; j < N; ++j) { /* load a block around Aik */ /* process a "block": */ * A k +0 ,j B i +0 ,j += A i +0 ,k +0 * A k +1 ,j B i +0 ,j += A i +0 ,k +1 * A k +0 ,j B i +1 ,j += A i +1 ,k +0 * A k +1 ,j B i +1 ,j += A i +1 ,k +1 before: A ik s accessed N times, then never again after: A ik s accessed N times but other parts of A ik accessed in between

simple blocking: spatial locality in Bij for ( int k = 0; k < N; k += 2) after blocking: slightly less spatial locality } 10 for ( int i = 0; i < N; i += 2) for ( int j = 0; j < N; ++j) { /* load a block around Aik */ /* process a "block": */ * A k +0 ,j B i +0 ,j += A i +0 ,k +0 * A k +1 ,j B i +0 ,j += A i +0 ,k +1 * A k +0 ,j B i +1 ,j += A i +1 ,k +0 * A k +1 ,j B i +1 ,j += A i +1 ,k +1 before blocking: perfect spatial locality ( B i,j and B i,j +1 adjacent) B i,j and B i +1 ,j far apart ( N elements) but still B i,j +1 accessed iteration after B i,j (adjacent)

simple blocking: spatial locality in Akj for ( int k = 0; k < N; k += 2) after: slightly less spatial locality } 11 for ( int i = 0; i < N; i += 2) for ( int j = 0; j < N; ++j) { /* load a block around Aik */ /* process a "block": */ * A k +0 ,j B i +0 ,j += A i +0 ,k +0 * A k +1 ,j B i +0 ,j += A i +0 ,k +1 * A k +0 ,j B i +1 ,j += A i +1 ,k +0 * A k +1 ,j B i +1 ,j += A i +1 ,k +1 before: perfect spatial locality ( A k,j and B k,j +1 adjacent) A k,j and A k +1 ,j far apart ( N elements) but still A k,j +1 accessed iteration after B k,j (adjacent)

simple blocking: spatial locality in Aik for ( int k = 0; k < N; k += 2) } 12 for ( int j = 0; j < N; ++j) { for ( int i = 0; i < N; i += 2) /* load a block around Aik */ /* process a "block": */ * A k +0 ,j B i +0 ,j += A i +0 ,k +0 * A k +1 ,j B i +0 ,j += A i +0 ,k +1 * A k +0 ,j B i +1 ,j += A i +1 ,k +0 * A k +1 ,j B i +1 ,j += A i +1 ,k +1 before: very poor spatial locality ( A i,k and A i +1 ,k far apart) after: some spatial locality A i,k and B i +1 ,k still far apart ( N elements) but still A i,k accessed together with A i,k +1

generalizing cache blocking for ( int kk = 0; kk < N; kk += K) { for ( int ii = 0; ii < N; ii += I) { with I by K block of A hopefully cached: for ( int jj = 0; jj < N; jj += J) { with K by J block of A, I by J block of B cached: for i in ii to ii+I: for j in jj to jj+J: for k in kk to kk+K: 13 B[i * N + j] += A[i * N + k] * A[k * N + j]; B ij used K times for one miss — N 2 /K misses A ik used J times for one miss — N 2 /J misses A kj used I times for one miss — N 2 /I misses catch: IK + KJ + IJ elements must fjt in cache

cache blocking overview reorder calculations typically work in square-ish chunks of input goal: maximum calculations per load into cache typically: use every value several times after loading it versus naive loop code: some values loaded, then used once some values loaded, then used all possible times 14

cache blocking and miss rate 15 read misses/multiply or add 0.09 blocked 0.08 unblocked 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.00 0 100 200 300 400 500 N

what about performance? 16 cycles per multiply/add [less optimized loop] 2.0 1.5 1.0 unblocked 0.5 blocked 0.0 0 100 200 300 400 500 N cycles per multiply/add [optimized loop] 0.5 0.4 0.3 0.2 unblocked 0.1 blocked 0.0 0 200 400 600 800 1000 N

performance for big sizes 17 cycles per multiply or add 1.0 matrix in unblocked L3 cache 0.8 blocked 0.6 0.4 0.2 0.0 0 2000 4000 6000 8000 10000 N

optimized loop??? performance difgerence wasn’t visible at small sizes (mostly by supplying better options to GCC) 1: reducing number of loads 2: doing adds/multiplies/etc. with less instructions 3: simplifying address computations but… how can that make cache blocking better??? 18 until I optimized arithmetic in the loop

overlapping loads and arithmetic time load load load multiply add multiply multiply multiply multiply add add add speed of load might not matter if these are slower 19

optimization and bottlenecks arithmetic/loop effjciency was the bottleneck after fjxing this, cache performance was the bottleneck common theme when optimizing: X may not matter until Y is optimized 20

example assembly (unoptimized) %rdx, %rax ... the_loop jl cmpl movl condition: // increment i addl // add to sum, on stack addq (%rax), %rax // add offset movq addq ... long result = 0; for ( int i = 0; i < N; ++i) result += A[i]; return result; } sum: 21 the_loop: ... leaq movq long sum( long *A, int N) { 0(,%rax,8), %rdx // offset ← i * 8 − 24(%rbp), %rax // get A from stack // get *(A + offset) %rax, − 8(%rbp) $1, − 12(%rbp) − 12(%rbp), %eax − 28(%rbp), %eax

Optimization part 1 1 Changelog Changes made in this version not - PowerPoint PPT Presentation

Optimization part 1 1 Changelog Changes made in this version not seen in fjrst lecture: 29 Feb 2018: loop unrolling performance: remove bogus instruction cache overhead remark 1 29 Feb 2018: spatial locality in Akj: correct reference to B k +1

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Conformal Field Theories, Conformal Bootstrap and Applications Konstantinos Deligiannis December

Part 0: Git-ing Started Part 1: Essential Skills Part 2: Introduction to Git Part 3: Advanced

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

P2P Combinatorial Optimization Amir H. Payberah (amir@sics.se) P2P Combinatorial Optimization, 13

Combinatorial Optimization at Work 2020 Traffic Optimization Part I: Paths & Lagrange

CSCI 1951-G Optimization Methods in Finance Part 11: Stochastic Optimization April 13, 2018

CSCI 1951-G Optimization Methods in Finance Part 07: Portfolio Optimization March 916,

Optimization of HPSG Grammar Implementations in Trale Georgiana Dinu Optimization of HPSG

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

Optimization Optimization Goal: Find the minimizer ! that minimizes the objective (cost)

Five Steps to Optimization Five Steps to Optimization Beyond Best Practices Beyond Best

St Stress Aware Layout Stress Aware Layout St A A L L t t Optimization Optimization

TEG: A New Post-Layout TEG: A New Post-Layout Optimization Method Optimization Method Shuo

Evolutionary Algorithm 2. Swarm Intelligence and Ant Colony Optimization Ant Colony Optimization

Optimization Process Done by an Optimization Algorithm Jose Rueda Torres Learning Objectives

Mortality Events Frances M.D. Gulland May 2018 Marine Mammal Mortality Information and data

Whats Most Important Key Trends from the 2020 Prosperity Now Scorecard January 29, 2020 The

6 Subsequences and sequential compactness 6.1 Nested intervals and nested d -cells Recall the

Unification of CSC and SE ABET Effor ts Similarity of CSC and SE Programs Similarity of CSC and

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department

14.54 International Trade Lecture 10: Production Functions 14.54 Week 6 Fall 2016 14.54 (Week

Semidefinite programming bounds for codes and anticodes in Cayley graphs Frank Vallentin

Learning for Hidden Markov Models & Course Recap Michael Gutmann Probabilistic Modelling and

Optimization part 1 1 Changelog Changes made in this version not - PowerPoint PPT Presentation

Optimization part 1 1 Changelog Changes made in this version not seen in fjrst lecture: 29 Feb 2018: loop unrolling performance: remove bogus instruction cache overhead remark 1 29 Feb 2018: spatial locality in Akj: correct reference to B k +1

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Conformal Field Theories, Conformal Bootstrap and Applications Konstantinos Deligiannis December

Part 0: Git-ing Started Part 1: Essential Skills Part 2: Introduction to Git Part 3: Advanced

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

P2P Combinatorial Optimization Amir H. Payberah (amir@sics.se) P2P Combinatorial Optimization, 13

Combinatorial Optimization at Work 2020 Traffic Optimization Part I: Paths &amp; Lagrange

CSCI 1951-G Optimization Methods in Finance Part 11: Stochastic Optimization April 13, 2018

CSCI 1951-G Optimization Methods in Finance Part 07: Portfolio Optimization March 916,

Optimization of HPSG Grammar Implementations in Trale Georgiana Dinu Optimization of HPSG

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

Optimization Optimization Goal: Find the minimizer ! that minimizes the objective (cost)

Five Steps to Optimization Five Steps to Optimization Beyond Best Practices Beyond Best

St Stress Aware Layout Stress Aware Layout St A A L L t t Optimization Optimization

TEG: A New Post-Layout TEG: A New Post-Layout Optimization Method Optimization Method Shuo

Evolutionary Algorithm 2. Swarm Intelligence and Ant Colony Optimization Ant Colony Optimization

Optimization Process Done by an Optimization Algorithm Jose Rueda Torres Learning Objectives

Mortality Events Frances M.D. Gulland May 2018 Marine Mammal Mortality Information and data

Whats Most Important Key Trends from the 2020 Prosperity Now Scorecard January 29, 2020 The

6 Subsequences and sequential compactness 6.1 Nested intervals and nested d -cells Recall the

Unification of CSC and SE ABET Effor ts Similarity of CSC and SE Programs Similarity of CSC and

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department

14.54 International Trade Lecture 10: Production Functions 14.54 Week 6 Fall 2016 14.54 (Week

Semidefinite programming bounds for codes and anticodes in Cayley graphs Frank Vallentin

Learning for Hidden Markov Models &amp; Course Recap Michael Gutmann Probabilistic Modelling and

Combinatorial Optimization at Work 2020 Traffic Optimization Part I: Paths & Lagrange

Learning for Hidden Markov Models & Course Recap Michael Gutmann Probabilistic Modelling and