Cache Impact on Program Performance T. Yang. UCSB CS240A. 2017

Multi-level cache in computer systems Topics § Performance analysis for multi-level cache § Cache performance optimization through program transformation Processor Caching Control Disk L2 L3 Main Cache cache meory L1 Datapath cache

Cache misses and data access time D 0 : total memory data accesses. D 1 : missed access at L1. m 1 local miss ratio of L1: m 1 = D 1 /D 0 D 2 : missed access at L2. m 2 local miss ratio of L2: m 2 = D 2 /D 1 D 3 : missed access at L3 m 3 local miss ratio of L2: m 3 = D 3 /D 2 Memory and cache access time: δ i : access time at cache level i δ mem : access time in memory. Average access CPU = total time/D 0 = δ 1 + m 1 *penalty time

Average memory access time (AMAT) Data found Data found in L2, L3 or in L1 memory δ 1 + m 1 *penalty AMAT= ~2 cycles

Total data access time Data found in Data found L3 or in L2 memory δ 1 + m 1 [ δ 2 + m 2 Penalty] Average time = ~2 cycles

Total data access time Found in L3 or memory δ 1 + m 1 [ δ 2 + m 2 Penalty] Average time = ~2 cycles

Total data access time No L3. Found in memory δ 1 + m 1 [ δ 2 + m 2 δ mem ] Average time = ~2 cycles ~10 cycles ~100-200 cycles

Total data access time Found in Found in L3 memory δ 1 + m 1 [ δ 2 + m 2 [ δ 3 + m 3 δ mem ]] Average memory access time (AMAT)=

Local vs. Global Miss Rates • Local miss rate – the fraction of references to one level of a cache that miss. For example, m 2 = D 2 /D 1 Notice total_L2_accesses is L1 Misses • Global miss rate – the fraction of references that miss in all levels of a multilevel cache – Global L2 miss rate = D 2 /D 0 – L2$ local miss rate >> than the global miss rate • Notice Global L2 miss rate = D 2 /D 0 = D 1 /D 0 * D 2 /D 1 = m 1 m 2 10

Global miss rate L1 Cache: 32KB I$, 32KB D$ L2 Cache: 256 KB L3 Cache: 4 MB 10/4/17 Fall 2013 -- Lecture 11 #13

Average memory access time with no L3 cache δ 1 + m 1 [ δ 2 + m 2 δ mem ] AMAT = = δ 1 + m 1 δ 2 + m 1 m 2 δ mem = δ 1 + m 1 δ 2 + GMiss 2 δ mem

Average memory access time with L3 cache δ 1 + m 1 [ δ 2 + m 2 [ δ 3 + m 3 δ mem ]] = δ 1 + m 1 δ 2 + m 1 m 2 δ 3 +m 1 m 2 m 3 δ mem AMAT = = δ 1 + m 1 δ 2 + GMiss 2 δ 3 + GMiss 3 δ mem

Example What is average memory access time?

Example What is the average memory access time with L1, L2, and L3?

Example

Cache-aware Programming • Reuse values in cache as much as possible § exploit temporal locality in program § Example 1: Y[2] is revisited continously For i=1 to n y[2]=y[2]+3 § Example 2 with access sequence: Y[2] is revisited after a few instructions later

Cache-aware Programming • Take advantage of better bandwidth by getting a chunk of memory to cache and use whole chunk § Exploit spatial locality in program For i=1 to n Visiting Y[1] benefits next access y[i]=y[i]+3 of Y[2] 4000 Y[0] Y[1] Y[2]] Y[3] Y[4] Y[31] 32-Byte Cache Block Tag 18 Memory

2D array layout in memory (just like 1D array) • for(x = 0; x < 3; x++){ for(y = 0; y < 3; y++) { a[y][x]=0; // implemented as array[3*y+x]=0 } } à access order a[0][0], a[1][0], a[2][0], a[3][0] …

Exploit spatial data locality via program rewriting: Example 1 • Each cache block has 64 bytes. Cache has 128 bytes • Program structure (data access pattern) § char D[64][64]; § Each row is stored in one cache line block § Program 1 for (j = 0; j <64; j++) for (i = 0; i < 64; i++) D[i][j] = 0; 64*64 data byte access à What is cache miss rate? § Program 2 for (i = 0; i < 64; i++) for (j = 0; j < 64; j++) D[i][j] = 0; What is cache miss rate?

Data Access Pattern and cache miss • for (i = 0; i <64; j++) for (j = 0; j < 64; i++) D[i][j] = 0; j Miss hit hit hit …hit D[0,0] D[0,1] …. D[0,63] Cache block D[1,0] D[1,1] …. D[1,63] 1 cache miss in one inner loop iteration D[63,0] D[63,1] …. D[63,63] i 64 cache miss out of 64*64 access. There is spatial locality. Fetched cache block is used 64 times before swapping out (consecutive data access within the inner loop

Memory layout and data access by block • Memory layout of Char D[64][64] same as Char D[64*64] Data access order Memory layout of a program D[0,0] D[0,0] D[0,1] D[0,1] Program in 2D loop …. …. Cache j Miss hit hit hit …hit D[0,63] D[0,63] block D[0,0] D[0,1] …. D[0,63] D[1,0] D[1,0] D[1,0] D[1,1] …. D[1,63] D[1,1] D[1,1] …. …. D[1,63] Cache D[1,63] … block … D[63,0] D[63,1] …. D[63,63] D[63,0] D[63,0] D[63,1] D[63,1] …. …. i Cache D[63,63] D[63,63] block 64 cache miss out of 64*64 access.

Data Locality and Cache Miss • for (j = 0; j <64; j++) for (i = 0; i < 64; i++) D[i][j] = 0; j D[0,0] D[0,1] …. D[0,63] D[1,0] D[1,1] …. D[1,63] 64 cache miss in one inner loop iteration D[63] D[63,0] …. D[63,63] i 100% cache miss There is no spatial locality. Fetched block is only used once before swapping out.

Memory layout and data access by block Data access order Memory layout Program in 2D loop of a program j D[0,0] D[0,0] D[0,0] D[0,1] …. D[0,63] D[1,0] D[0,1] Cache D[1,0] D[1,1] …. D[1,63] …. …. block D[63,0] D[0,63] D[0,1] D[1,0] D[1,1] D[1,1] Cache …. …. block D[63] D[63,0] …. D[63,63] D[63,1] D[1,63] … … i D[0,63] D[63,0] D[1,63] D[63,1] Cache …. …. block 100% cache miss D[63,63] D[63,63]

Summary of Example 1: Loop interchange alters execution order and data access patterns • Exploit more spatial locality in this case

Program rewriting example 2: cache blocking for better temporal locality • Cache size = 8 blocks =128 bytes § Cache block size =16 bytes, hosting 4 integers • Program structure § int A[64]; // sizeof(int)=4 bytes for (k = 0; k<repcount; k++) for (i = 0; i < 64; I +=stepsize) A[i] =A[i]+1 Analyze cache hit ratio when varying cache block size, or step size (stride distance)

Example 2: Focus on inner loop • for (i = 0; i < 64; i +=stepsize) A[i] =A[i]+1 memory Cache block Data access order/index Stepsize 0 1 2 3 S=1 0 2 4 S=2 0 4 S=4 S=8 0 8 s Step size or also called stride distance

Step size =2 • for (i = 0; i < 64; I +=stepsize) A[i] =A[i]+1 //read, write A[i] Memory … Cache block Data access order/index 0 2 4 S=2 M/H H/H M/H H/H M/H H/H

Repeat many times • for (k = 0; k<repcount; k++) for (i = 0; i < 64; I +=stepsize) A[i] =A[i]+1 //read, write A[i] Memory … Cache block Data access order/index 0 2 4 S=2 integers M/H H/H M/H H/H M/H H/H Array has 16 blocks. Inner loop accesses 32 elements, and fetches all 16 blocks. Each block is used as R/W/R/W. Cache size = 8 blocks and cannot hold all 16 blocks fetched.

Cache blocking to exploit temporal locality For (k=0; k=100; k++) f or (i = 0;i <64;i+=S) K=1 A[i] =f(A[i]) 0 2 4 6 8 K=2 0 2 4 6 8 K=3 0 2 4 6 8 … K=0 to 100 Rewrite program with 0 2 cache blocking K=0 to 100 4 6 K=0 to 100 8 10 Pink code block can be executed fitting into cache

Rewrite a program loop for better cache usage • Loop blocking (cache blocking) with blocksize=2 Rewrite as • More general: Given f or (i = 0; i < 64; i+=S) A[i] =f(A[i]) • Rewrite as: for (bi = 0; bi<64; bi=bi+blocksize) for (i = bi; i<bi+ blocksize;i+=S) A[i] =f(A[i])

Example 2: Cache blocking for better performance • For (k=0; k=100; k++) f or (i = 0; i < 64; i=i+S) A[i] =f(A[i]) • Rewrite as: For (k=0; k=100; k++) for (bi = 0; bi<64; bi=bi+blocksize) for (i = bi; i<bi+blocksize; i+=S) A[i] =f(A[i]) Look interchange for (bi = 0; bi<64; bi=bi+blocksize) For (k=0; k=100; k++) for (i = bi; i<bi+ blocksize; i+=S) A[i] =f(A[i]) Pink code block can be executed fitting into cache

Example 3: Matrix multiplication C=A*B Cij=Row Ai * Col Bj For i= 0 to n-1 For j= 0 to n-1 For k=0 to n-1 C[i][j] +=A[i][k]* B[k][j]

Example 3: matrix multiplication code 2D array implemented using 1D layout • for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) C[i+j*n] += A[i+k*n]* B[k+j*n] 3 loop controls can interchange (C elements are modified independently with no dependence) Which code has better cache performance (faster)? for (j = 0; j < n; j++) for (k = 0; k < n; k++) for (i = 0; i < n; i++) C[i+j*n] += A[i+k*n]* B[k+j*n]

Example 3: matrix multiplication code 2D array implemented using 1D layout • for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) C[i+j*n] += A[i+k*n]* B[k+j*n] 3 loop controls can interchange (C elements are modified independently with no dependence) Which code has better cache performance (faster)? -- Study impact of stride on inner most loop which does most computation for (j = 0; j < n; j++) for (k = 0; k < n; k++) for (i = 0; i < n; i++) C[i+j*n] += A[i+k*n]* B[k+j*n]

Example 4: Cache blocking for matrix transpose for (x = 0; x < n; x++) { for (y = 0; y < n; y++) { dst[y + x * n] = src[x + y * n]; } y src dst } x Rewrite code with cache blocking

Cache Impact on Program Performance T. Yang. UCSB CS240A. 2017 - PowerPoint PPT Presentation

Cache Impact on Program Performance T. Yang. UCSB CS240A. 2017 Multi-level cache in computer systems Topics Performance analysis for multi-level cache Cache performance optimization through program transformation Processor Caching

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity

CSE378 - Cache Performance metrics for caches Parameters for cache design Basic performance

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Cache Performance Samira Khan March 28, 2017 Agenda Review from last lecture Cache

Cache Memories, Cache Complexity Marc Moreno Maza University of Western Ontario, London, Ontario

Cache Creek Placer Area Fee Proposal History of Placer Mining at Cache Creek Prospecting in

General Cache Mechanics CPU Block: unit of data in cache and memory. (a.k.a. line) Memory

ADMIN Ethics Discussion & Reading Quiz Wed April 12 Reading posted online

CENG3420 Lecture 09: Virtual Memory & Performance Bei Yu byu@cse.cuhk.edu.hk (Latest

CS356 : Discussion #9 Cache Lab & Review for Midterm II Illustrations from CS:APP3e textbook

Cache Design Basics Nima Honarmand Spring 2018 :: CSE 502 Storage Hierarchy Make common case

Roadmap Integers & floats Machine code & C C: Java: x86 assembly car *c =

CS3014 Concurrent Systems I Harshvardhan Pandit Ph.D Researcher ADAPT Centre, Trinity College

Algorithm Engineering (aka. How to Write Fast Code) CS260 Lecture 1 Yan Gu I/O (Cache)

Prefetching Advanced Topics in Computer Architecture Timothy Jones Caching Were all

Cache Impact on Program Performance T. Yang. UCSB CS240A. 2017 - PowerPoint PPT Presentation

Cache Impact on Program Performance T. Yang. UCSB CS240A. 2017 Multi-level cache in computer systems Topics Performance analysis for multi-level cache Cache performance optimization through program transformation Processor Caching

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity

CSE378 - Cache Performance metrics for caches Parameters for cache design Basic performance

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Cache Performance Samira Khan March 28, 2017 Agenda Review from last lecture Cache

Cache Memories, Cache Complexity Marc Moreno Maza University of Western Ontario, London, Ontario

Cache Creek Placer Area Fee Proposal History of Placer Mining at Cache Creek Prospecting in

General Cache Mechanics CPU Block: unit of data in cache and memory. (a.k.a. line) Memory

ADMIN Ethics Discussion &amp; Reading Quiz Wed April 12 Reading posted online

CENG3420 Lecture 09: Virtual Memory &amp; Performance Bei Yu byu@cse.cuhk.edu.hk (Latest

CS356 : Discussion #9 Cache Lab &amp; Review for Midterm II Illustrations from CS:APP3e textbook

Cache Design Basics Nima Honarmand Spring 2018 :: CSE 502 Storage Hierarchy Make common case

Roadmap Integers &amp; floats Machine code &amp; C C: Java: x86 assembly car *c =

CS3014 Concurrent Systems I Harshvardhan Pandit Ph.D Researcher ADAPT Centre, Trinity College

Algorithm Engineering (aka. How to Write Fast Code) CS260 Lecture 1 Yan Gu I/O (Cache)

Prefetching Advanced Topics in Computer Architecture Timothy Jones Caching Were all

ADMIN Ethics Discussion & Reading Quiz Wed April 12 Reading posted online

CENG3420 Lecture 09: Virtual Memory & Performance Bei Yu byu@cse.cuhk.edu.hk (Latest

CS356 : Discussion #9 Cache Lab & Review for Midterm II Illustrations from CS:APP3e textbook

Roadmap Integers & floats Machine code & C C: Java: x86 assembly car *c =