Cache Memories Lecture, Oct. 30, 2018 1 Bryant and OHallaron, - PowerPoint PPT Presentation

Cache Memories Lecture, Oct. 30, 2018 1 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

General Cache Concept Smaller, faster, more expensive Cache 8 4 9 14 10 3 memory caches a subset of the blocks Data is copied in block-sized 10 4 transfer units Larger, slower, cheaper memory Memory 0 1 2 3 viewed as partitioned into “blocks” 4 4 5 6 7 8 9 10 10 11 12 13 14 15 2 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

3 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

Structure Representation r struct rec { int a[4]; size_t i; a i next struct rec *next; 24 32 0 16 }; 6 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

I[0].A I[0].B I[0].BV[0] I[0].B[1] I[1].A I[1].B I[1].BV[0] I[1].B[1] 7 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

I[0].A I[0].B I[0].BV[0] I[0].B[1] Each block I[1].A I[1].B I[1].BV[0] I[1].B[1] associated the first half of the array has a unique spot in memory I[2].A I[2].B I[2].BV[0] I[2].B[1] I[3].A I[3].B I[3].BV[0] I[3].B[1] 2^9 8 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

Cache Optimization Techniques for (j = 0; j < 3: j = j+1){ for (i = 0; i < 3: i = i+1){ for( i = 0; i < 3; i = i + 1){ for( j = 0; j < 3; j = j + 1){ x[i][j] = 2*x[i][j]; x[i][j] = 2*x[i][j]; } } } } Inner loop analysis These two loops compute the same result Array in row major order X[0][0] X[0][1] X[0][2] X[1][0] X[1][1] X[1][2] X[2][0] X[2][1] X[2][2] 0x0 – 0x3 0x4 - 0x7 0x8-0x11 0x12 – 0x15 0x16 - 0x19 0x20-0x23 X[0][0] X[0][1] X[0][2] X[1][0] X[1][1] X[1][2] X[2][0] X[2][1] X[2][2] 9 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

Cache Optimization Techniques for (j = 0; j < 3: j = j+1){ for (i = 0; i < 3: i = i+1){ for( i = 0; i < 3; i = i + 1){ for( j = 0; j < 3; j = j + 1){ x[i][j] = 2*x[i][j]; x[i][j] = 2*x[i][j]; } } } } These two loops compute the same result int *x = malloc(N*N); Array in row major order for (i = 0; i < 3: i = i+1){ for( j = 0; j < 3; j = j + 1){ x[i*N +j] = 2*x[i*N + j]; X[0][0] X[0][1] X[0][2] } } X[1][0] X[1][1] X[1][2] X[2][0] X[2][1] X[2][2] 0x0 – 0x3 0x4 - 0x7 0x8-0x11 0x12 – 0x15 0x16 - 0x19 0x20-0x23 X[0][0] X[0][1] X[0][2] X[1][0] X[1][1] X[1][2] X[2][0] X[2][1] X[2][2] 10 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

Matrix Multiplication Refresher 11 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

Miss Rate Analysis for Matrix Multiply • Assume: • Block size = 32B (big enough for four doubles) • Matrix dimension (N) is very large • Cache is not even big enough to hold multiple rows • Analysis Method: • Look at access pattern of inner loop j j k = x i i k C A B 12 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

Layout of C Arrays in Memory (review) • C arrays allocated in row-major order • each row in contiguous memory locations • Stepping through columns in one row: • for (i = 0; i < N; i++) sum += a[0][i]; • accesses successive elements • if block size (B) > sizeof(a ij ) bytes, exploit spatial locality • miss rate = sizeof(a ij ) / B • Stepping through rows in one column: • for (i = 0; i < n; i++) sum += a[i][0]; • accesses distant elements • no spatial locality! • miss rate = 1 (i.e. 100%) 13 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

Matrix Multiplication (ijk) /* ijk */ Inner loop: for (i=0; i<n; i++) { (*,j) for (j=0; j<n; j++) { (i,j) sum = 0.0; (i,*) for (k=0; k<n; k++) A B C sum += a[i][k] * b[k][j]; c[i][j] = sum; } matmult/mm.c } Row-wise Column- Fixed wise Misses per inner loop iteration: A B C 0.25 1.0 0.0 14 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

Matrix Multiplication (jik) /* jik */ Inner loop: for (j=0; j<n; j++) { for (i=0; i<n; i++) { (*,j) sum = 0.0; (i,j) (i,*) for (k=0; k<n; k++) A B C sum += a[i][k] * b[k][j]; c[i][j] = sum } } matmult/mm.c Row-wise Column- Fixed wise Misses per inner loop iteration: A B C 0.25 1.0 0.0 15 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

Matrix Multiplication (kij) /* kij */ Inner loop: for (k=0; k<n; k++) { for (i=0; i<n; i++) { (i,k) (k,*) (i,*) r = a[i][k]; for (j=0; j<n; j++) A B C c[i][j] += r * b[k][j]; } Row-wise Row-wise } Fixed matmult/mm.c Misses per inner loop iteration: A B C 0.0 0.25 0.25 16 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

Matrix Multiplication (ikj) /* ikj */ Inner loop: for (i=0; i<n; i++) { for (k=0; k<n; k++) { (i,k) (k,*) (i,*) r = a[i][k]; for (j=0; j<n; j++) A B C c[i][j] += r * b[k][j]; } } matmult/mm.c Row-wise Row-wise Fixed Misses per inner loop iteration: A B C 0.0 0.25 0.25 17 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

Matrix Multiplication (jki) Inner loop: /* jki */ for (j=0; j<n; j++) { (*,k) (*,j) for (k=0; k<n; k++) { (k,j) r = b[k][j]; for (i=0; i<n; i++) A B C c[i][j] += a[i][k] * r; } } Column- Fixed Column- matmult/mm.c wise wise Misses per inner loop iteration: A B C 1.0 0.0 1.0 18 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

Matrix Multiplication (kji) /* kji */ Inner loop: for (k=0; k<n; k++) { (*,k) (*,j) for (j=0; j<n; j++) { (k,j) r = b[k][j]; for (i=0; i<n; i++) A B C c[i][j] += a[i][k] * r; } } matmult/mm.c Column- Fixed Column- wise wise Misses per inner loop iteration: A B C 1.0 0.0 1.0 19 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

Summary of Matrix Multiplication for (i=0; i<n; i++) { for (j=0; j<n; j++) { ijk (& jik): sum = 0.0; • 2 loads, 0 stores for (k=0; k<n; k++) { • misses/iter = 1.25 sum += a[i][k] * b[k][j];} c[i][j] = sum; } } for (k=0; k<n; k++) { kij (& ikj): for (i=0; i<n; i++) { • 2 loads, 1 store r = a[i][k]; for (j=0; j<n; j++){ • misses/iter = 0.5 c[i][j] += r * b[k][j];} } } for (j=0; j<n; j++) { jki (& kji): for (k=0; k<n; k++) { • 2 loads, 1 store r = b[k][j]; for (i=0; i<n; i++){ • misses/iter = 2.0 c[i][j] += a[i][k] * r;} } } 20 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

Core i7 Matrix Multiply Performance 100 jki / kji Cycles per inner loop iteration ijk / jik jki kji 10 ijk jik kij ikj kij / ikj 1 50 100 150 200 250 300 350 400 450 500 550 600 650 700 Array size (n) 21 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

Example: Matrix Multiplication c = (double *) calloc(sizeof(double), n*n); /* Multiply n x n matrices a and b */ void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) c[i*n + j] += a[i*n + k] * b[k*n + j]; } j c a b = * i 22 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

Cache Miss Analysis • Assume: • Matrix elements are doubles • Assume the matrix is square • Cache block = 8 doubles • Cache size C << n (much smaller than n) • First iteration: • n/8 + n = 9n/8 misses n • Afterwards in cache: (schematic) = * = * 8 wide 23 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

Cache Miss Analysis • Assume: • Matrix elements are doubles • Cache block = 8 doubles • Cache size C << n (much smaller than n) • Second iteration: • Again: n/8 + n = 9n/8 misses n • Total misses: • 9n/8 * n 2 = (9/8) * n 3 = * 8 wide 24 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

Blocked Matrix Multiplication j1 c a b += * i1 Block size B x B 25 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

j1 c a b += * i1 Block size B x B 26 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

j1 c a b += * i1 Block size B x B 1 2 5 6 1 2 5 6 3 4 7 8 3 4 7 8 9 10 13 14 9 10 13 14 11 12 15 16 11 12 15 16 27 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

j1 c a b += * i1 Block size B x B 1 2 5 6 1 2 5 6 3 4 7 8 3 4 7 8 9 10 13 14 9 10 13 14 11 12 15 16 11 12 15 16 1 2 1 2 5 6 9 10 * + * 3 4 3 4 7 8 11 12 28 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

Cache Memories Lecture, Oct. 30, 2018 1 Bryant and OHallaron, - PowerPoint PPT Presentation

Cache Memories Lecture, Oct. 30, 2018 1 Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third Edition General Cache Concept Smaller, faster, more expensive Cache 8 4 9 14 10 3 memory caches a subset of the

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity

Cache Memories, Cache Complexity Marc Moreno Maza University of Western Ontario, London, Ontario

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

Real Time Embedded Systems " Memories Memories " rene.beuchat@epfl.ch LAP/ISIM/IC/EPFL

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Memories Introduction Why do we need memory in an FPGA Device? Topics Types of FPGA

Cache Creek Placer Area Fee Proposal History of Placer Mining at Cache Creek Prospecting in

General Cache Mechanics CPU Block: unit of data in cache and memory. (a.k.a. line) Memory

Bounded Degree Spanning Tree using Iterative Relaxation Barna Saha March 11, 2015 Bounded

Effective Word-Level Interpolation for Software Verification Alberto Griggio FBK-IRST

Fairness in ML 2: Equal opportunity and odds Privacy & Fairness in Data Science CS848 Fall

BRST-BV treatment of Vasilievs four-dimensional higher-spin gravity P. Sundell (University of

a r = v b c 1 Dot product (scalar

Functional Analysis Review Lorenzo Rosasco slides courtesy of Andre Wibisono 9.520:

O N THE R ANK -O NE T HEOREM FOR BV FUNCTIONS Annalisa Massaccesi Warwick, 13 - 07 - 2017 Joint

Speed and Scale: How to get there. Adrian Cockcroft @adrianco May 2014 # | Battery

Sambuz

Useful Links

Newsletter

Mail Us

Cache Memories Lecture, Oct. 30, 2018 1 Bryant and OHallaron, - PowerPoint PPT Presentation

Cache Memories Lecture, Oct. 30, 2018 1 Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third Edition General Cache Concept Smaller, faster, more expensive Cache 8 4 9 14 10 3 memory caches a subset of the

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity

Cache Memories, Cache Complexity Marc Moreno Maza University of Western Ontario, London, Ontario

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

Real Time Embedded Systems &quot; Memories Memories &quot; rene.beuchat@epfl.ch LAP/ISIM/IC/EPFL

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Memories Introduction Why do we need memory in an FPGA Device? Topics Types of FPGA

Cache Creek Placer Area Fee Proposal History of Placer Mining at Cache Creek Prospecting in

General Cache Mechanics CPU Block: unit of data in cache and memory. (a.k.a. line) Memory

Bounded Degree Spanning Tree using Iterative Relaxation Barna Saha March 11, 2015 Bounded

Effective Word-Level Interpolation for Software Verification Alberto Griggio FBK-IRST

Fairness in ML 2: Equal opportunity and odds Privacy &amp; Fairness in Data Science CS848 Fall

BRST-BV treatment of Vasilievs four-dimensional higher-spin gravity P. Sundell (University of

a r = v b c 1 Dot product (scalar

Functional Analysis Review Lorenzo Rosasco slides courtesy of Andre Wibisono 9.520:

O N THE R ANK -O NE T HEOREM FOR BV FUNCTIONS Annalisa Massaccesi Warwick, 13 - 07 - 2017 Joint

Speed and Scale: How to get there. Adrian Cockcroft @adrianco May 2014 # | Battery

Sambuz

Useful Links

Newsletter

Mail Us

Real Time Embedded Systems " Memories Memories " rene.beuchat@epfl.ch LAP/ISIM/IC/EPFL

Fairness in ML 2: Equal opportunity and odds Privacy & Fairness in Data Science CS848 Fall