Page 1 Ridges of Temporal Locality Ridges of Temporal Locality - PDF document

Writing Cache Friendly Code Writing Cache Friendly Code The Memory Mountain The Memory Mountain Repeated references to variables are good (temporal Repeated references to variables are good (temporal Read throughput (read bandwidth) Read throughput (read bandwidth) locality) locality) � Number of bytes read from memory per second (MB/s) Stride Stride- -1 reference patterns are good (spatial locality) 1 reference patterns are good (spatial locality) Memory mountain Memory mountain � Measured read throughput as a function of spatial and Examples: Examples: temporal locality. � cold cache, 4-byte words, 4-word cache blocks � Compact way to characterize memory system performance � Compact way to characterize memory system performance. int sumarrayrows(int a[M][N]) int sumarraycols(int a[M][N]) { { int i, j, sum = 0; int i, j, sum = 0; for (i = 0; i < M; i++) for (j = 0; j < N; j++) for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; sum += a[i][j]; return sum; return sum; } } Miss rate = 1/4 = 25% Miss rate = 100% – 1 – – 2 – Memory Mountain Test Function Memory Mountain Test Function Memory Mountain Main Routine Memory Mountain Main Routine /* mountain.c - Generate the memory mountain. */ #define MINBYTES (1 << 10) /* Working set size ranges from 1 KB */ /* The test function */ #define MAXBYTES (1 << 23) /* ... up to 8 MB */ void test(int elems, int stride) { #define MAXSTRIDE 16 /* Strides range from 1 to 16 */ int i, result = 0; #define MAXELEMS MAXBYTES/sizeof(int) volatile int sink; int data[MAXELEMS]; /* The array we'll be traversing */ for (i = 0; i < elems; i += stride) result += data[i]; int main() sink = result; /* So compiler doesn't optimize away the loop */ { } } i t int size; /* Working set size (in bytes) */ i /* W ki t i (i b t ) */ int stride; /* Stride (in array elements) */ /* Run test(elems, stride) and return read throughput (MB/s) */ double Mhz; /* Clock frequency */ double run(int size, int stride, double Mhz) { init_data(data, MAXELEMS); /* Initialize each element in data to 1 */ double cycles; Mhz = mhz(0); /* Estimate the clock frequency */ int elems = size / sizeof(int); for (size = MAXBYTES; size >= MINBYTES; size >>= 1) { for (stride = 1; stride <= MAXSTRIDE; stride++) test(elems, stride); /* warm up the cache */ printf("%.1f\t", run(size, stride, Mhz)); cycles = fcyc2(test, elems, stride, 0); /* call test(elems,stride) */ printf("\n"); return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */ } } exit(0); } – 3 – – 4 – The Memory Mountain The Memory Mountain Pentium 4 – 2.4 GHz Pentium 4 – 2.4 GHz Pentium III Xeon 5000 1200 550 MHz 4500 16 KB on-chip L1 d-cache hroughput (MB/s) 16 KB on-chip L1 i-cache 1000 4000 L1 512 KB off-chip unified put (MB/s) 4500-5000 L2 cache 3500 800 4000-4500 3500-4000 3000 read th Read throughp 3000 3500 3000-3500 600 2500 2500-3000 Ridges of 2000-2500 400 2000 xe Temporal 1500-2000 Slopes of L2 Locality 1500 1000-1500 Spatial 200 500-1000 Locality 1000 0-500 2k 0 500 16k s1 mem s3 2k 0 128k Working set s5 8k s7 32k s1 size (bytes) s9 s2 128k s3 1024k s4 s5 s11 s6 stride (words) s7 512k s8 s13 working set size (bytes) s9 s10 2m s11 8m s15 s12 s13 s14 s15 8m s16 Stride (words) – 5 – – 6 – Page 1

Ridges of Temporal Locality Ridges of Temporal Locality Pentium 4 Pentium 4 Memory performance (stride = 6) Slice through the memory mountain with stride=1 Slice through the memory mountain with stride=1 � illuminates read throughputs of different caches and 4500 memory 4000 1200 main memory L2 cache L1 cache 3500 (MB/s) region region region 1000 3000 Read throughput read througput (MB/s) 2500 800 2000 600 1500 400 1000 200 500 0 0 8m 4m 2m 1024k 512k 256k 128k 64k 32k 16k 8k 4k 2k 1k 8m 4m 2m 1024k 512k 256k 128k 64k 32k 16k 8k 4k 2k Working set size (bytes) – 7 – working set size (bytes) – 8 – A Slope of Spatial Locality A Slope of Spatial Locality Pentium 4 Pentium 4 Memory performance (working set size = 512 Kbytes) Slice through memory mountain with size=256KB Slice through memory mountain with size=256KB � shows cache block size. 3500 800 3000 700 (MB/s) 2500 600 B/s) read throughput (MB Read throughput 2000 500 one access per cache line 400 1500 300 1000 200 100 500 0 0 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16 stride (words) – 9 – – 10 – Stride (words) Matrix Multiplication Example Matrix Multiplication Example Miss Rate Analysis for Matrix Multiply Miss Rate Analysis for Matrix Multiply Assume: Assume: Major Cache Effects to Consider Major Cache Effects to Consider � Line size = 32B (big enough for 4 64-bit words) � Total cache size � Matrix dimension (N) is very large � Exploit temporal locality and keep the working set small (e.g., by using � Approximate 1/N as 0.0 blocking) /* ijk */ Variable sum � Cache is not even big enough to hold multiple rows � Block size for (i=0; i<n; i++) { held in register � Exploit spatial locality � Exploit spatial locality for (j=0; j<n; j++) { for (j=0; j<n; j++) { Analysis Method: Analysis Method: Analysis Method: Analysis Method: sum = 0.0; � Look at access pattern of inner loop for (k=0; k<n; k++) Description: Description: sum += a[i][k] * b[k][j]; c[i][j] = sum; k j j � Multiply N x N matrices } � O(N3) total operations i k i } � Accesses C A B � N reads per source element � N values summed per destination » but may be able to hold in register – 11 – – 12 – Page 2

Layout of C Arrays in Memory Layout of C Arrays in Memory Matrix Multiplication (ijk) Matrix Multiplication (ijk) (review) (review) C arrays allocated in row C arrays allocated in row- -major order major order � each row in contiguous memory locations /* ijk */ Inner loop: for (i=0; i<n; i++) { Stepping through columns in one row: Stepping through columns in one row: (*,j) for (j=0; j<n; j++) { � for (i = 0; i < N; i++) (i,j) sum = 0.0; sum += a[0][i]; (i,*) for (k=0; k<n; k++) � accesses successive elements A A B B C C sum += a[i][k] * b[k][j]; sum += a[i][k] * b[k][j]; � if block size (B) > 4 bytes, exploit spatial locality c[i][j] = sum; � compulsory miss rate = 4 bytes / B } Stepping through rows in one column: Stepping through rows in one column: } Row-wise Column- Fixed wise � for (i = 0; i < n; i++) sum += a[i][0]; Misses per Inner Loop Iteration: Misses per Inner Loop Iteration: � accesses distant elements A B C � no spatial locality! 0.25 1.0 0.0 � compulsory miss rate = 1 (i.e. 100%) – 13 – – 14 – Matrix Multiplication (jik) Matrix Multiplication (jik) Matrix Multiplication (kij) Matrix Multiplication (kij) /* jik */ /* kij */ Inner loop: Inner loop: for (k=0; k<n; k++) { for (j=0; j<n; j++) { for (i=0; i<n; i++) { (*,j) for (i=0; i<n; i++) { (i,k) (k,*) (i,*) (i,j) sum = 0.0; r = a[i][k]; (i,*) for (j=0; j<n; j++) A B C for (k=0; k<n; k++) A A B B C C sum += a[i][k] * b[k][j]; + [i][k] * b[k][j] c[i][j] += r * b[k][j]; c[i][j] += r * b[k][j]; c[i][j] = sum } } } Fixed Row-wise Row-wise } Row-wise Column- Fixed wise Misses per Inner Loop Iteration: Misses per Inner Loop Iteration: Misses per Inner Loop Iteration: Misses per Inner Loop Iteration: A B C A B C 0.25 1.0 0.0 0.0 0.25 0.25 – 15 – – 16 – Matrix Multiplication (ikj) Matrix Multiplication (ikj) Matrix Multiplication (jki) Matrix Multiplication (jki) /* ikj */ /* jki */ Inner loop: Inner loop: for (i=0; i<n; i++) { for (j=0; j<n; j++) { (*,k) (*,j) for (k=0; k<n; k++) { for (k=0; k<n; k++) { (i,k) (k,*) (i,*) (k,j) r = a[i][k]; r = b[k][j]; for (j=0; j<n; j++) A B C for (i=0; i<n; i++) A A B B C C c[i][j] += r * b[k][j]; c[i][j] += r * b[k][j]; c[i][j] += a[i][k] * r; c[i][j] += a[i][k] * r; } } } } Fixed Row-wise Row-wise Column - Fixed Column- wise wise Misses per Inner Loop Iteration: Misses per Inner Loop Iteration: Misses per Inner Loop Iteration: Misses per Inner Loop Iteration: A B C A B C 0.0 0.25 0.25 1.0 0.0 1.0 – 17 – – 18 – Page 3

Page 1 Ridges of Temporal Locality Ridges of Temporal Locality - PDF document

Writing Cache Friendly Code Writing Cache Friendly Code The Memory Mountain The Memory Mountain Repeated references to variables are good (temporal Repeated references to variables are good (temporal Read throughput (read bandwidth) Read

Agenda Item 7 Page 107 Page 108 Page 109 Page 110 Page 111 Page 112 Page 113 Page 114 Page

Page 1 of 36 Page 2 of 36 Page 3 of 36 Page 4 of 36 Page 5 of 36 Page 6 of 36 Page 7 of 36

Agenda Item 7 Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 7 Page 8 Page 9 Page 10

Wednesday, November 30, 2016 3:41 PM General Page 1 General Page 2 General Page 3 General Page

Lecture 8 Friday, June 2, 2017 5:38 PM slide_8 Page 1 slide_8 Page 2 slide_8 Page 3 slide_8

177 Hudson Street Manhattan, NY 10013 Block 219 Lot 21 Historic Photos Page 1 Page 2 Page 3

PAGE 1 PAGE 2 PAGE 3 PAGE 4 Vision PAGE 5 Desire Lines of Cow Paths? PAGE 6

1. Test page This page is for testing. This page is for testing. This page is for testing.

Lecture 12 Sunday, January 27, 2019 5:25 PM Lecture12 Page 1 Lecture12 Page 2 Lecture12 Page 3

KAMPARO page 9 page 16 page 19 page 27 page 34 2 INHOUDSOPGA VE page 4 Cables Chargers

Page 35 Page 36 Page 37 Page 38 Page 39 This page is intentionally left blank

May 26, 2015 Presentation to Council and School Board Page 1 of 24 Page 2 of 24 Page 3 of 24

BRIGHT-LINE TEST Table of Contents page page page page page 3 5 11 15 19 What is the

Contents Nordea Page 3 Integration Page 16 Highlights and market development Page 24

Contents Summary presentation Q3/02 Page 3 Nordea Page 43 Integration Page 54

Contents Press slides Page 3 Financial highlights Q3/03 Page 17 Credit quality Page

Computer Graphics Seminar MTAT.03.305 Spring 2020 Raimond Tunnel Computer Graphics

Parallel Programming http://www.cs.bham.ac.uk/~hxt/2013/ parallel-programming/ based on: David

Get Out of the Valley: Power-Efficient Address Mapping for GPUs The 45 th International Symposium

Performance Engineering for Algorithmic Building Blocks in the GHOST Library Georg Hager, Moritz

ManyCore ManyCore Computing: ManyCore ManyCore Computing: Computing: Computing: The Impact on

Nested Lists Motivating this Course Module In previous video series, introduced lists All

1 Arrays Arrays The array's element type is the type of values it stores The size of an array

Arrays CS180 15Feb2008 Announcements Exam1GradesonBlackboard