Cache Performance 1 C and cache misses (1) int array[1024]; // 4KB - PowerPoint PPT Presentation

Cache Performance 1

C and cache misses (1) int array[1024]; // 4KB array int even_sum = 0, odd_sum = 0; for ( int i = 0; i < 1024; i += 2) { even_sum += array[i + 0]; odd_sum += array[i + 1]; } Assume everything but array is kept in registers (and the compiler does not do anything funny). How many data cache misses on a 2KB direct-mapped cache with 16B cache blocks? 2

C and cache misses (2) int array[1024]; // 4KB array int even_sum = 0, odd_sum = 0; for ( int i = 0; i < 1024; i += 2) even_sum += array[i + 0]; for ( int i = 0; i < 1024; i += 2) odd_sum += array[i + 1]; Assume everything but array is kept in registers (and the compiler does not do anything funny). How many data cache misses on a 2KB direct-mapped cache with 16B cache blocks? Would a set-associtiave cache be better? 3

thinking about cache storage (1) 2KB direct-mapped cache with 16B blocks — set 0: address 0 to 15, (0 to 15) + 2KB, (0 to 15) + 4KB, … block at 0: array[0] through array[3] block at 0+2KB: array[512] through array[515] set 1: address 16 to 31, (16 to 31) + 2KB, (16 to 31) + 4KB, … block at 16: array[4] through array[7] block at 16+2KB: array[516] through array[519] … set 127: address 2032 to 2047, (2032 to 2047) + 2KB, … block at 2032: array[508] through array[511] block at 2032+2KB: array[1020] through array[1023] 4

thinking about cache storage (1) 2KB direct-mapped cache with 16B blocks — set 0: address 0 to 15, (0 to 15) + 2KB, (0 to 15) + 4KB, … block at 0+2KB: array[512] through array[515] set 1: address 16 to 31, (16 to 31) + 2KB, (16 to 31) + 4KB, … block at 16: array[4] through array[7] block at 16+2KB: array[516] through array[519] … set 127: address 2032 to 2047, (2032 to 2047) + 2KB, … block at 2032: array[508] through array[511] block at 2032+2KB: array[1020] through array[1023] 4 block at 0: array[0] through array[3]

thinking about cache storage (2) 2KB 2-way set associative cache with 16B blocks: block addresses — set 0: address 0, 0 + 1KB, 0 + 2KB, … block at 0: array[0] through array[3] block at 0+1KB: array[256] through array[259] block at 0+2KB: array[512] through array[515] … set 1: address 16, 16 + 1KB, 16 + 2KB, … address 16: array[4] through array[7] … set 63: address 1008, 2032 + 1KB, 2032 + 2KB … address 1008: array[252] through array[255] 5

thinking about cache storage (2) 2KB 2-way set associative cache with 16B blocks: block addresses — set 0: address 0, 0 + 1KB, 0 + 2KB, … block at 0+1KB: array[256] through array[259] block at 0+2KB: array[512] through array[515] … set 1: address 16, 16 + 1KB, 16 + 2KB, … address 16: array[4] through array[7] … set 63: address 1008, 2032 + 1KB, 2032 + 2KB … address 1008: array[252] through array[255] 5 block at 0: array[0] through array[3]

C and cache misses (3) typedef struct { int a_value, b_value; int boring_values[126]; } item; item items[8]; // 4 KB array int a_sum = 0, b_sum = 0; for ( int i = 0; i < 8; ++i) a_sum += items[i].a_value; for ( int i = 0; i < 8; ++i) b_sum += items[i].b_value; Assume everything but items is kept in registers (and the compiler does not do anything funny). How many data cache misses on a 2KB direct-mapped cache with 16B cache blocks? 6

C and cache misses (3, rewritten?) item array[1024]; // 4 KB array int a_sum = 0, b_sum = 0; for ( int i = 0; i < 1024; i += 128) a_sum += array[i]; for ( int i = 1; i < 1024; i += 128) b_sum += array[i]; 7

C and cache misses (4) typedef struct { int a_value, b_value; int boring_values[126]; } item; item items[8]; // 4 KB array int a_sum = 0, b_sum = 0; for ( int i = 0; i < 8; ++i) a_sum += items[i].a_value; for ( int i = 0; i < 8; ++i) b_sum += items[i].b_value; Assume everything but items is kept in registers (and the compiler does not do anything funny). How many data cache misses on a 4-way set associative 2KB direct-mapped cache with 16B cache blocks? 8

a note on matrix storage makes dynamic sizes easier: float A_2d_array[N][N]; 9 A — N × N matrix represent as array float *A_flat = malloc(N * N); A_flat[i * N + j] === A_2d_array[i][j]

matrix squaring for ( int i = 0; i < N; ++i) for ( int j = 0; j < N; ++j) for ( int k = 0; k < N; ++k) 10 n B ij = � A ik × A kj k =1 /* version 1: inner loop is k, middle is j */ B[i * N + j] += A[i * N + k] * A[k * N + j];

matrix squaring for ( int i = 0; i < N; ++i) for ( int j = 0; j < N; ++j) for ( int i = 0; i < N; ++i) for ( int k = 0; k < N; ++k) for ( int k = 0; k < N; ++k) for ( int j = 0; j < N; ++j) 11 n B ij = � A ik × A kj k =1 /* version 1: inner loop is k, middle is j*/ B[i*N+j] += A[i * N + k] * A[k * N + j]; /* version 2: outer loop is k, middle is i */ B[i*N+j] += A[i * N + k] * A[k * N + j];

performance 12 billions of instructions 1.2 k inner 1.0 k outer 0.8 0.6 0.4 0.2 0.0 0 100 200 300 400 500 N billions of cycles 1.0 k inner 0.8 k outer 0.6 0.4 0.2 0.0 0 100 200 300 400 500 N

alternate view 1: cycles/instruction 13 cycles/instruction 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0 100 200 300 400 500 N

alternate view 2: cycles/operation 14 cycles/multiply or add 3.5 3.0 2.5 2.0 1.5 1.0 0 100 200 300 400 500 N

loop orders and locality … better than … 15 loop body: B ij + = A ik A kj kij order: B ij , A kj have spatial locality kij order: A ik has temporal locality ijk order: A ik has spatial locality ijk order: B ij has temporal locality

matrix squaring for ( int i = 0; i < N; ++i) for ( int j = 0; j < N; ++j) for ( int i = 0; i < N; ++i) for ( int k = 0; k < N; ++k) for ( int k = 0; k < N; ++k) for ( int j = 0; j < N; ++j) 16 n B ij = � A ik × A kj k =1 /* version 1: inner loop is k, middle is j*/ B[i*N+j] += A[i * N + k] * A[k * N + j]; /* version 2: outer loop is k, middle is i */ B[i*N+j] += A[i * N + k] * A[k * N + j];

Cache Performance 1 C and cache misses (1) int array[1024]; // 4KB - PowerPoint PPT Presentation

Cache Performance 1 C and cache misses (1) int array[1024]; // 4KB array int even_sum = 0, odd_sum = 0; for ( int i = 0; i < 1024; i += 2) { even_sum += array[i + 0]; odd_sum += array[i + 1]; } Assume everything but array is kept in

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Cache Impact on Program Performance T. Yang. UCSB CS240A. 2017 Multi-level cache in computer

CSE378 - Cache Performance metrics for caches Parameters for cache design Basic performance

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity

Cache Performance Samira Khan March 28, 2017 Agenda Review from last lecture Cache

Cache Creek Placer Area Fee Proposal History of Placer Mining at Cache Creek Prospecting in

Cache Memories, Cache Complexity Marc Moreno Maza University of Western Ontario, London, Ontario

Previous Lecture Slides for Lecture 11 ENCM 501: Principles of Computer Architecture Winter 2014

Cache Control Philipp Koehn 16 October 2019 Philipp Koehn Computer Systems Fundamentals: Cache

A Time Predictable Instruction Cache for a Java Processor Martin Schoeberl Overview

Modeling Hardware Timing 1 Caches and Pipelines Peter Puschner slides: P. Puschner, R. Kirner,

Lecture 2: Single processor architecture and memory David Bindel 30 Aug 2011 Teaser What will

Cache Memories 15-213: Introduc0on to Computer Systems 10 th

S POILER : Speculative Load Hazards Boost Rowhammer and Cache Attacks Saad Islam, Daniel

General Cache Mechanics CPU Block: unit of data in cache and memory. (a.k.a. line) Memory