Plan Hierarchical memories and their impact on our programs 1 - PowerPoint PPT Presentation

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity Cache Analysis in Practice 2 Marc Moreno Maza The Ideal-Cache Model 3 University of Western Ontario, London, Ontario (Canada) Cache Complexity of some Basic Operations 4 CS3101 and CS4402-9535 Matrix Transposition 5 A Cache-Oblivious Matrix Multiplication Algorithm 6 (Moreno Maza) Cache Memories, Cache Complexity CS3101 and CS4402-9535 1 / 100 (Moreno Maza) Cache Memories, Cache Complexity CS3101 and CS4402-9535 2 / 100 Hierarchical memories and their impact on our programs Hierarchical memories and their impact on our programs Plan Capacity Staging Access Time Xfer Unit Cost Upper Level CPU Registers Registers 100s Bytes prog./compiler Hierarchical memories and their impact on our programs 1 300 – 500 ps (0.3-0.5 ns) Instr. Operands faster 1-8 bytes L1 Cache L1 Cache L1 and L2 Cache L1 d L2 C h 10s-100s K Bytes cache cntl Cache Analysis in Practice 2 Blocks ~1 ns - ~10 ns 32-64 bytes $1000s/ GByte L2 Cache The Ideal-Cache Model cache cntl h tl 3 Blocks 64-128 bytes Main Memory G Bytes Memory 80ns- 200ns Cache Complexity of some Basic Operations 4 ~ $100/ GByte OS OS Pages 4K-8K bytes Disk Matrix Transposition 5 10s T Bytes, 10 ms Disk (10,000,000 ns) ~ $1 / GByte $1 / GByte user/operator user/operator Files Mbytes A Cache-Oblivious Matrix Multiplication Algorithm 6 Larger Tape Tape Lower Level infinite sec-min sec min ~$1 / GByte (Moreno Maza) Cache Memories, Cache Complexity CS3101 and CS4402-9535 3 / 100 (Moreno Maza) Cache Memories, Cache Complexity CS3101 and CS4402-9535 4 / 100

Hierarchical memories and their impact on our programs Hierarchical memories and their impact on our programs CPU Cache (1/7) CPU Cache (2/7) Each location in each memory (main or cache) has A CPU cache is an auxiliary memory which is smaller, faster memory a datum (cache line) which ranges between 8 and 512 bytes in size, than the main memory and which stores copies of the main memory while a datum requested by a CPU instruction ranges between 1 and locations that are expectedly frequently used. 16. Most modern desktop and server CPUs have at least three a unique index (called address in the case of the main memory) independent caches: the data cache, the instruction cache and the In the cache, each location has also a tag (storing the address of the translation look-aside buffer. corresponding cached datum). (Moreno Maza) Cache Memories, Cache Complexity CS3101 and CS4402-9535 5 / 100 (Moreno Maza) Cache Memories, Cache Complexity CS3101 and CS4402-9535 6 / 100 Hierarchical memories and their impact on our programs Hierarchical memories and their impact on our programs CPU Cache (3/7) CPU Cache (4/7) When the CPU needs to read or write a location, it checks the cache: Read latency (time to read a datum from the main memory) requires if it finds it there, we have a cache hit to keep the CPU busy with something else: if not, we have a cache miss and (in most cases) the processor needs to out-of-order execution: attempt to execute independent instructions create a new entry in the cache. arising after the instruction that is waiting due to the Making room for a new entry requires a replacement policy: the Least cache miss Recently Used (LRU) discards the least recently used items first; this hyper-threading (HT): allows an alternate thread to use the CPU requires to use age bits. (Moreno Maza) Cache Memories, Cache Complexity CS3101 and CS4402-9535 7 / 100 (Moreno Maza) Cache Memories, Cache Complexity CS3101 and CS4402-9535 8 / 100

Hierarchical memories and their impact on our programs Hierarchical memories and their impact on our programs CPU Cache (5/7) CPU Cache (6/7) Modifying data in the cache requires a write policy for updating the main memory - write-through cache: writes are immediately mirrored to main The replacement policy decides where in the cache a copy of a memory particular entry of main memory will go: - write-back cache: the main memory is mirrored when that data is - fully associative: any entry in the cache can hold it evicted from the cache - direct mapped: only one possible entry in the cache can hold it The cache copy may become out-of-date or stale, if other processors - N -way set associative: N possible entries can hold it modify the original entry in the main memory. (Moreno Maza) Cache Memories, Cache Complexity CS3101 and CS4402-9535 9 / 100 (Moreno Maza) Cache Memories, Cache Complexity CS3101 and CS4402-9535 10 / 100 Hierarchical memories and their impact on our programs Hierarchical memories and their impact on our programs Cache issues Cold miss: The first time the data is available. Cure: Prefetching may be able to reduce this type of cost. Capacity miss: The previous access has been evicted because too much data touched in between, since the working data set is too large. Cure: Reorganize the data access such that reuse occurs before eviction. Conflict miss: Multiple data items mapped to the same location with eviction before cache is full. Cure: Rearrange data and/or pad arrays. True sharing miss: Occurs when a thread in another processor wants the same data. Cure: Minimize sharing. Cache Performance for SPEC CPU2000 by J.F. Cantin and M.D. Hill. False sharing miss: Occurs when another processor uses different The SPEC CPU2000 suite is a collection of 26 compute-intensive, non-trivial data in the same cache line. Cure: Pad data. programs used to evaluate the performance of a computer’s CPU, memory system, and compilers ( http://www.spec.org/osg/cpu2000 ). (Moreno Maza) Cache Memories, Cache Complexity CS3101 and CS4402-9535 11 / 100 (Moreno Maza) Cache Memories, Cache Complexity CS3101 and CS4402-9535 12 / 100

Hierarchical memories and their impact on our programs Hierarchical memories and their impact on our programs A typical matrix multiplication C code Issues with matrix representation #define IND(A, x, y, d) A[(x)*(d)+(y)] uint64_t testMM(const int x, const int y, const int z) { A double *A; double *B; double *C; long started, ended; float timeTaken; = int i, j, k; srand(getSeed()); A = (double *)malloc(sizeof(double)*x*y); B B = (double *)malloc(sizeof(double)*x*z); C = (double *)malloc(sizeof(double)*y*z); for (i = 0; i < x*z; i++) B[i] = (double) rand() ; x for (i = 0; i < y*z; i++) C[i] = (double) rand() ; C for (i = 0; i < x*y; i++) A[i] = 0 ; started = example_get_time(); for (i = 0; i < x; i++) for (j = 0; j < y; j++) Contiguous accesses are better: for (k = 0; k < z; k++) // A[i][j] += B[i][k] + C[k][j]; Data fetch as cache line (Core 2 Duo 64 byte per cache line) IND(A,i,j,y) += IND(B,i,k,z) * IND(C,k,j,z); With contiguous data, a single cache fetch supports 8 reads of doubles. ended = example_get_time(); Transposing the matrix C should reduce L1 cache misses! timeTaken = (ended - started)/1.f; return timeTaken; (Moreno Maza) Cache Memories, Cache Complexity CS3101 and CS4402-9535 13 / 100 (Moreno Maza) Cache Memories, Cache Complexity CS3101 and CS4402-9535 14 / 100 } Hierarchical memories and their impact on our programs Hierarchical memories and their impact on our programs Transposing for optimizing spatial locality Issues with data reuse float testMM(const int x, const int y, const int z) 1024 384 1024 { double *A; double *B; double *C; double *Cx; 384 4 x C C = long started, ended; float timeTaken; int i, j, k; A = (double *)malloc(sizeof(double)*x*y); A B 1024 B = (double *)malloc(sizeof(double)*x*z); 1024 C = (double *)malloc(sizeof(double)*y*z); Cx = (double *)malloc(sizeof(double)*y*z); srand(getSeed()); for (i = 0; i < x*z; i++) B[i] = (double) rand() ; for (i = 0; i < y*z; i++) C[i] = (double) rand() ; for (i = 0; i < x*y; i++) A[i] = 0 ; Naive calculation of a row of A , so computing 1024 coefficients: 1024 started = example_get_time(); for(j =0; j < y; j++) accesses in A , 384 in B and 1024 × 384 = 393 , 216 in C . Total for(k=0; k < z; k++) = 394 , 524. IND(Cx,j,k,z) = IND(C,k,j,y); for (i = 0; i < x; i++) Computing a 32 × 32-block of A , so computing again 1024 for (j = 0; j < y; j++) coefficients: 1024 accesses in A , 384 × 32 in B and 32 × 384 in C . for (k = 0; k < z; k++) IND(A, i, j, y) += IND(B, i, k, z) *IND(Cx, j, k, z); Total = 25 , 600. ended = example_get_time(); The iteration space is traversed so as to reduce memory accesses. timeTaken = (ended - started)/1.f; return timeTaken; (Moreno Maza) Cache Memories, Cache Complexity CS3101 and CS4402-9535 15 / 100 (Moreno Maza) Cache Memories, Cache Complexity CS3101 and CS4402-9535 16 / 100 }

Plan Hierarchical memories and their impact on our programs 1 - PowerPoint PPT Presentation

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity Cache Analysis in Practice 2 Marc Moreno Maza The Ideal-Cache Model 3 University of Western Ontario, London, Ontario (Canada) Cache Complexity

PHASE IA PLAN ULTIMATE PLAN 13 PHASE IB PLAN ULTIMATE PLAN 14 ULTIMATE PLAN ULTIMATE PLAN

NEW COURTHOUSE 1 ST FLOOR PLAN ANNEX 3rd FLOOR PLAN 2nd FLOOR PLAN BASEMENT FLOOR PLAN ANNEX

Medical Plan Comparison Central Care Plan Medical / Prescription Benefit Summary Advantage HDHP/HSA

Master Plan Open House #3 Preferred Alternative Master Plan Master Plan Process What is a

Site Plan May 2009 Site Plan February 2010 Site Plan May 5, 2010 Site Plan

To Plan or Not To Plan Failure to plan is a plan to fail Questions to ask yourself Who is this

Retirement Plan Changes Update Overview Retirement Plan Options at W&M No changes to

Impr Improvement ement Plan Orienta Plan Orientation tion Building Your Plan for Academic

San Pedro Community Plan San Pedro Community Plan Presentation Overview Community Plan

Local Development Plan Local Development Plan Local Development Plan Local Development Plan

Project Area Vilas Park Vilas Park Master Plan Vilas Park Master Plan Plan Maestro De Vilas Park

( ) + ( ) + ( ) + What is an Area Plan ( ) + What is an Area Plan A Comprehensive

803 SOUTH AVENUE AERIAL MAP SITE PLAN CRAWL SPACE & FOUNDATION PLAN FIRST FLOOR PLAN

Site plan Refused application Proposed Block plan Proposed site layout plan N Ground floor

DISTRICT PLANS 2 Financial Plan Instructional Plan Strategic Plan STRATEGIC PLAN GOALS 3

MESA/POWDERHORN PLAN UPDATE Meeting 3 October 23, 2012 Agenda 2 Plan Purpose Plan Area

CENG3420 Lecture 09: Virtual Memory & Performance Bei Yu (Latest update: March 19, 2020)

x86 Memory Protection and Binary Memory Threads Formats Allocators Translation User System

lecture 17 cache 1 - page table cache (TLB) Mon. March 14, 2016 some key ideas from last

Translation Buffers (TLBs) To perform virtual to physical address translation we need to

Today Memory Management Segmentation, Paging Improving memory performance MMU

Memory Hierarchies [FLPR12] Matteo Frigo, Charles E. Leiserson, Harald Prokop, Sridhar

[537] TLBs Tyler Harter 9/21/14 Overview Review Paging TLBs (Chapter 18) TLB measurement demo

Algorithm X. Li, C. Bao, F. Baker March 2009 Abstract This document specifies an update to

Plan Hierarchical memories and their impact on our programs 1 - PowerPoint PPT Presentation

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity Cache Analysis in Practice 2 Marc Moreno Maza The Ideal-Cache Model 3 University of Western Ontario, London, Ontario (Canada) Cache Complexity

PHASE IA PLAN ULTIMATE PLAN 13 PHASE IB PLAN ULTIMATE PLAN 14 ULTIMATE PLAN ULTIMATE PLAN

NEW COURTHOUSE 1 ST FLOOR PLAN ANNEX 3rd FLOOR PLAN 2nd FLOOR PLAN BASEMENT FLOOR PLAN ANNEX

Medical Plan Comparison Central Care Plan Medical / Prescription Benefit Summary Advantage HDHP/HSA

Master Plan Open House #3 Preferred Alternative Master Plan Master Plan Process What is a

Site Plan May 2009 Site Plan February 2010 Site Plan May 5, 2010 Site Plan

To Plan or Not To Plan Failure to plan is a plan to fail Questions to ask yourself Who is this

Retirement Plan Changes Update Overview Retirement Plan Options at W&amp;M No changes to

Impr Improvement ement Plan Orienta Plan Orientation tion Building Your Plan for Academic

San Pedro Community Plan San Pedro Community Plan Presentation Overview Community Plan

Local Development Plan Local Development Plan Local Development Plan Local Development Plan

Project Area Vilas Park Vilas Park Master Plan Vilas Park Master Plan Plan Maestro De Vilas Park

( ) + ( ) + ( ) + What is an Area Plan ( ) + What is an Area Plan A Comprehensive

803 SOUTH AVENUE AERIAL MAP SITE PLAN CRAWL SPACE &amp; FOUNDATION PLAN FIRST FLOOR PLAN

Site plan Refused application Proposed Block plan Proposed site layout plan N Ground floor

DISTRICT PLANS 2 Financial Plan Instructional Plan Strategic Plan STRATEGIC PLAN GOALS 3

MESA/POWDERHORN PLAN UPDATE Meeting 3 October 23, 2012 Agenda 2 Plan Purpose Plan Area

CENG3420 Lecture 09: Virtual Memory &amp; Performance Bei Yu (Latest update: March 19, 2020)

x86 Memory Protection and Binary Memory Threads Formats Allocators Translation User System

lecture 17 cache 1 - page table cache (TLB) Mon. March 14, 2016 some key ideas from last

Translation Buffers (TLBs) To perform virtual to physical address translation we need to

Today Memory Management Segmentation, Paging Improving memory performance MMU

Memory Hierarchies [FLPR12] Matteo Frigo, Charles E. Leiserson, Harald Prokop, Sridhar

[537] TLBs Tyler Harter 9/21/14 Overview Review Paging TLBs (Chapter 18) TLB measurement demo

Algorithm X. Li, C. Bao, F. Baker March 2009 Abstract This document specifies an update to

Retirement Plan Changes Update Overview Retirement Plan Options at W&M No changes to

803 SOUTH AVENUE AERIAL MAP SITE PLAN CRAWL SPACE & FOUNDATION PLAN FIRST FLOOR PLAN

CENG3420 Lecture 09: Virtual Memory & Performance Bei Yu (Latest update: March 19, 2020)